Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scaling recommendation models with data-aware architectures and hardware efficient implementations
(USC Thesis Other)
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALING RECOMMENDATION MODELS WITH DATA-AWARE ARCHITECTURES AND
HARDWARE EFFICIENT IMPLEMENTATIONS
by
Keshav Balasubramanian
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Keshav Balasubramanian
Acknowledgements
There are many people who have been pivotal to my success as a graduate student over the last five years.
I’d like to first and foremost thank my advisor, Professor Murali Annavaram. I would not have reached this
point without his support, guidance and flexible style of advising. Prof. Annavaram allowed me to explore
my own interests early on and always seemed to have the right suggestion anytime I hit a dead end with
my research. But perhaps more admirable than his ability to always know how to break his students out
of a research rut and always provide the right amount of input (never too little, never too much), is the
genuine care he demonstrates for his students. Often times, treating them like one would their own family.
And for that I will always remember Professor Annavaram as a person, just as much as I will remember
him as my advisor.
I would also like to thank Prof. Greg Ver Steeg and Prof. Aram Galstyan for their collaboration over
the last five years. I’ve always thought of the class I took with them on Representation Learning in the
fall of 2019 as the spark to my career as a researcher. And what stemmed from that was many years
of collaboration with many of their students, some of whom I now call friends. In the vein, I’d like to
mention and thank my two closest collaborators during my time at USC, Abdulla Alshabanah and Elan
Markowitz. I can unequivocally say that I would not be where I am today if not for the discussions,
debugging sessions and brilliant ideas of both Abdulla and Elan. I feel blessed to have been able to foster
such a good professional and personal relationship with both of them. I very much look forward to reading
their dissertations and attending their defenses as well.
ii
I’d also like to thank the many other collaborators I have worked with, whom, while I have not mentioned by name, are people I will always have gratitude towards. I’d like to thank my lab mates for their
support when I needed it and extend my sincerest gratitude to my friend Rachel Lamar for her support
and many crucial edits to many of my academic papers.
Finally, I’d like to thank my family, my aunt, uncle and cousins in San Diego and my mother and father
back in India, for their support and creating the most comfortable environment I could’ve asked for as a
graduate student.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Recommendation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Deep Learning Recommendation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 DLRM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1.1 Modelling Content Information . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1.2 Modelling Collaborative Information . . . . . . . . . . . . . . . . . . . . 14
2.3 Interactions as a Bipartite Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 3: cDLRM: Look Ahead Caching for Scalable Training of Recommendation Models . . . . 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Basic Approaches to addressing memory footprint and their shortcomings . . . . . . . . . 18
3.3 cDLRM Preliminaries, Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Caching Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Embedding table cache structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3 Lookahead Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 cDLRM Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Training cDLRM on a single GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Prefetching Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Caching+Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.2.1 Preloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.2.2 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.2.3 Backpropagation and Parameter Update . . . . . . . . . . . . . . . . . . 28
3.5.3 Eviction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Data Parallel Training with Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Maintaining Cache Coherency in Data Parallel Training . . . . . . . . . . . . . . . 30
iv
3.6.1.1 Coherency at the beginning of a lookahead window . . . . . . . . . . . . 30
3.6.1.2 Coherency after individual parameter updates: . . . . . . . . . . . . . . . 31
3.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.3.1 Model Accuracy on a single GPU . . . . . . . . . . . . . . . . . . . . . . 33
3.7.3.2 Caching overheads and training speed on a single GPU . . . . . . . . . . 33
3.7.3.3 Accuracy and Scalability of cDLRM on multiple GPUs . . . . . . . . . . . 35
3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4: Graph Traversal with Tensor Functionals: A Meta-algorithm for Scalable Learning . . . 38
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Graph Traversal via Tensor Functionals (GTTF) . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Stochastic Traversal Functional Algorithm . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Some Specializations of AccumulateFn & BiasFn . . . . . . . . . . . . . . . . . . 45
4.3.3.1 Message Passing: Graph Convolutional variants . . . . . . . . . . . . . . 45
4.3.3.2 Node Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 Estimating k
th power of transition matrix . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Unbiased Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5 Example code for instantiating GTTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6.1 Node Embeddings for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.2 Message Passing for Node Classification . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.3 Experiments comparing against Sampling methods for Message Passing . . . . . . 57
4.6.4 Runtime and Memory comparison against optimized Software Frameworks . . . . 58
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5: Biased User History Synthesis for Personalized Long-tail Item Recommendation . . . . 60
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1 Long-tail Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1.1 Model-Agnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2.1.2 Model-Specific . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Personalization in Recommendation Systems . . . . . . . . . . . . . . . . . . . . . 64
5.3 Base Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Biased User History Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Tail item biased Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Synthesis Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.2.1 Mean Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4.2.2 User-Attn Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2.3 GRU Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
v
5.5.2 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Efficient Inference Through Decoupling Towers . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Theoretical Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7.1 Information Theory Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.8 Graph Convolution Interpretation and Implementation with GTTF . . . . . . . . . . . . . 72
5.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.9.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.9.1.1 MovieLens-1M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.9.1.2 BookCrossing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.9.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.9.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9.5 Ablation and Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.9.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9.6.1 Case Study 1: Biased User History Synthesis on other base recommendation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9.6.2 Case Study 2: Training time analysis . . . . . . . . . . . . . . . . . . . . 83
5.9.6.3 Case Study 3: Energy consumption of Biased User History Synthesis . . 84
5.9.6.4 Case Study 4: Effect of Biased User History Synthesis on final item
embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9.7 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
List of Tables
2.1 Summary of important notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 summary. Tasks are LP, SSC, FSC, for link prediction, semi- and fully-supervised
classification. Split indicates the train/validate/test paritioning, with (a) = [3], (b) = to be
released, (c) = [30], (d) = [92]; (e) = [37]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Results of node embeddings on Link Prediction. Left: Test ROC-AUC scores. Right: Mean
Rank on the right for consistency with pytorch-biggraph. *OOM = Out of Memory. . . . . 57
4.3 Node classification tasks. Left: test accuracy scores on semi-supervised classification
(SSC) of citation networks. Middle: test micro-F1 scores for large fully-supervised
classification. Right: test accuracy on an SSC task, showing only scalable baselines. We
bold the highest value per column. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Performance of GTTF against frameworks DGL and PyG. Left: Speed is the per epoch time
in seconds when training GraphSAGE. Memory is the memory in GB used when training
GCN. All experiments conducted using an AMD Ryzen 3 1200 Quad-Core CPU and an
Nvidia GTX 1080Ti GPU. Right: Training curve for GTTF and PyG implementations of
Node2Vec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Statistics of the datasets used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . 74
5.2 Comparison against baselines on MovieLens-1M by HR@10 and NDCG@10 . . . . . . . . 78
5.3 Comparison against baselines on BookCrossing by HR@100 and NDCG@100 . . . . . . . 78
5.4 Ablation on base models on MovieLens-1M . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
vii
List of Figures
2.1 A DLRM Content Modelling Block. xe is the dense-sparse fused representation of entity e. 12
2.2 Basic Two-Tower Neural Network Architecture to explicitly model content information
and implicitly model collaborative information concurrently. The user and item content
modelling blocks follow the architecture in Figure 2.1. . . . . . . . . . . . . . . . . . . . . . 14
3.1 Performance comparison of simple appraoches to reducing memory. DLRM Baseline
refers to the hybrid data and model parallel system. . . . . . . . . . . . . . . . . . . . . . . 19
3.2 cDLRM Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 (a) Single GPU cDLRM Performance (b) Caching cost (c) Cache Performance (All data
from Terabyte dataset, Bot MLP: 13-512-256-64, Top MLP: 512-512-256-1, Embedding Dim:
64) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 (a) cDLRM multigpu scaling; (b) Caching overhead with 32K batch size; (c) Sensitivity of
convergence accuracy on cache aggregation frequency λ. Cache Size=150,00 ways / 16
sets. Lookahead size=500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 (c)&(d) Depict our data structure & traversal algorithm on a toy graph in (a)&(b). . . . . . 40
5.1 Long Tail Distribution of MovieLens-1M . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Biased User History Synthesis built on top of a Two Tower Neural Network architecture.
A user is usually connected to a distribution of head and tail items (blue box on the left).
p1 = p2 = ....pn if sampling without bias (uniform). pi
is characterized by eq. 5.1 if
sampling with bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 (a) Ablation Study on Sample Size for User-Attn Synthesis (T=0.01) for MovieLens-1M.
(b) Ablation Study on Softmax Temperature for User-Attn Synthesis (15 neighbors)
for MovieLens-1M. (c) Ablation Study on Sample Size for GRU Synthesis (T=0.01) for
MovieLens-1M. (d) Ablation Study on Softmax Temperature for GRU Synthesis (15
neighbors) for MovieLens-1M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
5.4 (a) Batch energy consumption during training a TTNN model for MovieLens-1M. (b)
Batch energy consumption during training a WnD model for MovieLens-1M. (c) Batch
energy consumption during training a DeepFM model for MovieLens-1M. . . . . . . . . . 81
5.5 (a) Energy consumed during training to achieve the best recommendation performance
of a TTNN model for MovieLens-1M. (b) Energy consumed during training to achieve
the best recommendation performance of a WnD model for MovieLens-1M. (c) Energy
consumed during training to achieve the best recommendation performance of a DeepFM
for MovieLens-1M. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Epoch training time for Mbase (green), baseline models (red) and MBUHS variants (blue)
for MovieLens-1M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Visualizations of final embeddings of items in the history of randomly selected users in
MovieLens-1M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
ix
Abstract
Recommendation models play a pivotal role in driving economic activity on the internet. Currently, stateof-the-art approaches for addressing the task of effective recommendation at internet-scale predominantly
employ Deep Neural Networks (DNNs) due to their demonstrated efficacy in capturing intricate distributions within large datasets across diverse domains. Nonetheless, training recommendation models at such
a massive scale poses several unique challenges. We focus on two significant challenges: (1) The dominance of embedding tables in the memory footprint of modern recommendation models results in excessive
memory utilization by GPUs during training. Consequently, the number of GPUs required by the training
system is determined by the model size rather than computational load. (2) The process of scaling recommendation models to train on real-world, internet-scale data gives rise to a long-tail distribution in item
popularity, causing a vast majority of less popular items to be seldom recommended.
To tackle the first challenge, we propose a lookahead-caching-based training algorithm, named cDLRM,
which effectively decouples the memory and computational demands of DNN-based recommendation
models. cDLRM enables training with no loss in model quality, while allowing the compute demands
of the model to dictate the required number of GPUs in the training infrastructure. To address the second
challenge, we first introduce Graph Traversal with Tensor Functionals (GTTF), an efficient meta-algorithm
that facilitates fast and effective stochastic graph representation learning. Subsequently, we present BiasedUser History Synthesis (BUHS), a recommendation model architecture built using GTTF. BUHS demonstrates awareness of the long-tail distribution in item popularity and effectively mitigates the bias towards
x
recommending only popular items while also enabling increased personalization of recommendation. In
addition to experiments that demonstrate the performance benefits over the state-of-the-art approaches,
we substantiate our empirical results with appropriate theoretical analysis.
xi
Chapter 1
Introduction
Recommendation systems have become an integral tool for driving user engagement in many technology
platforms. From personalized news recommendation [49] to job candidate recommendation [25], these
systems play an important role in improving user experience and maximizing platform viability. A variety
of different machine learning techniques have been successfully used in constructing recommendation
systems, including statistical learning techniques such as Bayesian modelling [22, 53], kernel methods
[87], and matrix-factorization based methods [73]. In recent years, the success of Deep Learning and Deep
Neural Networks (DNNs) in modelling complex data distributions in a variety of domains has created
new avenues to improve the accuracy of recommendation models using DNNs [18, 4]. This dissertation
focuses on modern Deep Learning Recommendation Models (DLRMs) and the problems that arise when
we attempt to scale them to learn on very large, internet-scale datasets.
Recommendation models are trained using user-item interaction data, where each data point is represented as a pair (u, i). In this representation, u denotes a user on a specific platform, and i corresponds to
an item that the user has engaged with on the same platform. The nature of items can differ depending on
the platform, encompassing various types such as ads [99, 97], videos [18], songs [5], and more. Furthermore, the definition of an interaction can also vary across platforms. For instance, social media platforms
that display ads may consider an interaction as the simple act of clicking on a presented ad. On the other
1
hand, music streaming platforms might define an interaction as the act of listening to a song for a specific
duration.
A recommendation model trains on the past user-item interactions to learn the likelihood that a user
will interact with an item in the future. In order to learn these probabilities accurately the model must find
representations of users and items such that semantically similar users or items have similar representation. Each user and item is characterized by a combination of continuous (dense) and categorical (sparse)
features. Continuous features are represented as real-valued vectors, while categorical features are represented as non-negative integer vectors. Continuous or dense features are typically encoded as real-values
vectors as the underlying data they represent is expressed as a real-valued scalar or vectors. Some examples of such data include product descriptions represented as floating-point feature vectors after being fed
through a language model, thumbnail images represented as floating-point feature maps after being fed
though a image processing model as well as demographic information such as income that are measured
on a continuous scale. Categorical or sparse features are encoded as integer-valued vectors because the
underlying data they represent is encoded as a discrete set of one or more categories or classes. Some
examples of sparse features include movie genre, user and item identifiers (encoded through enumeration)
as well as more complex, composite features such as keyword pairs and tuples.
Sparse features are mapped to a continuous vector space using learnable embedding tables [61] in
modern recommendation systems. Thus embedding tables play a critical role in learning the sparse representations in recommendation models. Embedding tables are simply numerical representations of items
(or users) such that two related items may have similar embedding value. Here the notion of "related" is
dependent on how the two items were interacted with in the past. For instance, the model must learn
to represent a baseball and sports jersey with similar representations. More details about the embedding
tables is described in Chapter 2.
2
Both the dense and sparse features are used in deep learning based recommendation systems. In general, dense features are provided as inputs to a fully connected neural network layers, while sparse features
are looked up in embedding tables to obtain their learned representation values. Subsequently, the transformed dense features and the mapped representations of sparse features are fed into a variable downstream neural network architecture [62]. More details of this architecture are described shortly.
The number of users, items and interactions can reach tens of billions at internet-scale [94]. Scaling
DLRMs to learn from such large datasets presents a number of unique modelling, system and implementation challenges. In this dissertation, we address two specific challenges.
The first, is the challenge of optimizing hardware utilization in DLRM implementations. As DLRMs
are scaled to process billions of data points, where each data point is a user-item interaction, a significant
portion of the model’s memory usage is attributed to numerous embedding tables, resulting in memory
footprints reaching several hundred gigabytes [62, 60]. Simultaneously, the dense neural networks and
downstream architectures necessitate substantial computational power to achieve the high level of expressiveness required for effective learning at such a scale. Consequently, training a large, internet-scale
DLRM becomes a resource-intensive task, demanding both significant memory and compute capabilities.
Presently, the best training systems store all model parameters, including the embedding tables, in
GPU memory. However, the size of embedding tables often surpasses the capacity of even the largest
GPUs. As a result, training setups commonly utilize multiple GPUs, ranging from tens to hundreds, to
distribute the model parameters [59]. This implementation utilizes a hybrid system that combines partial
data parallelism and partial model parallelism. Data parallelism involves replicating the dense networks
across all utilized GPUs during training to enhance performance. Conversely, model parallelism is utilized
to address the memory demands of the embedding tables. These tables are divided and allocated across
the GPUs using various allocation and sharding strategies to achieve efficient model parallelism.
3
The primary factor determining the number of GPUs required for training is the memory capacity
needed to accommodate the embedding tables. However, this system suffers from a significant drawback:
as the size of the embedding tables increases, the number of GPUs needed to store them also grows. GPUs
with large integrated memory are considerably more expensive compared to the more affordable discrete
DRAM DIMMs found on CPUs. Additionally, from a computational perspective, a threshold exists known
as the compute saturation point. Beyond this point, increasing the number of GPUs utilized no longer yields
performance improvements during the data parallel computation of the dense neural networks within the
model. Compute saturation occurs when the available parallelism across the GPU resources surpasses the
parallelism required by the dense network computations. Consequently, this system exhibits poor scalability concerning the number of GPUs employed, utilizing them primarily for model storage rather than
achieving performance gains. As a result, this approach proves highly cost-inefficient, often necessitating
expensive compute infrastructure to facilitate model training.
The second challenge we aim to tackle pertains to model bias and overfitting, which become prominent
as the number of items expands to an internet-scale level. Recommendation models handling a vast array
of items encounter the common issue of imbalanced user interactions. Specifically, a large proportion of
items, referred to as tail items, receive minimal user engagement, while a small subset of popular items,
known as head items, dominate user interactions [20, 50]. This phenomenon, known as the long-tail item
problem, introduces model bias, as the model tends to overfit on the head items and disregards the relevant
tail items [77]. Consequently, a feedback loop emerges, perpetuating repetitive recommendations of only
popular items. Addressing this challenge is crucial to improve user experience and foster diversity in
recommendations.
To overcome the first challenge, we propose a DLRM training system called cDLRM∗
[6]. In cDLRM,
all embedding tables are stored in CPU DRAM, while only a small cache of each table is maintained in
∗
Published in the Proceedings of RecSys 2021
4
GPU memory. This approach leverages the concept of lookahead caching, where a group of CPU threads
pre-processes training batches and caches the necessary embedding table entries in GPU memory. The
training process remains entirely contained within the GPUs, without the need for gradients to flow back
to the CPU. cDLRM capitalizes on the insight that only a subset of embedding table rows are required for
training a specific set of samples.
We demonstrate the effectiveness of cDLRM by training large models using just a single GPU. Furthermore, we showcase how cDLRM can scale out in a purely data parallel manner when multiple GPUs are
available. As far as our knowledge goes, cDLRM is the first system to demonstrate distributed data parallel
training of large recommendation models by employing embedding table caching as its foundation. Additionally, we assess the impact of cDLRM on model accuracy by training on two publicly available datasets.
Our results show that we can train the DLRM in a highly cost-efficient manner, with minimal (< 0.02) to
no loss in accuracy.
To tackle the second challenge, we first present Graph Traversal with Tensor Functionals (GTTF)†
[52],
a meta-algorithm and framework that facilitates rapid, efficient, and stochastic implementations of graph
representation learning algorithms. GTTF ensures not only speed but also unbiased learning. We then
view the task of recommendation as one of learning on a user-item interaction bipartite graph and present
a model to alleviate the long-tail item problem called Biased User-History Synthesis (BUHS). Since GTTF is
a framework that enables easy implementation of graph traversal and unbiased stochastic graph learning
models, we leverage its simple API to build BUHS as a single layer Graph Convolutional Network.
BUHS is designed to achieve two main objectives: (1) Alleviating the long-tail item problem, as discussed earlier, and (2) Enhancing personalization of recommendation. Although these objectives may
appear distinct, we demonstrate that they are complementary. Our approach utilizes an effective sampling
strategy that addresses both challenges simultaneously. The core idea is to augment user representations
†
Published in the proceedings of ICLR 2021
5
with a learnt representation of a sample of the user’s interaction history. However, we tackle the issue of
sample bias towards head items by generating samples with a bias towards tail items in the user’s history.
This technique serves two purposes in the context of our objectives. First, it increases the visibility of
tail items, mitigating the model’s tendency to overfit on head items resulting in better representations for
tail items. Second, it results in more unique and personalized user representations, as tail items provide
more information about a user’s specific interests compared to head items which demonstrate popular,
commonly held interests. As a result, our approach not only addresses the long-tail item problem but also
concurrently improves the recommendation of both tail and head items through enhanced personalization.
To the best of our knowledge, our approach is the first to highlight the intrinsic relationship between tail
items and improved personalization, while proposing a technique that directly addresses both the long-tail
item and personalization challenges in recommendation models.
The following chapter provides background essential to the rest of the dissertation. Subsequently, we
will delve into the specifics of cDLRM, GTTF, and BUHS, exploring their intricacies and contributions in
greater detail.
6
Chapter 2
Background
This chapter introduces some background and notation central to the understanding the rest of the dissertation. Table 2.1 summarizes the important notation.
2.1 Recommendation Models
The fundamental objective of a recommendation model is to generate a personalized list of items, denoted
as I
u
rec, for each user u from a global set of users U and a global set of items I. This list should comprise
items that user u is most likely to interact with, while ensuring that these recommended items do not
overlap with the user’s prior interactions Hu
. To formalize this, we define I
u
rec as a subset of I such that
I
u
rec ∩ Hu = ∅.
The interactions of all users, represented as {Hu
|u ∈ U}, can be organized into an interaction matrix
AH ∈ Z
|U|×|I|
≥0
. Each entry (u, i) of this matrix corresponds to a non-negative integer rating ru,i, where
0 ≤ ru,i ≤ R. In systems where users explicitly rate their interactions with items on a numerical scale,
the value of R represents the maximum rating value. For systems where explicit ratings are not available,
a binary representation is used. In this case, R = 1, where ru,i = 0 if user u did not interact with item i,
and ru,i = 1 if user u did interact with item i. This binary rating scheme is known as an implicit rating.
7
Table 2.1: Summary of important notation
Symbol Description
U Global set of Users
I Global set of Items
Hu
Interaction history of user u: set of items that user u has already interacted with
I
u
rec Recommendation list for user u upon a single query
AH Interaction matrix induced by the interaction histories of all u ∈ U
Z≥0 Set of non-negative integers
R Set of Real Numbers
Mi,: Elements of row i in matrix M
M:,j Elements of column j in matrix M
||.||F Frobenius Norm
||.|| L2 Matrix or Vector Norm
x Neural network input or output matrix/vector x
xu User-tower output
xi
Item-tower output
e1, . . . , en n embedding tables in a DLRM content modelling block
∥ Concatenation operation
In order to generate a personalized list of items that user u is most likely to interact with, recommendation models follow a two-step process. First, they predict the potential interaction ratings for user u across
all items not present in their interaction history. This prediction is denoted as r
pred
u = {ru,i′, ∀i
′ ∈ I \Hu}.
Next, the final recommendation list for user u is constructed by selecting the top K items with the highest
predicted ratings. Mathematically, the recommendation list is determined as:
arg TopK
i
′
r
pred
u
(2.1)
where K = |Iu
rec| represents the number of items to be presented to the user.
The core aspect of recommendation model design lies in the computation of r
pred
u . Below, we provide
an overview of the most common paradigms to do so.
8
2.1.1 Collaborative Filtering
The earliest recommendation models were based on User-based Collaborative Filtering (CF) [27, 69]. The
fundamental assumption behind User-based collaborative filtering is that users are inclined to interact with
items that similar users have also engaged with. The similarity between two users is computed using a
similarity scoring function applied to their corresponding rows in the interaction matrix AH. Specifically,
for two users u1 and u2, the similarity score su1,u2
is calculated as Sim(AH
u1,:
, AH
u2,:
), where Sim : Z
|I|
≥0 ×
Z
|I|
≥0 −→ R represents a real-valued similarity scoring function. Once the similarity scores between all
pairs of users are computed, the final predicted rating (in the case of explicit rating) or probability of
interaction (in the case of implicit rating) between user u and item i is calculated as the following weighted
average, with a normalization factor κ:
r
pred
u,i =
P
u′∈U\u
su,u′AH
u′
,i
κ
(2.2)
While the earliest designs of recommendation models took such a user-centric view, subsequent variations to collaborative filtering took an item-centric view, called Item-based collaborative filtering [72, 47].
In this approach, the predicted rating or probability of interaction between user u and i is computed as the
following weighted average (with normalizing factor κ):
r
pred
u,i =
P
i
′∈Hu
si,i′AH
u,i′
κ
(2.3)
Here, the similarity between two items i1 and i2 is computed in a similar fashion to the user-centric
approach. The formula considers the interactions of user u with items in their history (Hu
) and calculates
a weighted average based on the similarity scores si,i′ between item i and other items in the history. The
result is then normalized by the factor K.
9
2.1.2 Matrix Factorization
The basic form of Collaborative Filtering described earlier is a memory-based heuristic that lacks personalization in recommendations [41]. Matrix Factorization (MF) models address this limitation by representing
users and items in a low-dimensional vector space. The main objective of matrix factorization is to decompose the interaction matrix AH into the product of two learnable rectangular matrices: the user latent
factor matrix Wu ∈ R
|U|×L and the item latent factor matrix Wi ∈ R
|I|×L.
In a basic matrix factorization model, Wu and Wi are learned by aiming to reconstruct the known
values in AH. The objective function used for optimizing Wu and Wi
is as follows [41]:
arg min
Wu,Wi
||AH − WuWT
i
||F + λ(||Wu|| + ||Wi
||) (2.4)
Here, ||.||F represents the Frobenius Norm, ||.|| denotes the matrix L2-norm, and WuWT
i
is the predicted interaction matrix. The optimization process aims to minimize the difference between the reconstructed matrix and the actual interaction matrix, with a regularization term to prevent overfitting. Once
the optimization is completed, the predicted rating or probability of interaction between user u and item
i can be obtained by looking up the corresponding entry in the predicted interaction matrix WuWT
i
.
2.2 Deep Learning Recommendation Models
Collaborative Filtering (CF) and Matrix Factorization (MF) models suffer from a significant drawback when
dealing with highly sparse interaction matrices, which is often the case in real-world internet-scale recommendation systems [58, 64, 71]. There are two main reasons for this.
First, the objective function defined in Equation 2.4 is computed solely on the known values in the interaction matrix AH. Consequently, if AH is highly sparse, the available signal for model training becomes
10
extremely weak. The limited amount of observed interactions hinders the model’s ability to accurately
capture user preferences and item relevance.
Second, both CF and MF models do not consider user and item content information, disregarding an
additional potential source of signal regarding user preferences. By ignoring the content associated with
users and items, such as textual descriptions, attributes, or metadata, these models fail to utilize valuable
information that could enhance recommendation accuracy and personalization.
Recognizing the achievements of neural networks in various machine learning domains, researchers
have introduced the integration of Deep Neural Networks (DNNs) to address the limitations of traditional
CF and MF models. These advancements involve incorporating DNNs to model content information and/or
the collaborative filtering objective [33, 64, 82, 74]. However, these initial models treated content and collaborative information separately, which can present computational challenges when dealing with largescale recommendation systems.
Deep Learning Recommendation Models (DLRMs) offer a solution to this challenge by jointly modeling
both content and collaborative information, allowing the vast number of model parameters to implicitly
capture both aspects [62, 18]. By leveraging DNN architectures to transform user and item content features
and training on extensive datasets consisting of billions of user-item pairs, DLRMs provide an efficient
and robust approach to estimating user preferences. Thus, DLRMs enable enhanced recommendation
capabilities by effectively integrating content and collaborative information within a unified framework.
2.2.1 DLRM Architecture
In this section, we introduce the fundamental architecture of a Deep Learning Recommendation Model
(DLRM). The primary objective of the DLRM is to transform the content features of entities (users or
items), into dense vector representations. These representations are then utilized to make predictions
regarding the rating or probability of interaction for recommendation purposes.
11
.
.
.
.
.
.
.
e1
ens
Sparse Neural Network fs
Dense Neural Network fd
Dense Neural Network fo
xd
xd
xs
nd m
n
ns
Fd
Fs
ns
Is fi Symmetric?
YES
Flatten
n x ns
xs m
SYMMETRIC FUNCTION xe
xd xs
m + n x ns
Dense Neural Network fo
xe
NO
Figure 2.1: A DLRM Content Modelling Block. xe is the dense-sparse fused representation of entity e.
2.2.1.1 Modelling Content Information
As mentioned in Chapter 1, both users and items possess a combination of dense and sparse features.
Given an entity with a dense feature vector Fd ∈ R
nd and a sparse feature vector Fs ∈ Z
ns
≥0
, the DLRM
performs two transformations.
Firstly, the dense feature vector Fd is passed through a dense neural network architecture fd : R
nd −→
R
m, resulting in another dense representation denoted as xd. This transformation captures the essence of
the dense features learnt through the neural network fd.
Secondly, the sparse feature vector Fs is transformed into a dense matrix through a sparse neural
network architecture fs : Z
ns
≥0 −→ R
ns×n
. The sparse neural network fs is composed of ns embedding
tables e1, . . . , ens
, each with a dimensionality of n. Each integer value F
i
s
in Fs corresponds to an index
to be looked up in the embedding table ei
. The output of the sparse neural network xs is a ns × n matrix
of embedding lookups.
To obtain the final representation for the entity, a variable interaction layer denoted as fi
is applied.
This layer introduces flexibility by allowing different types of interaction functions. One option is to use an
12
unlearnable symmetric function, such as a sum or mean operation. In this case, the sparse representation xs
is first flattened and transformed through a neural network to match the dimensionality m. Alternatively,
a learnable neural network can operate on the concatenation of xd and all the rows in xs.
The variable interaction layer fi can be mathematically expressed in two different forms, depending
on whether it is an unlearnable symmetric function or a learnable neural network.
If fi
is an unlearnable symmetric function, the expression is as follows:
fi(xd, xs) = SYMMETRIC_FUNCTION(xd, fo(FLATTEN(xs))) (2.5)
In this case, SYMMETRIC_FUNCTION is a predefined symmetric function, and fo : R
ns·n −→ R
m is
a neural network transforming the flattened xs into a dense representation with dimensionality m. The
output dimensionality of fi
in this case, is m.
If fi
is a learnable neural network, the expression is as follows:
fi(xd, xs) = fo(xd ∥FLATTEN(xs)) (2.6)
Here, ∥ represents the concatenation operation, where the dense representation xd and the flattened
xs are concatenated. The concatenated vector is then passed through a neural network fo : R
m+ns·n −→
R
l
, which outputs a dense representation of dimensionality l, a hyperparameter determining the output
dimension of fi
.
In both cases, the variable interaction layer fi combines the information from the dense representation
xd and the sparse representation xs using either a predefined symmetric function or a learnable neural
network. This layer aims to capture the complex interactions between the dense and sparse features,
resulting in a final representation that effectively encodes the fused information. Figure 2.1 illustrates a
basic DLRM content modelling block.
13
User-Content Modelling Block Item-Content Modelling Block
User-Dense User-Sparse Item-Dense Item-Sparse
Dense Neural Network fT
xu xi
User u Item i
Predicted Interaction probability / Rating of (u, i)
Figure 2.2: Basic Two-Tower Neural Network Architecture to explicitly model content information and
implicitly model collaborative information concurrently. The user and item content modelling blocks follow the architecture in Figure 2.1.
For clarity and consistency, we will use the notation xu to represent the output of fi when the entity
is a user, and xi when the entity is an item. xu and xi will serve as the transformed and integrated
representations of the user and item, respectively, which can then be used for further processing within
the recommendation model.
2.2.1.2 Modelling Collaborative Information
To predict the rating or probability of interaction between a user u and an item i, the DLRM architecture,
as described earlier, is commonly implemented using a Two-Tower Neural Network (TTNN) design [93].
This design consists of two separate towers: the user-tower and the item-tower. The user-tower contains
the parameters responsible for transforming the user’s content features into the representation xu, while
the item-tower contains the parameters responsible for transforming the item’s content features into the
representation xi
.
14
Once the representations xu and xi are computed, the subsequent step is to compute the predicted rating or probability of interaction. This is accomplished by passing xu and xi
through a user-item interaction
neural network denoted as fT . The user-item interaction neural network takes these representations as
input and produces a single scalar value, which represents the predicted rating or probability of interaction
between the user u and the item i.
In the process of training on vast datasets consisting of user-item interaction pairs (u, i), Two-Tower
Neural Networks (TTNNs) effectively encode both content and collaborative information within their numerous model parameters. As a result, Deep Learning Recommendation Models (DLRMs) are inherently
able to incorporate collaborative information in an implicit manner, in contrast to CF and MF models that
require explicit modelling of this information.
2.3 Interactions as a Bipartite Graph
User-item interactions can be viewed as a bipartite graph. Concretely, given the user set U and item set
I, we define the bipartite graph of interactions G(V, E) with V = U ∪ I and E = {(u, i) | u ∈ U, i ∈
I, and user u interacted with item i}. In the case in which a specific interaction (u, i) is associated with
an explicit feedback rating r, then the weight of edge (u, i) in G is r. If only implicit feedback is available,
the weights of all edges are 1.
15
Chapter 3
cDLRM: Look Ahead Caching for Scalable Training of Recommendation
Models
In this chapter we present a solution to the first challenge we seek to address, which is to avoid using
GPUs as merely a model storage device in large recommendation models and only use as many GPUs as
is needed to achieve optimal compute parallelism.
3.1 Introduction
As indicated in Chapter 1, internet-scale Deep Learning Recommendation Models (DLRMs) possess the
capability to expand significantly, reaching a scale of several hundreds of billions to trillions of parameters. These parameters contribute to the creation of terabyte-sized models, primarily due to the presence
of numerous large embedding tables. Furthermore, the computational demands of these models are substantial due to the integration of multiple layers of dense neural networks. Presently, training such models
necessitates a hybrid approach involving both data and model parallelism. However, this approach leads
to suboptimal utilization of costly GPU resources, often exceeding the compute saturation point where additional compute parallelism proves unnecessary. Our investigation of the access patterns for embedding
tables reveals that the storage of complete embedding tables in GPU memory is redundant. In particular,
16
we observe that access to these tables occurs in an exceedingly sparse manner, with each batch of training
data accessing only a small fraction of each table.
Inspired by these observations, in this work we present cDLRM, a novel approach to decouple the memory demands of recommendation models from the computational demands. cDLRM places the embedding
tables on the CPU while allowing the training process to run on a GPU. However, instead of slowing down
the training to access the CPU-resident embedding tables, cDLRM proactively analyzes training batches
ahead of time to precisely identify the necessary subset of embedding table entries that are needed for a
batch. Based on this analysis, it caches the embedding table entries that are necessary to train an upcoming batch into the GPU memory just-in-time before the start of that training batch, thereby avoiding the
need to access embedding tables on the CPU. cDLRM relies on CPU threads to identify the embedding
entries for caching while enabling training of large recommendation models using even just a single GPU.
By enabling training on a single GPU our approach makes recommendation model training affordable to
even small businesses. We then show how cDLRM can scale the training speed with the availability of additional GPUs by supporting data parallel training. By leveraging model caching and data parallel training
we provide a resource efficient solution that gracefully scales performance and cost.
We summarize the primary contributions of this system as follows:
1. We propose a system, that we call cDLRM in which all embedding tables are kept in CPU DRAM
while only a small cache of each table is stored in GPU memory. cDLRM is built on the concept
of lookahead caching where a CPU thread pre-processes training batches and caches the necessary
embedding table entries to GPU memory. Training is contained entirely on the GPUs without ever
letting gradients flow back to the CPU. cDLRM is based on the insight that only a subset of embedding
table rows are required to train for a set of samples.
2. We demonstrate the effectiveness of cDLRM by training large models on just a single GPU and also
demonstrate how cDLRM can scale out in a purely data parallel manner when additional GPUs are
17
available. To the best of our knowledge this is the first work to demonstrate distributed data parallel
training of large recommendation models using table embedding caching as the foundation.
3. We quantify the impact of cDLRM by training on 2 publicly available datasets and demonstrate that
with little (< 0.02%) to no loss in accuracy, we can train the DLRM in a highly cost efficient manner.
3.2 Basic Approaches to addressing memory footprint and their shortcomings
We adopt the high-level recommendation model architecture depicted in Figures 2.1 and 2.2, while employing the specific components of the model as defined in the Facebook open-source DLRM [62]. Particularly,
each dense network in the architecture is a Multilayer Perception (MLP), with fd being referred to as the
bottom-mlp and fo being referred to as the top-mlp. Prior to delving into the specifics of cDLRM, we acknowledge several straightforward strategies aimed at addressing the substantial memory requirements
of embedding tables, along with their respective limitations.
• CPU-only training: The simplest approach is to train the entire model on the CPU. The benefits of
CPU-only training are that CPU DRAM is much cheaper and all communication overhead is eliminated. The downside to this approach is that computation now becomes the bottleneck since the
MLPs and second order interactions are much slower on the CPU than they are on GPUs. It is also
possible to distribute training on multiple CPU sockets but they suffer potentially more communication bottleneck since inter-CPU bandwidth is typically less than inter-GPU bandwidth.
• Naive CPU + GPU training: A strategy that improves upon CPU-only training is to use GPUs
for MLP and second order interaction computation while still keeping all embedding tables in CPU
DRAM. This involves frequent communication between the CPU and the GPU, with two major communication points - (1) during forward propagation, embeddings are fetched from the CPU and
transferred to the GPU(s) and (2) during back propagation, gradients flow back into the embedding
18
DLRM Baseline (2 GPUS) Naive CPU + 1 GPU CPU Only
System
0
10
20
30
40
50
60
70
Per batch iteration time (ms)
Figure 3.1: Performance comparison of simple appraoches to reducing memory. DLRM Baseline refers to
the hybrid data and model parallel system.
tables on the CPU. While this system speeds up MLP computations and second order interactions by
performing them on the GPU, CPU to GPU data transfer is an expensive overhead. Thus, the major
drawback here is the frequency of data transfer between the CPU and GPU. Figure 3.1 shows the
comparison between training using 2 GPUs with the default DLRM based hybrid parallel training
(first bar), Naive CPU + GPU training (second bar), CPU-only training (third bar). There is nearly
3X training delay when using CPU only or a naive CPU+GPU training approaches.
• Hashing: Another approach that seeks to keep all the embedding tables on the GPU while reducing
the amount of memory required by each table is based on hashing. This is a prevalent model size
reduction strategy. The idea is to reduce the number of rows in each embedding table and map all
the indices into the reduced table using a hash function. Naturally, this leads to entanglement of
embeddings and our experiments show that the final model accuracy can decrease substantially as
a result. This phenomenon has also been observed in prior work [97]. This solution is often the
least used in industry since even a 0.1% decrease in accuracy can lead to a significant reduction in
revenue [99].
19
3.3 cDLRM Preliminaries, Terminology and Notation
3.3.1 Caching Preliminaries
In computer systems, caching is used to reduce the latency of accessing main memory. Caches are small
amounts of on-chip memory that is used for storing data that is frequently accessed. A copy of this frequently accessed data is fetched from main memory and is kept resident in the cache to speed up future
accesses to the same data. Caches by design are much smaller in capacity than that of main memory, and
hence the cache management system has to decide what parts of the main memory data to store in the
cache and which data to remove from the cache.
Modern processors use a set-associative cache. Set-associative caches are organized into bins, which
are called sets. A memory address is first mapped to a given set. For instance, with 1K sets in a cache a 64-
bit memory address is hashed to generate a 10-bit set index. Each set is composed of multiple ways. Ways
allows multiple memory locations to be cached in the same set. Thus, after a memory location is mapped
to a particular cache set, it is cached in a way in the selected set based on a way-selection algorithm. Since
the memory is much larger than main memory the hashing process may map many memory locations to
the same set. When a set is fully occupied a way replacement may find an existing cache way and replace
that with the new data. Some example selection and replacement heuristics are the first available way,
random way, or least recently used way.
Since multiple memory locations can map to the same set there is a need to identify which memory
locations are currently stored in that set. For this purpose caches use a structure called a cache tags. To
access the cache, each memory location is first hashed to find the set index. Then the cache access logic
searches the cache tags of all the ways in that set to determine if that memory location is in fact present in
that set. If it is present it is called a cache hit, otherwise it is treated as a miss and that location is fetched
from its original memory location.
20
3.3.2 Embedding table cache structure
To reduce the overall memory footprint of the embedding tables in the GPU, cDLRM uses an embedding
table cache structure that occupies a much smaller memory footprint in GPU DRAM. The embedding table
cache structure in the GPU is organized much like a CPU cache described above. For a recommendation
model with embedding table set E = {e1, ...., em}, in which table ei ∈ Rri×d
, where ri
is the number of
rows in table ei
, and d is the dimensionality of the embeddings, the embedding cache is the set of smaller
embedding tables C = {c1, ...., cm}, ci ∈ Rsi×w×d
. Here si
is the number of sets and w the number of ways
in cache ci
, in line with standard terminology from memory caches. The number of sets is chosen based
on a hyperparameter L. If ri ≤ L then the entire table is stored on GPU, i.e. si = ri
, otherwise we use a
simple hashing function to map the embedding table entry to a cache set as follows: si = NextP rime(L)
∗
. Note that the reason for using the NextPrime hashing function is to reduce potential collisions when
two embedding table entries may map to the same set. The number of ways w is also a hyperparameter
that is empirically selected. E resides in CPU DRAM, while C resides in GPU DRAM.
3.3.3 Lookahead Windows
A lookahead window of size n is a set of n consecutive batches that appear in the training set. For training
set Dtrain and batch size b, the number of lookahead windows of size n that Dtrain can be split into is
⌈
|Dtrain|
bn ⌉. Each lookahead window is made up of bn training examples. We refer to this as the span of the
lookahead window. The span is the upperbound on the number of unique indices that are being looked up
in any table in any given lookahead window, and is much smaller than the number of rows in the table.
This observation is especially true when embedding tables are large, which is the case for many of the
production recommendation models. In the training datasets that we have studied, we find in practice that
the number of unique indices accessed in a lookahead window tends to be even smaller than the span.
∗NextP rime(x) computes the smallest prime number ≥ x efficiently by leveraging Bertrand’s postulate.
21
CPU DRAM GPU GDDR
wi ....
c1
wi
cm
Qwriteback
Pprefetching
Pcache+train
Peviction
victim table 1 victim table m
....
Pcache+train
Pcache+train
Pspawned
Pspawned
Pspawned
MLPs
wi
c2
forward + backward + parameter update
Buffer SSD
Preader
....
occupancy
table 1
occupancy
table m Miss?
e1 em
....
Qlookahead Pcache+train
Preader
Figure 3.2: cDLRM Block Diagram
Based on this observation, we conclude that maintaining an entire table on the GPU, when only a part
of it is being accessed and updated over the next n batches, is wasteful. If we can find a way to fetch
the embeddings for the indices that are going to be looked up over the next n batches from E and move
these embeddings into C just-in-time as training on the next n batches starts, we can significantly reduce
the memory usage on the GPU. The prefetching, caching+training, and eviction processes, described next,
work in unison to achieve this goal and enable training of a recommendation model with minimal overhead.
3.4 cDLRM Overview
The block diagram in figure 3.2 illustrates the different components of the cDLRM working together.
cDLRM consists of three processes: prefetching, caching+training, and eviction. Each process relies on
a set of queues to move embeddings between the CPU and GPU. At the start of training, the prefetching
process reads a set of n training batches, where n is the size of the lookahead window. The selection of this
lookahead window parameter is based on multiple factors as will be discussed. The goal of the prefetching
process is to extract all the unique indices accessed in the lookahead window for each table. It then fetches
the embeddings associated with these unique indices and pushes them into a queue labeled Qlookahead.
Concurrently, the caching+training process reads the embedding entries stored in the Qlookahead queue
and uses a hash algorithm to place each of these embeddings into the table caches resident in GPU DRAM.
22
Algorithm 1: Prefetching
Input: training set: Dtrain, lookahead window size: n, training batch size: b, embedding tables:
E = {e1, ...., em} shared lookahead window queue: Qlookahead
1: while LW ← NextLookaheadWindow(Dtrain, b, n) do
2: unique_embeds ← []
3: for i ← 1 to m do
4: Ui ← ComputeUniqueIndices(LW, i)
5: Pi ← ei
[Ui
]
6: unique_embeds.append(Pi
)
7: end for
8: Qlookahead.push(unique_idx_embeds)
9: end while
This process handles the complexities associated with cache management. After placing the entries from
Qlookahead into the table caches, the caching+training process performs forward and backward propagation. This training is completely contained in the GPU, with gradients never flowing back into the primary
embedding tables on the CPU. The last process, the eviction process, is responsible for writing the updated
embedding vectors back to the CPU embedding tables.
The prefetching process must stay ahead of the caching+training process so that the caching+training
process is not stalled waiting for embeddings. The selection of the lookahead parameter is primarily determined by this consideration. Since the speed of the prefetching and caching+training processes can vary
based on the underlying hardware configuration, we use a profiling tool to measure the training speed
for a given model and batch size on the selected GPU configuration. We also measure the speed of the
prefetching process in identifying and extracting unique indices of a single batch from the CPU-resident
embedding tables. The ratio of the training speed over the prefetching guides the choice of lookahead
window size. If these profiled measurements change dynamically due to bandwidth and other runtime
bottlenecks, training may occasionally stall. If such a stall occurs, our approach can be adapted to dynamically increase the lookahead window. In our experiments we did not have to change the lookahead
window since the speed ratios are quite stable.
23
3.5 Training cDLRM on a single GPU
A unique feature of cDLRM is that is that it only ever needs a single GPU to accommodate the model. Since
only caches of the embedding tables are stored in GPU memory, we can always choose a cache size to fit
within the limits of a single GPU’s memory regardless of the actual size of the whole embedding tables. We
first describe the details of the various processes and how they cooperate to achieve training cDLRM on a
single GPU before presenting how cDLRM scales out to multiple GPUs in a purely data parallel fashion.
3.5.1 Prefetching Process
The fundamental role of the prefetching process is to identify the embedding vectors needed to train on
the next window of batches and feed these embeddings to the caching+training process. It does so by
pre-processing a set of training batches that are read from a training dataset. The training dataset may be
present in a remote storage, and a separate reader process may fetch these batches into a buffer and forward
them to the prefetching process (figure 3.2 shows the buffering of the reader process). For every embedding
table, the prefetching process computes the unique indices in a lookahead window of batches and fetches
the embeddings corresponding to these unique indices from E. The fetched embeddings are buffered in
Qlookahead, which the caching+training process will pull from. Algorithm 1 outlines this procedure.
In practice, the outer while loop is parallelized using a number of processes that are spawned from
a process pool by the prefetching process. Each of the processes spawned from the pool is responsible
for executing the body of the loop on a single lookahead window. This ensures that the caching+training
process that consumes the entries from Qlookahead, is never stalled as a result of a prefetching bottleneck.
Popular machine learning frameworks such as PyTorch [65] and TensorFlow [1] use a similar approach in
their dataloaders.
Note that it is possible that some embedding vectors corresponding to a subset of unique indices may
already have been transferred to GPU cache in a prior training window. But the prefetching process does
24
not try to prevent loading such vectors into the Qlookahead queue. Instead, we leave this responsibility to
the caching+training process to provide a consistent view as we describe next.
3.5.2 Caching+Training Process
The computation performed by the caching+training process is split up into three phases †
: (1) preloading
/ caching, (2) forward propagation and (3) backpropagation and parameter update.
3.5.2.1 Preloading
In the preloading phase, the process pulls the embeddings corresponding to the unique indices in the next
n batches from the Qlookahead queue and caches them in C. The set index for original index j in table ei
is computed using the modulo operator (Specifically jset = j % si
).
Handling coherence. Before placing an embedding vector into GPU cache ci
, the caching+training
process must make sure that this index was not already cached in a prior caching step. If the index is
already in ci
, then the prefetched embedding vector from CPU DRAM is stale. Rather than search ci for
a cache hit, which takes up GDDR bandwidth, we maintain a separate data structure on the CPU called
the occupancy table. There are m occupancy tables, each one corresponding to one ci
. An occupancy
table stores which embedding table entries are currently resident in GPU. As such, the occupancy table is
essentially a tag structure for the embedding table caches and is stored in CPU DRAM to enable the CPU
to manage the caching process.
To find out if the embedding corresponding to index j in table ei
is resident in ci
, the caching+Training
process needs to compute the set index of j and lookup the occupancy table for ci
. Thus, the set of unique
indices for table ei that has been pulled from Qlookahead can be split up into two disjoint subsets: J
i
hit, the
set of indices whose embeddings are already in ci
, and J
i
miss, the set of indices whose embeddings need
†
For simplicity, we describe the phases by focusing on a single embedding table ei. The description generalizes for all
embedding tables.
25
to be moved into ci
. Any set+way that stores an index ∈ J
i
hit is considered a pinned entry so it will not be
evicted until it is used by the training pipeline.
Caching missing indices. The second step is to cache embeddings for indices in J
i
miss. For each
j ∈ J
i
miss we compute its set index S as described above. The way within the set S into which its
embedding will be moved is chosen uniformly at random from any unpinned entries in that set. The
reason for choosing ways at random is performance. Our implementation is heavily parallelized. If we
were to choose an alternate policy such as the next available way, the caching procedure would have to be
serial, making the entire process slower. Thus, random way selection allows multiple indices to randomly
and concurrently pick an available way in a set. Multiple original indices that map to the same set could
end up choosing the same way to cache their embedding in. By using different random seeds one can
minimize such overlap. But in the worst case, to resolve a conflict when two indices map to the same
index within the set, we once again choose one of the colliding indices at random to occupy the way. The
embeddings corresponding to all the other colliding indices are considered conflict victims of the caching
procedure. These conflict victims are unable to stay within a given set and are handle differently.
In practice, such a collision is extremely rare. This is because two unique indices from J
i
miss can only
have a potential collision if they map to the same set. We evaluated different policies on handling conflict
victims, but given the rarity of this scenario we simply do not place a conflict victim into the GPU cache.
Instead, during training we re-fetch the missing embedding vector (conflict victim) from the CPU directly
into a victim buffer as we describe in section 3.5.2.2.
Evicted Embeddings. Any way in a set that is evicted to make room for embeddings of indices in
J
i
miss needs to be written back to the CPU to update ei
. The caching+training process works with the
eviction process (section 3.5.3) to orchestrate this writeback. The caching+training process simply pushes
the evicted embeddings into an eviction queue, Qwriteback, from which the eviction process will pull the
evicted embeddings and write them to ei
.
26
Algorithm 2: Single GPU Training
Input: training set: Dtrain, lookahead window size: n, training batch size: b, embedding table
caches: C = {c1, ...., cm}, victim tables V = {v1, ..., vm}, DLRM MLPs: Mbot, Mtop, shared
lookahead window queue: Qlookahead, eviction queue: Qwriteback, number of epochs:
num_epochs,
1: for epochs ← 1 to num_epochs do
2: while bk ← NextBatch(Dtrain, b) do
3: if k % n == 0 then
4: unique_embeds ← Qlookahead.pull()
5: CacheEmbeds(C, unique_embeds, Qwriteback)
6: end if
7: F orward(Mbot, Mtop, C, V, bk)
8: Backward(Mbot, Mtop, C, V )
9: U pdate(Mbot, Mtop, C, V )
10: FlushV ictimT ables(V, Qwriteback)
11: end while
12: end for
3.5.2.2 Forward Propagation
The phase that follows the caching phase is the forward propagation phase. In this phase, the caching+training
process performs a forward propagation on a single batch of training examples. The continuous features
in the batch are transformed through the MLPs, while the categorical feature vector consists of indices for
which embeddings need to be obtained.
Indices that hit in ci
. After the caching phase, the embeddings for most indices needed over the
current lookahead window of n batches will be resident in ci
. The reason we cannot guarantee that the
embedding for every index in every batch in the lookahead window will be in ci
is because some of them
might be conflict victims as explained in section 3.5.2.1. The hitting indices can simply lookup their embeddings in ci by computing the set and the way.
Indices that miss in ci
. While indices that miss in a cache are rare, we still require the conflict victim
embeddings on the GPU to enable GPU-contained training. To address this issue, we maintain a victim
table vi ∈ Rb×d
, similar to a victim cache [39], in GPU DRAM, into which the conflict victim embeddings
will be cached on demand during forward propagation. This involves a CPU to GPU data copy step that
27
adds overhead. An important note about the size of the victim table is that the number of rows need only
be as large as b, the size of the batch. This is because in a batch of size b, there can only be at most b indices
that miss in ci
. In practice we find that hit rates in the primary cache are very high and that we rarely use
more than a few rows in the victim table to handle conflicts. Once the conflict victims have been placed
into the victim table, all of the required embeddings for the current batch are in GPU memory, and the
GPU-contained forward propagation can proceed.
3.5.2.3 Backpropagation and Parameter Update
Backpropagation, gradient computation, and parameter updates happen entirely in GPU memory. Embeddings of hitting indices are updated in ci
, and embeddings of conflict victims are updated in vi
. All
MLPs are updated in GPU memory as well. After updating all the parameters, the conflict victims that are
buffered in vi are explicitly flushed into the eviction queue while the cache resident embeddings are left
as is where they may be evicted later. Algorithm 2 outlines the entire training procedure.
3.5.3 Eviction Process
The last of the three processes is the eviction process. Its role is simply to pull embeddings that have been
pushed into the eviction queue by the training process and write them back to the original embedding
tables in CPU DRAM, namely, E. It is computationally the least intensive process of the three.
Several of the cDLRM design choices were made in favor of high performance rather than strictly enforcing an identical behavior of a baseline uncached system, that performs forward passes on embeddings
that have been updated only after the previous backward pass. For example, consider the following illustrative scenario: let j be a unique index that is needed during lookahead window w and w + 2, but not in
w + 1. If the prefetching process is prefetching data for window w + 2 it may fetch embedding vector for
j from the CPU and place it in Qlookahead. But during window w + 1, j may be evicted and pushed back to
28
Algorithm 3: Multi GPU Training (process perspective)
Input: training set: Dtrain, lookahead window size: n, training batch size: b, embedding table
caches: Cr = {c
r
1
, ...., cr
m}, victim tables Vr = {v
r
1
, ..., vr
m}, DLRM MLPs: Mbot
r
,Mtop
r ,
lookahead window queue: Qlookahead, eviction queue: Qwriteback, number of epochs:
num_epochs, process rank: r, communication world: W, cache aggregate granularity: λ
1: for epochs ← 1 to num_epochs do
2: while b
r
k ← NextBatch(Dtrain, b) do
3: if k % n == 0 and r == 0 then
4: unique_embeds ← Qlookahead.pull()
5: CacheEmbeds(C
0
, unique_embeds, Qwriteback)
6: BroadcastCacheState(C
0
, W)
7: end if
8: F orward(Mbot
r
, Mtop
r , Cr
, V r
, br
k
)
9: Backward(Mbot
r
, Mtop
r , Cr
, V r
)
10: AggregateMLP s(Mbot
r
, Mtop
r , W)
11: U pdate(Mbot
r
, Mtop
r , Cr
, V r
)
12: if k % λ == 0 then
13: AggregateCaches(C
r
, W)
14: end if
15: FlushV ictimT ables(V
r
, Qwriteback)
16: end while
17: end for
CPU. Finally during window w+2 the caching+training process may simply read the data from Qlookahead,
which may be a stale embedding. While such cases occur rarely, it is important to note them to understand
their impact on accuracy. The only way to avoid this race condition is to use locks, which would make the
whole system slower. Hence, for performance reasons, we opt to allow rare stale embeddings. We show
empirically that the allowance of this staleness does not hurt model accuracy.
3.6 Data Parallel Training with Caching
While cDLRM enables the training of DLRM on a single GPU regardless of embedding table sizes, when
multiple GPUs are available it is possible to improve the training speed by using data parallelism in MLPs
to speed up the overall computation. In such cases, cDLRM can easily scale out to multiple GPUs in a
purely data parallel fashion by replicating the embedding table caches as well as MLPs on all participating
GPUs. While prior approaches used model replication for data parallel training, our approach is unique in
29
that we cache the necessary model parameters across multiple GPUs thereby emulating data parallelism
on top of model caching.
3.6.1 Maintaining Cache Coherency in Data Parallel Training
When training in a data parallel fashion with replicated caches on all GPUs, cache coherency needs to
maintained. Such a situation occurs when two GPUs may need to access the same embedding table index
in different training samples. cDLRM enforces coherency at two places in the training pipeline: at the
beginning of a lookahead window when caches are loaded with the contents of the upcoming batches; and
after each GPU individually updates its replicas of the embedding tables and MLPs.
3.6.1.1 Coherency at the beginning of a lookahead window
Recall from section 3.5.2.1 that caches are preloaded/warmed up at the beginning of every lookahead window. In the case of single GPU training, the cached table entries reside on the singular GPU used to train
and hence there are no consistency issues across different caches. However, when training in a data parallel
fashion on multiple GPUs, the caches on all GPUs need to have the same state after caching. We facilitate
this through a broadcast of cache state. More specifically, the process with rank 0 is the only process that
interfaces with the prefetching process. It caches the embeddings for the upcoming lookahead window on
GPU0 as described in section 3.5.2.1 and broadcasts the cache state to all other GPUs over a high bandwidth NVLink interconnect. For aggregation simplicity the broadcast process replicates the cache across
all GPUs, even though some of the embedding table entries may only be needed on a single GPU. Through
this broadcast we guarantee that all GPUs see a consistent copy of any shared embedding entry before
they start the next training batch.
30
3.6.1.2 Coherency after individual parameter updates:
Since each process involved in data parallelism is training on their own batches, the individual updates
to the parameters (both MLPs and embedding tables) in each process will once again lead to a lack of
consistency between the parameters on each GPU. Note that this inconsistency is similar to traditional
data parallel training where each GPU may have its own model updates locally first. These updates must
be aggregated for a global model. cDLRM employs the following practices to synchronize parameters:
MLP aggregation. MLPs are synchronized by the standard practice of gradient averaging. At the
end of every minibatch, the gradients for the MLPs across all GPUs are averaged via an all-reduce. The
resulting gradient is used to update parameters on all GPUs.
Embedding Table Cache aggregation. Synchronizing embedding table caches in a similar manner
is infeasible. Despite being much smaller than whole embedding tables, accesses to the caches in a per
batch granularity is still sparse, thereby making the gradients sparse. Performing an all-reduce on sparse
gradients can only be accomplished using at least two communication calls: one to aggregate non-zero
elements first and another to aggregate values.
To avoid this overhead, we choose to average the caches post parameter update via a single all-reduce.
Recall that GPU0 broadcasts the entire cache to all GPUs, even though only some of these embedding
indices are actually shared. But by sharing the entire cache the aggregation process is simplified. Gradients
of shared embedding table entries will be produced by multiple GPUs and these are aggregated to create
a single weight update for all the shared entries. On the other hand, if an embedding table entry is only
used on a single GPU only that GPU produces the update for that model parameter and all other GPUs
produce a zero gradient update. Hence, our approach lets each GPU first update its own cache with the
gradients it computed. Once the local caches are updated these caches are then averaged to create a global
model update.
31
Dataset C M Training/Test Examples
Kaggle 13 26 39,291,958 / 3,274,330
Terabyte 13 26 645,753,353 / 13,760,492
(a) Dataset dimensions and split
Dataset DLRM cDLRM
Kaggle 78.967±0.01 78.971±0.01
Terabyte 81.07±0.01 81.06±0.01
(b) Accuracies at convergence
Table 3.1: Dataset Details
Interestingly, since only a few cached entries are truly shared across GPUs the sparse model update
of each GPU is sufficient to update its local cache. Hence, averaging the caches across multiple GPUs can
be done much less frequently than the MLPs. cDLRM thus tracks the fraction of sparse updates across
caches and waits until the fraction of cache updates across all GPUs exceeds a threshold. The granularity
at which we average caches is a hyperparameter of the optimization process. Algorithm 3 illustrates how
Algorithm 2 changes in the multi-gpu setting from the perspective of each process. AggregateMLPs and
AggregateCaches are collective communication calls in which all ranks participate.
3.7 Experimental Evaluation
In this section we experimentally evaluate the effect of the proposed caching strategy on DLRMs both in
terms of model accuracy as well as computational performance.
3.7.1 Datasets
We use two publicly available datasets to evaluate cDLRM relative to the baseline open source DLRM: the
Criteo AI Labs Ad Kaggle and Terabyte datasets. Both datasets consist of click logs for ad click-through-rate
prediction. To our knowledge, these are the only publicly available datasets in this domain, a sentiment
reiterated by [62]. The dataset details are outlined in Table 3.1(a). Each datapoint in both datasets contains
13 dense features (C) and 26 sparse features (M). The model architecture used to train on each dataset is
also listed in the table. All performance results reported are on the Terabyte dataset.
32
3.7.2 Experimental Setup
All our experiments are run on a server machine with 4 Intel(R) Xeon(R) Gold 5220R CPUs, each with 24
cores, and 8 Nvidia Quadro RTX 5000 GPUs, each equipped with 16GB of GDDR6 high bandwidth memory.
3.7.3 Experimental Results
We evaluate cDLRM on the following criteria: (1) model accuracy on a single GPU; (2) sensitivity of computational performance on cache parameters when training on a single GPU; and (3) accuracy and scalability
when using multiple GPUs. In all cases, we use the original DLRM [62] as the baseline for comparison,
using the same hyperparameters and no hyperparameter tuning or optimization. We use a bottom MLP
arch of 13-512-256-64, a top MLP arch of 512-512-256-1, an embedding table dimension of 64 and a 16-way
associative cache to evaluate (1) and (2). For (3) we use the model architecture used in the MLPerf [54]
benchmark in which the only difference is a bottom MLP arch of 13-512-256-128 and an embedding table
dimension of 128.
3.7.3.1 Model Accuracy on a single GPU
As mentioned earlier, there are rare cases when the embedding vectors may be stale. The comparison of
test accuracy between the final DLRM model and the cDLRM model obtained at convergence shows that
accurracy obtained by cDRLM is within 0.02%. These results are shown in Table 3.1(b). Both cDLRM and
DLRM were trained with the same batch size of 2048 for 2 epochs.
3.7.3.2 Caching overheads and training speed on a single GPU
To analyze the impact of caching on the training speed of cDLRM, we split the training time into two
components: (1) the time it takes to execute lines 7, 8, 9 and 10 in Algorithm 2 - batch computation time. (2)
The time it takes to execute lines 4 and 5 - caching overhead. The amortized caching overhead, is the average
33
10 50 100 200 400
Lookahead window size
30
35
40
45
50
55
60
Total per batch execution time (ms)
(a)
Cache Size = 50,000
Cache Size = 10,000
Cache Size = 1000
10 50 100 200 400
Lookahead window size
2
4
6
8
10
12
14
Amortized caching overhead (ms)
(b)
Cache Size = 50,000
Cache Size = 10,000
Cache Size = 1000
50,000 10,000 1000
Cache size
70
75
80
85
90
95
100
Cache hit rate
(c)
Lookahead = 10
Lookahead = 50
Lookahead = 100
Lookahead = 200
Lookahead = 400
Figure 3.3: (a) Single GPU cDLRM Performance (b) Caching cost (c) Cache Performance (All data from
Terabyte dataset, Bot MLP: 13-512-256-64, Top MLP: 512-512-256-1, Embedding Dim: 64)
per batch overhead due to caching incurred by every batch in the lookahead window. It is computed by
dividing the caching overhead by the lookahead window size. The total per batch execution time is the sum
of the batch computation time and the amortized caching overhead.
cDLRM is heavily parallelized with vector operations. Hence, caching overhead costs can be amortized when using large lookahead windows. By using a larger lookahead window there are more training
examples to pre-process concurrently which benefit from vectorization. On the other hand, using a larger
lookahead window requires more memory for Qlookahead, and larger embedding table cache sizes in GPU
DRAM to hold more embeddings. Hence, there is a tradeoff in amortizing caching overhead vs. reducing
cache size. As shown in Figure 3.3a, with a large cache size of up to 50,000 sets per table, the total per
batch computation time decreases with increase in the lookahead window size. In general, the larger the
lookahead window, the smaller the per-batch amortized caching overhead. This rule holds up to the maximum available hardware parallelism to execute vectorized code. Figure 3.3b shows the amortized caching
overhead gradually decreasing with increasing lookahead window size. In fact, the amortized caching
overhead decreases independent of the cache size on GPU DRAM.
However, as mentioned above there is a tradeoff between lookahead window size and cache size, even
when sufficient hardware parallelism is available. Figure 3.3a also demonstrates the detrimental effects
of smaller cache sizes when using larger lookahead windows. As explained before, when using a larger
34
1024 2048 4096 8192 16384 32768
Batch Size
0
50
100
150
200
Per batch computation time (ms / it)
(a)
2 GPUs
4 GPUs
8 GPUs
2 4 8
Number of GPUs
0
20
40
60
80
100
Percentage of overall compute
(b)
Caching Overhead
Total Batch Execution
0 25 50 75 100 125 150 175 200
Lambda
80.98
81.00
81.02
81.04
81.06
Convergence Accuracy
(c)
Batch Size = 32768
Batch Size = 16384
Batch Size = 8192
Figure 3.4: (a) cDLRM multigpu scaling; (b) Caching overhead with 32K batch size; (c) Sensitivity of convergence accuracy on cache aggregation frequency λ. Cache Size=150,00 ways / 16 sets. Lookahead size=500
lookahead window more embedding vectors need to be brought into the GPU cache. This results in a higher
probability of there being more unique indices in the lookahead window than there are empty cache slots
available. This leads to more conflict victims. Thus, the forward propagation becomes the bottleneck as a
result of having to fetch these conflict victims into the victim tables. This fact is corroborated by Figure
3.3c which shows the cache hit rate for different cache sizes and lookahead windows. The cache hit rate
measures the number of embedding vectors that are already cached in GPU DRAM and those that miss in
the cache are conflict victims. When the cache size is small increasing the lookahead window causes more
conflict victims. Hence, the total per batch execution time increases with lookahead window size when
the cache size is not properly configured to account for the lookahead window size. This result shows that
the cache size and lookahead window size selection must be coordinated for optimal performance.
3.7.3.3 Accuracy and Scalability of cDLRM on multiple GPUs
As explained in section 3.6 cDLRM can leverage multiple GPUs when they are available by training in a
purely data parallel fashion. As a result, cDLRM shows strong scaling when using multiple GPUs. Figure
3.4(a) illustrates the benefit of using multiple GPUs when they are available. The model architecture used
is the same model architecture used in the MLPerf training benchmark. We use a cache size of 150,000 sets
with 16-way set associativity. We use a lookahead window size of 3000 for batch sizes upto 8192. For a batch
size of 16384 we use a lookahead window size of 1500 and for a batch size of 32768 we use 500. These values
35
give us the best performance for the respective batch sizes. As we can see, larger batch sizes warrant the use
of additional GPUs much more so than smaller batch sizes. Unlike the DLRM baseline, cDLRM can select
the number of GPUs based on optimal batch sizes as opposed to being constrained by embedding table
memory requirements. Another effect of more data parallelism is that caching overhead starts to become
a larger percentage of the total computation time. Figure 3.4(b) illustrates this phenomenon. Caching can
add over 25% overhead to the total computation time. We believe that this overhead can be reduced or
even eliminated with some additional software engineering. But we leave this to future work. Recall that
for performance reasons that cDLRM aggregates caches at a granularity specified by the hyperparameter
λ due their sparse nature. Figure 3.4c illustrates the effect of lambda on convergence accuracy for different
batch sizes. The takeaway is that for a fixed cache and lookahead size, λ can be larger for smaller batch
sizes without incurring significant penalty in convergence accuracy.
3.8 Related Work
The problem of large recommendation systems have been recognized in previous work. [98] proposes a
hierarchical parameter server distributed between NVMe, CPU and GPU memory to alleviate the demands
on GPU memory for storing model parameters. Their approach is centered around a distributed hash
table, split between multiple GPUs, as well as an execution pipeline involving the SSD, CPU and GPU. Our
approach takes an orthogonal approach of reducing the need for many GPUs to start with using caching.
However, when a cached system still needs to be distributed across many GPUs we believe [98] can be
applied on top of cDLRM for cached parameter aggregations. AIBox [99] takes a similar approach to [98].
They seek to keep model parameters in NVMe and train on GPUs while using CPU memory as a cache
for frequently accessed parameters. The key difference between cDLRM and AIBox is that AIBox doesn’t
proactively identify the necessary model parameters for a given window of batches and instead caches
parameters in CPU memory. They use various storage access optimizations to reduce I/O overhead. As we
36
described earlier the locality of embedding table accesses is very poor in DLRM like models. Hence, preprocessing batches and prefetching the unique indices is an efficient way to get near 100% cache hit rate.
We also address how we deal with various consistency issues with caching embeddings. Other approaches
to reducing GPU memory usage such as [70] are tailored for models whose intermediate activations require
large amounts of GPU memory. DLRM intermediate activations are fairly small compared to embedding
table size. Hence such approaches are ill-suited to DLRM like models. Other pipelined approaches such as
[38] are designed for models in which sparse parameters and dense parameters are never part of the same
layer of computation. These are models that lend themselves to efficient pipelining. DLRM on the other
hand contains layers with both sparse and dense parameters, hence designing pipelined systems that do not
degrade model performance can be very challenging. We would like to stress that the objective of cDLRM
is not to be the fastest recommendation model training system, but to democratize recommendation model
training to the point where it does not require hundreds of thousands of dollars of hardware and GPUs to
be able to just fit the model.
37
Chapter 4
Graph Traversal with Tensor Functionals: A Meta-algorithm for
Scalable Learning
In the preceding chapter, we introduced a hardware-efficient implementation of deep learning recommendation models, which effectively tackled the first challenge we aimed to address. Our focus now shifts
towards resolving the long-tail item problem. However, before delving into the specifics of this issue, we
will take a brief detour to present a graph representation learning (GRL) framework. This framework
empowers rapid, stochastic, and unbiased learning of diverse GRL algorithms, serving as a fundamental
component of our proposed solution to the long-tail item problem. In this context, we view the recommendation problem as a learning task involving the user-item interaction graph, and our approach leverages
the capabilities of the GRL framework to achieve an efficient and effective implementation.
4.1 Introduction
Graph representation learning (GRL) has become an invaluable approach for a variety of tasks, such as
node classification (e.g., in biological and citation networks; [79, 40, 30, 90]), edge classification (e.g., link
prediction for social and protein networks; [67, 28]), entire graph classification (e.g., for chemistry and
drug discovery [26, 11]), etc.
38
In this work, we propose an algorithmic unification of various GRL methods that allows us to reimplement existing GRL methods and introduce new ones, in merely a handful of code lines per method.
Our algorithm (abbreviated GTTF, Section 4.3.2), receives graphs as input, traverses them using efficient
tensor∗ operations, and invokes specializable functions during the traversal. We show function specializations for recovering popular GRL methods (Section 4.3.3). Moreover, since GTTF is stochastic, these
specializations automatically scale to arbitrarily large graphs, without careful derivation per method. Importantly, such specializations, in expectation, recover unbiased gradient estimates of the objective w.r.t.
model parameters.
GTTF uses a data structure Ab (Compact Adjacency, Section 4.3.1): a sparse encoding of the adjacency
matrix. Node v contains its neighbors in row Ab[v] ≜ Abv, notably, in the first degree(v) columns of Ab[v].
This encoding allows stochastic graph traversals using standard tensor operations. GTTF is a functional,
as it accepts functions AccumulateFn and BiasFn, respectively, to be provided by each GRL specialization
to accumulate necessary information for computing the objective, and optionally to parametrize sampling
procedure p(v’s neighbors | v). The traversal internally constructs a walk forest as part of the computation
graph. Figure 4.1 depicts the data structure and the computation. From a generalization perspective, GTTF
shares similarities with Dropout [76].
Our contributions are:
1. A stochastic graph traversal algorithm (GTTF) based on tensor operations that inherits the benefits
of vectorized computation and libraries such as PyTorch and Tensorflow.
2. We list specialization functions, allowing GTTF to approximately recover the learning of a broad
class of popular GRL methods.
3. We prove that this learning is unbiased, with controllable variance, for this class of methods.
∗To disambiguate: by tensors, we refer to multi-dimensional arrays, as used in Deep Learning literature; and by operations,
we refer to routines such as matrix multiplication, advanced indexing, etc
39
0 1 2
3 4
(a) Example graph G
1
1 1 1 1
1
1 1
1 1
(b) Adjacency matrix for
graph G
1
0 2 3 4
1
1 4
1 3
1
4
1
2
2
(c) CompactAdj for G with
sparse Ab ∈ Z
n×n and dense
δ ∈ Z
n. We store IDs of adjacent nodes in Ab
0
1 1
2 2 4
0
4
3 1
0 1 2
4
(d) Walk Forest. GTTF invokes
AccumulateFn once per (green)
instance.
Figure 4.1: (c)&(d) Depict our data structure & traversal algorithm on a toy graph in (a)&(b).
4. We show that GTTF can scale previously-unscalable GRL algorithms, setting the state-of-the-art on
a range of datasets.
5. We open-source GTTF along with new stochastic traversal versions of several algorithms, to aid
practitioners from various fields in applying and designing state-of-the-art GRL methods for large
graphs.
4.2 Related Work
We take a broad standpoint in summarizing related work to motivate our contribution.
Models for GRL have been proposed, including message passing (MP) algorithms, such as Graph Convolutional Network (GCN) [40], Graph Attention (GAT) [79]; as well as node embedding (NE) algorithms,
including node2vec [28], WYS [3]; among many others [90, 89, 67]. The full-batch GCN of [40], which
drew recent attention and has motivated many MP algorithms, was not initially scalable to large graphs,
as it processes all graph nodes at every training step. To scale MP methods to large graphs, researchers
proposed.
Stochastic Sampling Methods that, at each training step, assemble a batch constituting subgraph(s)
of the (large) input graph. Some of these sampling methods yield unbiased gradient estimates (with some
variance) including SAGE [30], FastGCN [12], LADIES [102], and GraphSAINT [95]. On the other hand,
40
ClusterGCN [16] is a heuristic in the sense that, despite its good performance, it provides no guarantee of
unbiased gradient estimates of the full-batch learning. [26] and [10] generalized many GRL models into
Message Passing and Auto-Encoder frameworks. These frameworks prompt bundling of GRL methods
under Software Libraries, like PyG [23] and DGL [85], offering consistent interfaces on data formats.
Method
Family
Scale
Learning
Models
GCN, GAT MP ✗ exact
node2vec NE ✓ approx
WYS NE ✗ exact
Stochastic Sampling Methods
SAGE MP ✓ approx
FastGCN MP ✓ approx
LADIES MP ✓ approx
GraphSAINT MP ✓ approx
CluterGCN MP ✓ heuristic
Software Frameworks
PyG Both inherits / reDGL Both implements
Algorithmic Abstraction (ours)
GTTF Both ✓ approx
We now position our contribution relative to the above. Unlike generalized message passing [26], rather than abstracting the
model computation, we abstract the learning algorithm. As a result, GTTF can be specialized to recover the learning of MP as well
as NE methods. Morever, unlike Software Frameworks, which are
re-implementations of many algorithms and therefore inherit the
scale and learning of the copied algorithms, we re-write the algorithms themselves, giving them new properties (memory and computation complexity), while maintaining (in expectation) the original algorithm outcomes. Further, while the listed Stochastic Sampling Methods target MP algorithms (such as GCN, GAT, alike), as
their initial construction could not scale to large graphs, our learning algorithm applies to a wider class of GRL methods, additionally
encapsulating NE methods. Finally, while some NE methods such
as node2vec [28] and DeepWalk [67] are scalable in their original
form, their scalability stems from their multi-step process: sample
many (short) random walks, save them to desk, and then learn node
embeddings using positional embedding methods (e.g., word2vec,
[57]) – they are sub-optimal in the sense that their first step (walk
sampling) takes considerable time (before training even starts) and also places an artificial limit on the
41
number of training samples (number of simulated walks), whereas our algorithm conducts walks on-thefly whilst training.
4.3 Graph Traversal via Tensor Functionals (GTTF)
At its core, GTTF is a stochastic algorithm that recursively conducts graph traversals to build representations of the graph. We describe the data structure and traversal algorithm below, using the following
notation. G = (V, E) is an unweighted graph with n = |V | nodes and m = |E| edges, described as a
sparse adjacency matrix A ∈ {0, 1}
n×n
. Without loss of generality, let the nodes be zero-based numbered
i.e. V = {0, . . . , n − 1}. We denote the out-degree vector δ ∈ Z
n – it can be calculated by summing over
rows of A as δu =
P
v∈V A[u, v]. We assume δu > 0 for all u ∈ V : pre-processing can add self-connections
to orphan nodes. B denotes a batch of nodes.
4.3.1 Data Structure
Internally, GTTF relies on a reformulation of the adjacency matrix, which we term CompactAdj (for "Compact Adjacency", Figure 4.1c). It consists of two tensors:
1. δ ∈ Z
n
, a dense out-degree vector (figure 4.1c, right)
2. Ab ∈ Z
n×n
, a sparse edge-list matrix in which the row u contains left-aligned δu non-zero values.
The consecutive entries {Ab[u, 0], Ab[u, 1], . . . , Ab[u, δu − 1]} contain IDs of nodes receiving an edge
from node u. The remaining |V |−δu are left unset, therefore, Ab only occupies O(m) memory when
stored as a sparse matrix (Figure 4.1c, left).
CompactAdj allows us to concisely describe stochastic traversals using standard tensor operations. To
uniformly sample a neighbor to node u ∈ V , one can draw r ∼ U[0..(δu − 1)], then get the neighbor ID
with Ab[u, r]. In vectorized form, given node batch B and access to continuous U[0, 1), we sample neighbors
42
for each node in B as: R ∼ U[0, 1)b
, where b = |B|, then B′ = Ab[B, ⌊R ◦ δ[B]⌋] is a b-sized vector, with
B′
u
containing a neighbor of Bu, floor operation ⌊.⌋ is applied element-wise, and ◦ is Hadamard product.
4.3.2 Stochastic Traversal Functional Algorithm
Our traversal algorithm starts from a batch of nodes. It expands from each into a tree, resulting in a walk
forest rooted at the nodes in the batch, as depicted in Figure 4.1d. In particular, given a node batch B,
the algorithm instantiates |B| seed walkers, placing one at every node in B. Iteratively, each walker first
replicates itself a fanout (f) number of times. Each replica then samples and transitions to a neighbor. This
process repeats a depth (h) number of times. Therefore, each seed walker becomes the ancestor of a f-ary
tree with height h. Setting f = 1 recovers traditional random walk. In practice, we provide flexibility by
allowing a custom fanout value per depth.
Functional Traverse is listed in Algorithm 4. It accepts: a batch of nodes†
; a list of fanout values F
(e.g. to F = [3, 5] samples 3 neighbors per u ∈ B, then 5 neighbors for each of those); and more notably,
two functions: AccumulateFn and BiasFn. These functions will be called by the functional on every node
visited along the traversal, and will be passed relevant information (e.g. the path taken from root seed
node). Custom settings of these functions allow recovering wide classes of graph learning methods. At a
high-level, our functional can be used in the following manner:
1. Construct model & initialize parameters (e.g. to random). Define AccumulateFn and BiasFn.
2. Repeat (many rounds):
i. Reset accumulation information (from previous round) and then sample batch B ⊂ V .
ii. Invoke Traverse on (B, AccumulateFn, BiasFn), which invokes the Fn’s, allowing the first to accumulate information sufficient for running the model and estimating an objective.
iii. Use accumulated information to: run model, estimate objective, apply learning rule (e.g. SGD).
†Our pseudo-code displays the traversal starting from one node rather than a batch only for clarity, as our actual implementation is vectorized e.g. u would be a vector of nodes, T would be a 2D matrix with each row containing transition path
preceeding the corresponding entry in u
43
Algorithm 4: Stochastic Traverse Functional, parametrized by AccumulateFn and BiasFn
input: u (current node); T ← [] (path leading to u, starts empty); F (list of fanouts);
AccumulateFn (function: with side-effects and no return. It is model-specific and
records information for computing model and/or objective, see text);
BiasFn ← U (function mapping u to distribution on u’s neighbors, defaults to uniform)
1 def Traverse(T, u, F, AccumulateFn, BiasFn):
2 if F.size() = 0 then return # Base case. Traversed up-to requested depth
3 f ← F.pop() # fanout duplication factor (i.e. breadth) at this depth.
4 sample_bias ← BiasFn(T, u)
5 if sample_bias.sum() = 0 then return # Special case. No sampling from zero mass
6 sample_bias ← sample_bias / sample_bias.sum() # valid distribution
7 K ← Sample(Ab[u, :δu]] , sample_bias, f) # Sample f nodes from u’s neighbors
8 for k ← 0 to f − 1 do
9 Tnext ← concatenate(T, [u])
10 AccumulateFn(Tnext, K[k], f)
11 Traverse(Tnext, K[k], f, AccumulateFn, BiasFn) # Recursion
12
13 def Sample(N, W, f):
14 C ← tf.cumsum(W) # Cumulative sum. Last entry must = 1.
15 coin_flips ← tf.random.uniform((f,), 0, 1)
16 indices ← tf.searchsorted(C, coin_flips)
17 return N[indices]
AccumulateFn is a function that is used to track necessary information for computing the model
and the objective function. For instance, an implementation of DeepWalk [67] on top of GTTF, specializes
AccumulateFn to measure an estimate of the sampled softmax likelihood of nodes’ positional distribution,
modeled as a dot-prodct of node embeddings. On the other hand, GCN [40] on top of GTTF uses it to
accumulate a sampled adjacency matrix, which it passes to the underlying model (e.g. 2-layer GCN) as if
this were the full adjacency matrix.
BiasFn is a function that customizes the sampling procedure for the stochastic transitions. If provided,
it must yield a probability distribution over nodes, given the current node and the path that lead to it. If not
provided, it defaults to U, transitioning to any neighbor with equal probability. It can be defined to read
edge weights, if they denote importance, or more intricately, used to parameterize a second order Markov
Chain [28], or use neighborhood attention to guide sampling [79].
44
4.3.3 Some Specializations of AccumulateFn & BiasFn
4.3.3.1 Message Passing: Graph Convolutional variants
These methods, including [40, 30, 89, 2, 90] can be approximated by by initializing Ae to an empty sparse
n × n matrix, then invoking Traverse (Algorithm 1) with u = B; F to list of fanouts with size h; Thus
AccumulateFn and BiasFn become:
def RootedAdjAcc(T, u, f): Ae[u, T−1] ← 1; (4.1)
def NoRevisitBias(T, u): return 1[Ae[u].sum() = 0]
⃗1δu
δu
; (4.2)
where ⃗1n is an n-dimensional all-ones vector, and negative indexing T−k is the k
th last entry of T. If
a node has been visited through the stochastic traversal, then it already has fanout number of neighbors
and NoRevisitBias ensures it does not get revisited for efficiency, per line 5 of Algorithm 1. Afterwards,
the accumulated stochastic Ae will be fed‡
into the underlying model e.g. for a 2-layer GCN of [40]:
GCN(A,Xe ; W1, W2) = softmax(
◦
A ReLu(
◦
AXW1)W2); (4.3)
with
◦
A = D′1/2De′−1Ae′D′−1/2
; D′ = diag(δ
′
); δ
′ = ⃗1
⊤
n A
′
;
renorm trick
z }| {
Ae′ = In×n + Ae
Lastly, h should be set to the receptive field required by the model for obtaining output dL-dimensional
features at the labeled node batch. In particular, to the number of GC layers multiplied by the number of
hops each layers access. E.g. hops=1 for GCN but customizable for MixHop and SimpleGCN.
‡Before feeding the batch to model, in practice, we find nodes not reached by traversal and remove their corresponding rows
(and also columns) from X (and A).
45
4.3.3.2 Node Embeddings
Given a batch of nodes B ⊆ V , DeepWalk can be implemented in GTTF by first initializing loss L to the
contrastive term estimating the partition function of log-softmax:
L ← X
u∈B
log E
v∼Pn(V )
[exp(⟨Zu, Zv⟩)] , (4.4)
where ⟨., .⟩ is dot-product notation, Z ∈ R
n×d
is the trainable embedding matrix with Zi ∈ R
d
is ddimensional embedding for node u ∈ V . In our experiments, we estimate the expectation by taking 5
samples and we set the negative distribution Pn(V = v) ∝ δ
3
4
v , following [57].
The functional is invoked with no BiasFn and AccumulateFn =
def DeepWalkAcc(T, u, f): L ← L − *
Zi
,
X
CT
k=1
η[T−k
]
C−k+1
C
ZT−k
+
; η[u] ←
η[T−1]
f
; (4.5)
where hyperparameter C indicates maximum window size [inherited from word2vec, 57], in the summation on k does not access invalid entries of T as CT ≜ min(C, T.size), the scalar fraction
C−k+1
C
is
inherited from context sampling of word2vec [Section 3.1 in 45], and rederived for graph context by [3],
and η[u]
stores a scalar per node on the traversal Walk Forest, which defaults to 1 for non-initialized entries,
and is used as a correction term. DeepWalk conducts random walks (visualized as a straight line) whereas
our walk tree has a branching factor of f. Setting fanout f = 1 recovers DeepWalk’s simulation, though
we found f > 1 outperforms within fewer iterations e.g. f = 5, within 1 epoch, outperforms DeepWalk’s
published implementation. Learning can be performed using the accumulated L as: Z ← Z − ϵ∇ZL;
4.4 Theoretical Analysis
This section presents several theoretical results pertaining to the unbiased nature of algorithms implemented using GTTF as well as analysis of the run-time and space complexity of algorithms implemented
using GTTF.
4.4.1 Estimating k
th power of transition matrix
We show that it is possible with GTTF to accumulate an estimate of transition T matrix to power k. Let Ω
denote the walk forest generated by GTTF, Ω(u, k, i) as the i
th node in the vector of nodes at depth k of the
walk tree rooted at u ∈ B, and t
u,v,k
i
as the indicator random variable 1[Ω(u, k, i) = v]. Let the estimate
of the k
th power of the transition matrix be denoted Tbk
. Entry Tbk
u,v should be an unbiased estimate of T
k
u,v
for u ∈ B, with controllable variance. We define:
Tbk
u,v =
Pf
k
i=1 t
u,v,k
i
f
k
(4.6)
The fraction in Equation 4.6 counts the number of times the walker starting at u visits v in Ω, divided
by the total number of nodes visited at the k
th step from u.
Proposition 1. (UnbiasedTk) Tbk
u,v as defined in Equation 4.6, is an unbiased estimator of T
k
u,v
Proof. E[Tbk
u,v] = E
f
kP
i=1
t
u,v,k
i
f
k
=
f
Pk
i=1
E[t
u,v,k
i
]
f
k
=
f
Pk
i=1
P[t
u,v,k
i = 1]
f
k
=
f
Pk
i=1
T
k
u,v
f
k
= T
k
u,v
Proposition 2. (VarianceTk) Variance of our estimate is upper-bounded: Var[Tbk
u,v] ≤
1
4f
k
Proof. Var[Tbk
u,v] =
Pf
k
i=1 V ar[t
u,v,k
i
]
f
2k
=
f
kT
k
u,v(1 − T k
u,v)
f
2k
=
T
k
u,v(1 − T k
u,v)
f
k
Since 0 ≤ T k
u,v ≤ 1, then T
k
u,v(1 − T k
u,v) is maximized with T
k
u,v =
1
2
. Hence V ar[Tbk
u,v] ≤
1
4f
k
Naive computation of k
th powers of the transition matrix can be efficiently computed via repeated
sparse matrix-vector multiplication. Specifically, each column of T
k
can be computed in O(mk), where m
47
is the number of edges in the graph. Thus, computing T
k
in its entirety can be accomplished in O(nmk).
However, this can still become prohibitively expensive if the graph grows beyond a certain size. GTTF on
the other hand can estimate T
k
in time complexity independent of the size of the graph, (Prop. 8), with
low variance. Transition matrix powers are useful for many GRL methods [68].
4.4.2 Unbiased Learning
As a consequence of Propositions 1 and 2, GTTF enables unbiased learning with variance control for classes
of node embedding methods, and provides a convergence guarantee for graph convolution models under
certain simplifying assumptions.
We start by analyzing node embedding methods. Specifically, we cover two general types. The first is
based on matrix factorization of the power-series of the graph transition matrix and the second is based
on cross-entropy objectives, like DeepWalk [67], node2vec [28] and their variants.
Proposition 3. (UnbiasedTFactorization) Suppose L =
1
2
||LR −
P
k
ckT
k
||2
F
, i.e. factorization objective that can be optimized by gradient descent by calculating ∇L,RL, where ck’s are scalar coefficients. Let
its estimate Lb =
1
2
||LR −
P
k
ckTbk
||2
F
, where Tb is obtained by GTTF according to Equation 4.6. Then
E[∇L,RLb] = ∇L,RL.
Proof. Consider a d-dimensional factorization of P
k
ckT
k
, where ck’s are scalar coefficients:
L =
1
2
LR −
X
k
ckT
k
2
F
, (4.7)
parametrized by L, R⊤ ∈ R
n×d
. The gradients of L w.r.t. parameters are:
∇LL =
LR⊤ −
X
k
ckT
k
!
R
⊤ and ∇RL = L
⊤
LR⊤ −
X
k
ckT
k
!
. (4.8)
48
Given estimate objective L (replacing Tb with using GTTF-estimated Tb):
Lb =
1
2
LR −
X
k
ckTbk
2
F
. (4.9)
It follows that:
E
h
∇LLb
i
= E
" LR⊤ −
X
k
ckTbk
!
R
⊤
#
= E
" LR⊤ −
X
k
ckTbk
!# R
⊤ Scaling property of expectation
=
LR⊤ −
X
k
ckE
h
Tbk
i
!
R
⊤ Linearity of expectation
=
LR⊤ −
X
k
ckT
k
!
R
⊤ Proof of Proposition 1
= ∇LL
The above steps can similarly be used to show E
h
∇RLb
i
= ∇RL
Proposition 4. (UnbiasedLearnNE) Learning node embeddings Z ∈ R
n×d with objective function L, decomposable as L(Z) = P
u∈V L1(Z, u)−
P
u,v∈V
P
k L2(T
k
, u, v)L3(Z, u, v), where L2 is linear over T
k
,
then using Tbk yields an unbiased estimate of ∇ZL.
Proof. We want to show that E[∇ZL(Tbk
, Z)] = ∇ZL(T
k
, Z). Since the terms of L1 are unaffected by
Tb, they are excluded w.l.g. from L in the proof.
E[∇ZL(Tbk
, Z)] = E
∇Z(−
X
u,v∈V
X
k∈{1..C}
L2(Tbk
, u, v)L3(Z, u, v))
(by linearity of expectation) = −∇Z
X
u,v∈V
X
k∈{1..C}
L2(E[Tbk
], u, v)L3(Z, u, v)
49
(by Prop 1) = −∇Z
X
u,v∈V
X
k∈{1..C}
L2(T
k
, u, v)L3(Z, u, v) = ∇ZL(T
k
, Z)
Generally, L1 (and L3) score the similarity between disconnected (and connected) nodes u and v. The
above form of L covers a family of contrastive learning objectives that use cross-entropy loss and assume
a logistic or (sampled-)softmax distributions.
Proposition 5. (UnbiasedMP) Given input activations, H(l−1), graph conv layer(l)can use rooted adjacency
Ae accumulated by RootedAdjAcc (4.1), to provide unbiased pre-activation output, i.e. E
h ◦
AkH(l−1)W(l)
i
=
D
′−1/2A′D
′−1/2
k
H(l−1)W(l)
, with A′ and D′ defined in (4.3).
Proof. Let Ae be the neighborhood patch returned by GTTF, and lete. indicate a measurement based on the
sampled graph, Ae, such as the degree vector, δe, or diagonal degree matrix, De. For the remainder of this
proof, let all notation for adjacency matrices, A or Ae, and diagonal degree matrices, D or De, and degree
vector, δ, refer to the corresponding measure on the graph with self loops e.g. A ← A + In×n. We now
show that the expectation of the layer output is unbiased.
E
h ◦
AkH(l−1)W(l)
i
=
h
E
h ◦
Ak
i
H(l−1)W(l)
i
implies that E
h ◦
AkH(l−1)W(l)
i
is unbiased if E
h ◦
Ak
i
=
D−1/2AD−1/2
k
.
E
h ◦
A
k
i
= E
D1/2
De−1Ae
k
D−1/2
= D1/2E
De−1Ae
k
D−1/2
Let P
u,v,k be the set of all walks {p = (u, v1, ..., vk−1, v)|vi ∈ V }, and let p∃Ae indicate that the path
p exists in the graph given by Ae. Let t
u,v,k be the transition probability from u to v in k steps, and let t
p
be the probability of a random walker traversing the graph along path p.
E
De−1Ae
k
u,v
= E
h
Tek
u,vi
= P r h
et
u,v,k = 1i
=
X
p∈Pu,v,k
P r[p∃Ae]P r[et
p = 1|p∃Ae]
5
=
X
p∈Pu,v,k
Y
k
i=1
1[A[pi
, pi+1] = 1]f + 1
δ[pi
]
Y
k
i=1
(f + 1)−1 =
X
p∈Pu,v,k
Y
k
i=1
1[A[pi
, pi+1] = 1]δ[pi
]
−1
=
X
p∈Pu,v,k
P r[t
p = 1] = P r[t
u,v,k] = (T )
k
u,v =
D−1A
k
u,v
Thus, E
h ◦
Ak
i
=
D−1/2AD−1/2
k
and E
h ◦
AkH(l−1)W(l)
i
=
D−1/2AD−1/2
k H(l−1)W(l)
For writing, we assumed nodes have degree, δu ≥ f, though the proof still holds if that is not the
case as the probability of an outgoing edge being present from u becomes 1 and the transition probability
becomes δ
−1
u
i.e. the same as no estimate at all.
Proposition 6. (UnbiasedLearnMP) If objective to a graph convolution model is convex and Lipschitz continous, with minimizer θ
∗
, then utilizing GTTF for graph convolution converges to θ
∗
.
Proof. GTTF can be seen as a way of applying dropout [76], and our proof is contingent on the convergence
of dropout, which is shown in [7]. Our dropout is on the adjacency, rather than the features. Denote the
output of a graph convolution network§ with H:
H = GCNX(A; W) = T XW
We restrict our analysis to GCNs with linear activations. We are interested in quantifying the change of H
as A changes, and therefore the fixed (always visible) features X is placed on the subscript. Let Ae denote
adjacency accumulated by GTTF’s RootedAdjAcc (Eq. 4.1).
Hec = GCNX(Aec).
§The following definition averages the node features (uses non-symmetric normalization) and appears in multiple GCN’s
including [30].
Let A = {Aec}
|A|
c=1 denote the (countable) set of all adjacency matrices realizable by GTTF. For the analysis,
assume the graph is α-regular: the assumption eases the notation though it is not needed. Therefore,
degree δu = α for all u ∈ V . Our analysis depends¶ on 1
|A|
P
Ae∈A Ae ∝ A. i.e. the average realizable
matrix by GTTF is proportional (entry-wise) to the full adjacency. This is can be shown when considering
one-row at a time: given node u with δu = α outgoing neighbors, each of its neighbors has the same
appearance probability =
1
δu
. Summing over all combinations
δu
f
, makes each edge appear the same
frequency =
1
δu
|A|, noting that |A| evenly divides
δu
f
for all u ∈ V .
We define a dropout module:
d
A =
X
|A|
c
zcAec with z ∼ Categorical
|A| of them
z }| {
1
|A|,
1
|A|, . . . ,
1
|A| !
, (4.10)
where zc acts as Multinoulli selector over the elements of A, with one of its entries set to 1 and all others to
zero. With this definitions, GCNs can be seen in the droupout framework as: He = GCNX(
d
A). Nonetheless,
in order to inherit the analysis of [7, see their equations 140 & 141], we need to satisfy two conditions which
their analysis is founded upon:
(i) E[GCNX(
d
A)] = GCNX(A): in the usual (feature-wise) dropout, such condition is easily verified.
(ii) Backpropagated error signal does not vary too much around around the mean, across all realizations
of
d
A.
¶
If not α-regular, it would be 1
|A|
P
Ae∈A Ae ∝ D
−1A
Condition (i) is satisfied due to proof of Proposition 5. To analyze the error signal, i.e. the gradient of the
error w.r.t. the network, assume loss function L(H), outputs scalar loss, is λ-Lipschitz continuous. The
Liptchitz continuity allows us to bound the difference in error signal between L(H) and L(He):
||∇HL(H) − ∇HL(He)||2
2
(a)
≤λ (∇HL(H) − ∇HL(He))⊤(H − He) (4.11)
(b)
≤λ ||∇HL(H) − ∇HL(He)||2 ||H − He||2 (4.12)
w.p. ≥1− 1
Q2
≤ λ ||∇HL(H) − ∇HL(He)||2 W⊤X⊤Q
p
V ar[T ]XW (4.13)
=
λQ
2
√
f
||∇HL(H) − ∇HL(He)||2 ||W||2
1
||X||2
1
(4.14)
||∇HL(H) − ∇HL(He)||2 ≤
λQ
2
√
f
||W||2
1
||X||2
1
(4.15)
where (a) is by Lipschitz continuity, (b) is by Cauchy–Schwarz inequality, “w.p.” means with probability
and uses Chebyshev’s inequality, with the following equality because the variance of T is shown elementwise in proof for Prop. 2. Finally, we get the last line by dividing both sides over the common term. This
shows that one can make the error signal for the different realizations arbitrarily small, for example, by
choosing a larger fanout value or putting (convex) norm constraints on W and X e.g. through batchnorm
and/or weightnorm. Since we can have ∇HL(H) ≈ ∇HL(He1) ≈ ∇HL(He2) ≈ · · · ≈ ∇HL(He
|A|)
4.4.3 Complexity Analysis
Proposition 7. Storage complexity of GTTF is O(m + n).
Proof. The storage complexity of CompactAdj is O(sizeof(δ) + sizeof(Ab)) = O(n + m).
53
@ g t t f . b i a s ( )
def u ni f o rm ( a d j _ m a t r i x : t o r c h . Tenso r , ∗ ∗ kwargs ) −> t o r c h . Ten s o r :
re turn g t t f . m a k e _ u ni f o rm _ di s t ( a d j _ m a t r i x )
@ g t t f . a c c um ul a t e ( )
def s i m p l e _ g r a p h _ c o n v o l u t i o n ( wf : Wal kF o re s t , ∗ ∗ kwargs ) −> t o r c h . Ten s o r :
n o d e _ f e a t u r e s = kwargs [ " n o d e _ f e a t u r e s " ]
n um_la ye r s = wf . d e p t h ( )
wf . i n i t i a l i z e _ e m b e d d i n g s ( n o d e _ f e a t u r e s )
fo r wt in wf :
fo r l in range ( num_laye r s , 0 , − 1) :
wt [ l − 1 ] . embeddin g s = t o r c h . mean ( wt [ l ] . embeddings , dim = 0)
re turn wt [ 0 ] . embeddin g s
simpleGCN = g t t f . run (G , f a n o u t =k , d e p t h =d )
Listing 4.1: Implementation of SimpleGCN using GTTF
Moreover, for extemely large graphs, the adjacncy can be row-wise partitioned across multiple machines and therefore admitting linear scaling. However, we acknolwedge that choosing which rows to partition to which machines can drastically affect the performance. Balanced partitioning is ideal. It is an NPhard problem, but many approximations have been proposed. Nonetheless, reducing inter-communication,
when distributing the data structure across machines, is outside our scope.
Proposition 8. Time complexity of GTTF is O(bfh
) for batch size b, fanout f, and depth h.
Proof. For each step of GTTF, the computational complexity is O(bhf
). This follows trivially from the
GTTF functional: each nodes in batch (b of them) builds a tree with depth h and fanout f i.e. with h
f
tree
nodes. This calculation assumes random number generation, AccumulateFn and BiasFn take constant
time. The searchsorted function is linear, as it is called on a sorted list: cumulative sum of probabilities.
Proposition 8 implies the speed of computation is irrespective of graph size. Methods implemented in
GTTF inherit this advantage. For instance, the node embedding algorithm WYS [3] is O(n
3
), however, we
apply its GTTF implementation on large graphs.
54
4.5 Example code for instantiating GTTF
Before we move on to the experimental evaluation of GTTF we present a sample code snippet of how to
specify a Bias and Accumulation function to instantiate GTTF to implement Simple Graph Convolution
[89]. Our reasons for doing so are twofold: (1) the code snippet illustrates the simplicity of GTTF by
showing that it only takes a few lines of code to completely instantiate a graph convolution model (2) we
will refer to this snippet in the implementation details in Chapter 5 when choosing a Bias and Accumulation
function for Biased User History Synthesis. The snippet is shown in Listing 4.1, using the PyTorch [65]
GTTF backend∥
. The bias function, annotated with @gttf.bias, accepts the graph adjacency matrix as the
only mandatory input and output a probability distribution matrix for each node over its neighborhood.
The accumulation function, annotated with @gttf.accumulate accepts the WalkForest generated through
the procedure outlined in 4.3.2 and returns a value of any type. In the snippet shown to instantiate GTTF
as SimpleGCN, the output of the accumulation function is a tensor containing the averaged, sampled
neighborhood embeddings for every node in the batch. Once the bias and accumulation functions have
been implemented and bound to gttf through the decorators shown, gttf.run starts the process of sampling
and running the accumulation function on the sampled WalkForests.
4.6 Experiments
We conduct experiments on 10 different graph datasets, listed in in Table 4.1. We experimentally demonstrate the following. (1) Re-implementing baseline method using GTTF maintains performance. (2) Previouslyunscalable methods, can be made scalable when implemented in GTTF. (3) GTTF achieves good empirical
performance when compared to other sampling-based approaches hand-designed for Message Passing.
(4) GTTF consumes less memory and trains faster than other popular Software Frameworks for GRL. To
replicate our experimental results, for each cell of the table in our code repository, we provide one shell
∥GTTF also has a TensorFlow backend
55
script to produce the metric, except when we indicate that the metric is copied from another paper. Unless
otherwise stated, we used fanout factor of 3 for GTTF implementations.
4.6.1 Node Embeddings for Link Prediction
In link prediction tasks, a graph is partially obstructed by hiding a portion of its edges. The task is to
recover the hidden edges. We follow a popular approach to tackle this task: first learn node embedding
Z ∈ R
n×d
from the observed graph, then predict the link between nodes u and v with score ∝ Z
⊤
u Zv. We
use two ranking metrics for evaluations: ROC-AUC, which is a ranking objective: how well do methods
rank the hidden edges above randomly-sampled negative edges and Mean Rank.
We re-implement Node Embedding methods, DeepWalk [67] and WYS [3], into GTTF (abbreviated F).
Table 4.2 summarizes link prediction test performance.
LiveJournal and Reddit are large datasets, where original implementation of WYS is unable to scale to.
However, scalable F(WYS) sets new state-of-the-art on these datasets. For PPI and HepTh datasets, we
copy accuracy numbers for DeepWalk and WYS from [3]. For LiveJournal, we copy accuracy numbers for
DeepWalk and PBG from [43] – note that a well-engineered approach (PBG, [43]), using a mapreduce-like
framework, is under-performing compared to F(WYS), which is a few lines specialization of GTTF.
4.6.2 Message Passing for Node Classification
We implement in GTTF the message passing models: GCN [40], GraphSAGE [30], MixHop [2], SimpleGCN
[89], as their computation is straight-forward. For GAT [79] and GCNII [13], as they are more intricate,
we download the authors’ codes, and wrap them as-is with our functional.
We show that we are able to run these models in Table 4.3 (left and middle), and that GTTF implementations matches the baselines performance. For the left table, we copy numbers from the published papers.
However, we update GAT to work with TensorFlow 2.0 and we use our updated code (GAT*).
56
Table 4.1: summary. Tasks are LP, SSC, FSC, for link prediction, semi- and fully-supervised classification.
Split indicates the train/validate/test paritioning, with (a) = [3], (b) = to be released, (c) = [30], (d) = [92];
(e) = [37].
Dataset Split # Nodes # Edges # Classes Nodes Edges Tasks
PPI (a) 3,852 20,881 N/A proteins interaction LP
ca-HepTh (a) 80,638 24,827 N/A researchers co-authorship LP
ca-AstroPh (a) 17,903 197,031 N/A researchers co-authorship LP
LiveJournal (b) 4.85M 68.99M N/A users friendship LP
Reddit (c) 233,965 11.60M 41 posts user co-comment LP/FSC
Amazon (b) 2.6M 48.33M 31 products co-purchased FSC
Cora (d) 2,708 5,429 7 articles citation SSC
CiteSeer (d) 3,327 4,732 6 articles citation SSC
PubMed (d) 19,717 44,338 3 articles citation SSC
Products (e) 2.45M 61.86M 47 products co-purchased SSC
Table 4.2: Results of node embeddings on Link Prediction. Left: Test ROC-AUC scores. Right: Mean Rank
on the right for consistency with pytorch-biggraph. *OOM = Out of Memory.
PPI HepTh Reddit
DeepWalk 70.6 91.8 93.5
F(DeepWalk) 87.9 89.9 95.5
WYS 89.8 93.6 OOM
F(WYS) 90.5 93.5 98.6
LiveJournal
DeepWalk 234.6
PBG 245.9
WYS OOM*
F(WYS) 185.6
Table 4.3: Node classification tasks. Left: test accuracy scores on semi-supervised classification (SSC)
of citation networks. Middle: test micro-F1 scores for large fully-supervised classification. Right: test
accuracy on an SSC task, showing only scalable baselines. We bold the highest value per column.
Cora Citeseer Pubmeb
GCN 81.5 70.3 79.0
F(GCN) 81.9 69.8 79.4
MixHop 81.9 71.4 80.8
F(MixHop) 83.1 71.8 80.9
GAT* 83.2 72.4 77.7
F(GAT) 83.3 72.5 77.8
GCNII 85.5 73.4 80.3
F(GCNII) 85.3 74.4 80.2
Reddit Amazon
SAGE 95.0 88.3
F(SAGE) 95.9 88.5
SimpGCN 94.9 83.4
F(SimpGCN) 94.8 83.8
Products
node2vec 72.1
ClusterGCN 75.2
GraphSAINT 77.3
F(SAGE) 77.0
4.6.3 Experiments comparing against Sampling methods for Message Passing
We now compare models trained with GTTF (where samples are walk forests) against sampling methods
that are especially designed for Message Passing algorithms (GraphSAINT and ClusterGCN), especially
57
Table 4.4: Performance of GTTF against frameworks DGL and PyG. Left: Speed is the per epoch time in
seconds when training GraphSAGE. Memory is the memory in GB used when training GCN. All experiments conducted using an AMD Ryzen 3 1200 Quad-Core CPU and an Nvidia GTX 1080Ti GPU. Right:
Training curve for GTTF and PyG implementations of Node2Vec.
Speed (s) Memory (GB)
Reddit Products Reddit Cora Citeseer Pubmed
DGL 17.3 13.4 OOM 1.1 1.1 1.1
PyG 5.8 9.2 OOM 1.2 1.3 1.6
GTTF 1.8 1.4 2.44 0.32 0.40 0.43
0 250 500 750
Time (s)
0.6
0.7
0.8
ROC-AUC
framework
(node2vec)
PyG(node2vec)
since their sampling strategies do not match ours. Table 4.3 (right) shows test performance on node classification accuracy on a large dataset: Products. We calculate the accuracy for F(SAGE), but copy from
[37] the accuracy for the baselines: GraphSAINT [95] and ClusterGCN [16] (both are message passing
methods); and also node2vec [28] (node embedding method).
4.6.4 Runtime and Memory comparison against optimized Software Frameworks
In addition to the accuracy metrics discussed above, we also care about computational performance. We
compare against software frameworks DGL [85] and PyG [23]. These software frameworks offer implementations of many methods. Table 4.4 summarizes the following. First (left), we show time-per-epoch
on large graphs of their implementation of GraphSAGE, compared with GTTF’s, where we make all hyper
parameters to be the same (of model architecture, and number of neighbors at message passing layers).
Second (middle), we run their GCN implementation on small datasets (Cora, Citeseer, Pubmed) to show
peak memory usage. The run times between GTTF, PyG and DGL are similar for these datasets. The
comparison can be found in the Appendix. While the aforementioned two comparisons are on popular
message passing methods, the third (right) chart shows a popular node embedding method: node2vec’s
link prediction test ROC-AUC in relation to its training runtime.
58
4.7 Discussion
Transductive → Inductive Learning: From a practical prospective, GTTF can convert Graph Neural
Network models trained on-top of the (full) graph (i.e., transductive), into models that are trained on
many sampled subgraphs (i.e., walk forests). This conversion provides an implementation (programming)
benefit, but more importantly, a generalization benefit. These two benefits allow the model to operate on
test graphs differing from the training graph, as the training itself was conducted on (exponentially) many
subgraphs.
Data augmentation has been shown to improve (test) generalization performance in various domains
including Computer Vision and Natural Language Processing [75]. Analogously, training on many subgraphs (sampled as walk forests) is a form of data augmentation, that could favor the training algorithm of
GTTF over the vanilla (entire-graph) alternative, in regimes with few labeled data samples.
Finally, our analysis on unbiased learning for message passing models (Propositions 5 & 6) was
limited to linear graph neural networks with convex objectives. While this setting is mathematically easier
to analyze, it may not hold in practice. For instance, the strong model of GCNII is composed of a deep
network (e.g., 32+ layers), yielding a non-convex objective. Nonetheless, our experimental results show
that wrapping deep GCNII models, with the GTTF functional, maintains performance on smaller graphs
yet scales-up GCNII to arbitrary large graphs.
59
Chapter 5
Biased User History Synthesis for Personalized Long-tail Item
Recommendation
Having established the groundwork for efficient stochastic learning on graphs with the introduction of
GTTF in the preceding chapter, we are now poised to unveil our solution aimed at addressing the long-tail
item recommendation problem in internet-scale recommendation systems.
5.1 Introduction
In this chapter we present a novel training algorithm called Biased User History Synthesis to achieve two
main objectives: (1) Alleviating the long-tail item problem, as discussed earlier, and (2) Enhancing personalization. Although these objectives may appear distinct, we demonstrate that they are complementary.
Our approach utilizes an effective sampling strategy that addresses both challenges simultaneously. The
core idea is to augment user representations with a learnt representation of a sample of the user’s interaction history. However, we tackle the issue of sample bias towards head items by generating samples with a
bias towards tail items in the user’s history. This technique serves two purposes in the context of our objectives. First, it increases the visibility of tail items, mitigating the model’s tendency to overfit on head items
resulting in better representations for tail items. Second, it results in more unique and personalized user
representations, as tail items provide more information about a user’s specific interests compared to head
60
Figure 5.1: Long Tail Distribution of MovieLens-1M
items which demonstrate popular, commonly held interests. As a result, our approach not only addresses
the long-tail item problem but also concurrently improves the recommendation of both tail and head items
through enhanced personalization. To the best of our knowledge, our approach is the first to highlight the
intrinsic relationship between tail items and improved personalization, while proposing a technique that
directly addresses both the long-tail item and personalization challenges in recommendation systems.
The contributions of this work are as follows:
• We propose a novel training algorithm biased user history synthesis to address the long tail item
problem and achieve better personalization by augmenting user representations in recommendation systems. Our approach is built on (i) a tail item biased User Interaction History (UIH) sampling
strategy; and (ii) a synthesis model that produces the augmented user representation from the sampled user-item interaction history.
• In addition to motivating our idea intuitively, we use information theory to provide a theoretical
argument and justification for our proposed approach.
• Through extensive experimentation, we demonstrate that our model significantly outperforms stateof-the-art baselines on tail, head and overall recommendation performance.
61
5.2 Related Work
In this section, we review literature on two important aspects of recommendation systems - improving
long-tail item recommendation and making recommendations more personalized.
5.2.1 Long-tail Recommendation
Attempts to address the long-tail item recommendation problem, fall broadly into two categories: modelagnostic and model-specific.
5.2.1.1 Model-Agnostic
One promising approach to tackle the tail item recommendation problem is meta-learning, which is a
model-agnostic paradigm that has been extensively researched in this domain. One particular metalearning algorithm that has shown great potential is Model-Agnostic Meta-Learning (MAML), as proposed
in [24]. Previous studies [24, 51] have adapted MAML to learn a set of global parameters and various personalized user parameters. For instance, in [42], personalized parameters are learned using a support set
of items, while global parameters are learned using the personalized parameters and optimized on a query
set of items. Nevertheless, MAML derivatives have been found to settle on local optima for some users
because of the use of a global set of parameters for task-specific parameter initialization [20]. To overcome
this issue, [20] introduced feature-specific memory to enable personalized parameter initialization.
The use of meta-learning in conjunction with curriculum learning to transfer knowledge has been
explored in several studies. For example, in [96], two knowledge transfer mechanisms were employed:
1) a meta-learning mechanism that transfers knowledge from a few-shot to a many-shot model, and 2) a
curriculum learning mechanism to smoothly transfer the mapping between head and tail items. However,
62
while [96] focuses on data augmentation at the model level, our work takes a different approach by incorporating data augmentation at the user level. This approach enables us to capture each user’s unique
interests, resulting in a more personalized recommendation system.
The tail item recommendation problem has also been tackled through the use of heterogeneous information networks (HINs) with meta-learning in [50]. This approach addresses the issue at both the model
and data levels. However, their focus is on the impact of using HINs to capture more diverse and complex
information, while our approach aims to enhance user representations through data augmentation. By
augmenting user-level data, we are able to capture unique user interests.
Several other strategies have been proposed to address the long-tail item recommendation problem.
For example, the techniques in [19, 56] adopt loss function correction policies, while [8] uses a non-metalearning-based curriculum transfer learning strategy. The work in [101, 100] focuses on warming up the
ID embeddings. Specifically, [101] uses a scaling network to transform tail item embeddings into head
item feature space, while [100] relies on a conditional decoder to model the relationship between item ID
embedding and side information. Dropout [76] has been applied in [81] to condition the model for missing
input, which is equivalent to the tail item recommendation problem.
5.2.1.2 Model-Specific
In contrast, model-specific approaches , such as NeuHash [32], ncf [33], and ngcf [86] introduce new model
architectures that cannot be applied to any base recommendation model. For example, NeuHash generates
binary hash codes for users and items, while ngcf utilizes user-item graph connectivity and high-order
connectivity through embedding propagation layers. Traditional approaches, like ncf, tackle the tail item
recommendation problem by incorporating more content features and using an MLP to augment the final
layer.
63
5.2.2 Personalization in Recommendation Systems
Recommendation systems aim to personalize item recommendations to user interests. However, most
recommendation models implicitly learn user interests through fitting to user-item interaction data, and
additional effort is needed to ensure adequate personalization and generalization. Explicit modeling for
personalization can once again be classified into model-specific and model-agnostic approaches. Modelagnostic approaches seek to improve the personalization of a base recommendation model. For example,
[66] views personalization as a learnable post-processing operation. Their model utilizes an attentionbased re-ranking mechanism to personalize recommendation lists produced by any globally optimized,
upstream recommendation model. [46] proposes a personalized relevance-unexpectedness hybrid utility
function that can be incorporated into a base recommendation system.
Model-specific approaches propose entirely new models. For example, [48] proposes a deep generative model based on Wasserstein Auto-Encoders, that learns to produce both point-wise feedback data as
well as pair-wise ranking lists that maximize the margin between interacted (positive) and non-interacted
(negative) items for each user. They theoretically show that their model achieves both additional personalization and generalizability. Following a different paradigm, [63] views the task of recommendation
through a stochastic lens, as random walks on an item model graph. They account for personalization by
assigning item-to-item transition probabilities uniquely for each user.
5.3 Base Recommendation Systems
Our strategy is model-agnostic and can be used to improve the personalization and long tail item recommendation of any base recommendation system. But for concreteness, we use the Two Tower Neural
Network architecture described in section 2.2.1.2 in chapter 2 as the base recommendation system on which
we demonstrate our approach. TTNNs have proven to be very effective at large-scale recommendation [93,
64
0 1 0 …
C���������� �������
… 0 0 1 … Dense Vector
��������� ����� ���
�����������
0 1 0 … … 0 0 1 … Dense Vector
�����������
���
[�����������]
���������
���������
������������
���� ��������
�������
���� �������
���� ���� − ����
����������� �����
���
��������� �����
C���������� ������� Continuous ��������
��������� ����� ��������� ����� ���
���� � C���������� ������� C���������� ������� Continuous ��������
�2
�3
�5
��
��������
������
�������
���� �������
Synthesis
Model?
���
Synthesis
Model?
���
����ℎ������ ��������������
�����������
��
�J
KLM
��
Figure 5.2: Biased User History Synthesis built on top of a Two Tower Neural Network architecture. A
user is usually connected to a distribution of head and tail items (blue box on the left). p1 = p2 = ....pn if
sampling without bias (uniform). pi
is characterized by eq. 5.1 if sampling with bias.
62]. Further, prior model-agnostic strategies use TTNNs as their base recommendation system [96, 42, 24].
Nonetheless, in addition to our main results on TTNNs, we include case studies on two additional models,
DeepFM [29] and Wide and Deep [14] to demonstrate that our technique is indeed model-agnostic.
5.4 Biased User History Synthesis
In this section, we describe the details of our technique to enhance personalization and improve long tail
item recommendation. As mentioned in section 5.1, there are two critical components to our technique:
a tail item biased user-interaction history sampling strategy and a synthesis model that produces an augmented user representation using the sampled user history. Figure 5.2 illustrates our approach built on top
of a TTNN.
65
Algorithm 5: User History Synthesis (single user perspective)
Require: user representation xu, sample of UIH Hu
Ensure: synthesized user history representation x
AUG
u
1: Y ← []
2: for item ik in Hu do
3: y
ik ← representation from item tower for item ik
4: Y.append(y
ik )
5: end for
6: x
AUG
u ← SYNTHESIS(xu, Y )
7: return x
AUG
u
5.4.1 Tail item biased Sampling
We propose augmenting the user representation xu, with a representation of the user’s interaction history before feeding it into the scoring function S. The hope here, is that by introducing a direct signal
representing the user history, the augmented representation is more personal to the user than the representation xu is, for the simple reason that the parameters in the user tower are typically globally optimized
on all interactions. However, including entire histories is infeasible since the interaction histories of some
users could be very long and could lead to very large computational and memory overhead [94]. Hence,
we sample a fixed number of items for every user, with the size being a hyperparameter of the algorithm.
Unfortunately, uniform sampling from a user’s history will likely lead to the sample being composed of
mostly head items and reinforce the already existent bias towards head items. Additionally, as we show
in section 5.7, tail items in a user’s history are more informative about the unique interests of that user
than head items are. Thus, we bias our sampling procedure towards tail items (illustrated in Figure 5.2).
For user u, the probability distribution over each item i in u’s interaction history is characterized by the
following softmax distribution:
pi = sof tmax
1
T · di
(5.1)
66
where di
is the degree of item i in G and T is the temperature parameter to control the softmax
distribution. The lower the value of T the greater the bias toward tail items in the generated sample.
To generate the samples from this distribution, we run an instance of the GTTF meta-algorithm [52] with
equation 5.1 as the bias function. We denote the sampled history for a user u by Hu = {i0, i1, . . . , in} ∼ pi
,
where n is the fixed number of items being sampled.
5.4.2 Synthesis Models
Once items have been sampled from a user’s interaction history (blue box in figure 5.2), they are used to
augment user representation xu through a procedure we call user history synthesis (yellow box in figure
5.2). The user history synthesis procedure takes xu and Hu
as inputs and produces an augmented user
representation x
AUG
u . Algorithm 5 outlines the procedure. The most important part of the procedure
is the SYNTHESIS function, which is a learnable function. We examine three candidate choices for the
SYNTHESIS function:
5.4.2.1 Mean Synthesis
The first candidate synthesis function is a simple element-wise mean of the vectors in Y . Specifically, the
augmented representation is computed as follows using Mean Synthesis:
x
AUG
u = σ(W · CONCAT(xu, MEAN({y
ik
, ∀y
ik ∈ Y })) + b)
(5.2)
where CONCAT represents the row-wise concatenation operation, MEAN represents the element-wise
mean operation, W and b are the learnable transformation matrix and bias vectors of a fully-connected
layer respectively, and σ is a non-linear activation function. As we show in section 5.9, even such a simple
synthesis architecture can lead to out-sized gains.
67
5.4.2.2 User-Attn Synthesis
Mean synthesis treats every sampled item as equally informative about user u. However, this is unlikely
to always be the case. Thus, our next candidate synthesis architecture is based on a user-attention scheme
in which a user attends to the sampled items, and computes a weighted sum based on attention scores,
instead of an unweighted mean:
ak = ATTENTION(xu, yik ), ∀y
ik ∈ Y
x
AUG
u = σ(W · CONCAT(xu, SUM({aky
ik
, ∀y
ik ∈ Y })) + b)
(5.3)
The ATTENTION mechanism here could be any learnable function A : R
m ×R
n → R. Some examples
include [80] and [78]. For our work, we chose the scaled-dot product attention mechanism used in [78].
5.4.2.3 GRU Synthesis
Both the Mean and User-Attn synthesis architectures are permutation invariant, i.e. they assume no ordering to the sampled user history. However, interactions are often associated with timestamps and present
an opportunity to exploit the temporal nature of the data by treating interactions as an ordered sequence.
Thus, we examine the use of the Gated Recurrent Unit (GRU) [17] as a potential synthesis model. It is a
more computationally efficient sequence model than an LSTM [36] and has been shown to be effective at
sequential recommendation tasks [35]. We formulate the GRU synthesis model as follows:
YSORT ED = SORT(Y )
x
AUG
u = σ(W · CONCAT(xu,GRU({y
ik
, ∀y
ik ∈ YSORT ED})) + b)
(5.4)
The SORT function sorts the item representations y
ik ∈ Y according to the time stamp of the corresponding interaction (u, ik). This ensures the sequence being fed into the GRU is ordered. We present the
relative performance of each of the three choices of the synthesis model in section 5.9.
68
5.5 Training
In this section, we describe some important details on how the base recommendation system as well as
the biased user history sampling augmented model are trained.
5.5.1 Objective Function
The parameters of the base model as well as all variants of our user history synthesis model are trained
using a max-margin-based ranking loss function as in [94]. The objective is to maximize the score of
positive examples (u, i) ∈ G(E) and at the same time minimize the score of negative examples, i.e. (u, i) ∈
GC(E), where GC(E) is the complement graph of the interaction bipartite graph. This is typically done
by ensuring that the score of positives is larger than the score of negatives by a fixed margin. Thus, given
a positive example (u, i+), with user representation xu and item representation yi+ , the loss function is
computed as:
L(xu, yi+ ) = Eyi− ∽pn(u)max(0, S(xu, yi− ) − S(xu, yi+ ) + ∆) (5.5)
5.5.2 Negative Sampling
Negative sampling has been shown to greatly affect recommendation model performance [33, 91]. Sampling out-of-batch negatives uniformly randomly from the complementary graph GC can exacerbate the
long-tail item problem as tail items are likely to appear in negative examples disproportionately more than
they are in positive examples. On the other hand, only performing in-batch sampling, while popular, can
lead to head items appearing in disproportionately more negative examples than positive examples, especially when sampling multiple negatives per positive. Thus, we resort to mixed negative sampling [91],
where an equal number of in-batch and out-of-batch negatives are sampled for every user u in a training
batch.
69
5.6 Efficient Inference Through Decoupling Towers
Our model is designed to fit into the ranking stage of the two stage recommendation model pipeline [18,
93, 9, 21]. Inference scalability is a crucial aspect of ranking model design. If inference is slow, the latency
of serving becomes high. One of the properties of models like TTNNs that make them attractive, is that
they are easy to serve since user and item representations can be pre-computed and stored. Leaving only
relevance scores to be computed when serving queries. Our synthesis model benefits from the same property, as x
AUG
u can also be pre-computed and stored just as xu can. Thus, our approach directly inherits
the scalability of the base recommendation model on top of which it is built.
5.7 Theoretical Motivation
In this section, we make a case for why our proposed sampling with a bias towards tail items not only
improves tail item recommendation performance but will also improve the head item recommendation
performance. Thereby motivating our approach with a theoretical argument.
5.7.1 Information Theory Interpretation
Proposition 9. Let U and I be random variables that represent the events that a specific user u and (respectively) a specific item i appear in a randomly drawn edge from the set of edges E in the bipartite interaction
graph G(V, E). If their probabilities are characterized by the distribution p(u) and p(i) respectively and their
joint probability by p(u, i), then the pointwise mutual information contributed by tail items in a user’s history
is larger than the information contributed by head items.
70
Proof. From the definition of mutual information:
I(U, I) = Ep(u,i)
log p(u, i)
p(u)p(i)
= Ep(u,i)
logp(i|u)
p(i)
Mutual information can be viewed as the expectation over pointwise mutual information, defined as
log p(u,i)
p(u)p(i)
. This definition is commonly used in NLP [44]. Pointwise mutual information helps us decompose the mutual information in terms of contributions for individual items. In the absence of other
knowledge, we use the observed distribution pˆ(i|u) in place of p(i|u), where pˆ(i|u) is the observed probability of sampling an item i from the interaction history of user u. This is 0 for all items not present in
the user history and 1
M , for items in the user history where M is the number of items in the history Hu
.
Thus, we can estimate I(U, I) as:
bI(U, I) = Epˆ(u,i)
logpˆ(i|u)
p(i)
=
1
M
X
j∈Hu
log 1
p(I = j)
+ log 1
M
Here p(I = j) is the probability of sampling item j in a random draw of an edge and log 1
p(I=j) +log 1
M
is the pointwise mutual information. Clearly, the pointwise information monotonically increases as items
become more rare as p(I = j) is small for tail items.
We can build intuition by using an operational interpretation of mutual information as characterizing
a noisy channel where we would like to reconstruct the user identity from observing a sample of the user’s
associated items. Mutual information tells us the expected amount of information observing an item, I = i
gives us about the user U = u (or vice versa). In the extreme, if a rare item is interacted with by only a
71
@ g t t f . b i a s ( )
def i n v e r s e _ s o f t m a x ( a d j _ m a t r i x : t o r c h . Tenso r , ∗ ∗ kwargs ) −> t o r c h . Ten s o r :
p r o b a b i l i t y _ m a t r i x = t o r c h . z e r o s _ l i k e ( a d j _ m a t r i x )
num_users , s o f tmax_ tem p = kwargs [ ’ num_users ’ ] , kwargs [ ’ s o f tmax_ tem p ’ ]
fo r u s e r in range ( num_users ) :
i n v e r s e _ n e i g h b o r _ d e g r e e s = 1 / kwargs [ ’ n e i g h b o r _ d e g r e e s ’ ] [ u s e r ]
p r ob s = t o r c h . nn . s o f tm ax ( i n v e r s e _ n e i g h b o r _ d e g r e e s / temp )
p r o b a b i l i t y _ m a t r i x [ u s e r ] = p r ob s
re turn p r o b a b i l i t y _ m a t r i x
@ g t t f . a c c um ul a t e ( )
def a u g m e n t _ u s e r _ r e p r e s e n t a t i o n ( wf : Wal kF o re s t , ∗ ∗ kwargs ) −> t o r c h . Ten s o r :
n o d e _ f e a t u r e s = kwargs [ " t t n n _ r e p r e s e n t a t i o n s " ]
s y n t h e s i s _ f u n c t i o n = kwargs [ " s y n t h e s i s _ f u n c " ]
n um_la ye r s = wf . d e p t h ( )
wf . i n i t i a l i z e _ e m b e d d i n g s ( n o d e _ f e a t u r e s )
fo r wt in wf :
wt [ 0 ] . embeddin g s = s y n t h e s i s _ f u n c t i o n ( wt [ 0 ] , wt [ 1 ] )
re turn t o r c h . s t a c k ( [ wt [ 0 ] . embeddin g s fo r wt in wf ] )
simpleGCN = g t t f . run ( i n t e r a c t i o n _ g r a p h , f a n o u t =k , d e p t h = 1)
Listing 5.1: Implementation of BUHS using GTTF
single user, then we can immediately and perfectly identify the user if we observe an interaction with this
item. The items that are most informative about a user are the tail items, which motivates and justifies
our approach to bias sampling towards tail items for both tail item recommendation as well as enhanced
personalization.
5.8 Graph Convolution Interpretation and Implementation with GTTF
Biased User History Synthesis can be conceptually understood as a form of single-layer stochastic
graph convolution ([31, 94, 95, 12]) on the user-item bipartite interaction graph . This operational interpretation becomes apparent when considering biased user history synthesis as a specialized form of the
Message-Passing-Neural-Network (MPNN) abstraction of graph convolution outlined in [26]. Each variant
of the synthesis function (Mean, User-attn, and GRU) can be seen as a distinct message generating function,
72
responsible for conveying messages from the sampled neighborhood of a given node u to node u itself.
Subsequently, the update function constitutes the final layer responsible for generating x
AUG
u , the augmented representation of node u. Embracing this interpretation, and given the fundamental premise of
utilizing biased neighborhood samples during the learning process, we are motivated to adopt GTTF for
easy, efficient and unbiased learning. Thus, it suffices to merely define appropriate bias and accumulation
functions on the interaction bipartite graph as defined in 2.3 to instantiate GTTF to be BUHS. The code
snippet shown in Listing 5.1, shows the definition of bias and accumulation functions to instantiate GTTF
as BUHS. A few noteworthy points here: (1) the adj_matrix here is the adjacency matrix of the interaction
bipartite graph (2) we only populate the probability_matrix in the bias function for users since we only
sample user histories (3) the accumulation function outputs x
AUG
u for every user in a batch (4) the depth
is set to 1 since BUHS is equivalent to single layered stochastic graph convolution.
5.9 Experiments
Our experimental section is divided into two parts where we test the performance of our technique and
assess the impact of implementing our proposed strategy on top of a base recommendation system. We
start with a brief description of the datasets, baselines, and evaluation criteria we used in our experiments,
followed by the main results section and an ablation and case study section.
5.9.1 Datasets
Our evaluation is based on two benchmark datasets: MovieLens-1M∗
and BookCrossing†
. Both of these
datasets are highly sparse and well-suited for the long-tail recommendation scenario, as mentioned in
[96]. In the following section, we describe the pre-processing and cleanup steps we carried out on each
dataset. Table 5.1 provides an overview of the processed datasets.
∗
https://grouplens.org/datasets/movielens/1m/
†
http://www2.informatik.uni-freiburg.de/ cziegler/BX/
73
MovieLens-1M BookCrossing
Number of Users 6040 52,019
Number of Items 3883 225,794
Number of Head Items 777 2258
Number of Tail Items 3106 223,536
Number of Interactions 1,000,209 733,257
Sparsity 95.7% 99.993%
Average User Degree 163.6 13.3
Average Item Degree 245.4 3.07
Table 5.1: Statistics of the datasets used in the experiments.
5.9.1.1 MovieLens-1M
The MovieLens-1M dataset consists of explicit user feedback for movies. User features are user ID, gender,
occupation, age, and zip code. Item features are item ID, title, genres, and year. We apply a pretrained
sentence transformer from HuggingFace [88] on the item title to obtain a dense representation. As in [83,
96], we convert explicit feedback into implicit feedback. As in [33, 96, 84, 77, 15], we divide the data into
training, validation and test datasets according to the leave-one-out setting. Interactions are also associated
with timestamps that we use to order samples to feed into the GRU based synthesis model. For each user,
we use the latest interaction for testing, the penultimate interaction for validation and the rest for training.
As per [96], we consider the top 20% of items by degree as head items and the bottom 80% as tail items.
5.9.1.2 BookCrossing
The BookCrossing dataset includes both explicit and implicit user feedback for books. User features include
ID, location and age, while item features include ID, title, author, year, and publisher. We follow the lead of
[20, 96] and remove any users or items with missing features or no interactions to clean up the otherwise
dirty dataset as noted in [20]. Like MovieLens, we apply a pretrained sentence transformer to obtain a
dense representation for item titles. Unlike MovieLens, BookCrossing lacks timestamps for interactions,
so for users with a degree of at least three, we randomly select a validation edge and a test edge. The
74
remaining users, including those with fewer than three interactions, are used for training. We consider
the top 0.01% of items by degree as head items and the remaining items as tail items, following the approach
of [96].
5.9.2 Baselines
To present our main findings, we compare a TTNN that incorporates Biased User History Synthesis
(MBUHS) to both model-agnostic and model-specific techniques. To maintain simplicity, we refer to
the base recommendation model as Mbase. To ensure a fair comparison, all model-agnostic baselines are
implemented on top of Mbase. The following baselines are employed:
Model-Specific
• NeuMF [33] - This approach combines a Generalized Matrix Factorization Model with an MLP to learn
latent feature interactions through both linear and non-linear kernels. According to the authors, incorporating user and item features instead of solely relying on IDs can mitigate the long-tail item problem. In
accordance with their findings, we adopt this approach in our own work.
• NGCF [86] - Although not designed to solve the long-tail item problem, the authors contend that the
collaborative filtering signal is not fully captured by solely relying on user-item interaction pairs. They
propose explicitly injecting the signal by leveraging higher-order graph connectivity. Our user interaction
history sampling technique can also be seen as utilizing graph connectivity, and as a result, we compare
our approach against NGCF.
Model-Agnostic
• DropoutNet [81] - By creatively using dropout, the model is trained to improve its performance on tail
items. It can be applied on any latent model, including Mbase
• LogQ [56] - LogQ is a corrective method that addresses long-tail distributions. It calculates a correction
term based on item frequency and applies it to the logits before computing the final loss.
75
• MeLU [42] - A recent meta-learning technique that applies MAML [24] to address the long-tail item
problem.
• MWUF [101] - This method enhances the recommendation of tail items by employing two learnable
parameterized functions, namely the scaling and shifting functions. The scaling function embeds items
with no interactions into the head item embeddings subspace, while the shifting function is designed to
denoise tail item embeddings that were learned with limited interactions.
• CVAR [100] - To improve tail item recommendation, this method employs a Variational Autoencoder
to learn a latent distribution over items. The embeddings are then generated with a conditional decoder.
• MIRec [96] - This recent state-of-the-art meta-learning technique utilizes a curriculum transfer learning
protocol to transfer knowledge from head items to tail items.
5.9.3 Evaluation Criteria
To evaluate the model’s performance, we employ two metrics: Hit Rate @ k (HR@k) and Normalized Discounted Cumulative Gain @ k (NDCG@k)‡
. The former determines if a candidate item is among the top k
recommendations, while the latter measures how closely the recommended item is placed to the top of the
list. Together, these metrics provide a comprehensive evaluation of the recommendation model’s performance. As previously stated in Section 5.1, our goal is to enhance overall performance by simultaneously
improving recommendations for both tail and head items. For this reason, we report the performance of
each category separately in addition to overall ranking performance.
5.9.4 Main Results
Our results for Top-K recommendation are summarized in Tables 5.2 and 5.3 for MovieLens-1M and
BookCrossing, respectively. We notice the following from our main results:
‡We present both HR@k and NDCG@k as a percentage and hence values range from 0-100 as opposed to 0-1
76
• The tail, head, and overall recommendation are all significantly enhanced by all versions of MBUHS
(Mean, User-Attn, and GRU) when compared to Mbase. The success can be attributed to the improved
representation of tail items and the generation of personalized user representations. This underscores the
effectiveness of biased user history synthesis in addressing the tail item recommendation problem while
simultaneously making recommendations more personal.
• As shown in both tables 5.2 and 5.3, MBUHS generally outperforms both model-agnostic as well as
model-specific baselines in tail item recommendation, with at least one variant significantly outperforming
all baselines on both datasets. Additionally, we exceed most baselines in head item recommendation, which
aligns with the theoretical result of Prop 9, as enhanced personalization leads to better recommendation
for both head and tail items.
• On the Movielens-1M dataset, we notice that the User-Attn synthesis model outperforms the Mean
synthesis model in tail and overall recommendation but not head recommendation. We hypothesize that
this is due to the model assigning larger attention weights to tail items in the sample as a means to achieve
better overall recommendation performance (validating Prop 9 empirically). Nonetheless, even the simple
Mean synthesis model achieves out-sized gains over Mbase. We also see that the GRU synthesis model
performs the best in head and overall recommendation. This demonstrates the power of sequentially
modelling the sampled history in the synthesis model.
• For the BookCrossing dataset, our results show that the Mean synthesis model performs the best in
head and overall recommendation, while the User-Attn synthesis model outperforms others in tail recommendation. We hypothesize that the superior performance of User-Attn in tail recommendation is due to
the larger attention weights assigned to tail items in the sampled history. However, it’s worth noting that
the average number of interactions per user in the BookCrossing dataset is much smaller compared to the
MovieLens-1M dataset, with only 25% of users having more than three interactions. As a result, for any
reasonably sized sample, most users are likely to have duplicate items in their sampled history, as sampling
77
Overall Head Tail
Model HR@10 NDCG@10 HR@10 NDCG@10 HR@10 NDCG@10
Mbase 6.07 2.98 8.22 4.08 2.51 1.16
NeuMF 6.21 3.04 8.12 4.03 3.04 1.39
NGCF 6.03 3.06 9.18 4.7 0.79 0.33
DropoutNet 6.55 3.25 8.85 4.49 2.73 1.18
LogQ 5.21 2.69 6.39 3.26 3.25 1.51
MeLU 4.71 2.22 6.79 3.23 1.27 0.54
MWUF 6.24 3.11 8.86 4.49 1.89 0.83
CVAR 6.23 3.08 8.25 4.14 2.86 1.31
MIRec 6.24 3.04 8.51 4.15 2.46 1.2
MBUHS − Mean 7.51 3.31 10.26 5.24 2.95 1.29
MBUHS − Attn 7.63 3.73 10.05 5.01 3.61 1.60
MBUHS − GRU 7.75 3.87 10.35 5.20 3.44 1.42
Table 5.2: Comparison against baselines on MovieLens-1M by HR@10 and NDCG@10
Overall Head Tail
Model HR@100 NDCG@100 HR@100 NDCG@100 HR@100 NDCG@100
Mbase 5.92 1.37 13.13 3.23 5.21 1.19
NeuMF 6.74 1.50 9.2 2.14 6.5 1.44
NGCF 5.13 1.27 30.89 8.53 2.58 0.55
DropoutNet 5.93 1.26 12.11 2.52 5.32 1.14
LogQ 7.12 1.63 15.46 3.48 6.3 1.45
MeLU 3.37 0.7 7.12 1.46 3.01 0.63
MWUF 6.42 1.41 15.63 3.58 5.50 1.19
CVAR 7.22 1.66 15.38 3.31 6.4 1.5
MIRec 6.3 1.45 14.57 3.54 5.48 1.24
MBUHS − Mean 9.46 2.65 16.93 5.23 8.72 2.40
MBUHS − Attn 9.23 2.53 13.78 4.57 8.78 2.45
MBUHS − GRU 8.49 2.18 13.88 4.20 7.95 1.98
Table 5.3: Comparison against baselines on BookCrossing by HR@100 and NDCG@100
is done with replacement when the sample size exceeds the number of items in a user’s history. This can
result in noisy attention computation, where the attention weights of duplicate items are inflated, leading
to poorer overall performance compared to Mean synthesis. Additionally, recall from section 5.9.1.2, that
the interactions in BookCrossing do not contain timestamps. Thus, we randomly permute the sampled
history before feeding it into the GRU so as to not assume any ordering [31]. While this explains the relatively poor performance of GRU synthesis compared to Mean and User-Attn synthesis, it is worth noting
that the large expressive capability of GRUs still leads to significant improvement over Mbase
.
78
Overall Head Tail
Mbase / MBUHS Base Mean Attn GRU Base Mean Attn GRU Base Mean MHA GRU
TTNN 6.07 7.51 7.63 7.75 8.22 10.26 10.05 10.35 2.51 2.95 3.61 3.44
WnD 6.29 7.37 7.57 7.81 8.46 10.08 10.29 10.58 2.69 2.86 3.04 3.22
DeepFM 6.37 7.28 7.33 7.45 8.59 9.89 9.92 9.92 2.69 2.95 3.04 3.34
Table 5.4: Ablation on base models on MovieLens-1M
• Loss correction techniques, such as LogQ, have been effective in improving tail item recommendation.
However, a significant drawback is their negative impact on head item performance, which ultimately
affects overall performance. The bias towards tail items in these techniques is different from our model
because they penalize missed recommendations of tail items more heavily than missed head items. In
contrast, our model aims to extract more information from tail items to provide better recommendations.
Similarly, DropoutNet is very effective at improving tail item recommendation and overall recommendation, but it views the problem solely through the lens of overfitting head items and thus doesn’t perform
as well as our model.
• Our models significantly outperform the meta-learning based baselines, MeLU and MIRec. It is worth
noting that applying such meta-learning techniques on top of MBUHS as opposed to Mbase could potentially lead to further improvement in top line metrics. However, this aspect is not covered in this paper
and is reserved for future research.
5.9.5 Ablation and Sensitivity
In this section, we analyze key parameters that impact our model’s performance through ablation and
sensitivity studies. We conducted an ablation study to investigate the impact of biased sampling on the
performance of our proposed model. Additionally, we performed a sensitivity analysis on the softmax temperature in equation 5.1 and evaluated the effect of sample size on model performance. These experiments
were conducted on both the User-Attn and GRU variants of our model for the MovieLens-1M dataset. We
79
chose the User-Attn and GRU variants to understand the dynamics of the synthesis model in both, ordered and unordered user-history settings. It is worth noting that sample size and temperature are not
independent of each other, and their effects change as we vary one or the other. Therefore, the optimal
combination of sample size and temperature needs to be determined jointly. Nonetheless, we observe the
following general trends:
• Figures 5.3b and 5.3d indicate that incorporating bias in the sampling process leads to superior overall recommendation performance compared to unbiased sampling for both User-Attn and GRU synthesis.
However, in the case of User-Attn synthesis, this improvement is mainly driven by better tail item recommendation performance, while in the case of GRU synthesis, this gain seems to be predominantly derived
from better head item recommendation performance. Interestingly, for both variants, we also observe that
decreasing the sampling softmax temperature for a fixed sample size generally leads to better head item
performance, despite the fact that lower temperatures tend to increase the proportion of tail items in the
samples. Proposition 9 provides an explanation for this seemingly counter-intuitive finding. Moreover,
our findings suggest that the enhanced tail item recommendation performance is not solely attributable to
the increased tail item visibility introduced by our approach. Instead, it seems that greater personalization
also plays a role in this improvement.
• Figure 5.3a shows that increasing the sample size, for a fixed softmax temperature, has the effect of
improving head item recommendation and slightly degrading tail item recommendation for the User-Attn
Synthesis model. This results in a reasonable improvement in overall recommendation performance. Although sampling more items for a fixed temperature increases the expected proportion of head items in
the sample, the impact of user-attention helps mitigate the degradation to tail item recommendation while
simultaneously providing benefits to head item recommendation. On the other hand, as shown in Figure
5.3c, the GRU Synthesis model monotonically benefits from larger sample sizes. We conjecture that this is
due to a stronger signal obtained from a longer sequence in the sequential GRU model.
80
Overall Head Tail
0
5
10
Hit Rate @ 10
5 neighbors
10 neighbors
15 neighbors
(a)
Overall Head Tail
0
5
10
Hit Rate @ 10
No bias
T=1
T=0.1
T=0.01
(b)
Overall Head Tail
0
5
10
Hit Rate @ 10
5 neighbors
10 neighbors
15 neighbors
(c)
Overall Head Tail
0
5
10
Hit Rate @ 10
No bias
T=1
T=0.1
T=0.01
(d)
Figure 5.3: (a) Ablation Study on Sample Size for User-Attn Synthesis (T=0.01) for MovieLens-1M. (b) Ablation Study on Softmax Temperature for User-Attn Synthesis (15 neighbors) for MovieLens-1M. (c) Ablation
Study on Sample Size for GRU Synthesis (T=0.01) for MovieLens-1M. (d) Ablation Study on Softmax Temperature for GRU Synthesis (15 neighbors) for MovieLens-1M.
TTNN Mean Attn GRU
0
20
40
60
Energy Consumption (J)
(a)
WnD Mean Attn GRU
0
20
40
60
80
Energy Consumption (J)
(b)
DeepFM Mean Attn GRU
0
20
40
60
80
Energy Consumption (J)
(c)
Figure 5.4: (a) Batch energy consumption during training a TTNN model for MovieLens-1M. (b) Batch energy consumption during training a WnD model for MovieLens-1M. (c) Batch energy consumption during
training a DeepFM model for MovieLens-1M.
TTNN Mean Attn GRU
0
500
1,000
1,500
2,000
Energy Consumption (kJ)
(a)
WnD Mean Attn GRU
0
1,000
2,000
Energy Consumption (kJ)
(b)
DeepFM Mean Attn GRU
0
1,000
2,000
3,000
Energy Consumption (kJ)
(c)
Figure 5.5: (a) Energy consumed during training to achieve the best recommendation performance of a
TTNN model for MovieLens-1M. (b) Energy consumed during training to achieve the best recommendation
performance of a WnD model for MovieLens-1M. (c) Energy consumed during training to achieve the best
recommendation performance of a DeepFM for MovieLens-1M.
81
M-base
NeuMF
NGCF
LogQ
MeLU
CVAR
MIRec
BUHS-Mean
BUHS-Attn
BUHS-GRU
0
100
200
300
Epoch Training Time (s)
Figure 5.6: Epoch training time for Mbase (green), baseline models (red) and MBUHS variants (blue) for
MovieLens-1M
5.9.6 Case Studies
In addition to conducting ablation and sensitivity studies, we have undertaken four case studies to provide
further validation for our proposed approach. The first case study aims to demonstrate the efficacy of
biased user history synthesis in enhancing the performance of various base recommendation models, including TTNNs. The second case study involves an analysis of the time required to train a single epoch for
each baseline in section 5.9.2, comparing it to the training time of Mbase and all variants of MBUHS. Our
third case study investigates the impact of incorporating biased user history synthesis on the energy consumption during the training process of each base model. Lastly, the fourth and final case study delves into
examining the influence of our approach on the final item embeddings, thus providing valuable insights
into the underlying mechanisms governing our model’s performance.
5.9.6.1 Case Study 1: Biased User History Synthesis on other base recommendation models
Biased User History Synthesis is a model-agnostic technique that can be applied on top of various base
recommendation models. In this case study, we extended our approach to two popular base recommendation models, namely Wide and Deep (WnD) [14] and DeepFM [29], to investigate the effectiveness of our
82
Base
User ID = 5255 User ID = 745 User ID = 9 User ID = 5830 User ID = 5428 User ID = 2747
User-Attn Synthesis GRU Synthesis
Figure 5.7: Visualizations of final embeddings of items in the history of randomly selected users in
MovieLens-1M
approach across different base recommendation model architectures. The results are summarized in table
5.4. Our findings reveal that biased user history synthesis consistently improves the head, tail, and overall
recommendation performance for both DeepFM and Wide and Deep. Notably, the same trend holds across
all base recommendation models for overall recommendation, with GRU outperforming User-Attn, which
in turn outperforms Mean synthesis. However, it is important to note that the superiority of one base recommendation system over another does not necessarily imply the same relative performance when biased
user history synthesis is applied on top of the base models. The combination of the base recommendation
model and synthesis model needs to be chosen jointly for optimal results.
5.9.6.2 Case Study 2: Training time analysis
Figure 5.6 presents a comparison of the per epoch training time for Mbase, all variants of MBUHS, and
each baseline model on the Movielens-1M dataset. A noteworthy observation is that, for at least one variant
of MBUHS, the training time per epoch is consistently lower than that of all baseline models except LogQ,
which is to be expected since LogQ is a corrective method that adds no trainable paramters §
. This finding
§MWUF and DropoutNet both take over 1500 seconds per epoch. Thus they have been omitted from the figure for clarity
83
implies that MBUHS achieves superior recommendation quality over the baselines without sacrificing
training efficiency. Furthermore, it is evident that the mean variant of MBUHS demands a comparable
amount of time per epoch as Mbase. Consequently, the addition of biased user history synthesis to Mbase
holds great promise, as it adds minimal computational overhead, if any, to the training process.
5.9.6.3 Case Study 3: Energy consumption of Biased User History Synthesis
The added overhead of our model during inference can be eliminated by pre-computing the user and item
representations as mentioned in 5.6. During training, however, our model adds minimal overhead to a
base recommendation model to achieve the reported substantial gains in recommendation performance.
The added overhead in energy consumption, measured using hardware measurement tools, associated
with using our model is reported in Figure 5.4, indicating that our model may require additional energy
resources for its operation. However, Figure 5.5a demonstrates a noteworthy advantage of our model in
terms of long-term energy efficiency. A TTNN takes 98 training epochs to get to its best recommendation
performance, a hitrate of 6.07 on MovieLens-1M, while all of our model variants can achieve the same
recommendation performance in less than 12 training epochs. Figure 5.5a reports the energy consumed
by our model variants and a TTNN to get to a hitrate of 6.07 on MovieLens-1M. Thus, despite the added
energy overhead per batch, our model showcases better energy resources utilization compared to a TTNN
over the longrun. Figure 5.5b and Figure 5.5c show that model variants built on top of WnD and DeepFM
also showcase better energy resources utilization and follow the same trends that their TTNN counterparts
did.
5.9.6.4 Case Study 4: Effect of Biased User History Synthesis on final item embeddings
Our approach, though based on biased item sampling, is fundamentally user-centric as it aims to compute
an augmented user representation that is more personalized, potentially leading to improved head, tail, and
84
overall recommendation performance. However, in this case study, we are interested in understanding
how our strategy affects item representations. Specifically, we aim to investigate the effects of biased
user history sampling on head and tail item representations. To achieve this, we visualize the final highdimensional item embeddings in a 2-dimensional space using UMAP [55] to reduce the dimensionality.
Figure 5.7 presents a visualization of the final embeddings of items in the history of six randomly selected
users from the MovieLens-1M dataset. In the visualization, blue points represent embeddings of tail items,
while red points represent embeddings of head items. The darkness of the shade of blue indicates the
lower number of interactions experienced by the item in the dataset, while the darkness of the shade of
red indicates the higher number of interactions experienced by the item in the dataset.
We notice the following from the visualization in Figure 5.7, first there is a noticeable distinction in the
distribution of head and tail item representations between the embeddings produced by the base recommendation model and those augmented with biased user history synthesis. In the embeddings generated
by the base model, little to no clustering is observed, and head and tail item representations appear to be
intertwined. In contrast, the item representations produced by the User-Attn and GRU synthesis variants
exhibit stronger disentanglement between head and tail item representations, with the User-Attn Variant
also showing some semblance of clustering into head and tail item clusters. Second, with biased user history synthesis, we notice greater variance in the item embeddings, particularly for tail items. In contrast,
the base recommendation model appears to produce more biased item embeddings. These findings indicate that biased user history synthesis not has the potential to learn representations that better capture
the discriminative features between head and tail items, but also produce less biased tail item embeddings, thereby leading to improved recommendation performance from the perspective of users, as well
as enhanced recommendation of tail items.
85
5.9.7 Hyperparameters
We use a batch size of 1024, a learning rate of 0.0002 for all Movielens-1m models, a learning rate of
0.0001 for all BookCrossing models, and an embedding table dimension of 96 for all models. We sample
20 negative edges for each positive edge in MovieLens-1M and 10 negative edges for each positive edge in
BookCrossing during training. For the user-attn variant ofMBUHS we use a query and key projection dim
of 96. All MLPs use the GeLU activation function [34] and layernorms between layers. MLP dimensions
can be found in our code. The results in 5.2 and 5.3 are obtained with a sample size/sampling softmax
temperature of 15/0.01 and 10/1.0 for MovieLens-1m and BookCrossing respectively.
86
Chapter 6
Future Work
The solutions proposed in this dissertation to address the efficient hardware usage problem and the longtail item problem show significant promise and represent notable advancements in the state of the art.
However, they also open up avenues for future improvements and expansions.
In the case of cDLRM, the caching solution proves beneficial in enabling single-GPU training regardless of model size. However, the current implementation incurs a computational overhead due to the
caching+training process being responsible for both caching and training, leading to serial execution of
caching and training. This setup presents a computational bottleneck. A more sophisticated and efficient
system could be achieved by employing separate CPU processes for caching and training. However, this
introduces a challenge concerning the synchronization of the caching process with the training process
when updating caches with embeddings from the next lookahead window while the training process is still
using embeddings from the previous window. Asynchronicity between the two processes could potentially
lead to overwriting embeddings in use by the training process. Conversely, waiting for the training process
to complete a full lookahead window before caching becomes a serial and synchronous process, negating
the advantage of having separate processes for caching and training. A viable solution to this issue could
substantially alleviate the computational bottlenecks of cDLRM and potentially bring it closer to parity
with a baseline system that uses many more GPUs to train the model.
87
As for BUHS, we observe two main limitations that we identify as future directions for research. Firstly,
our approach currently does not account for the possibility of a long-tail distribution in user interaction history length. As evident from observations on the BookCrossing dataset, instances with small user degrees
can hinder effective model training, especially for more complex and potent synthesis models. Addressing
this limitation would be crucial in enhancing the robustness of our approach across diverse datasets. Secondly, another noteworthy limitation is the absence of modeling for essential non-relevance-based metrics,
such as diversity and novelty in recommendations. These metrics hold increasing significance in comprehensive evaluation of recommendation systems. An obvious extension to our model, which already
achieves improved relevance and personalization, would involve incorporating mechanisms to enhance
diversity and novelty in the recommended items. This holistic approach to recommendation system evaluation would undoubtedly contribute to the overall quality and utility of our proposed solution.
88
Chapter 7
Conclusions
Deep Learning Recommendation Models (DLRMs) have become a fundamental pillar of the internet economy, and their significance is expected to grow further as their prevalence and the size of training datasets
continue to expand. However, this growth in scale poses distinct challenges both at the system and algorithmic levels.
The unique architecture of DLRMs, characterized by computationally complex dense networks and
memory-intensive large sparse embedding tables, typically necessitates a data and model parallel hybrid
strategy. In this approach, expensive hardware resources like GPUs or compute accelerators are used in a
manner exceeding their capacity to parallelize computations effectively. Storing embedding tables across
multiple GPUs makes the training infrastructure inefficient and costly. To overcome this limitation, we
propose cDLRM, a recommendation model training system that ensures cost-efficient training by entirely
storing the large embedding tables on a CPU. The key insight behind cDLRM is that only a small subset of embedding table entries is updated by each training batch. Moreover, this subset of entries can be
identified by looking ahead in the training batches. cDLRM employs a CPU-based lookahead thread that
preprocesses multiple training batches ahead of their training and prefetches the necessary unique embedding vectors into GPU DRAM. By decoupling the memory demands of the recommendation model from its
computational demands, cDLRM enables training of large models on a single GPU. When multiple GPUs
89
are available, cDLRM exhibits robust scaling across GPUs and allows for pure data parallelism while maintaining model accuracy. By eliminating the need for excessive hardware and expensive GPUs for storing
recommendation models, cDLRM takes a significant step toward democratizing recommendation model
training.
In addition to cDLRM, we present our model, Biased User History Synthesis (BUHS), designed to tackle
the long-tail item problem in large-scale recommendation models. This problem arises due to an inherent
popularity bias toward a small subset of items. BUHS, a model-agnostic DLRM training algorithm, not
only addresses the long-tail item problem but also enhances personalization in the base recommendation
system. As BUHS operates stochastically and can be viewed as a learning task on the bipartite user-item
interaction graph, we introduce GTTF, a stochastic graph meta-algorithm, to facilitate efficient and unbiased learning of various graph representation models and then leverage GTTF, to efficiently implement
BUHS. In addition to our state-of-the-art results on both tail and head-item recommendation performance,
we also offer an information-theoretic interpretation of BUHS to motivate our approach and reveal an
intrinsic relationship between improved personalization and enhanced long-tail item recommendation.
90
Bibliography
[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. “TensorFlow: A
system for large-scale machine learning”. In: 12th USENIX Symposium on Operating Systems
Design and Implementation (OSDI 16). 2016, pp. 265–283. url:
https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf.
[2] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard,
Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. “MixHop: Higher-Order Graph
Convolutional Architectures via Sparsified Neighborhood Mixing”. In: International Conference on
Machine Learning. 2019.
[3] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. “Watch Your Step:
Learning Node Embeddings via Graph Attention”. In: Advances in Neural Information Processing
Systems. 2018.
[4] M. T. Ahamed and S. Afroge. “A Recommender System Based on Deep Neural Network and
Matrix Factorization for Collaborative Filtering”. In: 2019 International Conference on Electrical,
Computer and Communication Engineering (ECCE). 2019, pp. 1–5.
[5] Jonas Theon Anthony, Gerard Ezra Christian, Vincent Evanlim, Henry Lucky, and
Derwin Suhartono. “The Utilization of Content Based Filtering for Spotify Music
Recommendation”. In: 2022 International Conference on Informatics Electrical and Electronics
(ICIEE). 2022, pp. 1–4. doi: 10.1109/ICIEE55596.2022.10010097.
[6] Keshav Balasubramanian, Abdulla Alshabanah, Joshua D Choe, and Murali Annavaram. “CDLRM:
Look Ahead Caching for Scalable Training of Recommendation Models”. In: Proceedings of the
15th ACM Conference on Recommender Systems. RecSys ’21. Amsterdam, Netherlands: Association
for Computing Machinery, 2021, pp. 263–272. isbn: 9781450384582. doi: 10.1145/3460231.3474246.
[7] Pierre Baldi and Peter Sadowski. “The dropout learning algorithm”. In: Artificial Intelligence. 2014.
91
[8] Alex Beutel, Ed H. Chi, Zhiyuan Cheng, Hubert Pham, and John Anderson. “Beyond Globally
Optimal: Focused Learning for Improved Recommendations”. In: Proceedings of the 26th
International Conference on World Wide Web. WWW ’17. Perth, Australia: International World
Wide Web Conferences Steering Committee, 2017, pp. 203–212. isbn: 9781450349130. doi:
10.1145/3038912.3052713.
[9] Fedor Borisyuk, Krishnaram Kenthapadi, David Stein, and Bo Zhao. “CaSMoS: A Framework for
Learning Candidate Selection Models over Structured Queries and Documents”. In: Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD
’16. San Francisco, California, USA: Association for Computing Machinery, 2016, pp. 441–450.
isbn: 9781450342322. doi: 10.1145/2939672.2939718.
[10] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machine
Learning on Graphs: A Model and Comprehensive Taxonomy. 2021. arXiv: 2005.03675.
[11] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. “The
rise of deep learning in drug discovery”. In: Drug discovery today. 2018.
[12] Jie Chen, Tengfei Ma, and Cao Xiao. “FastGCN: Fast Learning with Graph Convolutional
Networks via Importance Sampling”. In: International Conference on Learning Representation. 2018.
[13] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. “Simple and Deep Graph
Convolutional Networks”. In: International Conference on Machine Learning. 2020.
[14] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong,
Vihan Jain, Xiaobing Liu, and Hemal Shah. “Wide & Deep Learning for Recommender Systems”.
In: CoRR abs/1606.07792 (2016). arXiv: 1606.07792. url: http://arxiv.org/abs/1606.07792.
[15] Mingyue Cheng, Fajie Yuan, Qi Liu, Shenyang Ge, Zhi Li, Runlong Yu, Defu Lian, Senchao Yuan,
and Enhong Chen. “Learning Recommender Systems with Implicit Feedback via Soft Target
Enhancement”. In: Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval. SIGIR ’21. Virtual Event, Canada: Association for
Computing Machinery, 2021, pp. 575–584. isbn: 9781450380379. doi: 10.1145/3404835.3462863.
[16] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. “Cluster-GCN:
An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks”. In: ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019.
[17] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation”. In: CoRR abs/1406.1078 (2014). arXiv: 1406.1078. url:
http://arxiv.org/abs/1406.1078.
[18] Paul Covington, Jay Adams, and Emre Sargin. “Deep Neural Networks for YouTube
Recommendations”. In: Proceedings of the 10th ACM Conference on Recommender Systems. New
York, NY, USA, 2016.
92
[19] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge J. Belongie. “Class-Balanced Loss Based
on Effective Number of Samples”. In: CoRR abs/1901.05555 (2019). arXiv: 1901.05555. url:
http://arxiv.org/abs/1901.05555.
[20] Manqing Dong, Feng Yuan, Lina Yao, Xiwei Xu, and Liming Zhu. “MAMO: Memory-Augmented
Meta-Optimization for Cold-start Recommendation”. In: CoRR abs/2007.03183 (2020). arXiv:
2007.03183. url: https://arxiv.org/abs/2007.03183.
[21] Chantat Eksombatchai, Pranav Jindal, Jerry Zitao Liu, Yuchen Liu, Rahul Sharma, Charles Sugnet,
Mark Ulrich, and Jure Leskovec. “Pixie: A System for Recommending 3+ Billion Items to 200+
Million Users in Real-Time”. In: CoRR abs/1711.07601 (2017). arXiv: 1711.07601. url:
http://arxiv.org/abs/1711.07601.
[22] Carsten Felden and Peter Chamoni. “Recommender Systems Based on an Active Data Warehouse
with Text Documents”. In: Proceedings of the 40th Annual Hawaii International Conference on
System Sciences. HICSS ’07. USA: IEEE Computer Society, 2007, 168a. isbn: 0769527558. doi:
10.1109/HICSS.2007.460.
[23] Matthias Fey and Jan E. Lenssen. “Fast Graph Representation Learning with PyTorch Geometric”.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds. 2019.
[24] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks”. In: CoRR abs/1703.03400 (2017). arXiv: 1703.03400. url:
http://arxiv.org/abs/1703.03400.
[25] Sahin Cem Geyik, Qi Guo, Bo Hu, Cagri Ozcaglar, Ketan Thakkar, Xianren Wu, and
Krishnaram Kenthapadi. “Talent Search and Recommendation Systems at LinkedIn: Practical
Challenges and Lessons Learned”. In: The 41st International ACM SIGIR Conference on Research &
Development in Information Retrieval. SIGIR ’18. Ann Arbor, MI, USA: Association for Computing
Machinery, 2018, pp. 1353–1354. isbn: 9781450356572. doi: 10.1145/3209978.3210205.
[26] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. “Neural
Message Passing for Quantum Chemistry”. In: International Conference on Machine Learning. 2017.
[27] David Goldberg, David Nichols, Brian M. Oki, and Douglas Terry. “Using Collaborative Filtering
to Weave an Information Tapestry”. In: Commun. ACM 35.12 (Dec. 1992), pp. 61–70. issn:
0001-0782. doi: 10.1145/138859.138867.
[28] Aditya Grover and Jure Leskovec. “node2vec: Scalable feature learning for networks”. In: ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016.
[29] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. “DeepFM: A
Factorization-Machine based Neural Network for CTR Prediction”. In: CoRR abs/1703.04247
(2017). arXiv: 1703.04247. url: http://arxiv.org/abs/1703.04247.
[30] William Hamilton, Rex Ying, and Jure Leskovec. “Inductive Representation Learning on Large
Graphs”. In: Advances in Neural Information Processing Systems. 2017.
93
[31] William L. Hamilton, Rex Ying, and Jure Leskovec. “Inductive Representation Learning on Large
Graphs”. In: CoRR abs/1706.02216 (2017). arXiv: 1706.02216. url: http://arxiv.org/abs/1706.02216.
[32] Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, and Christina Lioma.
“Content-aware Neural Hashing for Cold-start Recommendation”. In: CoRR abs/2006.00617
(2020). arXiv: 2006.00617. url: https://arxiv.org/abs/2006.00617.
[33] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. “Neural
Collaborative Filtering”. In: CoRR abs/1708.05031 (2017). arXiv: 1708.05031. url:
http://arxiv.org/abs/1708.05031.
[34] Dan Hendrycks and Kevin Gimpel. “Bridging Nonlinearities and Stochastic Regularizers with
Gaussian Error Linear Units”. In: CoRR abs/1606.08415 (2016). arXiv: 1606.08415. url:
http://arxiv.org/abs/1606.08415.
[35] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. “Session-based
Recommendations with Recurrent Neural Networks”. In: (Nov. 2015).
[36] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In: Neural computation 9
(Dec. 1997), pp. 1735–80. doi: 10.1162/neco.1997.9.8.1735.
[37] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. “Open Graph Benchmark: Datasets for Machine Learning on
Graphs”. In: arXiv. 2020.
[38] Biye Jiang, Chao Deng, H. Yi, Zelin Hu, Guorui Zhou, Y. Zheng, Sui Huang, X. Guo, D. Wang,
Y. Song, Liqin Zhao, Z. Wang, P. Sun, Y. Zhang, Di Zhang, Jin-hui Li, Jian Xu, Xiaoqiang Zhu, and
Kun Gai. “XDL: an industrial deep learning framework for high-dimensional sparse data”. In:
Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional
Sparse Data (2019).
[39] Norman P. Jouppi. “Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers”. In: Proceedings of the 17th Annual International
Symposium on Computer Architecture. ISCA ’90. Seattle, Washington, USA: Association for
Computing Machinery, 1990, pp. 364–373. isbn: 0897913663. doi: 10.1145/325164.325162.
[40] Thomas Kipf and Max Welling. “Semi-Supervised Classification with Graph Convolutional
Networks”. In: International Conference on Learning Representations. 2017.
[41] Yehuda Koren, Robert Bell, and Chris Volinsky. “Matrix Factorization Techniques for
Recommender Systems”. In: Computer 42.8 (2009), pp. 30–37. doi: 10.1109/MC.2009.263.
[42] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. “MeLU: Meta-Learned
User Preference Estimator for Cold-Start Recommendation”. In: CoRR abs/1908.00413 (2019).
arXiv: 1908.00413. url: http://arxiv.org/abs/1908.00413.
[43] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and
Alex Peysakhovich. “PyTorch-BigGraph: A Large-scale Graph Embedding System”. In: The
Conference on Systems and Machine Learning. 2019.
94
[44] Omer Levy and Yoav Goldberg. “Neural Word Embedding as Implicit Matrix Factorization”. In:
Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume
2. NIPS’14. Montreal, Canada: MIT Press, 2014, pp. 2177–2185.
[45] Omer Levy, Yoav Goldberg, and Ido Dagan. “Improving Distributional Similarity with Lessons
Learned from Word Embeddings”. In: Transactions of the Association for Computational
Linguistics. 2015.
[46] Pan Li, Maofei Que, Zhichao Jiang, Yao Hu, and Alexander Tuzhilin. “PURS: Personalized
Unexpected Recommender System for Improving User Satisfaction”. In: CoRR abs/2106.02771
(2021). arXiv: 2106.02771. url: https://arxiv.org/abs/2106.02771.
[47] G. Linden, B. Smith, and J. York. “Amazon.com recommendations: item-to-item collaborative
filtering”. In: IEEE Internet Computing 7.1 (2003), pp. 76–80. doi: 10.1109/MIC.2003.1167344.
[48] Huafeng Liu, Jingxuan Wen, Liping Jing, and Jian Yu. “Deep Generative Ranking for Personalized
Recommendation”. In: Proceedings of the 13th ACM Conference on Recommender Systems. RecSys
’19. Copenhagen, Denmark: Association for Computing Machinery, 2019, pp. 34–42. isbn:
9781450362436. doi: 10.1145/3298689.3347012.
[49] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. “Personalized News Recommendation Based on
Click Behavior”. In: Proceedings of the 15th International Conference on Intelligent User Interfaces.
IUI ’10. Hong Kong, China: Association for Computing Machinery, 2010, pp. 31–40. isbn:
9781605585154. doi: 10.1145/1719970.1719976.
[50] Yuanfu Lu, Yuan Fang, and Chuan Shi. “Meta-Learning on Heterogeneous Information Networks
for Cold-Start Recommendation”. In: Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery amp; Data Mining. KDD ’20. Virtual Event, CA, USA:
Association for Computing Machinery, 2020, pp. 1563–1573. isbn: 9781450379984. doi:
10.1145/3394486.3403207.
[51] Mi Luo, Fei Chen, Pengxiang Cheng, Zhenhua Dong, Xiuqiang He, Jiashi Feng, and Zhenguo Li.
“MetaSelector: Meta-Learning for Recommendation with User-Level Adaptive Model Selection”.
In: CoRR abs/2001.10378 (2020).
[52] Elan Markowitz, Keshav Balasubramanian, Mehrnoosh Mirtaheri, Sami Abu-El-Haija,
Bryan Perozzi, Greg Ver Steeg, and Aram Galstyan. “Graph Traversal with Tensor Functionals: A
Meta-Algorithm for Scalable Learning”. In: CoRR abs/2102.04350 (2021). arXiv: 2102.04350. url:
https://arxiv.org/abs/2102.04350.
[53] M. Marović, M. Mihoković, M. Mikša, S. Pribil, and A. Tus. “Automatic movie ratings prediction
using machine learning”. In: 2011 Proceedings of the 34th International Convention MIPRO. 2011,
pp. 1640–1645.
95
[54] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius,
David A. Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks,
Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim M. Hazelwood, Andrew Hock, Xinyuan Huang,
Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan,
Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie,
Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. “MLPerf Training
Benchmark”. In: CoRR abs/1910.01500 (2019). arXiv: 1910.01500. url:
http://arxiv.org/abs/1910.01500.
[55] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. 2020. arXiv: 1802.03426 [stat.ML].
[56] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit,
and Sanjiv Kumar. “Long-tail learning via logit adjustment”. In: CoRR abs/2007.07314 (2020).
arXiv: 2007.07314. url: https://arxiv.org/abs/2007.07314.
[57] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
representations of words and phrases and their compositionality”. In: Advances in Neural
Information Processing Systems. 2013.
[58] Andriy Mnih and Russ R Salakhutdinov. “Probabilistic Matrix Factorization”. In: Advances in
Neural Information Processing Systems. Ed. by J. Platt, D. Koller, Y. Singer, and S. Roweis. Vol. 20.
Curran Associates, Inc., 2007. url:
https://proceedings.neurips.cc/paper_files/paper/2007/file/d7322ed717dedf1eb4e6e52a37ea7bcdPaper.pdf.
[59] Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu,
Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al. “High-performance, Distributed Training
of Large-scale Deep Learning Recommendation Models”. In: arXiv preprint arXiv:2104.05158
(2021).
[60] Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu,
Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko,
Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli,
Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang,
Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala,
K. R. Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi,
Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah,
Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Lin Qiao,
Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. “High-performance, Distributed Training of
Large-scale Deep Learning Recommendation Models”. In: CoRR abs/2104.05158 (2021). arXiv:
2104.05158. url: https://arxiv.org/abs/2104.05158.
[61] Maxim Naumov. “On the Dimensionality of Embeddings for Sparse Features and Data”. In: CoRR
abs/1901.02103 (2019). arXiv: 1901.02103. url: http://arxiv.org/abs/1901.02103.
96
[62] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,
Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu,
Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu,
Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira,
Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. “Deep
Learning Recommendation Model for Personalization and Recommendation Systems”. In: CoRR
abs/1906.00091 (2019). arXiv: 1906.00091. url: http://arxiv.org/abs/1906.00091.
[63] Athanasios N. Nikolakopoulos, Dimitris Berberidis, George Karypis, and Georgios B. Giannakis.
“Personalized Diffusions for Top-n Recommendation”. In: Proceedings of the 13th ACM Conference
on Recommender Systems. RecSys ’19. Copenhagen, Denmark: Association for Computing
Machinery, 2019, pp. 260–268. isbn: 9781450362436. doi: 10.1145/3298689.3346985.
[64] Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. “Deep content-based music
recommendation”. In: Advances in Neural Information Processing Systems. Ed. by C.J. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger. Vol. 26. Curran Associates, Inc., 2013.
url:
https://proceedings.neurips.cc/paper_files/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898dPaper.pdf.
[65] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf,
Edward Z. Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style,
High-Performance Deep Learning Library”. In: CoRR abs/1912.01703 (2019). arXiv: 1912.01703.
url: http://arxiv.org/abs/1912.01703.
[66] Changhua Pei, Yi Zhang, Yongfeng Zhang, Fei Sun, Xiao Lin, Hanxiao Sun, Jian Wu, Peng Jiang,
Wenwu Ou, and Dan Pei. “Personalized Context-aware Re-ranking for E-commerce
Recommender Systems”. In: CoRR abs/1904.06813 (2019). arXiv: 1904.06813. url:
http://arxiv.org/abs/1904.06813.
[67] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. “DeepWalk: Online Learning of Social
Representations”. In: ACM SIGKDD international conference on Knowledge discovery & Data
Mining. 2014.
[68] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. “Network Embedding
as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec”. In: ACM International
Conference on Web Search and Data Mining. 2018.
[69] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. “GroupLens:
an open architecture for collaborative filtering of netnews”. In: Conference on Computer Supported
Cooperative Work. 1994.
[70] Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler.
“Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design”. In: CoRR
abs/1602.08124 (2016). arXiv: 1602.08124. url: http://arxiv.org/abs/1602.08124.
97
[71] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. “Restricted Boltzmann Machines for
Collaborative Filtering”. In: Proceedings of the 24th International Conference on Machine Learning.
ICML ’07. Corvalis, Oregon, USA: Association for Computing Machinery, 2007, pp. 791–798. isbn:
9781595937933. doi: 10.1145/1273496.1273596.
[72] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. “Item-Based Collaborative
Filtering Recommendation Algorithms”. In: Proceedings of the 10th International Conference on
World Wide Web. WWW ’01. Hong Kong, Hong Kong: Association for Computing Machinery,
2001, pp. 285–295. isbn: 1581133480. doi: 10.1145/371920.372071.
[73] Sebastian Schelter, Christoph Boden, Martin Schenck, Alexander Alexandrov, and Volker Markl.
“Distributed Matrix Factorization with Mapreduce Using a Series of Broadcast-Joins”. In:
Proceedings of the 7th ACM Conference on Recommender Systems. RecSys ’13. Hong Kong, China:
Association for Computing Machinery, 2013, pp. 281–284. isbn: 9781450324090. doi:
10.1145/2507157.2507195.
[74] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. “AutoRec: Autoencoders
Meet Collaborative Filtering”. In: Proceedings of the 24th International Conference on World Wide
Web. WWW ’15 Companion. Florence, Italy: Association for Computing Machinery, 2015,
pp. 111–112. isbn: 9781450334730. doi: 10.1145/2740908.2742726.
[75] Connor Shorten and Taghi M. Khoshgoftaar. “A survey on Image Data Augmentation for Deep
Learning”. In: Journal of Big Data. 2019.
[76] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
“Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In: Journal of Machine
Learning Research 15.56 (2014), pp. 1929–1958. url:
http://jmlr.org/papers/v15/srivastava14a.html.
[77] Xuehan Sun, Tianyao Shi, Xiaofeng Gao, Yanrong Kang, and Guihai Chen. “FORM: Follow the
Online Regularized Meta-Leader for Cold-Start Recommendation”. In: Proceedings of the 44th
International ACM SIGIR Conference on Research and Development in Information Retrieval. New
York, NY, USA: Association for Computing Machinery, 2021, pp. 1177–1186. isbn: 9781450380379.
url: https://doi.org/10.1145/3404835.3462831.
[78] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need”. In: CoRR abs/1706.03762 (2017).
arXiv: 1706.03762. url: http://arxiv.org/abs/1706.03762.
[79] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. “Graph Attention Networks”. In: International Conference on Learning
Representations. 2018.
[80] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. Graph Attention Networks. 2018. arXiv: 1710.10903 [stat.ML].
98
[81] Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. “DropoutNet: Addressing Cold Start in
Recommender Systems”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon,
U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30.
Curran Associates, Inc., 2017. url:
https://proceedings.neurips.cc/paper/2017/file/dbd22ba3bd0df8f385bdac3e9f8be207-Paper.pdf.
[82] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. “Collaborative Deep Learning for Recommender
Systems”. In: CoRR abs/1409.2944 (2014). arXiv: 1409.2944. url: http://arxiv.org/abs/1409.2944.
[83] Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo.
“Ripple Network: Propagating User Preferences on the Knowledge Graph for Recommender
Systems”. In: CoRR abs/1803.03467 (2018). arXiv: 1803.03467. url:
http://arxiv.org/abs/1803.03467.
[84] Jianling Wang, Kaize Ding, and James Caverlee. “Sequential Recommendation for Cold-Start
Users with Meta Transitional Learning”. In: Proceedings of the 44th International ACM SIGIR
Conference on Research and Development in Information Retrieval. SIGIR ’21. Virtual Event,
Canada: Association for Computing Machinery, 2021, pp. 1783–1787. isbn: 9781450380379. doi:
10.1145/3404835.3463089.
[85] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma,
Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. “Deep
Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks”. In:
arXiv preprint arXiv:1909.01315 (2019).
[86] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. “Neural Graph
Collaborative Filtering”. In: CoRR abs/1905.08108 (2019). arXiv: 1905.08108. url:
http://arxiv.org/abs/1905.08108.
[87] Y. Wang, S. C. Chan, and G. Ngai. “Applicability of Demographic Recommender System to Tourist
Attractions: A Case Study on Trip Advisor”. In: 2012 IEEE/WIC/ACM International Conferences on
Web Intelligence and Intelligent Agent Technology. Vol. 3. 2012, pp. 97–101.
[88] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. “HuggingFace’s
Transformers: State-of-the-art Natural Language Processing”. In: CoRR abs/1910.03771 (2019).
arXiv: 1910.03771. url: http://arxiv.org/abs/1910.03771.
[89] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger.
“Simplifying Graph Convolutional Networks”. In: International Conference on Machine Learning.
2019.
[90] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and
Stefanie Jegelka. “Representation Learning on Graphs with Jumping Knowledge Networks”. In:
International Conference on Machine Learning. 2018.
[91] Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Wang, Taibai Xu, and
Ed H. Chi. “Mixed Negative Sampling for Learning Two-tower Neural Networks in
Recommendations”. In: 2020.
99
[92] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. “Revisiting semi-supervised learning
with graph embeddings”. In: International Conference on Machine Learning. 2016.
[93] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, Aditee Ajit Kumthekar,
Zhe Zhao, Li Wei, and Ed Chi, eds. Sampling-Bias-Corrected Neural Modeling for Large Corpus Item
Recommendations. 2019.
[94] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and
Jure Leskovec. “Graph Convolutional Neural Networks for Web-Scale Recommender Systems”.
In: CoRR abs/1806.01973 (2018). arXiv: 1806.01973. url: http://arxiv.org/abs/1806.01973.
[95] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna.
“GraphSAINT: Graph Sampling Based Inductive Learning Method”. In: International Conference
on Learning Representations. 2020.
[96] Yin Zhang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Lichan Hong, and Ed H. Chi. “A
Model of Two Tales: Dual Transfer Learning Framework for Improved Long-tail Item
Recommendation”. In: CoRR abs/2010.15982 (2020). arXiv: 2010.15982. url:
https://arxiv.org/abs/2010.15982.
[97] Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li.
“Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems”.
In: Proceedings of Machine Learning and Systems. Ed. by I. Dhillon, D. Papailiopoulos, and V. Sze.
Vol. 2. 2020, pp. 412–428. url:
https://proceedings.mlsys.org/paper/2020/file/f7e6c85504ce6e82442c770f7c8606f0-Paper.pdf.
[98] Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li.
“Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems”.
In: Proceedings of Machine Learning and Systems. Ed. by I. Dhillon, D. Papailiopoulos, and V. Sze.
Vol. 2. 2020, pp. 412–428. url:
https://proceedings.mlsys.org/paper/2020/file/f7e6c85504ce6e82442c770f7c8606f0-Paper.pdf.
[99] Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. “AIBox: CTR
prediction model training on a single node”. In: Proceedings of the 28th ACM International
Conference on Information and Knowledge Management. 2019, pp. 319–328.
[100] Xu Zhao, Yi Ren, Ying Du, Shenzheng Zhang, and Nian Wang. “Improving Item Cold-start
Recommendation via Model-agnostic Conditional Variational Autoencoder”. In: Proceedings of the
45th International ACM SIGIR Conference on Research and Development in Information Retrieval.
ACM, July 2022. doi: 10.1145/3477495.3531902.
[101] Yongchun Zhu, Ruobing Xie, Fuzhen Zhuang, Kaikai Ge, Ying Sun, Xu Zhang, Leyu Lin, and
Juan Cao. “Learning to Warm Up Cold Item Embeddings for Cold-start Recommendation with
Meta Scaling and Shifting Networks”. In: CoRR abs/2105.04790 (2021). arXiv: 2105.04790. url:
https://arxiv.org/abs/2105.04790.
[102] Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. “Few-Shot
Representation Learning for Out-Of-Vocabulary Words”. In: Advances in Neural Information
Processing Systems. 2019.
100
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Fast and label-efficient graph representation learning
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Learning controllable data generation for scalable model training
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Distributed resource management for QoS-aware service provision
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Hardware and software techniques for irregular parallelism
PDF
Data-driven optimization for indoor localization
PDF
Robust causal inference with machine learning on observational data
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
Asset Metadata
Creator
Balasubramanian, Keshav
(author)
Core Title
Scaling recommendation models with data-aware architectures and hardware efficient implementations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
02/06/2024
Defense Date
01/26/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,graph representation learning,machine learning,OAI-PMH Harvest,recommendation models,system design
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Galstyan, Aram (
committee member
), Golubchik, Leana (
committee member
)
Creator Email
keshavb96@gmail.com,keshavba@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113825945
Unique identifier
UC113825945
Identifier
etd-Balasubram-12658.pdf (filename)
Legacy Identifier
etd-Balasubram-12658
Document Type
Dissertation
Format
theses (aat)
Rights
Balasubramanian, Keshav
Internet Media Type
application/pdf
Type
texts
Source
20240208-usctheses-batch-1125
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
graph representation learning
machine learning
recommendation models
system design