Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Parametric and semi-parametric methods for knowledge acquisition from text
(USC Thesis Other)
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PARAMETRIC AND SEMI-PARAMETRIC METHODS FOR KNOWLEDGE ACQUISITION
FROM TEXT
by
Yury Zemlyanskiy
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2022
Copyright 2022 Yury Zemlyanskiy
Dedication
To my wife, Dr. Natalia Zemlianskaia.
ii
Acknowledgements
I want to thank Fei Sha, who gave me this opportunity in the first place, for advising and mentoring
me throughout these years. Thank you for showing me what scientific research is. I appreciate setting
high standards and constantly pushing for thinking and communication clarity. I still remember how Fei
completely rewrote our first publication, turning a mudded technical report into a beautiful story. That
and many other examples are my constant inspiration.
I’m incredibly grateful to Leana Golubchik, who stepped up as my advisor during my last year in school
and offered her support and kindness during thesis proposal and defense time.
I appreciate the work of my thesis defense members, Meisam Razaviyayn and Robin Jia, and my thesis
proposal and qualification exam members, Jonathan May, Shri Narayanan, and Xiang Ren. Thank you for
taking the time to listen to my research and for your insightful discussion and feedback!
I want to thank my dear friend and colleague, Michiel De Jong. Truly, my Ph.D. has changed after we
started collaborating. All the long research discussions gave me a second breath and immensely inspired
me. I wish I could thank you for all the book’s recommendations, but on the other hand, they took much
time I could have spent researching.
Throughout my study, I was always supported by the fantastic staff at USC Viterbi School of Engi-
neering. I was fortunate to have Lizsl De Leon as the Director of Student Affairs of the Computer Science
Department. I always felt welcomed at her office and could ask for help and advice on any topic. I want
to thank our lab manager, Nina Shilling, for creating a cozy and welcoming lab atmosphere.
iii
I was fortunate to meet many fantastic lab mates and colleagues at USC. I have fond memories of adven-
tures with Seb Arnold. I would also like to thank Melissa Ailem, Soravit (Beer) Changpinyo, Aaron Chan,
Wei-Lun (Harry) Chao, Liyu Chen, Chao-Kai Chiang, John Paul Francis, Jeremy Hsu, Hexiang (Frank) Hu,
Shariq Iqbal, Zhiyun Lu, Ivy Xiao, Yiming Yan, Yury Zemlyanskiy, Bowen Zhang, Ke Zhang and Bill Zhu
for the supportive and welcoming atmosphere, insightful research discussions and game/karaoke nights.
I am grateful to my Google colleagues for inviting me for internships. I want to commemorate Ilya
Eckstein, my first Google mentor and good friend, who passed away in 2021. I was fortunate to work with
an incredible group of researchers, including Joshua Ainslie, William Cohen, Nicholas FitzGerald, Ruining
He, Sudeep Gandhe, Juraj Gottweis, Bhargav Kanagal, Panupong (Ice) Pasupat, Philip Pham, Linlu Qiu,
Anirudh Ravula, Sumit Sanghai, Peter Shaw, and others.
I’m thankful to my other collaborators and co-authors, Marius Kloft, Lukas Ruff, Thomas Schnake, and
Robert Vandermeulen.
Finally, none of that would be possible without my beloved wife, Natalia Zemlianskaia. You are my
inspiration as a model scientist – brilliant, passionate, and very precise. With you and your love, I could
look beyond daily struggles and see a bigger, brighter picture. I often ask myself how come I’m that lucky
to have you.
iv
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
I Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Organization and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Relationship to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
II Parametric entity knowledge representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: DocEnt: Learning Self-Supervised Entity Representations from Large Document
Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Self-Supervised Entity Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 DocEnt-Dual, Known asRELIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 DocEnt-Full . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 DocEnt-Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Supervised Task: Movielens Tag Prediction . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Few-Shot Task: Open Vocabulary Tag Prediction . . . . . . . . . . . . . . . . . . . 14
2.3.3 Zero-Shot Task: Reddit Movie Suggestions . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Amazon Movie Reviews Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Reviews2Movielens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 Reddit Movie Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
2.5.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Tag Prediction: Fine-tuning Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Entity-less Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4.1 Movielens Tag Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4.2 Reddit Movie Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
III Semi-parametric entity knowledge representation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 3: ReadTwice: Reading Very Large Documents with Memories . . . . . . . . . . . . . . . 25
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 ReadTwice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Pre-training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.4 Ablation Analysis & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Chapter 4: Mention Memory: incorporating textual knowledge into Transformers through entity
mention attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Constructing mention memory from corpus . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1.1 Mention Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1.2 Mention memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 tome model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2.1 Attention over memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2.2 Sparse large-scale retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.3 Mention encoder pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 tome pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.3 Claim verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.5 Qualitative properties of tome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vi
IV Semi-parametric task-specific knowledge representation . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 5: Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing . 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.3 Ablations and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
V Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Chapter 7: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1 Dense memory for the structured prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.1.1 Method Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.1.2 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 Dense memory: future directions & challenges . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
A Appendix for DocEnt: Learning Self-Supervised Entity Representations from Large
Document Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
B Appendix forReadTwice: Reading Very Large Documents with Memories . . . . . . . . . 83
B.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.2 Pre-training details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
B.3 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.4 Extractive QA layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.4.1 HotpotQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.4.2 TriviaQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
B.4.3 NarrativeQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C Appendix for Mention Memory: incorporating textual knowledge into Transformers
through entity mention attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
C.1.1 Mention Encoder data generation . . . . . . . . . . . . . . . . . . . . . . 88
C.1.2 Coreference resolution loss . . . . . . . . . . . . . . . . . . . . . . . . . . 88
C.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.2.1 Fine-tuning setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.2.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.2.3 Claim verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
C.2.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vii
C.2.5 Importance of pre-training objectives . . . . . . . . . . . . . . . . . . . . 91
C.2.6 tome initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
C.3 Nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
C.3.1 On-device nearest neighbor search . . . . . . . . . . . . . . . . . . . . . 92
C.4 Retrieval examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
D Appendix for Generate-and-Retrieve: use your predictions to improve retrieval for
semantic parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
viii
ListofTables
2.1 Reviews2Movielens task, illustrated. Here are sample review snippets for a certain classic film
which is summarized using MovieLens tags. Notice that the tags may not appear in the input
verbatim and can be thought of as boolean questions about the film. Note also that Review 3 has
zero relevant signal—a common challenge of low SNR in this dataset. Bonus teaser: can you guess
the$movie from these snippets? This little quiz alludes to a key learning task in our approach. . . 9
2.2 Qualitative examples illustrating zero-shot movie ranking byDocEnt-Full, with natural language
queries crawled from Reddit. The bracketed greyed-out movie mentions are users’ examples of
desired recommendations, removed from the queries to probe the model in what resembles a movie
guessing game. Those obfuscated entities were correctly guessed by the model based on remaining
query terms, making it to the Top 5 in most cases. Other top matches appear to be equally relevant. 15
2.3 Evaluation datasets sizes for Tag Prediction tasks. Closed / Open stand for the closed and open
vocabulary tasks, respectively; M-T Pairs shows the number of corresponding movie-tags pairs.
The top two rows describe movie holdout sets used in our closed vocabulary experiments; bottom
two rows showing tag holdouts for open vocabulary experiments. . . . . . . . . . . . . . . . . . 18
2.4 Mean Average Precision and ROC-AUC results on the closed-vocabulary tag prediction task.
TagGenome is the original baseline from MovieLens creators [91], trained on multiple additional
features and considered SOTA. Despite using fewer features, DocEnt matches TagGenome
performance on AUC and outperforms it on precision (MAP). . . . . . . . . . . . . . . . . . . . 20
2.5 Zero-shot results for DocEnt models vs several baselines on Reddit Movie Suggestions. MRR
stands for Mean Reciprocal Rank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 This text is not visible and used to put a footnote . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Results on the NarrativeQA’s development / test splits. . . . . . . . . . . . . . . . . . . . . 31
3.3 Ablation studies on variants of ReadTwice on the dev sets. We report F1 (answer only)
score for HQA, ROUGE-L and BLEU-1 for NQA (denoted -R and -B respectively) and F1
forTQA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Accuracy on claim verification datasets. #Encoded refers to the number of passages
encoded by a BERT reader to answer a single question. . . . . . . . . . . . . . . . . . . . . 46
ix
4.2 Accuracy on open-domain QA datasets TriviaQA (TQA), ComplexWebQuestions (CWQ)
and EntityQuestion (EQ). #Encoded refers to the number of passages encoded by a BERT
reader to answer a question. TQA
e-dev
corresponds to TQA with train and dev samples
limited to those with Wikipedia entity as an answer. See Appendix C.2.3 for full results. . 47
4.3 tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for
the first ( −→
1
) memory attention layer for two passage mentions. Memory mentions are
in brackets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 tome-2 retrievals for the first HoVer dev sample. We show top-1 retrieval results for the
first ( −→
1
) and the second (−→
2
) memory attention layers for passage mentions “Life
Goes On” and “Hungry”
1
. Memory mentions are in brackets. The first retrieval for the
“Life Goes On” is a different song with the same name and the first retrieval for “Hungry”
is related but not useful. However, the second retrieval for “Life Goes On” identifies
the correct song and describes its position on the album while the second retrieval for
“Hungry” captures its position relative to “Life Goes On”. . . . . . . . . . . . . . . . . . . 48
4.5 Accuracy on held-out subset of TriviaQA and ComplexWebQuestions (CWQ) questions.
tome-1-unseen was pre-trained and fine-tuned with memory without entities from
held-out set and evaluated with full memory. Note that performance is considerably lower
than on the full dev set as answers in the held-out set (which are in dev but not train) are
more likely to be rare entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1 Results on semantic parsing benchmarks. We report the percentage exact match between
true and predicted labels as sequences. Results are on test set for all benchmarks except
MTOP
boot
, where we report on dev to remain comparable with CASPER. . . . . . . . . . . 55
5.2 Performance on high-resource settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Template recall@K=4 on the development sets for MTOP
boot
, TOPv2
W
and TOPv2
R
. . . . . 56
5.4 Input TF-IDF retrieves an exemplar with lexical overlap (‘musicals’) that is not relevant to
the sample. The GandR retrieval balances lexical and label similarity and leads to a correct
prediction. Single representative exemplar out of 4 displayed for each method. . . . . . . . 57
B.1 Masked Language Model (MLM) accuracy on the held out set. SS mode corresponds to the
case when memories are collected only from the segment itself, effectively disabling any
information propagation between segments. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2 Ablation studies on variants of ReadTwice. We report F1 (answer only) score for HQA,
ROUGE-L and BLEU-1 forNQA (-R and -B correspondingly) and F1 forTQA. . . . . . . . . 84
B.3 Number of parameters per model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.4 Results on the TriviaQA dataset in Wikipedia setting. . . . . . . . . . . . . . . . . . . . . . 85
x
C.1 Accuracy on claim verification datasets. #Encoded refers to the number of passages
encoded by a BERT reader to answer a single question. EaE stands for Entities as Experts
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.2 Accuracy on FM2 compared with original dataset baselines. Oracle refers to oracle
retrieval followed by a BERT-Base reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.3 EntityQuestions recall@20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.4 EntityQuestions top-1 accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
C.5 Performance ablations for pre-training objectives experiments. . . . . . . . . . . . . . . . 91
C.6 Proportion of time spent on ANNS fortome-1 pre-training setting. . . . . . . . . . . . . . 93
C.7 tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for
the first ( −→
1
) memory attention layer for passage mentions “the novel”, “the movie” and
“the album”. Memory mentions are in brackets. We can see that the model can retrieve
relevant mentions for non-named passage mentions, and generally understands it is
looking for mentions related to music. However, while the best retrieval for “album” is
from a passage that mentions sampling the Shining, it is quite far removed and it is likely
the retrieval is not sufficiently accurate here. . . . . . . . . . . . . . . . . . . . . . . . . . . 93
D.1 Dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
D.2 Input TF-IDF predicts the sl:ordinal slot, which exists in pre-training domains but does
not apply to the Weather domain. GandR has high slot coverage, so if a slot exists, it will
likely be present in at least one of the retrieved exemplars. The fact that GandR does not
retrieve an exemplar with thesl:ordinal slot (as it is not present in the Weather training
exemplars) provides a hint for the model that it may be an invalid slot and GandR’s
updated prediction eliminates it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
D.3 Normally, the sl:contact slot for the in:create_call intent is paired with a name. In
this case, however, the contact ismyself . The model with input TF-IDF retrieval generates
the correct slot as it retrieves another instance with slot myself due to lexical similarity
in the input. In contrast, GandR retrieves exemplars with perfectly matching templates,
but without the same slot, such that it does not assign myself to thesl:contact slot in its
prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
xi
ListofFigures
2.1 Models in the DocEnt family. Left: a baseline dual encoder model called DocEnt-Dual a.k.a.
RELIC, maximizingP(e|s) but notP(s|e). Center: DocEnt-Full—a model maximizing the joint
sentence-entity probability using full cross-attention. Right: DocEnt-Hybrid, designed to capture
the best of both worlds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 ReadTwice model architecture. The input is processed twice, with a memory table for
inter-segment information sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Overview of Mention Memory. A pre-trained mention encoder is used to generate dense
representations for each entity mention in Wikipedia (approximately 150 million total)
which are stored in a table. The tome model takes a passage annotated with entity
mention boundaries as input, and applies a Transformer block. Next, the tome model
applies one or more TOMEBlocks. Each TOMEBlock contains a memory attention layer
and a Transformer block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Claim verification accuracy as a function of fine-tuning memory size (in millions). . . . . 49
5.1 Overview of GandR. First, GandR generates a preliminary prediction using an input
augmented with exemplars with similar inputs. Then, GandR retrieves exemplars based
on a relevance measure balancing input similarity and similarity between the preliminary
prediction and exemplar outputs, and generates a final prediction based on these exemplars. 53
5.2 Performance on MTOP
boot
dev set as a function of output similarity weightα . . . . . . . . 56
7.1 Performance of a model augmented with TF-IDF retrieval on the low resource version of
TOPv2
S
, where we only used 1% of training data (roughly, 800 samples). . . . . . . . . . . 62
7.2 Performance retrieval augmented models on anonymized TOPv2
S
dataset vs the number
of retrieved exemplars. Both models are using input TF-IDF retrieval. Append tokens
model appends retrieved exemplars to the input, while Dense memory compresses
exemplars in the multiple dense vectors and integrates into T5 encoder via cross-attention
layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
xii
7.3 Performance retrieval augmented models on anonymized TOPv2
S
with different training
set sizes. Both models retrieve K = 4 exemplars using input TF-IDF similarity.
Appendtokens model appends retrieved exemplars to the input, whileDensememory
compresses exemplars in the multiple dense vectors and integrates into T5 encoder via
cross-attention layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xiii
Abstract
Knowledge acquisition, the process of extracting, processing, and storing new information, is critical to
any intelligent system. Nonetheless, modern neural networks (e.g., BERT) used in natural language pro-
cessing typically do not have an explicit memory component. Instead, the knowledge about the world that
the models acquire is stored implicitly in the model’s parameters. This proves unreliable and makes the
models ill-suited for knowledge-intensive tasks that require reasoning over vast amounts of textual data.
My thesis explores alternative parametric and semi-parametric methods to extract and represent knowl-
edge from text. The main hypothesis is that we can improve the performance of modern NLP models
by representing acquired knowledge in a dedicated memory. The models can access knowledge explic-
itly through interacting with the memory. The thesis consists of three sections: the first section focuses
on parametric memory for a pre-defined set of entities. The second part explores a semi-parametric ap-
proach to representing entity-centric knowledge in a long document or entire corpus. Finally, the last part
discusses memory for structured prediction tasks.
xiv
PartI
Background
1
Chapter1
Introduction
1.1 Overview
Knowledge acquisition, the process of extracting, processing, and storing new information, is critical to any
intelligent system. The world is complex, fast-changing, and uncertain. An intelligent agent will likely face
many unseen situations, such that it cannot simply memorize a comprehensive recipe for actions.. Instead,
the agent must combine knowledge of the world in different ways. To do so, it must extract information
from experience and store that information in a manner that is both useful and accessible for the agent.
Much knowledge about the world is stored in text. For example, English Wikipedia alone has 6.5M
articles on a wide array of entities and concepts. In addition, one can find domain-specific knowledge in
textbooks, clinical study protocols, legal records and more. This thesis explores knowledge acquisition
methods from language so that intelligent systems can benefit from these resources.
Let us consider a concrete example of the use of information from a text corpus. Many essential ap-
plications, such as intelligent virtual assistants and modern search engines contain a question-answering
component. Which knowledge acquisition steps does the system need to perform to answer questions
like “What is the nationality of the hero who killed Medusa?”. First, the system must process information
in a text corpus like Wikipedia and extract relevant facts about entities. Second, these facts need to be
represented and indexed for easy and efficient access. The goal is that, given a question, the system can
then locate facts pertinent to the question (Who killed Medusa? and What’s Perseus’s country of origin?)
and synthesize the answer.
Language models such as BERT and others [21, 57, 73, 51, 8] are current state-of-the-art methods for
many NLP applications including question answering, text classification, semantic parsing, and more. The
2
key to their success is two fold. First, these models are built on top of the powerful Transformer architecture
[89, 87]. Second, the approaches use the transfer learning paradigm. The idea is to train the models on a
large text corpus such as Wikipedia or CommonCrawl using an objective akin to language modeling. Then,
one uses learnt parameters as an initialization point before fine-tuning the model on tasks of interest.
However, one drawback of above language models [21, 57, 73, 51] is lack of an explicit memory com-
ponent. Instead, models represent acquired knowledge from text during training implicitly, in model pa-
rameters. Recent research [9, 24, 93] shows that modern neural networks unreliably memoize factual
information, making them unsuitable for knowledge-intensive tasks. For example, BERT pre-trained on
Wikipedia struggles to remember facts about entities from the corpus. The model answers the question
“Where the hero who killed Medusa was from” with “England”
1
instead of “Greece”.
Can we do better? The main hypothesis of my thesis is that we can improve performance of modern
NLP models by representing acquired knowledge in a dedicated memory. The models can access knowl-
edge explicitly through interacting with the memory. We outline the desidarata of such a memory.
Comprehensive. The memory is expected to reliably capture knowledge from the corpus without for-
getting useful information.
Amendable. The memory should be amendable and adapt to new information, such as a different do-
main, or an evolving state of the world. For standard neural models, expensive re-training is required to
incorporate new information. Many older popular models are starting to become outdated: for example,
BERT [21] and RoBERTa [57] have no knowledge on the COVID19 pandemic because they were trained
in 2018 and 2019 respectively.
Searchable. Memory is organized such that the model can locate relevant information. For example,
in Part IV, we show that the quality of the retrieval procedure matters for retrieval augmented models
[34, 46, 65, 94, 33] and propose ways to improve it.
Usable. Memory represents knowledge in such a way that the model can extract relevant information.
For example, a memory we develop in Part III compresses facts into a dense vector format. One challenge
is to ensure that the model can recover relevant knowledge compressed in those vectors.
1
We used BERT-base model available online at https://demo.allennlp.org/masked-lm and queried it using text “The
hero who killed Medusa was from country [MASK]”.
3
Interpretable & Grounded. Memory grounded in a text corpus can help explain model predictions.
Interpretable models allow humans to judge whether input evidence or reasoning was flawed. A way to
provide such interpretations is by tracking which individual memories have been used by the model. It is
even more informative if those memories are grounded and can be traced back to passages in a text corpus.
Efficient. Memory should be computationally efficient to use. For example, retrieval augmented models
[34, 46, 65, 94, 33] store an entire corpus or training set as a memory. They retrieve and append relevant
passages for any new input. This makes memory access expensive since retrieved passages need to be
re-processed by these models. In contrast, the memory we develop in Part III is in dense vector format
containing pre-processed facts, which is much cheaper to integrate into neural models.
The next section describes the organization of my work on developing memory components that satisfy
the above properties.
1.2 ThesisOrganizationandContribution
The thesis is organized as follows: the first part focuses on parametric memory for a pre-defined set of
entities. The second part explores a semi-parametric approach to representing entity-centric knowledge in
a long document or entire corpus. Finally, the last part discusses memory for structured prediction tasks.
Much human knowledge that is stored in text is inherently entity-centric. With this in mind, we ap-
proach Part II with the following motivation. Given a large and noisy collection of documents about an
entity (e.g., movie reviews), can we distill all the valuable information therein into a dense entity embed-
ding? Subsequently, we want to use these embeddings as a knowledge base in downstream tasks such as
vertical search and attribute prediction. We carry out experiments and evaluate several methods to learn
such embeddings for a pre-defined set of entities in Chapter 2. Note that this approach cannot adapt when
data distribution shifts (e.g., a new corpus with different entities) due to reliance on parametric represen-
tations.
In Part III we propose a semi-parametric method to extract and store knowledge from text using a
"mention-based memory." Every time an entity is mentioned in the text, its property or relation to other
entities is described. I provide several strategies to capture such information in a high-dimensional vector
using a parametric Mention Encoder model. Then the memory is a collection of these vectors for all
4
mentions in the text. I use this memory to improve performance for various challenging knowledge-
intensive tasks.
Chapter3 employs memory to reason over long documents such as books or a collection of articles,
which is a challenge for modern language models. The main idea is to have texts read in small segments in
parallel, summarizing entity-related information from each segment into the memory. Then, the segments
are read a second time, with access to global entity information provided in the memory table.
Chapter4 focuses on open-domain question answering tasks that require retrieving and assimilating
factual information from multiple sources. The goal is to use memory to reliably incorporate information
from across a text corpus into a language model. The corresponding paper [20] is a joint first author
contribution with Michiel de Jong. My contribution is developing a memory of 150 million Wikipedia
mentions that contain knowledge of the entire corpus. While incorporating memory into a Transformer
is a challenging task, my focus is on the memory quality and how it affects the downstream performance.
InPartIV, we explore a memory tailored for downstream tasks. Instead of focusing on entity-centric
information, we explore the use of memory to address challenging tasks where the number of training
samples is small relative to possible input-output space. One example is a broad category of structured
prediction tasks, which includes semantic parsing.
In practice, training data is commonly limited due to expensive of data collection. the rapid rise of
virtual assistants and chatbot interfaces requires developing semantic parsing models for novel tasks. This
makes studying structured prediction in low resource settings especially crucial.
A common approach to structured prediction problems is to treat the gold structure as a sequence and
then fine-tune a sequence-to-sequence model such as T5 [73] or BART [51]. Existing works have found
that one can improve performance by augmenting the model with the training set as an external memory.
The approach is commonly calledretrievalaugmentedmodels [34, 65, 94, 33]. The idea is to find related
training samples, denoted exemplars, and append them to the sample input. In theory, all information from
exemplars is available to the model during training and could be stored in model parameters. In practice,
however, the model may not successfully retain all information, especially in the low-resource scenario.
Reminding the model of some data patterns at test time appears to help.
InChapter5 we propose a way to retrieve more "informative" exemplars that help produce the correct
parse. Existing retrieval is commonly based on the similarity of query and exemplar inputs. We propose
GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar. GandR first
5
generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars with outputs
similar to the preliminary prediction, which are used to generate a final prediction. GandR sets state of
the art on multiple low-resource semantic parsing benchmarks.
Finally, in the Chapter 6 we summarize our contributions and discuss promising future research di-
rections.
1.3 RelationshiptoPublishedWork
Chapter 2 Yury Zemlyanskiy, Sudeep Gandhe, Ruining He, Bhargav Kanagal, Anirudh Ravula, Juraj
Gottweis, Fei Sha, and Ilya Eckstein. DOCENT: learning self-supervised entity representations from large
document collections. In EACL, 2021
Chapter3 Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong, Philip Pham, Ilya Eckstein, and Fei Sha.
Readtwice: Reading very large documents with memories. In NAACL-HLT, 2021
Chapter4 Michiel de Jong*, Yury Zemlyanskiy*, Nicholas FitzGerald, Fei Sha, and William Cohen. Men-
tion memory: incorporating textual knowledge into transformers through entity mention attention. In
ICLR. 2021
Chapter5 This chapter corresponds to work currently under submission to COLING, 2022.
OtherWorks The following works are outside the scope of this dissertation but were published during
its preparation
Yury Zemlyanskiy and Fei Sha. Aiming to Know You Better Perhaps Makes Me a More Engaging
Dialogue Partner. In CoNLL, 2018
Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, Marius Kloft. Self-attentive,
multi-context one-class classification for unsupervised anomaly detection on text. In ACL, 2019
6
PartII
Parametricentityknowledgerepresentation
7
Chapter2
DocEnt: LearningSelf-SupervisedEntityRepresentationsfromLarge
DocumentCollections
In memory of Ilya Eckstein.
This paper explores learning rich self-supervised entity representations from large amounts of associ-
ated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked
retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest
self-supervision signals based merely on a local context within a sentence, we radically expand the notion
of context to include any available text related to an entity. This enables a new class of powerful, high-
capacity representations that can ultimately distill much of the useful information about an entity from
multiple text sources, without any human supervision.
We present several training strategies that, unlike prior approaches, learn tojointly predict words and
entities—strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as
MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results,
our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and can
scale to very large corpora.
Finally, we make our datasets and pre-trained models publicly available
1
. This includesReviews2Movielens,
mapping the ∼ 1B word corpus of Amazon movie reviews [36] to MovieLens tags [35], as well as Reddit
Movie Suggestions with natural language queries and corresponding community recommendations.
1
See http://goo.gle/research-docent for Reviews2Movielens and models. Scripts and Reddit Suggestions can be found
athttps://urikz.github.io/docent
8
Review 1: “This movie develops its power best if you don’t try to look out for the “real” and “true” events
behind the four versions of the narration... shown in a very intelligent and artistic way, no silly plot-twists, no
explanation in the end — it is open to your fantasy... “$movie” is an important piece of cinematic storytelling
andareallyinterestingwaytoreflectontheoriginoftales... SomescenesevenremindmeofAndrejTarkovskijs
intensive style..”.
Review2: “Just rented this, and at first I didn’t like very much, but then it starts to sink in for how good it is,
the acting is great especially Toshiro Mifune, it was shot very good for an older movie... it’s #62 on the top 250”
Review3: “Saw this movie at my local video store... was placed on a waiting list, but when I returned to check
it out the video store had closed down over night. Actually whent out of business”
... More reviews ...
Summary tags: [nonlinear] [multiple storylines] [japan] [black and white] [surreal] [cerebral] [imdb top
250], ...
Table 2.1: Reviews2Movielens task, illustrated. Here are sample review snippets for a certain classic film which is
summarized using MovieLens tags. Notice that the tags may not appear in the input verbatim and can be thought of
as boolean questions about the film. Note also that Review 3 has zero relevant signal—a common challenge of low
SNR in this dataset. Bonus teaser: can you guess the $movie from these snippets? This little quiz alludes to a key
learning task in our approach.
2.1 Introduction
Much of the online information describing entities in domains such as music, movies, venues or con-
sumer products, is only available as unstructured text—a format that is human-readable but not machine-
understandable (yet). Consider online reviews—a rich source of mostly user-generated about a vast number
of entities. Our key research question is: Can we learn strong models for entity understanding tasks such
as vertical search and question answering, solely from text? In other words, given a large and noisy col-
lection of documents about an entity, can we distill all the useful information therein into a dense entity
representation, so as to benefit multiple downstream tasks?
Traditionally, learning entity representations required supervised signals such as clicks, “likes" and
consumption behavior [2, 38, 49, 92], which are generally expensive and time consuming to obtain at
scale. To leapfrog these limitations, we draw inspiration from the recent progress in unsupervised learn-
ing of text, particularly contextualized representations via techniques such as ELMo [68], CoVe [61] and
BERT [21]. Many of these representations are learned by predicting a missing word from its context. More
recently, [84] showed that extending word masking strategies to entities can lead to superior language
models. Even more recent entity linking methods such as RELIC [54] and others, detailed in Section 2.6,
were shown to produce explicit encodings applicable to entity understanding tasks.
9
We start with RELIC-like approaches and generalize them into a family of models, collectively called
DocEnt, that jointly embed text and entities (Section 2.2) via self-supervised tasks. The first one, DocEnt-
Dual, is essentially RELIC, but trained with a much broader context to include any and all sentences
potentially related to an entity. Importantly, DocEnt-Dual/RELIC only optimizes a single task, namely
entity prediction given an associated sentence, effectively modeling P(Entity|Sentence).
Another natural way of jointly modelling entities and text is by directly tapping the cross-attention
mechanism inBERT, simply by extending theBERT vocabulary to include entity tokensV
E
. Each entity-
related sentence can then be augmented with a corresponding token from V
E
. We call this method
DocEnt-Full and, despite (or perhaps because of) its conceptual simplicity, it proves surprisingly effective
in semi-supervised tasks.
Finally,DocEnt-Hybrid aims to capture the best of both models by extendingDocEnt-Dual with an
additional task of predicting words in a sentence, conditioned on its associated entity. This task encourages
the latter to “remember” salient phrases in its sentences.
We empirically evaluate these methods by learning entity representations for movies from a TV-Movies
portion of the Amazon Reviews Corpus [36]. To this end, we consider several movie-oriented tasks for
downstream evaluation, i.e. Reddit Movie Suggestions and MovieLens Tag Prediction [35], which we study
in both zero-shot, supervised and few-shot settings. We join the MovieLens dataset with the reviews cor-
pus [36] obtaining a mapping from movie reviews to user-generated tags. On the supervised tag prediction
task, our text-based model demonstrates SOTA performance, despite not using powerful user signals [92].
In fact, we are able to match or outperform baselines on all tasks where they are available.
2.1.1 Contributions
1. First, we propose a family of methods to train deep self-supervised entity representations purely
from related text documents, with strong zero-shot results on ranked retrieval with natural language
queries.
2. Secondly, we show that these pre-trained representations are amenable to fine-tuning on new tasks
such as MovieLens tag prediction, where we show state-of-the-art results. They are also effective
10
Figure 2.1: Models in the DocEnt family. Left: a baseline dual encoder model called DocEnt-Dual a.k.a. RELIC,
maximizingP(e|s) but notP(s|e). Center: DocEnt-Full—a model maximizing the joint sentence-entity probability
using full cross-attention. Right: DocEnt-Hybrid, designed to capture the best of both worlds.
few-shot learners, which we demonstrate on a harder open-vocabulary
2
task akin to Boolean Ques-
tion Answering [17].
3. Next, we propose Reviews2Movielens—a new Text Based Entity Understanding task. The requisite
dataset, which we release publicly, effectively joins the Amazon Movie Reviews Corpus and Movie-
Lens into a large, sparsely supervised set with approximately 1B words and 470K movie-tag pairs.
4. Finally, we also release a dataset of user-generated Reddit Movie Suggestions, a benchmark for nat-
ural language search and recommendation scenarios.
2.2 Self-SupervisedEntityRepresentations
Inspired by the success of self-supervised language models, we seek to extend them to jointly compute
text and entity representations. Recall that our input is a set of entitiesE where for every entity e ∈ E,
we have a collection of sentences, denoted byS
e
, from all documents related toe. Intuitively, we want the
representation ofe to be influenced by each associated sentence s∈S
e
, and vice versa. To that end, we
explore two (self-) supervision signals: P(e|s) andP(s|e).
2
An open vocabulary allows any phrase to be a label.
11
2.2.1 DocEnt-Dual,KnownasRELIC
At the core of DocEnt-Dual is a RELIC model that co-encodes an entity e and an associated sentence
s ∈ S
e
so as to maximize their compatibility score, defined as the cosine similarity between the two
encodings:
s(e,s) =
g(e)
T
f
CLS
(s)
∥g(e)∥∥f
CLS
(s)∥
,
where g(e) is an embedding of e and f(s) is a BERT-based encoding of s, with its special [CLS] token
whose output representation is denoted byf
CLS
. Then, the conditional probability ofe givens is given
by a softmax over the setE
3
:
P(e|s) =
exp(s(e,s))
P
e
′
∈E
exp(s(e
′
,s))
.
Finally,RELIC is trained by maximizinglogP(e|s) over all associated pairse,s∈S
e
:
L
E
(e,s) = logP(e|s).
Note that bothg andf (initialized with a commonBERT) are learned during training.
Our sole difference to the original RELIC is in training data: whileRELIC only uses sentences contain-
ing entity mentions, we allow a radically broader context – all sentences associated with an entity – with
the goal of remembering all of its attributes. Crucially, no human labeling is required.
Despite its effectiveness (as demonstrated in Section 2.5), RELIC has one obvious limitation: it ignores
P(s|e), leaving a useful signal “on the table". We therefore propose another way of co-encoding sentences
and entities by tapping the full cross-attention power of Transformers.
2.2.2 DocEnt-Full
Before we proceed, let us revisit BERT’s Masked Language Model (MLM) training objective. Given a
sequence of input tokens s = [s
1
,...,s
n
], a fraction of tokens s
J
at randomly selected positions J is
replaced with a special [MASK] token. We denote this new sequence bys
− J
.
3
In practice, only a subset of entities inE is used in the denominator: the so called “in-batch negatives”.
12
Then,BERT predicts masked tokens based on their contextualized representationsf(s
− J
). The MLM
training objective to maximize is:
L
MLM
= logP(s
J
|s
− J
).
Enter DocEnt-Full. It follows the standard BERT architecture, with a twist. First, we expand the input
vocabulary to include all entity tokens inE. Then, during input sequence construction, each sentence
s ∈ S
e
is prepended
4
with the corresponding entity tokene, as shown in Figure 2.1. This way, masking
and predicting this token (via softmax) effectively adds our new objective L
E
toBERT. Further, the newe
token is now part of a sentence context, augmenting the originalL
MLM
to
L
MLM+E
(s,e) = logP(s
J
|s
− J
,e),
andL
Full
=L
E
+λ L
MLM+E
becomes the combined loss function optimized using nothing but BERT’s standard MLM training, with a
hyperparameterλ to balance the two terms
5
.
This conceptual simplicity and full cross-attention power come with a cost: bundling wordpieces and
entities together forces the model to allocate an equal capacity to both types of tokens (e.g., 768D for
BERT-base), regardless of the size ofE. As a result, a relatively small-sizedE may be prone to overfitting
6
in zero-shot scenarios, as we observe in Section 2.5.4.2.
2.2.3 DocEnt-Hybrid
Recall that RELIC avoids the above limitation by decoupling text and entity encoders. To get the best of
both worlds, we introduce DocEnt-Hybrid—a third model that sticks with the modular dual encoder ar-
chitecture while also modelingP(s|e). This is achieved by implementing a different variant of L
MLM+E
where, for every masked wordpiece token, the output of Transformer layersf(s
− J
) is first concatenated
with the associated entity embeddingg(e) before feeding into the final MLM prediction layer. By including
4
Technically, we replaceBERT’s standard (sA,sB ) two-segment input structure with (e,s), fors∈Se.
5
The relative masking frequency of entity tokens is another hyperparameter available to balance the two objectives.
6
Conversely, a very largeE may require an optimized implementation of softmax to maintain scalability.
13
entity embeddings in the prediction of related text tokens, we get them to “remember" important aspects
from the text without sacrificing modularity.
2.3 Tasks
In this section, we define the three tasks used to evaluate pre-trained entity representations.
2.3.1 SupervisedTask: MovielensTagPrediction
The original MovieLens Tag Prediction task is to produce movie-tag scores for a set of movies and a canon-
ical vocabulary of tags (see examples in Table 2.1), based on a collection of crowdsourced (movie, tag, user)
votes, as well as (user, movie) star ratings. These tags are often not factual but may refer to plot elements,
qualitative aspects or reflect subjective opinions. Since the same can be said about user reviews, and we
observe a non-trivial amount of textual entailment between the two sources. We therefore intentionally
exclude user ratings from the input. The new challenge is to complete the movie-tag relevance matrix
by leveraging movie reviews, hereafter referred to as the closed-vocabulary tag prediction task
7
. This is a
supervised setup where models are fine-tuneed with tag labels and evaluated on a held-out set subset of
movies, as elaborated in Section 2.5.
2.3.2 Few-ShotTask: OpenVocabularyTagPrediction
In reality, the space of tags is not static. Rather, tags are a useful kind of user-generated content that evolves
to reflect the zeitgeist, much like human language. Many online platforms (e.g, Twitter and Instagram to
name a few) have vibrant online communities that keep inventing new tags. We therefore propose a new
open-vocabulary formulation of the tag prediction problem where any phrase is allowed to be a tag.
This requires a small change in evaluation. Instead of held-out movies, we hold out a subset of tags and
fine-tune on the rest (and on all the movies). Note that this is no longer a classic multi-label classification
task as we never get to see the test labels during training. Rather, this open-vocabulary setup is akin to
answering boolean questions (about a movie) based on a text document [17].
7
One can also view this as a two-dimensional knowledge base (KB) completion problem, where relation types are not available
and the KB is reduced to a 2D matrix.
14
Query Top5Results
Movies like [Whiplash] about an artist or a musician chas-
ing an almost impossible dream and nearly or does ruin his
life because of it
Inside Llewyn Davis, Whiplash, A Young
Man with a Horn, Hustle & Flow, Born to Be
Blue
Really dark, slow paced movies with minimal story, but in-
credible atmosphere, kinda like [Drive] or [TheRover]
The Rover, Valhalla Rising, Only God For-
gives, Blade Runner, Sicario
Films like [MissionImpossible] or [TheItalianJob] that
have big scenes where the characters must break in or infil-
trate some place
National Treasure: Book of Secrets,Mission:
Impossible–RogueNation, Ant-Man,The
ItalianJob
Table 2.2: Qualitative examples illustrating zero-shot movie ranking by DocEnt-Full, with natural language
queries crawled from Reddit. The bracketed greyed-out movie mentions are users’ examples of desired recommen-
dations, removed from the queries to probe the model in what resembles a movie guessing game. Those obfuscated
entities were correctly guessed by the model based on remaining query terms, making it to the Top 5 in most cases.
Other top matches appear to be equally relevant.
2.3.3 Zero-ShotTask: RedditMovieSuggestions
The purpose of this task is to evaluate pre-trained entity representations in the context of vertical search.
The classic entity ranking problem is, given a text query and a finite set of entities, to rank them according
to their relevance to the query. Recall thatDocEnt models are naturally designed to make such relevance
predictions viaP(Entity|Sentence) — without any fine-tuning, if necessary. We therefore leverage the
Reddit Movie Suggestions Dataset (detailed in Section 2.4.3) as a source of both queries and ground truth
to define a zero-shot movie ranking task. To clarify, the notion of zero shot implies a pre-trained but not
fine-tuned model in our context. This dataset is particularly interesting for its challenging queries, with
their distinctly natural, often conversational language (e.g., “Last week I watched the British cold war movie
Threads. I am scarred, but intrigued as well. Any similar deeply disturbing yet realistic movies you can recommend?”,
see Table 2.2 for more examples). Another challenge is an explicit recommendation intent present in many
of the queries (i.e., “Movies like ...”), making this task a mixture of Search and Recommendation. The latter
typically requires specialized recommendation models of entity-to-entity similarity, and cannot generally
be solved with keyword-based search.
15
2.4 Datasets
2.4.1 AmazonMovieReviewsCorpus
All our models are pretrained on Amazon Product Reviews [36] in the “Movies and TV” category, com-
prising 4,607,047 reviews for 208,321 movies collected during 1996–2014
8
.
2.4.2 Reviews2Movielens
One of this paper’s contributions isReviews2Movielens—a new multi-document multi-label dataset created
by joining Amazon Movie Reviews [36, 64] and MovieLens [35], a rich source of crowdsourced movie tags.
The key challenge in joining the two datasets is establishing correspondences between their respective
movie IDs, which turns out to be a many-to-one mapping
9
. We have identified a subset of high-precision
many-to-one correspondences by applying Named Entity Recognition techniques
10
to both Amazon prod-
uct titles (incl. release years) and their product pages. The resulting mapping consists of 71,077 unique
Amazon IDs and 28,918 unique MovieLens IDs. The mapping accuracy was manually verified to be 97%
based on 200 random samples. Ultimately, the joined dataset contains nearly 2 million reviews and close
to 1B words, significantly more than its IMDB counterpart [60].
Since both datasets are widely used as a source of data and academic benchmarks [62, 45, 4, 36, 64], we
hope that this new mapping
11
will be useful to the community.
2.4.3 RedditMovieSuggestions
This user-generated dataset contains a collection of 4765 movie-seeking queries and corresponding recom-
mendations, collectively curated and voted on by the Reddit Movie Suggestions community
12
. Worth not-
ing are (a) the conversational, human-to-human language of the queries; (b) the community-recommended
movies that, while sparse and possibly biased, can be used as a source of ground truth. While modest in
8
We’ve used the 2016 version of the dataset fromhttp://jmcauley.ucsd.edu/data/amazon.
9
Each Amazon ID (ASIN) matches a canonical product URL, e.g., https://www.amazon.com/dp/B06XGG4FFD. However,
these IDs correspond to specific product editions (typically DVDs) rather than unique titles, causing duplication issues. Some are
collections of several titles.
10
We use the public Google Cloud Natural Language API – https://cloud.google.com/natural-language/docs/
basics#entity%20analysis.
11
Seehttp://goo.gle/research-docent
12
https://www.reddit.com/r/MovieSuggestions
16
size, the dataset is well-suited to evaluate zero-shot performance on the movie ranking task defined in
Section 2.3.3.
2.5 Experiments
2.5.1 Pre-training
All our experiments start with pre-training models on the Amazon Movie Reviews corpus, followed by
optional task-dependent fine-tuning. First, we apply some simple filtering to the input, removing reviews
shorter than 5 words and movies with less than 5 reviews
13
. This results in 81,057 Amazon movies, of
which 17,131 have MovieLens correspondences, and 4,181,727 reviews in total. Further, we split reviews
into individual sentences (or short paragraphs) so as to circumvent theBERT sequence length limit. Finally,
since our goal is to learn non-obvious entity attributes, we remove movie names from their reviews.
All our models use the standard BERT-base configuration with 12 layers, 12 attention heads and a
hidden size of 768, and are initialized with a publicly available BERT-base checkpoint
14
.
2.5.2 TagPrediction: Fine-tuningStrategies
We will now describe the fine-tuning strategies used to transfer pre-trained DocEnt models to downstream
tag prediction tasks.
DocEnt-Full To generate movie-tag relevance scores, we need to predictP(Tag|Movie), which we
cast as binary classification. Recall that BERT has a built-in binary classifier (for next-sentence prediction),
implemented as a single-layer FFN
15
on top of its [CLS] output, with logistic loss. We simply repurpose
that layer for our task.
DocEnt-Dual and DocEnt-Hybrid Recall that, during pre-training, DocEnt-Dual and DocEnt-
Hybrid use softmax cross entropy loss to predict P(Entity|Sentence). However, tag prediction poses
the inverse problem: predict tags based on a movie entity. In our dual encoder framework, that can be
done simply by computing softmax over all of the encoded tags rather than entities, without any changes
to the architecture.
13
This low-count filtering is applied after de-duplication and aggregation.
14
https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
15
Feed-Forward Neural Network
17
Task Movies Tags M-TPairs
Closed (test) 1000 1128 46359
Closed (dev) 380 1128 17943
Open (test) 6392 500 141618
Open (dev) 3362 100 25274
Table 2.3: Evaluation datasets sizes for Tag Prediction tasks. Closed / Open stand for the closed and open vocabulary
tasks, respectively; M-T Pairs shows the number of corresponding movie-tags pairs. The top two rows describemovie
holdout sets used in our closed vocabulary experiments; bottom two rows showingtag holdouts for open vocabulary
experiments.
Shared Strategies For fine-tuning, all of the models share the following choices. First, we treat every
existing movie-tag pair in the training set as a positive example, weighted proportionally to the number of
user votes for that pair (or to the logarithm thereof). Next, for a given movie, about 10% of all vocabulary
tags are sampled as negative examples, excluding the known true positives for that movie. To prevent
overfitting, we fix entity embedding weights for all models during fine-tuning.
2.5.3 Entity-lessBaselines
To corroborate the utility of explicit entity representations, we set out to evaluate a few baselines that
circumvent them by representing each entity as a Bag-of-Sentences (BoS), computed over its related re-
views with a sentence encoder of choice. Such a BoS encoder can replace entity embeddings in our ar-
chitecture, yielding a naïve variant of DocEnt-Dual. We call these baselinesBoS-GloVe,BoS-BERT and
BoS-SentenceBert
16
, reflecting their underlying sentence encoders.
2.5.4 Evaluation
2.5.4.1 MovielensTagPrediction
The main challenge with evaluating tag prediction is the sparse and noisy nature of user-generated ground
truth. For instance, a certain movie tag having zero votes may still be relevant in reality. On the other hand,
some entities may have votes for contradictory tags (e.g., both “funny” and “not funny”). The original Tag
Genome baseline [91] mitigated this by collecting an additional dataset of unbiased movie-tag relevance
scores. Alas, that data has not been released. Instead, we propose two complementary metrics that cast
tag prediction either as binary classification or as a ranking problem.
16
SentenceBERT [75] fine-tunes BERT on NLI to provide off-the-shelf semantic sentence representations.
18
For classification, we binarize labels as follows. Let #(m,t) be the number of users who assigned a
tagt to a moviem. Then its binary counterpartl(m,t) is set to 1 iff #(m,t)>T , a threshold
17
.
For the tag ranking formulation, we make the assumption that true movie-tag relevance is correlated
with the number of movie-tag votes, and define our movie-tag relevance score as r(m,t) = #(m,t).
Equipped with this score, we use Precision@k and NDCG metrics [40] to measure performance.
Tagpredictionbaselines include
MovielensTopTags— a fixed ordering of tags.
TF-IDF scores for movie-tag pairs, based on tag frequencies in movie reviews.
BoS-BERT, as defined in Sec. 2.5.3, is fine-tuned to estimate sentence-to-tag relevance directly
18
. This
setup is applicable to both open and closed vocabulary scenarios. During inference, a movie-tag
prediction is obtained by averaging over sentence-wise predictions for the movie’s reviews.
TagGenome—the original baseline from MovieLens team [91]. The comparison is not entirely apt as that
model was trained on additional movie-tag relevance data and user ratings, albeit with a smaller
corpus of unsupervised reviews. Also,TagGenome was trained on all of MovieLens (no holdouts).
Humans—to simulate human performance, apply cross-validation to ground truth user votes, treating
one of the folds as a quasi-model.
All models were evaluated on the same holdout sets, with averaging.
ClosedVocabularyTagPrediction In this scenario, evaluation is done on a holdout set of movies (with
a smaller development set used for hyperparameter tuning; see Table 2.3 for details).
Results for ranking (MAP) and binary classification (AUC) metrics are shown in Table 2.4. Collectively,
DocEnt models outperform the strongTagGenome baseline on tag ranking (see also Fig. 2 (a) and (b)) and
match (or slightly outperform) it in binary classification. It is a strong result, considering that DocEnt had
no access to additional features used by TagGenome and employed no feature engineering. Of the three
models,DocEnt-Dual scores the lowest on all metrics, likely due to not optimizing forP(Text|Entity)
17
We useT = 2 to filter out noisy tags.
18
We found it is best to encode a review sentence usingBERT’s[CLS] output, while tags are encoded by averaging individual
tokens’ output vectors.
19
Model MAP AUC
MovielensTopTags 6.2 0.80
TD-IDF 32.3 0.86
BoS-BERT 39.3 0.91
TagGenome 43.9 0.98
DocEnt-Full 44.7 0.98
DocEnt-Dual 38.6 0.96
DocEnt-Hybrid 44.1 0.98
Human 76.6 0.99
Table 2.4: Mean Average Precision and ROC-AUC results on the closed-vocabulary tag prediction task. TagGenome
is the original baseline from MovieLens creators [91], trained on multiple additional features and considered SOTA.
Despite using fewer features, DocEnt matches TagGenome performance on AUC and outperforms it on precision
(MAP).
1 5 10 15 20
30
35
40
45
50
55
k, Number of top predictions
Precision @ k, %
1 5 10 15 20
35
40
45
50
55
k, Number of top predictions
NDCG @ k
100 200 300 400 500
0.7
0.8
0.85
0.9
0.95
1
Number of training tags
AUC
TD-IDF TagGenome BoS-BERT
DocEnt-Full DocEnt-Dual DocEnt-Hybrid
(a) Closed-vocabulary Pr@k (b) Closed-vocabulary NDCG (c) Open-vocabulary AUC
Figure 2: Performance on tag prediction tasks. Left and center: Precision and NDCG @k, with a closed vocabulary.
DocEnt-Full dominates the strongTagGenome baseline for smaller values ofk, a concentration of gains typical for
binary classification models. For perspective, human Precision@ k ranges 80-95% for this task. Right: AUC for open
vocabulary experiments, with models trained using a variable fraction of the tag vocabulary. DocEnt approaches
close-vocabulary AUC after training with only 10-50% of the vocabulary (showing all baselines that were available
to us in this setting).
20
Model MRR
Recall, %
@50 @100
Lucene (TF-IDF) 0.14 15.3 20.7
BoS-GloVe 0.04 4.1 6.6
BoS-BERT
∗ 0.08 9.6 14.2
BoS-SentenceBert 0.07 7.6 11.7
DocEnt-Full 0.22 21.3 28.4
DocEnt-Dual 0.27 28.0 36.3
DocEnt-Hybrid 0.31 31.9 40.9
Table 2.5: Zero-shot results forDocEnt models vs several baselines on Reddit Movie Suggestions. MRR stands for
Mean Reciprocal Rank.
in pre-training. Finally, note that all models still score way below humans on the (harder) tag ranking task,
indicating considerable headroom.
Open Vocabulary Tag Prediction This task is evaluated by withholding parts of the tag vocabulary
so that those tags are never seen in training (consult Table 2.3 for details). Fig. 2 (c) shows our models’
performance on the binary classification task base on the fraction of the vocabulary seen by a model in fine-
tuning. The graph shows that training with only 100 of the 1124 tags results in reasonable performance.
Of our three models, DocEnt-Full starts below the others but adapts the fastest, reaching a near-closed
vocabulary performance with less than 50% of the full tag vocabulary.
2.5.4.2 RedditMovieSuggestions
Moviesuggestionbaselines Since this is a search task, we compare our models to an Apache Lucene
19
baseline, arguably the world’s most widely used open-source search engine. For completeness, we also
compare to BoS-BERT
∗ 20
, BoS-GloVe and BoS-SentenceBert, neural baselines defined in Sec. 2.5.3,
whose query-movie relevance score is given by the maximum cosine similarity among the movie’s review
sentences
21
.
Table 2.5 shows the Mean Reciprocal Rank (MRR) as well as recall, metrics that suit the noisy ground
truth (for completeness, see also the qualitative results in Table 2.2). DocEnt models outperform the
19
https://lucene.apache.org/
20
In absence of a fine-tuned [CLS] output, this version of BoS-BERT encodes sentences by averaging their individual tokens’
output vectors.
21
In this case, we found that aggregating sentence-wise predictions withL
∞
norm is superior to averaging.
21
Lucene baseline on all metrics, with DocEnt-Hybrid leading by a large margin. Compared to DocEnt-
Dual, its strong performance is not surprising since DocEnt-Hybrid optimizes bothP(Entity | Text)
andP(Text|Entity)—a combination of tasks that helps avoid overfitting.
Also expected is the relatively weak performance of DocEnt-Full. As discussed in Sec. 2.2, its high-
capacity entity representations are prone to overfitting when the number of entities is relatively small.
Still, this shortcoming can be remedied by fine-tuning, as evidenced by this model’s superior results on tag
prediction in Sec. 2.5.4.1. These results suggest thatDocEnt-Full may be a good choice in semi-supervised
scenarios.
2.6 RelatedWork
Much of the prior art in text-based entity understanding is motivated by the Entity Linking (EL) prob-
lem: predict a unique entity from its mention in text, assuming a single right answer. By contrast, tasks
like entity retrieval and tag prediction imply multiple valid matches and emphasize understanding enti-
ties through the prism of their attributes, expressed in natural language. Still, recent EL works propose
dual encoder approaches similar to ours [98, 54, 15, 83, 97, 12, 47, 37, 32], with [54] already discussed in
Section 2.2.1. Dual encoders have also been explored in zero-shot scenarios [30, 58, 96, 32], with entity
embeddings computed dynamically based on metadata such as dictionary definitions, entity name and/or
category. Others incorporate entity representations directly in the transformer by retrieving from an ex-
ternal memory [27, 66]. While clearly useful for EL, e.g., in sentences with multiple entity mentions, the
benefits to our applications are unclear. Finally, there is ERNIE [84] – a language model trained with
awareness of entity mentions. Alas, the lack of explicit entity representation limits its use in our tasks.
2.7 Conclusion&FutureWork
This paper proposes a family of models to learn self-supervised entity representations from large document
collections. We motivate these dedicated representations by contrasting them with naive text-as-a-proxy
approaches, with clear gains on entity-centric tasks such as natural language search and movie tag pre-
diction. We then show that achieving superior performance requires optimizing bothP(Entity | Text)
andP(Text|Entity)—in contrast to the baselineRELIC model (and similar prior dual encoders) having
only a single objective. To that end, we propose two novel models and study them in zero-shot, few-shot
22
and supervised settings. We match or outperform competitive baselines, where available, with little or no
fine-tuning.
FutureWork As shown qualitatively in Sec. 2.3.3,DocEnt has the potential for being a hybrid approach
to bridge entity retrieval and recommendation, an application worth exploring in depth (e.g., on the Movie-
Lens Recommendation task which can be readily integrated withDocEnt thanks toReviews2Movielens). A
larger entity retrieval study with heterogeneous entity types is another useful direction. Lastly, extending
DocEnt to additional entity understanding tasks such as QA and summarization is yet another promising
avenue.
23
PartIII
Semi-parametricentityknowledgerepresentation
24
Chapter3
ReadTwice: ReadingVeryLargeDocumentswithMemories
Knowledge-intensive tasks such as question answering often require assimilating information from dif-
ferent sections of large inputs such as books or article collections. We proposeReadTwice
1
, a simple and
effective technique that combines several strengths of prior approaches to model long-range dependencies
with Transformers. The main idea is to read text in small segments, in parallel, summarizing each seg-
ment into a memory table to be used in a second read of the text. We show that the method outperforms
models of comparable size on several question answering (QA) datasets and sets a new state of the art on
the challenging NarrativeQA task, with questions about entire books.
3.1 Introduction
Transformer-based models such as BERT are very effective in capturing long-range dependencies in text
passages through the attention mechanism [89, 21]. However, the amount of compute in attention depends
quadratically on the number of tokens in an input text passage. As such, the standard BERT implementa-
tion limits input size to a fixed number (often 512) of tokens.
In reality, dependencies over significantly longer ranges are common and modeling them is crucial.
For instance, in a sentence like Inside the Sammath Naur, the Ring-bearer struggled to throw the Ring into
the volcano, the narrative interweaves several prior storylines from a book. Comprehending this sentence
therefore requires looking up previous mentions of Ring-bearer and Sammath Naur, located many tokens
away.
Several methods have been proposed to address this challenge; see [87] for a survey and §3.3 for a
detailed discussion. One popular strategy is to reduce the number of tokens attended to. Longer inputs
1
Source code and pre-trained checkpoints forReadTwice can be found athttps://goo.gle/research-readtwice.
25
Figure 3.1: ReadTwice model architecture. The input is processed twice, with a memory table for inter-
segment information sharing.
can in fact be processed in this way – but only up to a limit of around 5,000 tokens, as used in [3, 101, 6]
– far below the context sizes required to model long documents such as books.
Another strategy such as HIBERT [103] splits inputs into smaller segments which are processed in-
dividually, then assembled into a hierarchical representation. As a downside, inter-segment context is
unavailable during encoding.
We proposeReadTwice, a simple approach that combines the strengths of both strategies. As its name
suggests, the main idea is to process the input twice: a long text input (such as a document, or even a book)
is treated as a collection of shorter text segments which are read independently and in parallel. Then, the
encoder reads each segement again, now augmented with compressed information from other segments.
The crucial component in ReadTwice, as illustrated in Figure 3.1, is a memory module that holds
compressed information from all segments. That compressed information is used onlyonce: in the second
pass. Thus, ReadTwice is much more computationally efficient than models like ETC that rely on mem-
ory for all segments, in every layer. While ReadTwice requires two passes, it differs from hierarchical
models such as HIBERT that do not condition segment encoding on other segments. §3.3 contrasts these
approaches in more detail.
We validate the efficacy of ReadTwice on extractive question answering (QA) tasks, showing strong
performance on HotpotQA [99], TriviaQA [43] and NarrativeQA [48]. In particular, ReadTwice signifi-
cantly improves the state-of-the-art on QA based on entire books in NarrativeQA, with absolutes gains of
4.5 ROUGE-L points and 3 BLEU-1 points (relative improvements of 23% and 17%, respectively).
3.2 Method
We first describe the ReadTwice model, followed by its pre-training procedure.
26
3.2.1 ReadTwice
The model reads a large text document split intoN segmentsx
1
,...,x
N
; eachx
i
is limited to 512 tokens,
as in a typical BERT model.
The model architecture is depicted in Figure 3.1. In the first read, each segment is encoded indepen-
dently with standard BERT. Then, memories are extracted from each segment—a process we describe in
detail later—and gathered into a global memory pool. For the second read, a MemoryAttention layer
(with a residual connection and aLayerNorm on top) is first used to merge the information from the for-
mer intra-segmental contextual token embeddings and the global memory. The merged result is then read
by another small BERT model with onlytwo Transformer layers to produce the final output. The rationale
is that the first read already generates rich contextualized embeddings, and the second read only needs to
incorporate information from the memory. More formally:
H
0
i
= TokenEmbed(x
i
),H
1
i
= BERT
1
(x
i
),∀i
M
i
= ExtractMemories(H
1
i
),∀i
M = Gather([M
1
,...,M
N
])
H
2
i
= MemoryAttention(H
1
i
,M),∀i
H
3
i
= LayerNorm(H
1
i
+H
2
i
),∀i
H
4
i
= BERT
2
(H
3
i
),∀i
Next, we describe the newly introduced layers.
ExtractMemories andGather Our aim is to compress the information in each segment and disseminate
it to other segments to be used in the second read. We consider three types of memories:
• ReadTwice (CLS). One obvious choice is to use thecls token representation associated with segment
x
i
as a summary of the segment.
• ReadTwice (STS). To obtain more fine-grained memories, we extract a memory vector for each
consecutive span of 32 tokens. Contextual embeddings of each span’s first and the last tokens are
concatenated and linearly projected to a single point in the token vector space as the span represen-
tation. The projection matrix is learned end to end.
27
• ReadTwice (E). In another variant of span-based memory, we memorize representations of entity
mention spans. To obtain these spans, we first annotate each segment with an external Named Entity
Recognition system. Then, each entity mention span is encoded in the same way as in ReadTwice
(STS). This design is motivated by the intuition that long-range dependencies primarily occur be-
tween entities.
Empirically, we find that ReadTwice (E) leads to best performance (see the ablation in Section 3.4.4) and
it is the memory type used in our headline results.
We collect all memories from all segments into a flat memory table. The table size is given by the
number of segments (CLS), the number of 32-token spans (STS), or the number of entity mentions (E).
MemoryAttention In this layer, we let contextual token embeddings from individual segments interact
with other segments’ memories via dot-product attention over the memory table.
Leth
ij
be the contextual embedding of tokenj in segmenti after the first read. And let m be a memory
table entry whose source segment is given bym
s
. We then define its attention weight as:
α m
=
e
h
T
ij
Mm+r
i,ms
P
m
e
h
T
ij
Mm+r
i,ms
+e
h
T
ij
M
0
(3.1)
whereM
0
is a learnable no-op memory not associated with any specific text. r
i,ms
is a learned position
score which captures the relative distance between segmenti and the memoryM
m
, akin to [80]:
r
i,ms
=ω(dist(i,m
s
)) (3.2)
whereω is a set of weights indexed by the distance
dist(i,m
s
) =
− B i− m
s
<− B
B i− m
s
>B
i− m
s
otherwise
(3.3)
where the cutoff threshold B clips the effect of distance to [− B,B]. We setB to 10 in this work.
28
Finally, theMemoryAttention layer output for a given token is given by
h
2
ij
=
X
m=1
α m
M
m
(3.4)
3.2.2 Pre-training
We pretrain ReadTwice similarly to [21], using the Wikipedia and BooksCorpus datasets. When entity
mentions are used in the memory table, the texts are processed with the Entity Linking (EL) and Named
Entity Recognition (NER) tools from the Google Cloud NLP API
2
. Moreover, we use existing hyperlinks in
Wikipedia as additional entity annotations. The first and the second BERT readers are trained end-to-end.
Our pre-training objective is the standard Masked Language Model (MLM) task, with the MLM pre-
diction loss computed based on the output of the second reader.
In order to encourage the model to rely on the memory, we increase the difficulty of the MLM task.
Following the entity masking procedure in [34, 85], we mask entity mention tokens more aggressively
at a 25% rate and jointly mask all tokens within a mention. By contrast, for non-entity tokens, we mask
contiguous sequences of random length at a 15% rate.
3.3 RelatedWork
One way to extend the limit on input size is by reducing the number of tokens attended to. etc [3] and
longformer [6] allow standard attention only between tokens within a fixed distance. To allow infor-
mation flow over longer distances, they use auxiliary global "memory" tokens which attend to all regular
tokens and vice versa. BigBird [101] additionally has each token attend to a random subset of other to-
kens. While reducing asymptotic complexity from quadratic to linear (in input size), these global tokens
are added at each attention layer, incurring a high computational cost.
Another approach is to split the input into multiple segments and then aggregate information across
segments. This is achieved through hierarchical modeling [11, 103]. While reducing the attention size
to the number of segments, each individual segment has no information about its siblings during token-
level encoding. Alternatively, recurrent models [18, 72] read a large input from left to right, dynamically
compressing faraway contexts, thus allowing unidirectional information aggregation (left to right). One
2
https://cloud.google.com/natural-language/docs/basics#entity_analysis
29
disadvantage is that the input needs to be processed sequentially, which becomes time-consuming for
producing contextualized representations of a large input.
Our method brings these lines of work together. Processing segments independently and in parallel,
then memorizing their compressed representations and sharing memory across segments enables con-
textual embeddings to be updated based on faraway information. Enabling memory sharing only once—
during the second read—allows it be done cheaply.
Note that the memory module here is internally generated from the input, as opposed to external
memory models which are orthogonal to our approach [67, 28].
3.4 Experiments
3.4.1 Pre-trainingsetup
AllReadTwice models are initialized with the publicRoBERTa (base) checkpoint
3
adapted to Tensorflow
by [76]. Further, models are pre-trained for 1M steps on 64 TPU cores using the LAMB optimizer [100].
Each batch contains 512 segments, with at most 128 segments per document. The segments are con-
secutive spans of 512 tokens. Therefore, the model can process documents up to 65k (≈ 128× 512) tokens.
Each batch contains the maximum number of documents such that the total number of segments is at most
512. Approximately half of Wikipedia articles fit in one segment (thus not needing memory), with a fat
tail of longer documents.
In terms of compute and memory overhead, ReadTwice is about 30% slower than theRoBERTa-base
model and uses 15M (or 12%) more parameters: 14M owing to the second read BERT
2
and 1M due to
ExtractMemories andMemoryAttention layers.
3.4.2 Evaluationsetup
We evaluate ReadTwice on the downstream extractive question-answering task using several datasets:
HotpotQA (HQA) [99], TriviaQA (TQA) [43] and NarrativeQA (NQA) [48].
In HQA, questions are based on relatively short text passages (2 evidence paragraphs), with eight
additional distractor passages. In TQA, evidence text is medium-sized. NQA asks questions about entire
books, requiring a successful QA system to model very long-range dependencies. The NQA dataset has
3
https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz
30
Model
HQA TQA
F1(ans) F1(dev) F1(test)
LF 74.3 75.2 -
ETC 75.1 - -
BigBird 75.7 79.5 -
RoBERTa (us) 72.0 75.9 -
ReadTwice-E 75.9 80.7 80.9
Table 3.1: Results on HotpotQA development set (answeronly F1 score) and on TriviaQA development
and test splits for the Wikipedia full setting. Additional test results are available on the public leaderboard
4
Model ROUGE-L BLEU-1 BLEU-4 METEOR
BiDAF [48] 6.3 / 6.2 5.8 / 5.7 0.2 / 0.3 3.8 / 3.7
R
3
[95] 11.4 / 11.9 16.4 / 15.7 0.5 / 0.5 3.5 / 3.5
BM25+BERT [63] 14.8 / 15.5 14.6 / 14.5 1.8 / 1.4 5.1 / 5.0
RoBERTa (us) 17.4 / 18.0 18.2 / 18.0 2.4 / 2.6 5.4 / 5.4
ETC (us) 18.3 / 18.8 16.1 / 17.2 2.4 / 2.7 5.4 / 5.4
ReadTwice (E) 22.7 /23.3 21.1 /21.1 3.6 /4.0 6.7 /7.0
Table 3.2: Results on the NarrativeQA’s development / test splits.
an average of 62,000 words per document with a maximum of 400,000. Only 40% of NQA’s answers are
span-based – we use a ROUGE-L oracle as training labels for the other questions.
ReadTwice is fine-tuned on each task. QA-specific heads are used to generate span-based predictions,
consisting of fully-connected layers that take contextual embeddings from the second reader as inputs.
These layers output a score for whether the corresponding tokens are the beginning or ending of an answer
span. For a similar setup, see multi-segment based QA tasks [16, 14].
During fine-tuning, batches contain 128 segments for all tasks (also with up to 128 segments per doc-
ument). Every segment contains 512 tokens, but as neighboring segments have 128 token overlaps, the
model can process documents of up to 49K tokens (≈ 128× (512− 128)). For TQA and HQA, documents
have approximately 10 segments. For NQA, we split the documents into sub-documents with 49k tokens
and apply memory only within these sub-documents.
We perform hyperparameter search only over learning rateλ ∈{5e− 6, 1e− 5, 3e− 5} and train
for 6 epochs with 10% warm up proportion. Moreover, we use early stopping based on the performance
on the development set.
4
Seehttps://competitions.codalab.org/competitions/17208#results, tab “Wikipedia”.
31
3.4.3 MainResults
Results for HQA and TQA are reported in Table 3.1. We compare to prior art (using reported results where
available or from our own implementations otherwise, denoted as “us”): Longformer (LF) [6], ETC [3],
BigBird [101], andRoBERTa [57]. By default, we compare against the “base“ configuration of those models
where the number of parameters is comparable toBERT-Base, as is the case forReadTwice.
Table 3.1 shows that for small to medium sized text passages, the proposed ReadTwice outperforms
all models of comparable size.
Table 3.2 contrasts ReadTwice to other methods on extremely large contexts: BiDAF [48], R
3
[95],
BM25 + BERT Reader / Ranker [63] and our own implementation of RoBERTa and ETC
5
. ReadTwice
significantly outperforms all previous work and establishes new state-of-the-art results, demonstrating
the effectiveness of performing a second read conditioned on global memory for processing extremely
long texts.
3.4.4 AblationAnalysis&Discussion
To isolate individual components’ contributions, Table 3.3 contrasts several variants of ReadTwice. These
ablations lead to two key insights.
Inter-segmentmemorymatters We introduce a variantReadTwice-E(SS) (where SS stands for “Sin-
gle Segment”) to isolate the gains from the memory layer. ReadTwice-E(SS) prevents segments from
attending to memories of other segments, thus disabling long-range dependency modeling. We observe
that ReadTwice-E improves over ReadTwice-E(SS) on all tasks, modestly but non-negligibly for TQA,
and significantly for HQA and especially NQA.
This matches our knowledge of those datasets: TQA questions are based on a relatively short context
and can typically be answered using a single passage in the context document. HQA questions have a
similarly sized context, but are explicitly constructed to require information from multiple paragraphs to
answer, and ReadTwice shows accordingly larger gains. Finally, NQA has much larger contexts, and its
questions generally require information from different parts of the document, increasing the importance
of long-range dependency modeling and accordingly, the performance boost fromReadTwice.
5
For ETC we use the public (base configuration) checkpoint https://storage.googleapis.com/gresearch/etcmodel/
checkpoints/etc_base_2x_pretrain.zip
32
Model HQA NQA-R NQA-B TQA
E 75.89 22.71 21.07 80.7
E(SS) 75.08 21.93 18.39 80.3
E(SS,10L) 74.70 21.39 18.37 80.4
RoBERTa 72.00 17.40 18.2 75.9
CLS 75.32 20.89 17.80 80.6
STS 75.39 21.08 18.38 80.4
Table 3.3: Ablation studies on variants of ReadTwice on the dev sets. We report F1 (answer only) score
forHQA, ROUGE-L and BLEU-1 forNQA (denoted -R and -B respectively) and F1 forTQA.
Entities matter Entity mentions appears to be the most effective memory type in most experiments,
leading to noticeably improved performance on both HQA and NQA. The difference is most pronounced
in NQA whose particularly long and challenging contexts make it a perfect testbed.
Sourceofnon-memorygains The non-memory gains over a baselineRoBERTa model originate from
the two extra layers and the entity-based MLM objective. In order to disentangle the sources of gains
we train the ReadTwice-E(SS) model using a 10-layer Transformer for BERT
1
(denoted as E(SS,10L) in
Table 3.3), with the same number of layers as RoBERTa. While the gains from 2 extra layers are signif-
icant (E(SS) vs E(SS,10L)), most of the gains appear to result from the custom pre-training procedure
(E(SS,10L) vsRoBERTa).
3.5 Conclusion&FutureWork
ReadTwice performs well on several QA tasks, particularly NarrativeQA where long-range dependencies
among entities appear to be very important. The proposed method is conceptually simple, easy to im-
plement and is capable of reading entire books. For future work, we plan to explore new memory types,
hierarchies and aggregation functions. We also aim to apply the model to other tasks, particularly long
text summarization, likely to benefit from a memory-forming mechanism.
33
Chapter4
MentionMemory: incorporatingtextualknowledgeintoTransformers
throughentitymentionattention
Natural language understanding tasks such as open-domain question answering often require retrieving
and assimilating factual information from multiple sources. We propose to address this problem by in-
tegrating a semi-parametric representation of a large text corpus into a Transformer model as a source
of factual knowledge. Specifically, our method represents knowledge with “mention memory”, a table
of dense vector representations of every entity mention in a corpus. The proposed model - tome - is a
Transformer that accesses the information through internal memory layers in which each entity mention
in the input passage attends to the mention memory. This approach enables synthesis of and reasoning
over many disparate sources of information within a single Transformer model. In experiments using a
memory of 150 million Wikipedia mentions, tome achieves strong performance on several open-domain
knowledge-intensive tasks, including the claim verification benchmarks HoVer and FEVER and several
entity-based QA benchmarks. We also show that the model learns to attend to informative mentions with-
out any direct supervision. Finally we demonstrate that the model can generalize to new unseen entities
by updating the memory without retraining.
4.1 Introduction
Neural models have greatly advanced the state of the art in natural language processing and generation
tasks. Accordingly, there has been increasing interest in applying neural language models to tasks which
require extensive world knowledge to solve [69]. Much of this world knowledge can be found distributed
34
over text corpora, which raises the question whether language models pre-trained on text corpora cap-
ture this information. Recent work suggests that while language models may successfully predict facts
about the world [70] such knowledge is superficial and unreliable [9]. Our goal is to reliably incorporate
information from across a text corpus into a language model.
Recent work has represented the information present in a text corpus explicitly by constructing a
virtual knowledge base (vkb) [22, 82]. A vkb consists of dense representations of entity mentions in
the text, designed to reflect the property or relation expressed by the entity mention. We propose to
incorporate a vkb into a language model by using it as an external memory, performing attention over
the entire vkb within a Transformer model. In this way the model can synthesise and reason over many
disparate sources of information from the text corpus. We refer to thevkb used in such a way as Mention
Memory, and the model as tome (Transformer Over Mention Encodings). We first pre-train a mention
encoder to specifically encourage mention representations that are useful for a Transformer model, and
construct a Mention Memory from 150 million entity mentions in English Wikipedia. Then we train a
tome model with attention layers over the Mention Memory, which is kept frozen (see Figure 4.1).
We argue that the Mention Memory approach has several appealing properties. First, tome retrieves
entity mention representations corresponding to specific entity attributes or relations described in the cor-
pus. This retrieval is much more fine-grained than aggregate entity retrieval methods such as Entities as
Experts (EaE) [25], and we show large improvements in accuracy over EaE on tasks that require detailed
entity information, such as claim verification and entity-based question answering. The fine-grained re-
trieval also allows potential users to see more precisely what knowledge the model’s predictions is based
on (see Table 4.4). Second, tome retrieves dense representations, which are easy to incorporate into a
Transformer model without reprocessing the input, unlike raw text. Therefore, tome is able to retrieve,
assimilate and reason over information from many different sources within a single Transformer model,
allowing for multi-source and multi-hop reasoning without the beam search machinery that is required
for multi-hop retrieve-and-read [104]. This also makes tome much more scalable: retrieve-and-read ap-
proaches have to read many retrieved passages which becomes expensive with larger reader models, while
the cost of memory layers does not scale with reader size and is negligible for larger readers. Third, the
retrieval is latent, without direct or distant supervision on the retrieved results. We show that, even with-
out supervision, the model learns to retrieve highly specific and informative entity attributes and perform
35
Figure 4.1: Overview of Mention Memory. A pre-trained mention encoder is used to generate dense rep-
resentations for each entity mention in Wikipedia (approximately 150 million total) which are stored in a
table. The tome model takes a passage annotated with entity mention boundaries as input, and applies a
Transformer block. Next, the tome model applies one or moreTOMEBlocks. EachTOMEBlock contains a
memory attention layer and a Transformer block.
multiple reasoning steps. Finally, the memory table is semi-parametric, so knowledge can be added or
updated by applying the mention encoder to new text without retraining.
In order to verify the model’s capacity to capture accurate factual information in the corpus, we start
by evaluating tome on the HoVer [41], FEVER [88] and FM2 [23] claim verification datasets, on which
it strongly improves performance over entity aggregate and comparable retrieve-and-read baselines. We
demonstrate that the model learns to attend to informative mentions for verifying claims using only the
verification accuracy as a signal. Ablations show the memory is crucial for performance, and that the
model can effectively use larger memory than it was pre-trained on. In a second set of experiments we
evaluate tome on question-answering benchmarks TriviaQA [44], ComplexWebQuestions [86] and Enti-
tyQuestions [78], improving performance over comparable baselines. Finally we show that the model can
be adapted to generalize to new unseen entities by updating the memory, without retraining.
4.2 Method
Our method represents knowledge in a corpus as a collection of “mentionencodings” – dense vector repre-
sentations for every entity mention that appears in the corpus. Every time an entity appears in a passage
– "[Barack Obama] was elected president in 2008" – some property of the entity or its relation to other
entities is described. The first component of our method, the Mention Encoder model, is responsible for
distilling information from entity mentions in the corpus into high-dimensional mention encodings. We
36
use the Mention Encoder to encode each entity mention in English Wikipedia and gather encodings into a
MentionMemory. The purpose of the Mention Memory is to capture all knowledge contained in the corpus
in a way that can be easily integrated into a Transformer. The second component of our method, thetome
model, applies sparse attention over the Mention Memory to incorporate external information from the
corpus into a Transformer model. An overview of the whole method is shown in Figure 4.1.
Jointly training the Mention Encoder andtome models is computationally costly, since it would require
backpropagating through the Mention Encoder for each attended mention. Consequently, we propose to
train the models in two stages. First, we pre-train the Mention Encoder and generate the Mention Memory.
Second, we pre-train the tome model while keeping the Mention Memory frozen: the gradient does not
propagate through it and the memories are not modified. Mention Encoder pre-training is specifically
designed such that mention encodings capture relevant contextual information about each mention and
are useful for tome even without joint training. We formally define these models in sections 4.2.1 and
4.2.2, and their pre-training procedures in 4.2.3 and 4.2.4.
Notation. An input to the model is a passage x = x
1
,...,x
T
of length T . We assume that each
passage has been annotated with an NER system. Following [5] we use special entity markers to highlight
entity mentions in the passage. We introduce tokens [E
start
] and [E
end
] to the vocabulary and insert them
before and after each mention in the passage. For example, the original passage “Whatisthenationalityof
the hero who killed Medusa” turns into “What is the [E
start
] nationality [E
end
] of the [E
start
] hero [E
end
]
whokilled[E
start
]Medusa[E
end
]”. Each mentionm in a passage is described by a tuple(s,e), wheres and
e are start and end positions of the mention. We consider entity markers to be part of the corresponding
mention, so thatx
s
= [E
start
] andx
e
= [E
end
]. Representations of these tokens are later used to generate
mention encodings.
37
4.2.1 Constructingmentionmemoryfromcorpus
4.2.1.1 MentionEncoder
Let H ∈ R
T× d
be token representations where d is the hidden dimension, such that H
i
∈ R
d
is the
contextualized embedding for thei-th token. Following [25] we compute the encoding of a span (s,e) as
a learnable linear projection W of the concatenation of its start and end token representationsH
s
andH
e
SpanEncodingLayer(H,(s,e)) =W[H
s
;H
e
] (4.1)
The Mention Encoder is a Transformer model with two final SpanEncodingLayers that producekey and
value mention encodings. Valuementionencodings store context-level information about each mention and
are used as inputs to thetome model. Keymentionencodings identify the type of information stored in the
value encodings and serve as attention keys for the memory layer. These two SpanEncodingLayers do
not share weights.
4.2.1.2 Mentionmemory
After the Mention Encoder is pre-trained (see section 4.2.3), we use it to generate a Mention Memory from
entity mentions in Wikipedia. While we could include encodings of any corpus mention in the Mention
Memory, we focus on grounded mentions which can be linked to Wikipedia entities. We denote these
as linked mentions, which we hypothesize contain information that can be retrieved and grounded. We
gather mention encodings into matricesMemKey∈R
N× d
K
andMemValue∈R
N× d
V
, whereN is the total
number of linked entity mentions in English Wikipedia (approximately 150 million) and d
K
and d
V
are
dimensions of key and value encodings. Additionally, we record entity (Wikipedia) IDs of mentions in
MemEnt ∈ R
N
, which we use as labels for auxiliary losses, not as inputs to the model or supervision on
retrieval. MemKey(i),MemValue(i),MemEnt(i) correspond to the key encoding, value encoding and entity
ID for thei-th linked mention in Wikipedia.
4.2.2 tomemodel
The tome model incorporates information from a text corpus into a Transformer by applying sparse at-
tention over the Mention Memory. The model consists of one or more TOMEBlocks, each containing
38
a memory attention layer followed by a post-processing Transformer block. Memory attention layers re-
trieve and attend to relevant “memories” for every mention in the input passage. The model then processes
the retrieval-augmented representation with the Transformer block, allowing it to access and combine in-
formation from multiple sources in the corpus. Finally, multiple TOMEBlocks enable the model to refine
retrievals and perform multi-hop reasoning. More formally, aTOMEBlock receives the output representa-
tion of the previous layerH and produces new representationsH
′
M = MemoryAttention(H), (4.2)
H
′
= TransformerBlock(M) (4.3)
The tome model encodes input passages x with the word embedding layer and initial Transformer
block and then applies one or moreTOMEBlocks
H
0
= InitialTransformerBlock(TokenEmbedding(x)), (4.4)
H
l
= TOMEBlock
l
(H
l− 1
), l = 1...L (4.5)
In this work we consider two configurations of the tome model: tome-1 and tome-2, with one and
two TOMEBlocks respectively. Each TOMEBlock of tome-2 contains half as many Transformer layers as
intome-1 to hold the total number of Transformer layers fixed between models.
4.2.2.1 Attentionovermemory
Each memory attention layer is implemented as a sparse dot-product attention layer that takes the output
H of the previous Transformer block, incorporates information from the Mention Memory, and returns a
representationM (omitting layer indices). Consider a mentionm that starts at positions and ends at posi-
tione. We start by computing its query mention encoding Query(m) by applying aSpanEncodingLayer
Query(m) = SpanEncodingLayer(H,(s,e)), (4.6)
39
Query mention encodings are used to retrieve relevant memories from the Mention Memory table. How-
ever, applying standard attention over 150 million mention encodings is infeasible. Instead, we first per-
form approximate nearest neighbor search to retrieve the top-K mentions with the largest dot product
between queryQuery(m) and key mention encoding fromMemKey. We denote the set of these memories
as TopMem(Query(m)). We compute attention over these memories and incorporate the result into the
token contextual representation at positions
α i
∝ exp(Query(m)· MemKey(i)), i∈ TopMem(Query(m)) (4.7)
Value(m) =
X
i∈TopMem(Query(m))
α i
· MemValue(i) (4.8)
M
s
= LayerNorm(H
s
+W
U
Value(m)) (4.9)
whereW
U
is a learnable matrix of shaped× d
V
.
4.2.2.2 Sparselarge-scaleretrieval
Approximate nearest neighbor search (ANNS) can be performed cheaply using one of multiple ANNS
libraries, for example ScaNN [31]. We implemented two on-device search methods to avoid the engineering
complexity of real-time communication with an ANNS server, though we have verified this is also viable.
The first naively computes a simple dot-product between passage queries and memory keys, and was
used in our main experiments as it was easiest to implement. We also implemented and will be releasing
a much faster version based on CPU ANNS methods. The memory is sharded over devices, so that the
device-memory overhead is negligible.
Holding the number of entries in memory fixed, the compute cost of retrieval from memory does not
grow with the size of the reader or the dimensionality of the memory values, so that the relative cost of the
memory layer becomes smaller with reader size. In particular, the overhead from the memory used in our
pre-training setting is small for BERT-Large and up. More details on ANNS implementation and overhead
can be found in Appendix C.3.
40
4.2.3 Mentionencoderpre-training
While backpropagating through a Wikipedia-scale mention memory is challenging, it is possible to train
smaller-scale memory architectures end-to-end. We take an approach inspired byMarge [50] andRead-
Twice [102] which apply cross-attention over documents within a batch. In particular, we process pas-
sages in each batch twice. As a first step, the Mention Encoder model generates mention encodings from
each passage and aggregates the mention encodings into a batch-wide memory table. In the second step,
we apply a tome architecture that attends to the batch memory, which we call batch-tome. Note that
batch-tome is just used for pre-training the Mention Encoder and not evaluated on any downstream
tasks. Mention Encoder and batch-tome are jointly trained end-to-end so that the Mention Encoder is
encouraged to produce mention encodings that contain useful information forbatch-tome.
We want to make sure the batch memory contains relevant mentions, so we pre-train the models on
batches of passages constructed from related Wikipedia articles with high entity overlap. Appendix C.1.1
provides more details on Mention Encoder data generation. We use the pre-trained Mention Encoder to
construct the Mention Memory table from corpus, and use the batch-tome model as the initialization
point fortome-specific pre-training (described in Section 4.2.4).
Maskedlanguagemodel. Our primary pre-training objective is the standard masked language mod-
eling task, with the loss computed based on the output of the second read (batch-tome). To encourage
the model to rely on memory, we increase the task’s difficulty relative to standard BERT pre-training by
masking entity mention tokens more aggressively.
Coreference resolution. We wish to encourage the Mention Encoder to represent the entity at-
tributes expressed by entity mentions, so we also employ an entity-oriented pre-training task to the output
of batch-tome for which such attribute information is likely to be especially helpful. Unlike Entities as
Experts [25], batch-tome does not use entity embeddings, so we cannot use the entity linking task. In-
stead, we apply a related entity coreference resolution objective, which asks the model to predict whether
two linked mentions correspond to the same entity based on the similarity of their encodings. Given that
entity surface forms are frequently masked, the model needs to instead use the properties of other men-
tions in the batch to determine which entity it is most compatible with, incentivizing the Mention Encoder
to encode such properties. We compute a coreference mention encoding for every linked mention in the
41
batch by applying a separateSpanEncodingLayer on the output of batch-tome. The loss is implemented
using cross-entropy over dot-product similarity scores. See Appendix C.1.2 for details.
4.2.4 tomepre-training
Astome attends to the full Mention Memory instead of in-batch memory, we do not employ the batching
procedure from Mention Encoder pre-training, instead sampling Wikipedia passages randomly. For the
same reason, we replace the in-batch entity coreference objective by Mention Memory entity coreference,
in which the model has to predict which mentions from the Mention Memory share an entity with the input
mention. The goal of this auxiliary objective is to incentivize the model to learn to retrieve informative
mention encodings to solve the semantically challenging task. Mention Memory entity coreference also
allows us to solve tasks like TriviaQA or ComplexWebQA without a decoder by directly predicting the
answer entity.
Entityprediction. Analogous to batch coreference resolution loss we compute mention encodingz
m
using the output of the tome model. As in section 4.2.2, TopMem(z
m
) returns the top K memories with
the largest dot product between the mention encodingsz
m
and key mention encodingsMemKey from the
Mention Memory. The scoreEntProb(m,j) of entityj equals the sum of attention weights of memories
corresponding to this entity.
EntProb(m,j) =
P
i∈TopMem(zm)
exp(z
m
· MemKey(i))· 1{MemEnt(i) =j}
P
i∈TopMem(zm)
exp(z
m
· MemKey(i))
(4.10)
The final entity prediction is argmax
j
EntProb(m,j). Entity prediction lossL
ep
(m) for a mentionm
of entityEnt(m) isL
ep
(m) =− logEntProb(m,Ent(m)). Total loss equals the average loss over linked
input mentions for which at least one memory of the same entity is retrieved.
Disallowedsamepassageretrieval. For each passage in the pre-training corpus, there exist memo-
ries corresponding to mentions in the passage generated from the unmasked version of the same passage.
In order to prevent the model from ‘cheating’ by attending to such memories, we set the attention weight
for all memories from the same passage to zero.
42
4.3 RelatedWork
Our approach lies at the intersection of three lines of work: i) knowledge-augmented language models, ii)
employing a text corpus as a virtual knowledge base, iii) retrieve-and-read methods.
Knowledge-augmentedlanguagemodels. Entities as Experts (EaE) [25] injects information into a
Transformer model model with an intermediate attention layer over trainable entity embeddings, which
serve as an aggregate representation of entity information in a text corpus. In contrast,tome attends to a
much larger table of mention encodings, allowing for retrieval of more fine-grained information. Attend-
ing to mentions as opposed to entity representations also enables tome to generalize to unseen entities.
FiLM [90] extends EaE by adding an attention layer over facts from a KB on the output of the Transformer.
The fact attention layer enables more fine-grained queries but still retrieves aggregate entity embeddings
as values, which are also not reasoned over by the Transformer. KnowBERT [67] is similar to EaE, but
with entity embeddings generated from a KB instead of trained end-to-end with a text corpus. Marge [50]
and ReadTwice [102] incorporate dense representations from other passages within the same batch into
a Transformer through sparse top-k attention. The first pre-training stage of our method for training the
Mention Encoder is similar to Marge and ReadTwice. However, tome performs global attention over a
full corpus, rather than a single batch. Furthermore, tome attends to a Mention Memory consisting of
pre-computed dense representations. Therefore tome is not limited to downstream task with batches of
relevant documents, and does not need to apply an expensive reader model to an entire batch of documents
for each input.
Text corpus as virtual knowledge base. DrKIT [22] performs multi-hop question answering by
using a text corpus as a virtual knowledge base. Similar to tome, the authors apply a mention encoder
to convert the corpus into a table of mention encodings. A Transformer model encodes the question into
dense queries, which are compared with the mention encodings to traverse the vkb. Conversely, tome
retrieves mention encodings, and then jointly processes theminside the Transformer. In follow-up work to
DrKIT, OPQL [82] uses a FiLM-like approach to access a memory of relation mentions, which are encoded
with a self-supervised relation encoder. However, the relation mention encoding combines a mention-
specific relation representation with EaE-like entity encodings, so they are less fine-grained than tome’s
encodings. Unlike tome, OPQL also lacks a sparse large-scale retrieval mechanism, and relies on ad hoc
43
heuristics to limit the size of the memory.
1
MOLEMAN [26] compares a passage mention encoding with
mention encodings from a corpus to perform entity linking, but does not retrieve the mentions.
Retrieve-and-readmethods. REALM [34] learns to retrieve relevant passages from a text corpus in a
self-supervised manner. Retrieved passages are concatenated to the input passage which is then re-encoded
by a Transformer model to perform a task. The key difference between retrieve-and-read approaches [34,
46, 52, 39] and tome is that tome retrieves dense representations, as opposed to text. That means that
tome only applies a reader model once to a single input, while retrieve-and-read approaches have to apply
an expensive BERT reader to many different passages. In addition, Transformer models can only process
relatively short sequences, which imposes a binding constraint on the number of retrieved text passages
that can be processedtogether, whereastome can retrieve and reason over information from many sources
inside the same reader. Generative models like RAG [52] or FiD [39] attend to different retrieved documents
in the decoder, but still have to apply a BERT read for every retrieved document, do not consider interaction
between retrievals while encoding the question, and cannot perform iterative retrieval.
4.4 Experiments
4.4.1 Experimentalsetup
The Mention Encoder is based on a BERT-base model with two final SpanEncodingLayers that produce
key and value encodings. Mention Encoder and batch-tome share Transformer weights during Mention
Encoder pre-training. The Mention Memory consists of mention encodings for N = 150 million linked
Wikipedia entity mentions. Transformer layers in tome and batch-tome models are equivalent to those
in the BERT-base model. The tome InitialTransformerBlock contains 4 Transformer layers. tome-1
has a singleTOMEBlock with 8 Transformer layers, andtome-2 has twoTOMEBlocks with 4 Transformer
layers each. Therefore, the number of trainable parameters in tome-1 and tome-2 is approximately the
same as in BERT-base. We use a smaller Mention Memory containing 38m uniformly sampled memories
for tome pre-training. During fine-tuning and evaluation we utilize the full Mention Memory. Appendix
C.1 contains more details.
1
It should be noted that absent heuristics, the number of potential relation mentions (i.e., entity mention pairs) is much larger
than the number of entity mentions.
44
4.4.2 Baselines
We compare tome with existing methods that utilize textual information from a corpus in a language
model. These can be divided into generative LLMs (T5), entity embedding retrieval (Entities as Experts,
OPQL), extractive retrieve-and-read (REALM) and generative retrieve-and-read (RAG, Fusion-in-Decoder).
tome occupies a novel position in the space of retrieval models, being more fine-grained than entity embed-
ding retrieval methods, but performing all its reasoning with a single BERT read, unlike retrieve-and-read
methods. The most closely comparable models are Entities as Experts and REALM, and we use these as
our primary baselines. We report other baselines for reference, with the caveat that these results are not
apples-to-apples: RAG and Fusion-in-Decoder have large decoders and retrievers and encode a large num-
ber of passages with a BERT reader for each question compared totome’s single read. Fusion-in-Decoder
and RAG
2
also use ground-truth supervision for retrieval. We mark the number of parameters and BERT
applications for each baseline in the result tables. Consistent with retrieve-and-read, we count the pa-
rameters of the Mention Encoder and tome, but not the size of the non-trainable and sparsely accessed
Mention Memory.
4.4.3 Claimverification
Data. Our first set of experiments evaluates tome on the claim verification tasks FEVER [88], HoVer [41],
and FM2 [23] in which the model is provided with a claim and has to determine whether the claim is
supported by the Wikipedia corpus. FEVER is a larger dataset with 186k claims for which most of the
claims can be verified with a single Wikipedia passage. In contrast, HoVer is smaller with 26k claims, but
is explicitly constructed to require evidence from multiple sources and multiple reasoning steps. FM2 is
also smaller and is constructed through an adversarial game that leads to more challenging retrieval. The
claim verification training data contains gold evidence passages, but unlike most published results we do
not use these, leaving only the accuracy of the claim verification to guide the retrieval.
Results. Table 4.1 contains our claim verification results. tome outperforms both Entities as Experts
and REALM, especially on HoVer and FM2. This is consistent with the properties of tome: HoVer requires
2
RAG is initialized from DPR which is trained with gold retrieval passages for TriviaQA.
45
Table 4.1: Accuracy on claim verification datasets. #Encoded refers to the number of passages encoded by
a BERT reader to answer a single question.
Model #Params #Encoded HoVer
test
FEVER
test
FM2
dev
RAG 620M 100 - 72.5 -
REALM 330M 5 66.1 67.1 65.8
Entities as Experts 360M 1 66.6 63.6 63.5
tome-1 220M 1 72.8 67.8 67.7
tome-2 220M 1 73.1 68.1 68.4
combining detailed information from multiple sources, whichtome is especially well equipped to do com-
pared to aggregate entity-based or retrieve-and-read models. FM2 features generally challenging retrieval
and may benefit from contextualizing retrieved evidence.
4.4.4 QuestionAnswering
Data. In a second set of experiments we evaluate tome on TriviaQA (TQA) [44], ComplexWebQuestions
(CWQ) [86] and EntityQuestions (EQ) [78], open-domain QA tasks for which most answers are Wikipedia
entities. We approach these datasets as entity-linking tasks, as in [25]. We append a mask token to each
question, which is marked as a question mention. The probability for each candidate entity is predicted as
the aggregate attention weight on mentions of the entity (Section 4.2.4). Questions with answers that do
not correspond to entities in our entity vocabulary are marked as answered incorrectly. TQA consists of
96k trivia questions, for which 84% of answers correspond to a Wikipedia entity. We use the open-domain
setting without gold evidence passages. In order to compare head-to-head performance, we also report
results on a subset of TQA with only questions with Wikipedia entities as an answer. CWQ consists of 35k
complex questions (compositions, conjunctions, etc.) for which 94% of answers correspond to a Wikipedia
entity. EQ contains challenging questions involving rare entities, with Wikipedia entities as answers.
Results. Table 4.2 contains the results for TQA, CWQ and EQ experiments. Like tome, Entities as
Experts and OPQL treat the above datasets as entity-linking tasks. REALM performs extractive QA, while
T5, RAG and Fusion-in-Decoder generate the answer. We note a similar pattern of results as for claim
verification. tome strongly outperforms Entities as Experts on all tasks. tome performs slightly better
than REALM on a simple task like TriviaQA (entity subset) and strongly outperforms REALM on more
challenging tasks that require multiple (CWQ) or challenging (EQ) retrieval.
46
Table 4.2: Accuracy on open-domain QA datasets TriviaQA (TQA), ComplexWebQuestions (CWQ) and
EntityQuestion (EQ). #Encoded refers to the number of passages encoded by a BERT reader to answer a
question. TQA
e-dev
corresponds to TQA with train and dev samples limited to those with Wikipedia entity
as an answer. See Appendix C.2.3 for full results.
Model #Params #Encoded TQA
dev
TQA
test
TQA
e-dev
CWQ
dev
EQ
dev
RAG 620M 100 56.8 68.0 - - -
Fusion-in-Decoder 440M 100 65.0 77.1 - -
REALM 330M 5 55.8 67.1 63.4 46.7 59.0
T5-3B 3B 1 - - - 38.7 -
T5-11B 11B 1 42.3 50.1 - - -
Entities as Experts 360M 1 43.2 53.4 51.3 42.7 32.5
OPQL 220M 1 - - - 41.1 -
tome-1 220M 1 50.8 61.1 60.3 44.9 62.1
tome-2 220M 1 54.6 65.8 64.8 47.7 66.0
Table 4.3: tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) memory attention layer for two passage mentions. Memory mentions are in brackets.
Claim: GreaterSwissMountainDog andHarrier are bothdogbreeds. Label: TRUE
GreaterSwissMountainDog−→
1
Breed History the origin of the[GreaterSwissMountainDog]
is not definitively known. ...
Harrier−→
1
The harrier is a medium-sized dog breed of the[hound] class, used for hunting. . .
4.4.5 Qualitativepropertiesof tome
What memories does tome retrieve? Given that tome retrieval is unsupervised, it is natural to ask
what memories it learns to retrieve. First, we observe that batch-tome and tome trained on just the
MLM objective learn to attend to memories of the same entity as the passage linked mention (55% and
41% average attention score). This is promising as entity mentions from the same entity often contain
mutually relevant information. Quantitative evaluation of downstream retrieval is challenging as tome
often retrieves mentions that are not part of, but equally informative as gold passages. Instead, we provide
tome retrievals onthefirstthree samples of the HoVer dev set to demonstrate its retrieval behavior without
cherry-picking. Table 4.3 demonstrates a successful simple retrieval, while Table 4.4 displays interesting
multi-hop retrieval. The last is found in Appendix C.4.
Importance of memory size. Figure 4.2 shows claim verification performance as a function of
memory-size during fine-tuning (pre-training memory size is held constant). For smaller memory sizes,
entries in memory are uniformly sampled from the full Mention Memory. Performance increases smoothly
47
Table 4.4: tome-2 retrievals for the first HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) and the second (−→
2
) memory attention layers for passage mentions “Life Goes On” and “Hungry”
3
.
Memory mentions are in brackets. The first retrieval for the “Life Goes On” is a different song with the
same name and the first retrieval for “Hungry” is related but not useful. However, the second retrieval
for “Life Goes On” identifies the correct song and describes its position on the album while the second
retrieval for “Hungry” captures its position relative to “Life Goes On”.
Claim: Thesong recorded byFergie that was produced byPolowDaDon and was followed byLife
GoesOn wasHungry. Label: TRUE
LifeGoesOn−→
1
...and Johnny J produced the chart topping hits “All Bout U”, “How Do U Want It”
and[“LifeGoesOn”]. ...
Life Goes On−→
2
...On November 11, 2016, Fergie released the third single from the album, [“Life
GoesOn”]...
Hungry−→
1
...Polow da Don, is an American record producer, songwriter and rapper. His cousin is
[Atlanta] singer Monica. Jones has produced a variety of singles for a multitude of artists including
“Anaconda” by Nicki Minaj (2014), “Love In This Club” by Usher (2008), “Buttons” by the Pussycat Dolls
(2006), “Hungry” by Fergie ...
Hungry−→
2
...“Life Goes On” is a song recorded by American singer Fergie for her second studio
album, Double Dutchess (2017). ...The song serves as the third single from [Fergie’s] second studio
album, following “Hungry”.
with memory size. Larger memory size yields diminishing returns, perhaps reflecting that entity mentions
may contain overlapping information.
Zero-shottransfertounseenentities. An important advantage of memory architectures is that the
behavior of the model can be steered by deciding what to include in the memory. Here we show that the
tome model can use information that was not present in memory during training. We sample questions in
the TQA and CQA dev sets, and generate a subset of the memory without any mentions corresponding to
the answer entities for those questions. Then we pre-train and fine-tune a model on this smaller memory,
which we call tome-unseen. We evaluate tome-unseen on the sampled questions using the full memory
for evaluation only, and compare to standard tome. Table 4.5 shows that using full memory only during
evaluation does not lower performance.
3
We replaced the original song title with the song “Hungry” as the original may be inappropriate.
48
0 50 100 150
64
66
68
70
72
74
Memory size during fine-tuning, in millions.
Accuracy, %
HoVer (dev)
FEVER (dev)
Figure 4.2: Claim verification accuracy as a func-
tion of fine-tuning memory size (in millions).
Table 4.5: Accuracy on held-out subset of Triv-
iaQA and ComplexWebQuestions (CWQ) ques-
tions. tome-1-unseen was pre-trained and fine-
tuned with memory without entities from held-out
set and evaluated with full memory. Note that per-
formance is considerably lower than on the full dev
set as answers in the held-out set (which are in dev
but not train) are more likely to be rare entities.
Dataset TriviaQA
dev
CWQ
dev
tome-1 17.4 16.4
tome-1-unseen 17.6 16.7
4.5 Conclusion
We introducedtome, a Transformer model that performs attention over a semi-parametric representation
of the entire Wikipedia text corpus. This representation, or Mention Memory, consists of a dense encod-
ing for each entity mention in Wikipedia. tome can retrieve information from multiple sources without
supervision, aggregate information within the Transformer, and reason over the retrieved information.
tome leads to strong improvements on multiple open-domain claim verification and entity-based question
answering tasks.
49
PartIV
Semi-parametrictask-specificknowledgerepresentation
50
Chapter5
Generate-and-Retrieve: useyourpredictionstoimproveretrievalfor
semanticparsing
A common recent approach to semantic parsing augments sequence-to-sequence models by retrieving and
appending a set of training samples, called exemplars. The effectiveness of this recipe is limited by the abil-
ity to retrieve informative exemplars that help produce the correct parse, which is especially challenging
in low-resource settings. Existing retrieval is commonly based on similarity of query and exemplar in-
puts. We propose GandR, a retrieval procedure that retrieves exemplars for which outputs are also similar.
GandR first generates a preliminary prediction with input-based retrieval. Then, it retrieves exemplars
with outputs similar to the preliminary prediction which are used to generate a final prediction. GandR
sets the state of the art on multiple low-resource semantic parsing tasks.
5.1 Introduction
A common and successful approach to structured prediction problems [53, 13] is to treat the gold structure
as a sequence and fine-tune a sequence-to-sequence model such as T5 [73] or BART [51]. However, the
performance of fine-tuned models suffers in low resource scenarios where available training data is limited
relative to the complexity of the task [13].
Existing work [65, 33, 94] has found that retrieving related training samples, denoted exemplars, and
appending the retrieved input-output pairs to the sample input before processing the sample can improve
performance in low resource settings. In principle, all information from exemplars is available to the
model during training and could be stored in model parameters. However, in practice the model may not
51
successfully retain all information, and reminding the model of salient input-output patterns at test time
appears to help.
That raises the question: what exemplars are most informative for the model? Existing work focuses on
retrieving exemplars for which the input is partially similar to the test input, effectively answering “ What
is the output for similar inputs?”. In this work we explore whether there is complementary information in
exemplars that answer the inverse question, “What is the input for similar outputs?”.
We propose Generate-and-Retrieve (GandR), a method to retrieve exemplars with similar output as
well as input. As the true output of a sample is in general unknown, GandR proceeds in two steps. First, a
preliminary prediction is generated using retrievals with similar input only. Then, a new set of exemplars
is retrieved based on a relevance measure that balances the similarity of the inputs and the similarity of
the preliminary prediction and the exemplar output. Figure 4.1 provides an overview of the method.
We evaluate GandR in the setting of task-oriented semantic parsing, a core component of widely used
virtual assistants. We show that similarity in output space provides a complementary signal to input sim-
ilarity, yielding retrievals that prove more informative for the model. Moreover, for many structured pre-
diction tasks the output space is more structured than the free-form input text, so that simple, non-learned
distance measures work well for outputs even when inputs are lexically dissimilar. Table 5.4 demonstrates
an example where our proposed similarity function retrieves an example that is somewhat less similarly
phrased but with more similar output, and the model produces a better prediction as a result. Finally,
the model has the opportunity to verify that its preliminary predictions are valid outputs in the target
language.
The proposed method strongly improves performance in low-resource settings for semantic parsing,
achieving state of the art results for low-resource and transfer benchmarks in MTOP [53] and TopV2 [13].
5.2 Method
We approach semantic parsing as a conditional language generation task and apply a T5 sequence-to-
sequence model [73] to predict a parse y given a query x. For each sample, we retrieve K = 4 relevant
training exemplars sampled according to a relevance scoring function. We append the retrieved input-
output pairs to the sample input and apply the T5 model to the augmented input to predict a parse output.
52
Figure 5.1: Overview of GandR. First, GandR generates a preliminary prediction using an input augmented
with exemplars with similar inputs. Then, GandR retrieves exemplars based on a relevance measure bal-
ancing input similarity and similarity between the preliminary prediction and exemplar outputs, and gen-
erates a final prediction based on these exemplars.
In particular, let(x
′
1
,y
′
1
),...,(x
′
K
,y
′
K
) denote the retrieved input-output pairs, then the augmented input
is
x
′
=x || x
′
1
& y
′
1
|| x
′
2
& y
′
2
|| ...
Our approach closely follows that of [65], differing primarily in the choice of relevance function. During
evaluation we retrieve the topK most relevant exemplars. During training, we sample retrievals according
to a geometric distribution over the relevance score rank. In particular, the probability that we retrieve an
exemplar is given byp(1− p)
r
wherer is the rank of the exemplar according to relevance score andp is
a temperature hyperparameter.
In [65], the relevance score is given by the inner product of Universal Sentence Encoder [10] encod-
ings of the candidate input and the sample input. We found that a simple TF-IDF [74] similarity baseline
achieves comparable or better results.
Our proposed approach, GandR, builds on the input-similarity baseline by constructing a hybrid sim-
ilarity measure that takes into account not only the similarity between sample and candidate inputs, but
also the similarity between the sequence predicted by the model and the candidate output. See Figure 5.1
for an overview. First, GandR generates a preliminary prediction using an input augmented with exem-
plars with similar inputs. Then, GandR retrieves exemplars based on a hybrid similarity measure over
inputs and outputs, and generates a final prediction based on these exemplars.
Specifically, let ˆ y
i
be preliminary prediction, then the proposed output similarity between samples i
andj is given by the TF-IDF similarity between the predicted structure (in our case, the set of intents and
53
slots) and the structure of the true parsey
j
. Our proposed relevance score is a weighted sum of input and
output similarity:
R
ij
= (1− α )TF-IDF(x
i
,x
j
)+α TF-IDF(ˆ y
i
,y
j
)
5.2.1 Training
For simplicity, we train GandR in two stages. We start training with TF-IDF input relevance scoring,
yielding model M
1
. Model M
1
is used to generate GandR preliminary predictions during training and
evaluation. We continue training M
1
for the remaining training steps, yielding M
2
, which is used to
generate final predictions augmented with retrievals from M
1
. Note that this two-stage training is for
convenience only, and it is possible to use a single set of weightsM
single
to generate preliminary and final
GandR predictions. In that case,M
single
needs to be trained with a mix of input-only and GandR retrieval
augmentations to ensure it is able to use either effectively.
5.3 RelatedWork
Sequence-to-sequence models [73, 51] have achieved state-of-the-art performance on task-oriented seman-
tic parsing [53, 13, 1] as well as other structured prediction tasks [73]. The general approach is to pre-train
on language modeling and perform fine-tuning on the specific domain of interest.
Several works augment the input with retrieved exemplars from the training data, with differing meth-
ods for selecting informative examples. [65] and [33] retrieve exemplars with similar input encodings from
a pre-trained neural encoder, evaluating on semantic parsing. [94] retrieves examplars for which the in-
put has high BM25 similarity with the sample input, with good performance on language generation. We
adopt a similar approach with TF-IDF similarity as a baseline for semantic parsing.
[7] and [19] learn dense retrievers in the spirit of [46], providing another path to incorporate label
information for retrieval. These approaches require training a separate model specifically for retrieval,
possibly with additional learning signal. In contrast, we employ a sparse similarity measure over model
predictions that are produced incidentally in the course of fine-tuning the main model.
54
Model MTOP
boot
MTOP
1k
MTOP
25%
TOPv2
W
TOPv2
R
Reptile [13] 77.7 70.5
RAF [81] 78.7
CASPER [65] 73.3 / 83.9
T5 72.9 / 83.3 62.8 78.5 79.2 68.8
T5 with input TF-IDF 74.9 / 84.5 66.8 79.4 79.9 71.0
GandR 76.4/84.6 67.5 80.1 80.5 71.7
Table 5.1: Results on semantic parsing benchmarks. We report the percentage exact match between true
and predicted labels as sequences. Results are on test set for all benchmarks except MTOP
boot
, where we
report on dev to remain comparable with CASPER.
Model MTOP TOPv2
S
RAF 87.1
CASPER 86.4
T5 85.7 86.9
T5 input TF-IDF 86.4 87.0
GandR 86.4 87.0
Table 5.2: Performance on high-resource settings.
Selecting relevant training exemplars is also important for in-context prompting [56]. Similar to related
fine-tuning literature, work in this direction uses either a pre-trained [29] or fine-tuned [55] sentence
encoder to retrieve exemplars.
5.4 Experiments
5.4.1 Setup
We evaluate GandR and baselines on semantic parsing benchmarks MTOP [53] and TOPv2 [13], focusing
on low-resource and transfer settings. MTOP is a medium-sized semantic parsing dataset used in [65],
for which we evaluate on the domain bootstrapping setting in which one of the domains is limited to a
very small amount of training data. We also evaluate on low-resource settings MTOP
1k
and MTOP
25%
in
which we randomly sample 1k and 25% of training samples, respectively. TOPv2 is centered on transfer to
low-resource domains: models are trained on a set of high resource-domains denoted as TOPv2
S
and then
fine-tuned on low-resource Weather andReminder domains
1
, denoted as TOPV2
W
and TOPV2
R
. We show
the sizes of datasets and splits in Table D.1 in the Appendix.
1
We are using 25 SPIS low resource split from [13].
55
0 0.2 0.4 0.6 0.8 1
75
75.5
76
76.5
input TF-IDF
output TF-IDF
α .
Exact match, %
Figure 5.2: Performance on MTOP
boot
dev set as a function of output similarity weightα .
Retriever MTOP
boot
TOPv2
W
TOPv2
R
input TF-IDF 35.9 55.1 20.1
output TF-IDF 70.3 74.8 53.7
GandR 70.0 68.7 52.5
Table 5.3: Template recall@K=4 on the development sets for MTOP
boot
, TOPv2
W
and TOPv2
R
.
5.4.2 Mainresults
The results of our primary experiments are shown in Table 5.1. We find that input TF-IDF is a strong
baseline, rivaling or improving over prior work. Further, GandR retrieval outperforms all baselines, setting
the state of the art on evaluated settings.
5.4.3 Ablationsanddiscussion
Retrievalislessimportantforhigh-resourcesettings Table 5.2 shows results on the high-resource
full MTOP and TOPv2 datasets. In higher-resource settings, augmenting the input with exemplars appears
to be both less effective and less sensitive to retrieval method, with almost identical results among methods
with and without retrieval for the highest resource TOPv2 dataset.
Usinghybridsimilarityleadstobetterretrievalquality Figure 5.2 displays MTOP
boot
performance
as a function of TF-IDF output weightα . The results demonstrate that input and output similarity signals
are strongly complementary.
56
Input sample
x: Could you connect me to the Musicals group
y: [in:create_call [sl:group Musicals] ]
Training sample with similar input
x
1
: musicals in windham this weekend
y
1
: [in:get_event [sl:category_event musicals ] [sl:location windham ] [sl:date_time this
weekend ] ]
ˆ y: [in:create_call [sl:contact me] [sl:group Musicals] ]
Training sample with similar input and label
x
2
: can you please send text to the development group
y
2
: [in:send_message [sl:group development]]
ˆ y: [in:create_call [sl:group Musicals] ]
Table 5.4: Input TF-IDF retrieves an exemplar with lexical overlap (‘musicals’) that is not relevant to the
sample. The GandR retrieval balances lexical and label similarity and leads to a correct prediction. Single
representative exemplar out of 4 displayed for each method.
Consideringoutputsimilarityleadstohighertemplaterecall Following [65], we computetemplate
recall@K as a proxy metric for retrieval. This measure corresponds to the proportion of evaluation samples
for which at least one of the top K retrievals has the same template (identical intents and slots) as the
gold parse. Results show (Table 5.3) that considering output as well as input similarity increases template
recall. We note that output TF-IDF has similar or higher template recall than GandR even though it has
lower performance. Ultimately, template recall is only a proxy, and we are really interested in retrieval
informativeness; GandR’s performance shows that balancing input similarity and template recall leads to
exemplars that are most helpful for the model.
5.4.4 Erroranalysis
The primary motivation for GandR is that hybrid similarity leads to more informative exemplars. Infor-
mativeness can only be objectively measured through model performance, but our motivating intuition
appears to be borne out by samples in the data. We observe a number of different cases for which output
or hybrid-similarity retrieval can help. Table 5.4 shows an example of a case for which input TF-IDF re-
trieves an irrelevant example with lexical overlap, while GandR retrieves an example with both lexical and
57
parse overlap, leading to a correct prediction. Using preliminary predictions for retrieval can also allow
the model to verify whether its predictions are correct. A common simple case when this can help is if the
model generates a prediction that is dissimilar to any samples in the training set in which case the model
may reconsider whether that prediction is correct (Table D.2). Considering output similarity does come
with tradeoffs. Table D.3 demonstrates a situation where output similarity distracts the model away from
a lexically similar and informative exemplar and the model is wrong as a result.
5.5 Conclusion
We propose GandR, a new method for structured prediction that generates a preliminary prediction, re-
trieves training exemplars with similar outputs (and similar inputs), and augments the input with the
retrieved exemplars to generate a final prediction. We demonstrate that using output similarity yields
improvements for semantic parsing in low-resource settings, achieving state of the art results on several
semantic parsing benchmarks.
58
PartV
ConclusionandFutureWork
59
Chapter6
Conclusion
In this thesis we developed multiple parametric and semi-parametric methods for knowledge acquisition
from text. We propose dedicated memory components that have several key properties that modern
neural models lack. This allows the proposed methods to significantly improve on various challenging
knowledge-intensive tasks.
The first method, DocEnt, introduced a parametric entity memory to aggregate information about
an entity (movie) from a collection of relevant documents (movie reviews). The first contribution is a
training method via joint modeling of entities and related text distribution instead standard approach of
the conditional distribution of entity given text. My second contribution is extensive experiments on the
MovieLens tag prediction tasks. These experiments showed that entity memory component significantly
improves neural networks performance on the tag prediction task. Also, the proposed way of learning
entity embeddings is superior to the existing methods. Finally, I showed that the model generalizes well
even to unseen tags by treating them as arbitrary text.
In Part III we developed a semi-parametric memory to capture entity-centric knowledge from text
corpus. This memory improves processing of large documents and achieve state-of-the-art results on a
challenging NarrativeQA benchmark [48]. Moreover, we extend the method to extract and represent world
knowledge about entities from an entire corpus. The memory represents information easily accessible to
another Transformer like model [20] and allows the model to achieve state-of-the-art results on multiple
knowledge intensive tasks like open domain question answering. Finally, mention encoder, a parametric
component of the memory, is trained to extract entity related information from surrounding context using
only textual data without explicit supervision. The method improves over standard models like BERT [21]
on mention-oriented tasks (like mention typing).
60
Finally, we explore a retrieval augmented models, a common memory-based augmentation, in the
context of semantic parsing. We propose a way to improve existing methods specifically in low resource
scenarios. Combining the approach with the dense memory developed in the Part III is a promising future
direction that we discuss in Section 7.1.
We hope that the thesis highlights a direction for improving NLP models in general. An existing
prominent research direction to achieve better performance is by scaling a model’s size [73, 71, 8]. However,
that comes at a expense of computational resources and training time. Our memory augmented models
(Chapter 3 and Chapter 4) outperform models with one or two orders of magnitude learnable parameters
on several knowledge-intensive benchmarks (c.f. Table 4.2). This suggest that memory augmented models
is a promising future research direction.
61
Chapter7
FutureWork
7.1 Densememoryforthestructuredprediction
One takeaway from the Chapter 5 is that retrieval quality matters for low resource semantic parsing. Our
experiments show that the number of retrieved exemplars also matters. For example, Figure 7.1 shows the
performance of the retrieval augmented model on a low resource version of the TOPv2
S
. Note that the
gains from using 4 retrievals instead of 1 are roughly double the gains from using the retrieval in the first
place.
0 1 2 3 4
63
64
65
66
K, number of retrieved exemplars
Exact match, %
Figure 7.1: Performance of a model augmented with TF-IDF retrieval on the low resource version of
TOPv2
S
, where we only used 1% of training data (roughly, 800 samples).
Another situation where multiple exemplars are very likely to be beneficial is the few-shot in-context
learning setting for large language models [71, 8].
62
However, using multiple exemplars is challenging due to the high computational cost. The reason is
that the conventional way to integrate exemplars into a sequence-to-sequence model is to append them to
the input. However, compute cost in Transformer attention layers depends quadratically on the number of
tokens in an input text passage. As a result, the input length to the Transformer model is typically limited,
preventing us from using a larger number of exemplars.
We propose to apply a methodology developed in chapters 3 and 4 for retrieval augmented models.
The idea is to integrate retrieved exemplars into T5 at much lower cost by compressing them into dense
memory vectors using a trainable memory encoder model (analogous to the mention encoder, but not
focusing on entity mentions).
We show preliminary results indicating that the proposed compression step does not significantly
affect model performance given the same number of retrievals. The lower cost makes it possible to retrieve
more exemplars, leading to better performance than a standard approach in our synthetic benchmark. We
discuss the method’s limitations and propose the next steps required for applying the proposed method
on real datasets.
7.1.1 MethodSketch
The method performs retrieval based on input TF-IDF similarity as discussed in the Chapter 5. The only
difference is how exemplars are integrated into the T5 model.
Following Chapter 4 the general idea is to compress each provided exemplar into a small number of
high-dimensional vectors that we call memories using a separate memory encoder. It then becomes much
cheaper to incorporate these memories into the T5 model. We do this by introducing cross-attention layers
over the memory vectors inside the T5 encoder.
We train this model in end-to-end fashion, so that the memory encoder learns to produce memories
that are helpful for the T5 encoder.
Finally, in our experiments we share parameters between memory and T5 encoders, yielding a model
with only 13% more parameters than T5.
63
0 5 10 15
60
70
80
K, number of retrieved exemplars
Exact match, %
Append tokens
Dense memory
Figure 7.2: Performance retrieval augmented models on anonymized TOPv2
S
dataset vs the number of
retrieved exemplars. Both models are using input TF-IDF retrieval. Append tokens model appends re-
trieved exemplars to the input, whileDensememory compresses exemplars in the multiple dense vectors
and integrates into T5 encoder via cross-attention layers.
7.1.2 PreliminaryExperiments
As a synthetic proof-of-concept experiment, we consider a setting constructed to ensure that more re-
trievals are crucial. In particular, we use the TOPv2
S
dataset with anonymization applied during both
training and evaluation.
Anonymization is a standard data augmentation technique for semantic parsing. Once exemplars have
been retrieved, the idea is to randomly rename intents and slots in sample and exemplar labels. Intents and
slots are renamed consistently, such that, for example, intent “IN:SEND_MESSAGE” is mapped to “IN:123”
in all exemplars’ labels and in the true label. Anonymization prevents the model from memoizing intent
and slot names and forces it to rely on the exemplars to provide a correct parse.
The setting is engineered to benefit strongly from multiple retrievals. If some intent or slot in the
label does not exist in the exemplars, then the model has no way to make a correct prediction. The more
exemplars are available to the model, the more likely label’s intents and slots are covered, and the more
likely they are expressed in similar wording, making it easier for the model to capture the new names.
The results on Figure 7.2 show that when the number of retrieved exemplars is small (K = 1,K = 2
andK = 4) the difference between the dense memory approach and a baseline that concatenates tokens
and is small. However, dense memory method is capable of performing more retrievals due to the lower
64
cost of integrating retrieved exemplars. For K = 8 or K = 16 retrieved exemplars the dense memory
method significantly outperforms the baseline.
7.1.3 Discussion
0 20 40 60 80 100
50
60
70
Percents of available training data, %
Exact match, %
Append tokens
Dense memory
Figure 7.3: Performance retrieval augmented models on anonymized TOPv2
S
with different training set
sizes. Both models retrieve K = 4 exemplars using input TF-IDF similarity. Append tokens model
appends retrieved exemplars to the input, while Dense memory compresses exemplars in the multiple
dense vectors and integrates into T5 encoder via cross-attention layers.
One of the challenges currently preventing the dense memory method from being applied to real
datasets is that it requires a lot of training data. As we showed previously, the gap between dense memory
and concat tokens baseline on the anonymized TOPv2
S
dataset when both methods retrieveK = 4 exem-
plars is modest. However, for lower amounts of available training data the gap increases dramatically (see
Figure 7.3). This is not surprising since T5 was never pre-trained to generate and use dense memories, so
we only rely on fine-tuning stage.
This presents a dichotomy. On the one hand, more fine-tuning data improves the dense memory
method, but as we have shown earlier, it also renders retrieval unnecessary. On the other hand, in low-
resource situations, when multiple retrievals are beneficial, the dense memory cannot be adequately fine-
tuned.
We believe a promising direction to tackle this challenge is to pre-train dense memory and T5 models
on unlabelled text data. If the model learns to produce and use memory during pre-training, it may be able
to reap the benefits of increased retrievals in low-resource settings without requiring fine-tuning data.
65
7.2 Densememory: futuredirections&challenges
In Part III, we have shown that dense memory of mention encodings is a competitive method for knowledge
acquisition from text. Now, we highlight several promising future directions and challenges that to be
addressed.
Generalizememorytoincludenon-entityinformation. World knowledge expressed in text is not
limited to entities. For example, several NLP tasks like common sense reasoning [77] require the model
to be aware of physical laws and social norms. If we want the memory to capture this knowledge, we
must expand our choice of mentions. For example, we could incorporate encodings of all noun phrases
into the memory, allowing us to acquire information on arbitrary objects, abstract concepts, etc. However,
this could be challenging because a significant increase in memory size makes it harder to find relevant
information and makes the search more computationally expensive.
Memoryformulti-hopreasoning. Questions requiring multi-hop reasoning might be challenging for
the flat mention memory. For example, the mention memory generated from Wikipedia contains a lot of
facts about Barach Obama, which allows the model to answer relevant questions. However, what if a
question requires multiple facts that are useless individually? Consider a question, “Does Barack Obama
have access to a secret service protection?”. Answering this question requires first identifying that Barack
ObamaisaUSex-president and second thatUSex-presidentshaveaccesstosecretserviceprotection. Finding
these mentions containing these facts is challenging because of the enormous search space. One research
direction is to explore a more structured version of mention memory to reduce effective search space. For
example, one can group facts by related entities and immediately eliminate most entities that are not from
a question or previously retrieved facts.
Memory for aggregated queries. Another class of challenging questions requires aggregating and
summarizing information from many sources. For example, “Did most senators support Barack Obama’s
health policy” or “Based on text reviews, do people generally prefer iPhone or Android phones?. One possible
approach is to combine mention memory with parametric entity memory developed in Part 1. This way,
an entity embedding vector can gather aggregated information about an entity.
66
Bibliography
[1] Armen Aghajanyan, Jean Maillard, Akshat Shrivastava, Keith Diedrick, Michael Haeger, Haoran
Li, Yashar Mehdad, Veselin Stoyanov, Anuj Kumar, Mike Lewis, and Sonal Gupta. Conversational
semantic parsing. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online,
November 16-20, 2020, pages 5026–5035. Association for Computational Linguistics, 2020.
[2] Eugene Agichtein, Eric Brill, and Susan Dumais. Improving web search ranking by incorporating
user behavior information. In Proceedings of the 29th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 19–26. ACM, 2006.
[3] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC: Encoding long and structured
inputs in transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 268–284, Online, November 2020. Association for Computational
Linguistics.
[4] Deepa Anand and Deepan Naorem. Semi-supervised aspect based sentiment analysis for movies
using review filtering. Procedia Computer Science, 84:86–93, 2016.
[5] Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching the blanks:
Distributional similarity for relation learning. In ACL 2019, 2019.
[6] Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR, abs/2004.05150, 2020.
[7] Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autore-
gressive Language Modeling with Mesh-Tensorflow. March 2021.
[8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray,
Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato,
Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual, 2020.
[9] Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu.
Knowledgeable or educated guess? revisiting language models as knowledge bases. InACL/IJCNLP
2021, 2021.
67
[10] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil.
Universal sentence encoder. CoRR, abs/1803.11175, 2018.
[11] Ming-Wei Chang, Kristina Toutanova, Kenton Lee, and Jacob Devlin. Language model pre-training
for hierarchical document representations. CoRR, abs/1901.09128, 2019.
[12] Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks
for embedding-based large-scale retrieval, 2020.
[13] Xilun Chen, Asish Ghoshal, Yashar Mehdad, Luke Zettlemoyer, and Sonal Gupta. Low-resource do-
main adaptation for compositional task-oriented semantic parsing. In Bonnie Webber, Trevor Cohn,
Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 5090–5100. Association for
Computational Linguistics, 2020.
[14] Hao Cheng, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Probabilistic assumptions mat-
ter: Improved models for distantly-supervised document-level question answering. In Dan Jurafsky,
Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5657–5667.
Association for Computational Linguistics, 2020.
[15] Xiao Cheng and Dan Roth. Relational inference for wikification. In Proceedingsofthe2013Conference
on Empirical Methods in Natural Language Processing, pages 1787–1796, Seattle, Washington, USA,
October 2013. Association for Computational Linguistics.
[16] Christopher Clark and Matt Gardner. Simple and effective multi-paragraph reading comprehension.
In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Associ-
ationforComputationalLinguistics,ACL2018,Melbourne,Australia,July15-20,2018,Volume1: Long
Papers, pages 845–855. Association for Computational Linguistics, 2018.
[17] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina
Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein,
Christy Doran, and Thamar Solorio, editors,Proceedingsofthe2019ConferenceoftheNorthAmerican
Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-
HLT2019,Minneapolis,MN,USA,June2-7,2019,Volume1(LongandShortPapers), pages 2924–2936.
Association for Computational Linguistics, 2019.
[18] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdi-
nov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint
arXiv:1901.02860, 2019.
[19] Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya Godbole, Ethan Perez, Jay Yoon Lee, Lizhen Tan,
Lazaros Polymenakos, and Andrew McCallum. Case-based reasoning for natural language queries
over knowledge bases. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau
Yih, editors,Proceedingsofthe2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,
EMNLP2021,VirtualEvent/PuntaCana,DominicanRepublic,7-11November,2021, pages 9594–9611.
Association for Computational Linguistics, 2021.
[20] Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, and William W. Cohen. Mention
memory: incorporating textual knowledge into transformers through entity mention attention. In
International Conference on Learning Representations, 2022.
68
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT, 2019.
[22] Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov,
and William W. Cohen. Differentiable reasoning over a virtual knowledge base. In ICLR 2020, 2020.
[23] Julian Martin Eisenschlos, Bhuwan Dhingra, Jannis Bulian, Benjamin Börschinger, and Jordan Boyd-
Graber. Fool me twice: Entailment from wikipedia gamification. arXiv preprint arXiv:2104.04725,
2021.
[24] Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich
Schütze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models.
CoRR, abs/2102.01017, 2021.
[25] Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. En-
tities as experts: Sparse memory access with entity supervision. In EMNLP 2020, 2020.
[26] Nicholas FitzGerald, Daniel M. Bikel, Jan A. Botha, Daniel Gillick, Tom Kwiatkowski, and Andrew
McCallum. MOLEMAN: mention-only linking of entities with a mention annotation network. In
ACL-IJCNLP 2021, 2021.
[27] Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. En-
tities as experts: Sparse memory access with entity supervision, 2020.
[28] Thibault Févry, Livio Baldini Soares, Nicholas Arthur FitzGerald, Eunsol Choi, and Tom
Kwiatkowski. Entities as experts: Sparse memory access with entity supervision. In EMNLP 2020 -
Conference on Empirical Methods in Natural Language Processing, 2020.
[29] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot
learners. arXiv preprint arXiv:2012.15723, 2020.
[30] Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessandro Presta, Jason Baldridge, Eugene Ie, and
Diego García-Olano. Learning dense representations for entity retrieval. In Proceedings of the 23rd
ConferenceonComputationalNaturalLanguageLearning,CoNLL2019,HongKong,China,November
3-4, 2019, pages 528–537. Association for Computational Linguistics, 2019.
[31] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar.
Accelerating large-scale inference with anisotropic vector quantization. In ICML 2020, 2020.
[32] Nitish Gupta, Sameer Singh, and Dan Roth. Entity linking via joint encoding of types, descrip-
tions, and context. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of
the2017ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2017,Copenhagen,
Denmark, September 9-11, 2017, pages 2681–2690. Association for Computational Linguistics, 2017.
[33] Vivek Gupta, Akshat Shrivastava, Adithya Sagar, Armen Aghajanyan, and Denis Savenkov.
RETRONLU: retrieval augmented task-oriented semantic parsing. CoRR, abs/2109.10410, 2021.
[34] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Retrieval augmented
language model pre-training. In ICML 2020, 2020.
[35] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. TiiS,
5(4):19:1–19:19, 2016.
69
[36] Ruining He and Julian J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends
with one-class collaborative filtering. In Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian
Horrocks, and Ben Y. Zhao, editors, Proceedings of the 25th International Conference on World Wide
Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 507–517. ACM, 2016.
[37] Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, and Houfeng Wang. Learning entity
representation for entity disambiguation. InProceedingsofthe51stAnnualMeetingoftheAssociation
for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers ,
pages 30–34. The Association for Computer Linguistics, 2013.
[38] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep
structured semantic models for web search using clickthrough data. InProceedingsofthe22ndACM
international conference on Information & Knowledge Management, pages 2333–2338. ACM, 2013.
[39] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open
domain question answering. In EACL 2021, 2021.
[40] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM
Trans. Inf. Syst., 20(4):422–446, 2002.
[41] Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Kumar Singh, and Mohit
Bansal. Hover: A dataset for many-hop fact extraction and claim verification. In EMNLP2020, 2020.
[42] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv
preprint arXiv:1702.08734, 2017.
[43] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan,
editors,Proceedingsofthe55thAnnualMeetingoftheAssociationforComputationalLinguistics,ACL
2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association
for Computational Linguistics, 2017.
[44] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In ACL 2017, 2017.
[45] Jason J Jung. Attribute selection-based recommendation framework for short-head user group: An
empirical study by movielens and imdb. Expert Systems with Applications, 39(4):4049–4054, 2012.
[46] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi
Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In EMNLP
2020, 2020.
[47] Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, and Kentaro Inui. Dynamic entity representation
with max-pooling improves machine reading. In Kevin Knight, Ani Nenkova, and Owen Rambow,
editors, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17,
2016, pages 850–855. The Association for Computational Linguistics, 2016.
[48] Tomás Kociský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis,
and Edward Grefenstette. The narrativeqa reading comprehension challenge. Trans.Assoc.Comput.
Linguistics, 6:317–328, 2018.
70
[49] Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for recommender
systems. IEEE Computer, 42(8):30–37, 2009.
[50] Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida I. Wang, and Luke Zettle-
moyer. Pre-training via paraphrasing. In NeurIPS 2020, 2020.
[51] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. BART: denoising sequence-to-sequence pre-training for
natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie
Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for
Computational Linguistics, 2020.
[52] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe
Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS 2020, 2020.
[53] Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. MTOP:
A comprehensive multilingual task-oriented semantic parsing benchmark. In Paola Merlo, Jörg
Tiedemann, and Reut Tsarfaty, editors,Proceedingsofthe16thConferenceoftheEuropeanChapterof
the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021,
pages 2950–2962. Association for Computational Linguistics, 2021.
[54] Jeffrey Ling, Nicholas FitzGerald, Zifei Shan, Livio Baldini Soares, Thibault Févry, David Weiss, and
Tom Kwiatkowski. Learning cross-context entity representations from text. CoRR, abs/2001.03765,
2020.
[55] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What
makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804, 2021.
[56] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-
train, prompt, and predict: A systematic survey of prompting methods in natural language process-
ing. arXiv preprint arXiv:2107.13586, 2021.
[57] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach.
CoRR, abs/1907.11692, 2019.
[58] Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Jacob Devlin, and
Honglak Lee. Zero-shot entity linking by reading entity descriptions. In Proceedings of the 57th
ConferenceoftheAssociationforComputationalLinguistics,ACL2019,Florence,Italy,July28-August
2, 2019, Volume 1: Long Papers, pages 3449–3460, 2019.
[59] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR 2019, 2019.
[60] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher
Potts. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada
Mihalcea, editors, The 49th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages
142–150. The Association for Computer Linguistics, 2011.
71
[61] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Con-
textualized word vectors. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach,
Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,AdvancesinNeuralInformationPro-
cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December
2017, Long Beach, CA, USA, pages 6294–6305, 2017.
[62] Bradley N Miller, Istvan Albert, Shyong K Lam, Joseph A Konstan, and John Riedl. Movielens un-
plugged: experiences with an occasionally connected recommender system. In Proceedings of the
8th international conference on Intelligent user interfaces, pages 263–266. ACM, 2003.
[63] Xiangyang Mou, Mo Yu, Bingsheng Yao, Chenghao Yang, Xiaoxiao Guo, Saloni Potdar, and Hui Su.
Frustratingly hard evidence retrieval for QA over books. CoRR, abs/2007.09878, 2020.
[64] Jianmo Ni, Jiacheng Li, and Julian McAuley. Justifying recommendations using distantly-labeled
reviews and fine-grained aspects. In Proceedingsofthe2019ConferenceonEmpiricalMethodsinNat-
ural Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 188–197, 2019.
[65] Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval aug-
mentation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,
Proceedingsofthe2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2021,
Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7683–7698. Association
for Computational Linguistics, 2021.
[66] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh,
and Noah A. Smith. Knowledge enhanced contextual word representations, 2019.
[67] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh,
and Noah A. Smith. Knowledge enhanced contextual word representations. InEMNLP-IJCNLP2019,
2019.
[68] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. Deep contextualized word representations. In Marilyn A. Walker, Heng Ji,
and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New
Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237. Association for
Computational Linguistics, 2018.
[69] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao,
James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rock-
täschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. In
NAACL-HLT 2021, 2021.
[70] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander H. Miller. Language models as knowledge bases? In EMNLP-IJCNLP 2019, 2019.
[71] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. 2019.
[72] Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive trans-
formers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
72
[73] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. JMLR, 2020.
[74] Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of
the first instructional conference on machine learning , volume 242, pages 29–48. Citeseer, 2003.
[75] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-
networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the
2019ConferenceonEmpiricalMethodsinNaturalLanguageProcessingandthe9thInternationalJoint
Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7,
2019, pages 3980–3990. Association for Computational Linguistics, 2019.
[76] Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints for se-
quence generation tasks. Trans. Assoc. Comput. Linguistics, 8:264–280, 2020.
[77] Maarten Sap, Vered Shwartz, Antoine Bosselut, Yejin Choi, and Dan Roth. Commonsense reasoning
for natural language processing. In Agata Savary and Yue Zhang, editors, Proceedings of the 58th
AnnualMeetingoftheAssociationforComputationalLinguistics: TutorialAbstracts,ACL2020,Online,
July 5, 2020, pages 27–33. Association for Computational Linguistics, 2020.
[78] Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. Simple entity-centric ques-
tions challenge dense retrievers. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and
Scott Wen-tau Yih, editors, EMNLP 2021, 2021.
[79] Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. Relevance of unsupervised
metrics in task-oriented dialogue for evaluating natural language generation. CoRR, abs/1706.09799,
2017.
[80] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa-
tions. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors,Proceedingsofthe2018Conferenceof
theNorthAmericanChapteroftheAssociationforComputationalLinguistics: HumanLanguageTech-
nologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages
464–468. Association for Computational Linguistics, 2018.
[81] Akshat Shrivastava, Shrey Desai, Anchit Gupta, Ali Elkahky, Aleksandr Livshits, Alexander Zo-
tov, and Ahmed Aly. Retrieve-and-fill for scenario-based task-oriented semantic parsing. CoRR,
abs/2202.00901, 2022.
[82] Haitian Sun, Patrick Verga, Bhuwan Dhingra, Ruslan Salakhutdinov, and William W. Cohen. Rea-
soning over virtual knowledge bases with open predicate relations. In ICML 2021, 2021.
[83] Yaming Sun, Lei Lin, Duyu Tang, Nan Yang, Zhenzhou Ji, and Xiaolong Wang. Modeling mention,
context and entity with neural networks for entity disambiguation. In Qiang Yang and Michael J.
Wooldridge, editors, Proceedings of the Twenty-Fourth International Joint Conference on Artificial In-
telligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, pages 1333–1339. AAAI Press, 2015.
[84] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu,
Hao Tian, and Hua Wu. ERNIE: enhanced representation through knowledge integration. CoRR,
abs/1904.09223, 2019.
73
[85] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu,
Hao Tian, and Hua Wu. ERNIE: enhanced representation through knowledge integration. CoRR,
abs/1904.09223, 2019.
[86] Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions.
In NAACL-HLT 2018, 2018.
[87] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. CoRR,
abs/2009.06732, 2020.
[88] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-
scale dataset for fact extraction and verification. In NAACL-HLT 2018, 2018.
[89] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg,
Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,
AdvancesinNeuralInformationProcessingSystems30: AnnualConferenceonNeuralInformationPro-
cessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
[90] Pat Verga, Haitian Sun, Livio Baldini Soares, and William Cohen. Adaptable and interpretable neural
memory over symbolic knowledge. In ACL 2021, 2021.
[91] Jesse Vig, Shilad Sen, and John Riedl. The tag genome: Encoding community knowledge to support
novel interaction. TiiS, 2(3):13:1–13:44, 2012.
[92] Jesse Vig, Shilad Sen, and John Riedl. The tag genome: Encoding community knowledge to support
novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(3):13, 2012.
[93] Jonas Wallat, Jaspreet Singh, and Avishek Anand. Bertnesia: Investigating the capture and forgetting
of knowledge in BERT. CoRR, abs/2106.02902, 2021.
[94] Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and
Michael Zeng. Training data is more valuable than you think: A simple and effective method by
retrieving from training data. CoRR, abs/2203.08773, 2022.
[95] Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
Tesauro, Bowen Zhou, and Jing Jiang. R
3
: Reinforced ranker-reader for open-domain question
answering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,ProceedingsoftheThirty-Second
AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial
Intelligence(IAAI-18),andthe8thAAAISymposiumonEducationalAdvancesinArtificialIntelligence
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5981–5988. AAAI Press, 2018.
[96] Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian Riedel, and Luke Zettlemoyer. Zero-shot entity
linking with dense entity retrieval, 2019.
[97] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Joint learning of the
embedding of words and entities for named entity disambiguation. InProceedingsofThe20thSIGNLL
Conference on Computational Natural Language Learning, pages 250–259, Berlin, Germany, August
2016. Association for Computational Linguistics.
[98] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. Learning distributed
representations of texts and entities from knowledge base. Trans.Assoc.Comput.Linguistics, 5:397–
411, 2017.
74
[99] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov,
and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question an-
swering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedingsof
the2018ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,Brussels,Belgium,October
31 - November 4, 2018, pages 2369–2380. Association for Computational Linguistics, 2018.
[100] Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan
Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning:
Training BERT in 76 minutes. In8thInternationalConferenceonLearningRepresentations,ICLR2020,
Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
[101] Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón,
Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for
longer sequences. CoRR, abs/2007.14062, 2020.
[102] Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong, Philip Pham, Ilya Eckstein, and Fei Sha. Read-
twice: Reading very large documents with memories. In NAACL-HLT 2021, 2021.
[103] Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: document level pre-training of hierarchical
bidirectional transformers for document summarization. CoRR, abs/1905.06566, 2019.
[104] Chen Zhao, Chenyan Xiong, Jordan L. Boyd-Graber, and Hal Daumé III. Multi-step reasoning over
unstructured text with beam dense retrieval. In NAACL-HLT 2021, 2021.
75
Appendices
A AppendixforDocEnt:LearningSelf-SupervisedEntityRepresentations
fromLargeDocumentCollections
The table below shows examples of queries from our Reddit Movie Suggestions dataset, along with the cor-
responding top retrieval results by DocEnt-Hybrid. Note that, for recommendation-style queries (such
as “movies like X") the model does in fact act like a recommender, returning movie entities similar to X—
based purely on text queries, no preprocessing or entity linking.
76
Query
DocEnt-Hybrid
toppredictions
Filmsaboutpeoplethatendupirreversiblytrappedinasurreal
reality?
Mulholland Drive, Being John
Malkovich, Lost Highway, In-
land Empire, Eternal Sunshine
of the Spotless Mind, Synec-
doche, New York, Eraserhead,
The Science of Sleep, Black
Swan, Barton Fink
WhaticonicmoviesinAmericancultureshouldpeoplevisiting
the US watch? We’re hosting some foreign exchange students from
Korea and we want to get some of the most iconic movies that give
a good overview of pop culture here to show them. What’s on your
can’t-miss list? What movies encapsulate the American culture and
are fun to watch?
The Breakfast Club, Forrest
Gump, Mean Girls, Ferris
Bueller’s Day Off, Animal
House, Casablanca, Grease,
Pulp Fiction, Singin’ in the
Rain, Back to the Future
What are some movies that emotionally destroy you? I recently
watched Manchester By The Sea and it ruined me. I loved every minute
of it. Another if my favourites is The Green Mile. What are some sug-
gestions of sad movies?
Grave of the Fireflies, Requiem
for a Dream, House of Sand
and Fog, The Hours, Brokeback
Mountain, Million Dollar Baby,
A.I. Artificial Intelligence, Leav-
ing Las Vegas, The Elephant
Man, Titanic
77
Unreliable Narrator. Any suggestions for movies containing unre-
liable narrators? (Characters version of events is false through mental
illness, narcissism, naivety etc)
Memento, Mulholland Drive,
The Prestige, Following, Vanilla
Sky, Shutter Island, The Ma-
chinist, Inception, Hidden, Eter-
nal Sunshine of the Spotless
Mind
Looking for a noir movie that is somewhat comedic/not super
dark,likeInherentVice.
Kiss Kiss Bang Bang, The Ice
Harvest, Bad Lieutenant: Port
of Call New Orleans, L.A. Confi-
dential, Assassination of a High
School President, One for the
Money, The Salton Sea, The
Man Who Wasn’t There, Seven
Psychopaths, Palmetto
Movies Like Four Weddings and a Funeral. This is on behalf of
someone else. She likes the highbrow British humour and that it isn’t
oversexed or necessarily focused on romance
Four Weddings and a Funeral,
Love Actually, Notting Hill,
Bridget Jones’s Diary, Shaun of
the Dead, About a Boy, The Best
Exotic Marigold Hotel, About
Time, Death at a Funeral, The
Wedding Date
78
CynicalPrivateDetectiveNoirFilms? You know the stereotype: an
ex-cop private eye, whiskey flask and cigarettes, fedora and trench coat
under a street lamp, internal monologues full of metaphors and cyni-
cism while a lonely saxophone wails in the background. Real Noir De-
tective stuff. They parodied pretty much this exact stereotype in Who
Framed Roger Rabbit, and even Calvin and Hobbes made fun of it per-
fectly from time to time. I know the stereotype, but have never found
a single movie that plays it straight. I’ve tried to explore 1940’s noir
thinking that was it, but, even though they call it the "Bogart drawl",
Maltese Falcon and Casablanca don’t even come close to this style of
film. Anyone know some good examples of this kind of movie, or any
way that I can find movies like this more easily?
Chinatown, Kiss Me Deadly,
Kiss Kiss Bang Bang, L.A. Con-
fidential, The Man Who Wasn’t
There, Farewell, My Lovely, Sin
City, The Good German, Dead
Men Don’t Wear Plaid, The Mal-
tese Falcon
Quality, melancholic romcoms. Just got done watching About
Time, Eternal Sunshine of the Spotless Mind, and 500 Days of Summer.
I was wondering if there are any other similar movies in this genre. Bit-
tersweet, romantic, melancholic, and make you think about your own
life. It’s not just typical Nicholas Sparks romance crap, it has some
quality to it. Thanks!
About Time, 500 Days of Sum-
mer, Seeking a Friend for the
End of the World, One Day,
Like Crazy, The Time Traveler’s
Wife, Crazy, Stupid, Love, Her,
Liberal Arts, Love Happens
A movie with intercultural communication and intercultural
encounters. I am taking a globalization course at school and have
to write a paper on a movie that displays different cultures interacting
with each other. The brief description is "Choose and analyze a film
which portrays some of the characteristics of intercultural communi-
cation and intercultural encounters." Some of the examples are Lost in
Translation, Borat, Crash, Slumdog Millionaire, etc. Just wondering
if you guys had any recommendations because I’ve seen most of the
movies on the list and want something different. It can be any genre as
long as it displays intercultural interactions
Lost in Translation, Crash, Ba-
bel, Japanese Story, Outsourced,
The Visitor, The Best Exotic
Marigold Hotel, Gran Torino,
The Ramen Girl, The Harimaya
Bridge
79
I’m looking for some movies for my grandson and I this week.
Heis3,andlovesSeasameStreet,butI’dratherwatchsomething
we both get chuckle at. My grandson is 4, and I was looking into
movies to rent this week.
Shark Tale, Cats & Dogs, Find-
ing Nemo, The Nut Job, Mada-
gascar, Monsters, Inc., Scooby-
Doo, Garfield: The Movie, Ice
Age, Rango
Movies that are pretty much people talking...about stuff. Genre
doesn’t matter. But, to give you some examples.. 12 Angry Men (1957)
Before trilogy Exam (2009) The Invitation I think I prefer something
slow-paced and laid back instead of something like Circle (2015), I
guess. EDIT: I am depressed. So, I don’t want to see anything depress-
ing.
August: Osage County, The
Man from Earth, Margin Call,
The Sunset Limited, 12 An-
gry Men, Coffee and Cigarettes,
The Station Agent, Killing Them
Softly, The Big Kahuna, The
Comedy
MoviessimilartoAmelie? I was wondering if there are any similar
movies to Amelie. I am probably gonna watch Amelie later this week.
Foreign titles are welcome.
Amélie, A Very Long Engage-
ment, Romantics Anonymous,
Delicatessen, Happenstance, He
Loves Me... He Loves Me Not,
Love Me If You Dare, Priceless,
Micmacs, Delicacy
Looking for a good ’Colonial Epic’ like Lawrence of Arabia or
’OutofAfrica.’ I know I’ve just made up a genre there, but with hope
the two movies I’ve listed will go some distance toward explaining what
I’m looking for. Thanks in advance for your suggestions.
The Four Feathers, Indochine,
Out of Africa, The Mission, The
Four Feathers, A Passage to In-
dia, The English Patient, Black
Robe, Zulu Dawn, The Patriot
80
Movies centering about groups of seemingly different people
with different agendas. The movie generally starts with present-
ing every member of the group in some situation, then there is sort
of a meeting place where all the protagonists come to know each other
before the story unfolds. The characters involved are generally reclu-
sive, usually selfish. Agendas are unveiled progressively throughout
the movie as characters are either asked to cooperate or to single-out
one member of the group ("the killer is one of us !"). Examples would be
almost every adaptation of Agatha Christie’s Poirot novels, although I
should stress the fact that it shouldn’t necessarily be a whodunit (if it’s
a good one though, i’ll take it !). Thanks for any suggestion.
Gosford Park, Identity, And
Then There Were None, Mind-
hunters, Nine Dead, Clue, Mur-
der on the Orient Express, Devil,
Ten Little Indians
A movie where a loser/underdog becomes a winner/successful
and obtains everything he/she wanted but doesn’t become
happy. More specifically, a movie where the process of becoming a
winner requires a change in the character that creates internal disso-
nance (Maybe the character becomes something he hates). I’m inter-
ested in seeing if there are any movies that focus on the duality of man
or the internal struggle of doing social good vs obtaining personal grat-
ification.
Fight Club, There Will Be Blood,
The Master, Groundhog Day,
American Beauty, The Social
Network, Adaptation, Citizen
Kane, A Clockwork Orange, The
Wolf of Wall Street
81
Movies that ’feel’ weird like Being John Malkovich, The Game,
EternalSunshine,TheLobster,TrumanShow... I really hope I can
accurately describe what I mean, because it’s more a feeling than any-
thing else, so here I go. I’m looking for movies that have a similar vibe
to these movies: Being John Malkovich The Game Eternal Sunshine of
the Spotless Mind The Box The Lobster Truman Show Her What these
have in common for me is that they feel weirdly realistic and unrealistic
at the same time. They are set in the world we all now - but somehow
it’s not the world that we know because it has some weird rules that the
film still just takes as a given, as if they were normal. I think Being John
Malkovich is the best example of what I’m looking for (have seen all
Charlie Kaufman movies by the way). The world looks just like ours.
The people look just like we do. But they all behave in such strange
ways, like people would never act in real life. But in the context of the
film it all makes sense. I really don’t know how to describe it, because
that’s what I’m looking for. Movies that make it hard to put your finger
on what it is that makes them so uncanny.
Eternal Sunshine of the
Spotless Mind, Being John
Malkovich, Synecdoche, New
York, Stranger than Fiction,
Adaptation, I Heart Huckabees,
The Science of Sleep, Her,
Donnie Darko, 500 Days of
Summer
Moviesaboutdedicationtoaskill/craft/etc(example:Whiplash)
and/or about writers. Any movie with someone going to great
lengths to excel would be great. The writing part would be a welcome
bonus though. Any suggestions?
Magic Beyond Words: The JK
Rowling Story, Adaptation, The
Words, Stranger than Fiction,
Capote, Finding Forrester, Won-
der Boys, What Just Happened,
The Muse, Midnight in Paris
82
B AppendixforReadTwice:ReadingVeryLargeDocumentswithMemories
B.1 Method
Sparse MemoryAttention In the standard setting ReadTwice (CLS) and ReadTwice (STS) apply the
MemoryAttention layer for all tokensh
ij
. Our preliminary experiments showed thatReadTwice (E) ben-
efits from a sparse pattern – when the layer is applied only over tokens that belong to an entity mention. It
acts like an identity function for all other tokens. This follows an intuition that long-range dependencies in
text mainly occur between entity mentions. Similar sparse patterns forReadTwice (CLS) andReadTwice
(STS) (e.g. CLS tokens attending only CLS-based memories) affected their performance negatively.
B.2 Pre-trainingdetails
Entity mention specific pre-training We use coreference resolution as an auxiliary pre-training task
specifically for ReadTwice (E). If there are two entities in the memory table pointing to the same entity
(but in different segments), our co-reference resolution task will encourage their entity representations to
be close. This is achieved through a binary classification task on whether m andm
′
entries inM point to
the same entity. The classification probability is modeled as:
y
mm
′ =σ (M
T
m
M
m
′ +b
0
)
whereb
0
is a bias term. A logistic loss is formed (using the ground-truth as positive examples and all other
entities as negative examples) and added to the MLM learning objective, for every entry in the memory
table. Memories that correspond to mentions without a corresponding entity ID (meaning Entity Linking
has failed to link them to IDs) are ignored in this loss.
Ablation results (c.f. Table B.2) shows that this auxiliary loss is not meaningfully important for model
performance, although it does gives a modest boost to the ROUGE-L score on the NarrativeQA task.
MLM Analysis We consider whether the models learns to use memories in its predictions, evaluating
MLM accuracy on a heldout set of Wikipedia articles and books. We compare ReadTwice with a re-
stricted version of itself that cannot access memories from different segments, denoted as Single Segment
(SS) setting. The single segment version of RIT is essentially a standardRoBERTa model with an additional
83
Model
MLMacc(valid),%
entity tokens all tokens
ReadTwice (SS) 42.9 48.9
ReadTwice 50.3 50.5
Table B.1: Masked Language Model (MLM) accuracy on the held out set. SS mode corresponds to the case
when memories are collected only from the segment itself, effectively disabling any information propaga-
tion between segments.
Model HQA NQA-R NQA-B TQA
ReadTwiceE 75.89 22.71 21.07 80.7
w/o coref 76.0 21.79 21.01 80.5
Table B.2: Ablation studies on variants of ReadTwice. We report F1 (answer only) score forHQA, ROUGE-
L and BLEU-1 forNQA (-R and -B correspondingly) and F1 forTQA.
attention mechanism over entity mentions within the segment, but different segments are processed com-
pletely independently. The results are reported in Table B.1. ReadTwice achieves a+1.5% accuracy gain
over ReadTwice (SS), rising to +7.4% for entity tokens, confirming the model learns to utilize mention
memory.
B.3 QuestionAnswering
B.4 ExtractiveQAlayers
The model is fine-tuned and evaluated on several extractive QA tasks. We introduce additional QA-specific
layers to generate span-based predictions. LetH
i
be model’s output for the segmenti. The model generates
Model #Params
BERT [21] 110M
RoBERTa [57] 125M
LF [6] 149M
ETC [3] 166M
BigBird [101] 166M
ReadTwice (ENTITY) 145M
Table B.3: Number of parameters per model.
84
Model
dev test(full) test(verified)
F1 EM F1 EM F1 EM
LF 75.2 - - - - -
BigBird 79.5 75.7 - - - -
RoBERTa (us) 75.9 71.3 - - - -
ReadTwice (ENTITY) 80.7 76.47 80.9 76.7 89.2 86.8
Table B.4: Results on the TriviaQA dataset in Wikipedia setting.
separate scores (logits) for whether a tokenj is the beginning of an answer span,Z
(b)
i,j
, and another score
for the end of the span,Z
(e)
i,j
Z
(b)
i,j
=W
b
· FFN(H
i,j
) (B.1)
Z
(e)
i,j
=W
e
· FFN(H
i,j
) (B.2)
whereW
b
,W
e
are learnable weights andFFN(· ) is a shared fully-connected layer.
For the loss function we largely follow works by [16, 14], which describe an efficient way to train
an extractive QA system in multi-segment setting where there are multiple correct answer spans in the
evidence. LetB ={(i,j)| answer span starts at a positionj in the segmenti}. Then the loss is OR-model
with global normalization
L
(b)
span
=− log
P
(i,j)∈B
expZ
(b)
i,j
P
i,j
expZ
(b)
i,j
(B.3)
The loss for the end position of the answer spanL
(e)
span
is computed in a similar way. During inference, the
model picks the most confident prediction amongst all the segments.
B.4.1 HotpotQA
Datadownloadlink: https://hotpotqa.github.io
Unlike other datasets HotpotQA contains questions with “yes“ / “no“ answers. In order to handle them
appropriately we introduce an additional classification layer on top of ReadTwice output CLS represen-
tation. The layer produces scores for all three possible options – the answer is “yes“, the answer is “no“
and the answer is a span in the document. During training we normalize these scores globally across all
85
passages. The loss function for the option classifier is a negative log-likelihood of the correct option, ap-
plied only to the two supporting paragraphs (not the distractors). During inference we select the option
with the highest score across all paragraphs. If the selected option is “yes“ or “no“, we use it as the model’s
prediction. If the classifier predicts that answer is a span then we use the standard extractive QA layer to
extract a span.
Modelselection Model was selected based on the highest F1 (answer only) score on the development
set. The best model was trained with a learning rate3× 10
− 5
for 6 epochs.
B.4.2 TriviaQA
DataDownloadLink: https://nlp.cs.washington.edu/triviaqa
A complete set of TriviaQA evaluation results is shown in the Table B.4.
Modelselection Model was selected based on the highest F1 score on the development set. The best
model was trained with a learning rate1× 10
− 5
for 5.3 epochs.
B.4.3 NarrativeQA
DataDownloadLink: https://github.com/deepmind/narrativeqa
Here we provide more information on the NarrativeQA dataset. In particular, we would like to point
that the manner in which NarrativeQA questions were generated encourages questions to require infor-
mation from distant parts of the documents.
Every question was written given a short Wikipedia summary of a movie/book as context. Accordingly,
models can achieve high accuracy on NarrativeQA when given the summary as evidence, rather than the
whole book ([63] [14] report ROUGE-L scores of 57.19 and 60.5 on the dev set, respectively). However, each
sentence in the Wikipedia summary might correspond to a whole chapter of a book, so any question that
uses information from multiple summary sentences is likely to require information from different sections
of the book. Indeed, retrieve-and-read methods perform poorly in this setting [63]. On the other hand,
ReadTwice performs significantly better, demonstrating its ability to capture long-term dependencies.
Preprocessing In contrast to the HotpotQA and TriviaQA, answers to NarrativeQA questions do not
necessarily correspond to spans in the document. An exact answer can be found only in≈ 40% cases -
for the rest of the questions we use a ROUGE-L oracle as labels.
86
Fine-tuning Similarly to [28] we found it helpful to enforce sparsity in theMemoryAttention layer by
computing attention in the Equation 1 and 4 only over the 100 memories m which have the largest dot
product with the hidden stateh
ij
.
Evaluation In line with the previous work [48, 63] we convert both hypothesis and reference to low-
ercase and remove a trailing period before running an evaluation script. Following [63] we use an open-
source library to perform evaluation
1
, which includes ROUGE-L, BLEU-1, BLEU-4 and METEOR scores.
Model selection Model was selected based on the highest ROUGE-L score on the development set.
The best model was trained with a learning rate5× 10
− 6
for 2.2 epochs.
C AppendixforMentionMemory:incorporatingtextualknowledgeinto
Transformersthroughentitymentionattention
C.1 Pre-training
We train on English Wikipedia, processed with the entity linking and named entity recognition tools from
the Google Cloud NLP API
2
. We use existing hyperlinks in Wikipedia as additional entity annotations. All
models are pre-trained on 128 TPUs using AdamW optimizer [59] with learning rate 1e-4 and batch size
of 4096. Each passage in the batch has length T = 128, excluding entity tokens. The Mention Encoder
and batch-tome are pre-trained for 1 million steps with 50k warmup steps, and tome is trained for 500k
additional steps with 25k warmup steps after initialization frombatch-tome. Both models are trained with
linear learning rate decay. Mention Encoder andbatch-tome share Transformer weights during Mention
Encoder pre-training. We apply gradient clipping with a norm of 1.0 and weight decay of 0.01. Weight
decay is applied to all weights except layer norm and bias weights.
batch-tome and tome are trained with weight 0.85 on the MLM objective and 0.15 on the entity
coreference resolution objective. We mask 20% of whole entity mentions and 10% of other tokens. We limit
the coreference resolution objective to mentions of the 1 million most frequent Wikipedia entities. We use
24 mentions per sample, with a batch size of 32 samples per TPU. We subsample mentions uniformly if
1
https://github.com/Maluuba/nlg-eval by [79]
2
https://cloud.google.com/natural-language/docs/basics#entity_analysis
87
the average number of annotated mentions on a TPU exceeds 24. Key mention encoding have dimension
d
K
= 128 and value and coreference mention encodings have dimensiond
V
=d
C
= 512.
Disallowed same passage retrieval for Mention Encoder. We want the model to use memory as a
source of additional information for processing a passage. Therefore, we explicitly set attention weights
to 0 for memories generated from the same passage as the current one.
C.1.1 MentionEncoderdatageneration
We pre-train Mention Encoder to produce mention encodings that are useful forbatch-tome. In order to
providebatch-tome with an incentive to use the memory, we need to ensure that mentions from different
samples within a batch are relevant to each other. We achieve this by batching passages from the same or
related Wikipedia articles.
We generate clusters of 256 passages from Wikipedia articles using a greedy method. First, we create
a cluster from the longest unused Wikipedia article and add related articles until the cluster consists of
256 passages. In particular, at each step we add the article with the largest Jaccard similarity between its
entity set and the entity set of articles in the current cluster.
C.1.2 Coreferenceresolutionloss
For every linked mention m in the batch we compute a mention encoding z
m
by applying a separate
SpanEncodingLayer on the output of thebatch-tome. First, we compute the loss for every linked men-
tionm in the batch. To this end, we denote linked mentions ineveryotherpassage in the batch as positive,
P
+
(m), if they have the same entity ID asm and negative,P
− (m), otherwise. The loss per mention is an
average of cross-entropy losses per every positive mentionm
+
∈P
+
(m)
L
coref
(m) =− 1
|P
+
(m)|
log
X
m
+
∈P
+
(m)
exp(z
T
m
z
m
+)
exp(z
T
m
z
m
+)+
P
m
− ∈P
− (m)
exp(z
T
m
z
m
− )
The total loss is average of losses per linked mentions which have at least one positive mention (set
P
+
(m) is not empty).
88
C.2 Experiments
C.2.1 Fine-tuningsetup
tome is fine-tuned on 32 TPUs using the Adam optimizer with a learning rate of 1e-5 and total batch
size 32. In contrast to pre-training we set max mentions to 32 per sample for fine-tuning. We use 1000
warmup steps and linear learning rate decay. Gradient clipping and weight decay are the same as during
pre-training. We take the highest scoring checkpoint on dev sets and evaluate it on the test set. We use
the spaCy noun chunker to detect noun phrases and treat these as claim/question entity mentions.
The model can be fine-tuned with full memory on a server of 8 A100 GPUs or 16 v3/v4 TPUs. A model
with half memory (75M mentions) can be fine-tuned on 8 V100s/P100s or 8 TPUs.
C.2.2 Baselines
Following [34] we used REALM to perform extractive question answering on TriviaQA and ComplexWe-
bQuestions datasets. We also adapted the model to the classification setting in order to apply it to claim
verification tasks. Given an input claim X we compute probability of a predictionY (whether the claim
holds true or not) as a marginalization over retrievalZ.
Pr(Y|X,Z) =
X
z∈Z
Pr(Y|X,Z =z)· Pr(Z =z|X)
where Pr(Y|X,Z = z) is the output probability produced by the reader model and Pr(Z = z|X) is
produced by the retrieval model.
C.2.3 Claimverification
See Table C.1 for the results on development and test splits of claim verification datasets. Additionally,
Table C.2 compares our FM2 results to the original dataset baselines.
C.2.4 QuestionAnswering
We report additional results on the EntityQuestions dataset from [78]. The dataset consists of questions
involving rare entities, making it especially challenging for modern retrievals methods such as DPR. Eval-
uation results for tome models and baselines are shown in Table C.3 and Table C.4. Following [78] we
89
Table C.1: Accuracy on claim verification datasets. #Encoded refers to the number of passages encoded by
a BERT reader to answer a single question. EaE stands for Entities as Experts model.
Model #Params #Encoded HoVer
dev
HoVer
test
FEVER
dev
FEVER
test
FM2
dev
RAG 620M 100 - - 74.5 72.5 -
REALM 330M 5 67.3 66.1 70.4 67.1 65.8
EaE 360M 1 66.2 66.6 66.1 63.6 63.5
tome-1 220M 1 73.6 72.8 70.5 67.8 67.7
tome-2 220M 1 74.1 73.1 71.1 68.1 68.4
Table C.2: Accuracy on FM2 compared with original dataset baselines. Oracle refers to oracle retrieval
followed by a BERT-Base reader.
Model Accuracy
Oracle [23] 69.3
DPR [23] 64.2
EaE 63.5
REALM 65.8
tome-1 67.7
tome-2 68.4
report recall at 20 as an evaluation metric. Since tome retrieves mentions rather than passages, a direct
comparison is difficult. We evaluate tome conservatively, treating recall at 20 as successful if one of the 20
highest scoring mentions belongs to the correct entity (in contrast to DPR, for which the correct answer
only has to be somewhere in the retrieved 100 word document).
tome sets the state of the art on this dataset and outperforms DPR by a very large margin. REALM
cannot be fairly compared to DPR due to longer retrieved passages (100 vs 288 tokens). Therefore, we
perform a separate experiment using accuracy with REALM as a baseline, showing large performance
gains over REALM as well.
Table C.3: EntityQuestions recall@20
Model Recall@20
DPR [78] 65.4
BM25 [78] 71.2
tome-1 83.3
tome-2 83.8
Table C.4: EntityQuestions top-1 accuracy
Model Accuracy
Entities as Experts 32.5
REALM 59.0
tome-1 62.1
tome-2 66.0
90
Table C.5: Performance ablations for pre-training objectives experiments.
Dataset HoVer
dev
FEVER
dev
TriviaQA
dev
CWQ
dev
tome-1 73.6 70.5 50.8 44.9
w/o entity coreference loss 69.8 68.4 42.5 40.5
w/o entity prediction loss 73.7 70.7 49.4 43.8
C.2.5 Importanceofpre-trainingobjectives
We perform several ablation experiments for the pre-training procedure (see Table C.5). First, results show
that the entity prediction objective (c.f. Section 4.2.4) is not essential for TOME pre-training. Performance
on claim verification datasets (FEVER and HoVer) is not affected by whether we use entity prediction for
pre-training. More surprisingly, removing this objective only slightly decreases the performance on entity
question answering datasets (TriviaQA and ComplexWebQuestions). We predict entities for question-
answering in the same way as we do for the entity prediction objective during pre-training (c.f. Equa-
tion 4.10), so we expected the entity prediction auxiliary loss to be important.
On the other hand, a related entity coreference objective (c.f. Section C.1.1 and Appendix C.1.2) is
crucial for Batch-TOME and Mention Encoder pre-training. That is consistent with our intuition that
semantically challenging tasks incentivize the model to store useful information in memory.
C.2.6 tomeinitialization
We initializetome model with a pre-trainedbatch-tome model which we find to be especially important
for warming up retrieval. If tome is initialized from scratch (or even from BERT weights), tome does
not learn to use the memory. In fact, tome has to be initialized from the same BATCH-TOME used to
generate the memory. This implies that multi-stage training is a vital ingredient for tome to succeed.
Our explanation for why tome is sensitive to initialization is that tome needs to learn two skills: first, to
effectively use retrieved mentions for its predictions, and second, to retrieve relevant mentions. Learning
both capabilities end to end gives rise to a mutual dependence: to get a signal for learning how to use
retrieved mentions, the retrieved mentions have to be useful, and to learn to retrieve useful mentions, the
model needs to utilize retrieved mentions. If initialized from scratch, the model is not able to learn both
skills simultaneously. The pre-training stage with the smaller in-batch memory functions as a curriculum
to address that problem.
91
C.3 Nearestneighborsearch
Nearest neighbor search is an extremely common problem, and there exist numerous approaches and
packages for fast approximate nearest neighbor search (ANNS). [31, 42]. Most approaches employ two
methods for fast search: 1) compress the search table through projecting to a lower dimension and quan-
tization and perform comparisons in this compressed space and 2) divide the search table in buckets of
similar items, and search only a subset of the buckets. Retrieve-and-read use ANNS packages to search for
related passages [34, 52].
Applying such packages fortome is slightly trickier, astome needs to perform ANNS inside the model.
One viable route is to compute queries on-device, transmit them to a separate ANNS server and them
transmit them back. We would recommend this approach for GPU accelerators, with faster host-device
communication and slower device-device communication. As we are using TPU accelerators, we decided
to use on-device ANNS, which does not require coordinating additional servers and will potentially allow
for backpropagating through memory in future work.
C.3.1 On-devicenearestneighborsearch
We shard the Mention Memory over all TPU devices. We perform search by distributing each query to all
devices and retrieving top-K search results from each local memory shard. Then, the results are distributed
back to the original devices and the local search results are aggregated through another, global top-K.
Dot-product The first method we describe is naive dot-product search, taking advantage of matrix mul-
tiplication capacity of TPU accelerators. In this method we perform search over local shards by taking the
dot product between the query and the local memory shard and performing an approximate top-k oper-
ation over the results. Dot-product search is easy to implement and fast for smaller memory sizes (up to
10 million entries). We implemented this method first due to its simplicity and our primary experimental
results employ this search method.
ANNS To speed up search we implemented method 2) from standard CPU-based ANNS, bucketing the
search table and searching only a subset of buckets. In particular we perform k-means clustering to divide
the Mention Memory into clusters, and perform dot-product search over the topn
s
clusters on each device.
92
Overhead While the Mention Memory is stored on-device, memory overhead is negligible as the mem-
ory table is sharded. For pre-training the Mention Memory took up 2.2% of available device memory. Table
C.6 shows percentage of time spent on ANNS intome-1 pretraining for different reader architectures. The
relative overhead of search becomes smaller with reader size, and ANNS overhead in particular becomes
negligible for BERT-Large and up. We did not measure standard CPU ANNS overhead, but it should be
comparable to or faster than our ANNS numbers.
Table C.6: Proportion of time spent on ANNS fortome-1 pre-training setting.
Model Dot-product ANNS
BERT-Base 0.79 0.22
BERT-Large 0.48 0.07
T5-11B Encoder 0.17 0.02
Hyperparameters For ANNS in TOMEBlocks we take top-2 search results from each local memory
shard, and apply top-128 over the retrieved results. For ANNS in the entity prediction layer we take top-32
search results from each local shard, and aggregate across shards without applying an additional top-K
operation.
C.4 Retrievalexamples
Table C.7: tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) memory attention layer for passage mentions “the novel”, “the movie” and “the album”. Memory
mentions are in brackets. We can see that the model can retrieve relevant mentions for non-named passage
mentions, and generally understands it is looking for mentions related to music. However, while the best
retrieval for “album” is from a passage that mentions sampling the Shining, it is quite far removed and it
is likely the retrieval is not sufficiently accurate here.
Claim:StephenKing wrotethenovel thatthemovie directed byStanleyKubrick that was sampled
inthealbum"WhereBloodandFireBringRest" was based on. Label: TRUE
thenovel−→
1
Music Video. The video is a homage to Stanley Kubrick’s 1980 film The Shining based
on the[StephenKing] novel ...
themovie−→
1
Music Video. The video is a homage to Stanley Kubrick’s 1980 film The Shining based
on the[StephenKing] novel ...
thealbum−→
1
Where Blood and Fire Bring Rest is the third full-length album released by[metalcore]
band ZAO. It was the first album to feature vocalist Dan Weyandt after the departure of Shawn Jonas
along with new bassists/guitarists, Russ Cogdell and Brett Detar. The album contains a sample from the
film The Shining ...
93
D AppendixforGenerate-and-Retrieve:useyourpredictionstoimprove
retrievalforsemanticparsing
D.1 Method
OutputTF-IDF
We measure the similarity between two parses, denoted as output similarity, by their TF-IDF similarity.
More precisely, we process parses as a sentence where each intent and slot in the parse is treated as a
single token. Then we simply compute TF-IDF similarity between the constructed sentences. Slot values
are discarded - the similarity between slot values should already be captured by input TF-IDF similarity.
D.2 Data
Dataset #Train #Dev #Test
MTOP 15667 2234 4385
MTOP
1k
1096 2234 4385
MTOP
25%
3916 2234 4385
TOPv2
S
83703 11967 27336
TOPv2
W
176 147 5682
TOPv2
R
493 337 5767
Table D.1: Dataset statistics.
We show the sizes of datasets and splits in Table D.1.
D.3 Training
We initialize our model from the public T5.1.1-base checkpoint
3
. All models are fine-tuned using the
ADAM optimizer with dropout 0.1 and weight decay 0.01. In our experiments, we consider sampling tem-
peraturep∈{0.5,0.1,0.05}, batch size∈{128,256} and output similarity weightα ∈{0,0.25,0.5,0.75,0.9,1.0},
whereα = 0 is our input TF-IDF baseline.
For all experiments we train (or in case of early stopping, provide the opportunity to train) input TF-
IDF and GandR for the same number of steps. We train for 120k steps on MTOP
boot
(90k first stage/30k
second stage), 40k (10k/30k) on MTOP
1k
and 60k (30k/30k) on MTOP
25%
with early stopping. For TOPv2,
3
https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_
checkpoints.md#t511
94
we train for 90k steps on the source domain followed by 7.5k (5k/2.5k) steps on TOPv2 transfer domains
without early stopping.
Hyperparameters were selected based on development set performance. We measure and report aver-
age performance over 3 random seeds.
D.4 Erroranalysis
We provide two additional retrieval examples (Table D.2 and Table D.3) to illustrate cases where GandR
retrieval can be beneficial or harmful.
Input sample
x: When will it rain next?
y: [in:get_weather When will it [sl:weather_attrubute rain ] next ? ]
Training sample with similar input
x
1
: Will it be cold today?
y
1
: [in:get_weather Will it be [sl:weather_attrubute cold ] [sl:date_time today ? ] ]
ˆ y: [in:get_weather When will it [sl:weather_attrubute rain ] [sl:ordinal next] ? ]
Training sample with similar input and label
x
2
: Is it raining?
y
2
: [in:get_weather Is it [sl:weather_attrubute raining ] ? ]
ˆ y: [in:get_weather When will it [sl:weather_attrubute rain ] next ? ]
Table D.2: Input TF-IDF predicts the sl:ordinal slot, which exists in pre-training domains but does not
apply to the Weather domain. GandR has high slot coverage, so if a slot exists, it will likely be present
in at least one of the retrieved exemplars. The fact that GandR does not retrieve an exemplar with the
sl:ordinal slot (as it is not present in the Weather training exemplars) provides a hint for the model that
it may be an invalid slot and GandR’s updated prediction eliminates it.
95
Input sample
x: call myself
y: [in:create_call [sl:contact myself ]]
Training sample with similar input
x
1
: remind myself to get breakfast
y
1
: [in:create_reminder [sl:person_reminded myself ] [sl:todo get breakfast]]
ˆ y: [in:create_call [sl:contact myself ]]
Training sample with similar input and label
x
2
: Call Jeremy
y
2
: [in:create_call [sl:contact Jeremy]]
ˆ y: [in:create_call myself ]
Table D.3: Normally, the sl:contact slot for the in:create_call intent is paired with a name. In this
case, however, the contact is myself . The model with input TF-IDF retrieval generates the correct slot as
it retrieves another instance with slot myself due to lexical similarity in the input. In contrast, GandR
retrieves exemplars with perfectly matching templates, but without the same slot, such that it does not
assign myself to thesl:contact slot in its prediction.
96
Abstract (if available)
Abstract
Knowledge acquisition, the process of extracting, processing, and storing new information, is critical to any intelligent system. Nonetheless, modern neural networks (e.g., BERT) used in natural language processing typically do not have an explicit memory component. Instead, the knowledge about the world that the models acquire is stored implicitly in the model’s parameters. This proves unreliable and makes the models ill-suited for knowledge-intensive tasks that require reasoning over vast amounts of textual data. My thesis explores alternative parametric and semi-parametric methods to extract and represent knowledge from text. The main hypothesis is that we can improve the performance of modern NLP models by representing acquired knowledge in a dedicated memory. The models can access knowledge explicitly through interacting with the memory. The thesis consists of three sections: the first section focuses on parametric memory for a pre-defined set of entities. The second part explores a semi-parametric approach to representing entity-centric knowledge in a long document or entire corpus. Finally, the last part discusses memory for structured prediction tasks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Robust and generalizable knowledge acquisition from text
PDF
Visual representation learning with structural prior
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Building generalizable language models for code processing
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Transfer learning for intelligent systems in the wild
PDF
Identifying and mitigating safety risks in language models
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Modeling, learning, and leveraging similarity
PDF
Efficient and accurate object extraction from scanned maps by leveraging external data and learning representative context
PDF
Learning shared subspaces across multiple views and modalities
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
BeyondURL: learning meaningful embedding representations for Web addresses
PDF
Lexical complexity-driven representation learning
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Learning logical abstractions from sequential data
Asset Metadata
Creator
Zemlyanskiy, Yury
(author)
Core Title
Parametric and semi-parametric methods for knowledge acquisition from text
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-12
Publication Date
08/22/2022
Defense Date
08/19/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,machine learning,memory augmented models,natural language processing,OAI-PMH Harvest,question answering,retrieval augmented models
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Jia, Robin (
committee member
), Razaviyayn, Meisam (
committee member
), Sha, Fei (
committee member
)
Creator Email
yuri.zemlyanskiy@gmail.com,zemlyans@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111379831
Unique identifier
UC111379831
Legacy Identifier
etd-Zemlyanski-11153
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zemlyanskiy, Yury
Type
texts
Source
20220825-usctheses-batch-975
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
machine learning
memory augmented models
natural language processing
question answering
retrieval augmented models