Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Expanding the performance-compute frontier for retrieval-augmented language models
(USC Thesis Other)
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPANDING THE PERFORMANCE-COMPUTE FRONTIER FOR RETRIEVAL-AUGMENTED
LANGUAGE MODELS
by
Michiel de Jong
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2023
Copyright 2023 Michiel de Jong
Dedication
To my parents, Jan and Sandra, for unconditional support.
ii
Acknowledgements
I want to start by thanking Fei Sha, for his advice and mentorship throughout the years. He always chal-
lenged me to improve, to set high standards, and to relentlessly clarify my own thinking and communi-
cation of those insights to others. He encouraged me to explore my current line of research, a suggestion
which I originally met with hesitance but for which I am extremely grateful, and introduced me to so many
wonderful collaborators. It’s safe to say I would not be where I am without him.
I’m very grateful to Leana Golubchik, who took over as my advisor in the last years of my PhD and
was endlessly patient, helpful, kind and inclusive.
I’m similarly thankful to the members of my dissertation, proposal and qualifying committees: Jacob
Bien, Muhao Chen, Keith Michael Chugg, Robin Jia, Jay Pujara, Jesse Thomason, and Dani Yogatama.
Through my PhD I have received endless support from the USC academic staff. Lizsl de Leon was
kind and extremely helpful during very challenging times. Asiroh Cham took over and resolved multiple
last second problems even though she was performing multiple roles. Finally, Nina Shilling made sure we
always had a welcoming atmosphere in the lab.
I’m happy to have met so many dear friends, collaborators and colleagues during my time at USC. Yury
Zemlyanskiy, in particular, has been my constant partner since the very first time we worked together, and
has contributed incredible joy to the research process and the PhD. It’s not an exaggeration to say that our
collaboration has been at the root of all my successes, such as they are.
iii
I’m also grateful for the many other friends and colleagues I met at USC. Aaron Chan was an early
collaborator upon whom I inflicted many of my learning mistakes. Shariq Iqabal, Bowen Zhang and es-
pecially Seb Arnold helped keep me sane during the ups and downs of my PhD. I had many good times
and adventures with Melissa Ailem, Soravit (Beer) Changpinyo, Liyu Chen, Robby Costales, Jeremy Hsu,
Hexiang (Frank) Hu, Zhiyun Lu, Ivy Xiao, Yiming Yan, Ke Zhang, and Bill Zhu.
For the last two years of my degree I was an intern at Google Research, where I had the opportunity to
meet and work with many extremely competent as well as supportive people. I want to highlight William
Cohen, who was almost a second adviser, helping to guide my research, championing me at Google and
giving me so many opportunities. I am glad to know Nicholas FitzGerald and Joshua Ainslie as mentors,
collaborators and friends. Sumit Sanghai supported me for my last internship at Google to allow me to
complete my PhD, and was always willing to help. Livio Baldini Soares was universally kind and helpful
throughout my entire time at Google. I’ve had the privilege to hold many enlightening research discussions
with Markus Rabe, and insightful discussions or collaborations with Wenhu Chen, Santiago Ontañón, Pat
Verga, James Lee-Thorp, Tony Wu and others too numerable to mention!
Finally, I want to thank my family, my friends and especially my parents. They supported me through-
out five long years - and I needed a lot of support. I’m so grateful, and I will do my best to pay it forward.
iv
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
I Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Relationship to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
II Computational trade-offs for retrieval-augmented language models . . . . . . . . . . . . . . . . 6
Chapter 2: FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference . . . 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Fusion-in-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 FLOPs of FiD model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Effective computational throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Operational intensity of FiD inference . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Layer-sparse cross-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Multi-query attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Asymmetric Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.4 Other analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
v
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
III Pre-computed memory representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Chapter 3: Mention Memory: incorporating textual knowledge into Transformers through entity
mention attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Constructing mention memory from corpus . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1.1 Mention Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1.2 Mention memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 tome model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2.1 Attention over memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2.2 Sparse large-scale retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Mention encoder pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 tome pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Claim verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.5 Qualitative properties of tome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Chapter 4: Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval
augmentation makes the most of your compute . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Computational resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 Fusion-in-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 lumen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Computational analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.3 Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.4 Memory shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.5 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
vi
Chapter 6: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A Appendix for FiDO: Fusion-in-Decoder optimized for stronger performance and faster
inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
A.2 Other Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
B Appendix for Mention Memory: incorporating textual knowledge into Transformers
through entity mention attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B.1.1 Mention Encoder data generation . . . . . . . . . . . . . . . . . . . . . . 72
B.1.2 Coreference resolution loss . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.2.1 Fine-tuning setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
B.2.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
B.2.3 Claim verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
B.2.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
B.2.5 Importance of pre-training objectives . . . . . . . . . . . . . . . . . . . . 74
B.2.6 tome initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B.3 Nearest neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
B.3.1 On-device nearest neighbor search . . . . . . . . . . . . . . . . . . . . . 76
B.4 Retrieval examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vii
ListofTables
2.1 Maximum batch size for QA inference with 40 retrieved passages on a single TPUv4 for
FiD Base models with different FiDO components. . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Pre-training and fine-tuning samples per second per chip for FiD Base model with varying
FiDO components. We use 64 TPUv4 chips and batch size 2048 for pre-training and 32
chips and batch size 64 for fine-tuning. See Section 2.5.1 for training information. . . . . . 13
2.3 Inference time per sample, decoder time per sample (ms) and downstream QA exact match
for FiDO base-XL with different components ablated separately. FiDO is evaluated on dev
sets for ablation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Time per sample (ms) and QA exact match for FiD, FiD-Light, and FiD Base-sized models
with layer-sparse cross-attention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Comparison of FiDO with published results on Natural Questions, TriviaQA and
WebQuestions test sets. We focus on comparing with FiD as other works enhance
performance with improved retrieval (such as ATLAS), which is orthogonal to our
contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Accuracy on claim verification datasets. #Encoded refers to the number of passages
encoded by a BERT reader to answer a single question. . . . . . . . . . . . . . . . . . . . . 36
3.2 Accuracy on open-domain QA datasets TriviaQA (TQA), ComplexWebQuestions (CWQ)
and EntityQuestion (EQ). #Encoded refers to the number of passages encoded by a BERT
reader to answer a question. TQA
e-dev
corresponds to TQA with train and dev samples
limited to those with Wikipedia entity as an answer. See Appendix B.2.3 for full results. . 37
3.3 tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for
the first ( −→
1
) memory attention layer for two passage mentions. Memory mentions are
in brackets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
viii
3.4 tome-2 retrievals for the first HoVer dev sample. We show top-1 retrieval results for the
first ( −→
1
) and the second (−→
2
) memory attention layers for passage mentions “Life
Goes On” and “Hungry”
1
. Memory mentions are in brackets. The first retrieval for the
“Life Goes On” is a different song with the same name and the first retrieval for “Hungry”
is related but not useful. However, the second retrieval for “Life Goes On” identifies
the correct song and describes its position on the album while the second retrieval for
“Hungry” captures its position relative to “Life Goes On”. . . . . . . . . . . . . . . . . . . 38
3.5 Accuracy on held-out subset of TriviaQA and ComplexWebQuestions (CWQ) questions.
tome-1-unseen was pre-trained and fine-tuned with memory without entities from
held-out set and evaluated with full memory. Note that performance is considerably lower
than on the full dev set as answers in the held-out set (which are in dev but not train) are
more likely to be rare entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 AddingmemorytoFiDleadstosignificantperformancegainswithoutadditional
fine-tuning or inference FLOPs. Exact match performance on Natural Questions and
TriviaQA for FiD-Base and FiDO
1
/3 with Base decoder and live encoder, and memory
encoder with 24 Base layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Comparison of FiDO with published results on Natural Questions and TriviaQA test sets.
We focus on comparing with FiD as other works enhance performance with improved
retrieval (such as ATLAS), which is orthogonal to our contributions. . . . . . . . . . . . . . 53
A.1 Inference time per sample (ms) with batch size 1 for Base FiD with varying FiDO components. 70
A.2 Decoder inference time (ms) and QA exact match for FiD Base models, comparing the
trade-offs of beam search versus scaling decoder size. . . . . . . . . . . . . . . . . . . . . . 71
B.1 Accuracy on claim verification datasets. #Encoded refers to the number of passages
encoded by a BERT reader to answer a single question. EaE stands for Entities as Experts
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.2 Accuracy on FM2 compared with original dataset baselines. Oracle refers to oracle
retrieval followed by a BERT-Base reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.3 EntityQuestions recall@20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.4 EntityQuestions top-1 accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
B.5 Performance ablations for pre-training objectives experiments. . . . . . . . . . . . . . . . 75
B.6 Proportion of time spent on ANNS fortome-1 pre-training setting. . . . . . . . . . . . . . 77
ix
B.7 tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for
the first ( −→
1
) memory attention layer for passage mentions “the novel”, “the movie” and
“the album”. Memory mentions are in brackets. We can see that the model can retrieve
relevant mentions for non-named passage mentions, and generally understands it is
looking for mentions related to music. However, while the best retrieval for “album” is
from a passage that mentions sampling the Shining, it is quite far removed and it is likely
the retrieval is not sufficiently accurate here. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
x
ListofFigures
2.1 Shows the percentage of FLOPs in forward pass, training time and inference time for the
encoder and decoder for a Fusion-in-Decoder model with 40 retrieved passages and batch
size 24. The vast majority of FLOPs and training time originate from the encoder, but the
decoder is much more expensive for inference. . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 MAIN RESULT. Layer-sparse cross-attention (LSA) and multi-query (MQ)
attention eliminate the bulk of decoder inference cost with minor performance
penalty, and the decoder can then be massively scaled up (Dec
XL
) with only a
modest increase in inference time. To the left, encoder and decoder inference time
per sample on a single TPUv4 with batch size 24 and 40 retrieved passages for variants of
base-sized FiD model. To the right, corresponding exact match performance on Natural
Questions (NQ), TriviaQA (TQA) and WebQuestions (WQ) dev sets. . . . . . . . . . . . . . 9
2.3 MAINRESULT.FiDOachievesmuchhigherperformanceforanygiveninference
budget. Exact match on Natural Questions (NaturalQ), TriviaQA and WebQuestions
(WebQ) test sets as a function of inference budget (log scale). Compares FiD Small, Base
and Large models with FiDO Small-Large, Base-XL, Large-XXL and XL-XXL models. . . . 10
2.4 Cross-attention and total decoder inference time for FiDO Base-XL with varying factors
of layer-sparse cross-attention. The main FiDO configuration uses LSA-6 which has
cross-attention every 6 layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Performance on Natural Questions dev set as a function of inference time for FiDO Small,
Base and Large models with and without asymmetric decoder. . . . . . . . . . . . . . . . . 20
2.6 Log time per sample (TPS) as a function of retrieved passages (left) or the number of
generated tokens (right) for Base FiD variants and FiDO-Base-XL. . . . . . . . . . . . . . . 21
3.1 Overview of Mention Memory. A pre-trained mention encoder is used to generate dense
representations for each entity mention in Wikipedia (approximately 150 million total)
which are stored in a table. The tome model takes a passage annotated with entity
mention boundaries as input, and applies a Transformer block. Next, the tome model
applies one or more TOMEBlocks. Each TOMEBlock contains a memory attention layer
and a Transformer block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Claim verification accuracy as a function of fine-tuning memory size (in millions). . . . . 39
xi
4.1 Exact match on Natural Questions dev set for lumen-XXL as a function of proportion of
live (fine-tuned and conditioned on question) vs memory (pre-computed) encoder layers.
lumen closes the gap between pure memory and FiD approaches with a fraction of live
layers and therefore compute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Overview of the lumen architecture. Before fine-tuning, each passage in the corpus
is encoded by a memory encoder. While processing a sample, a question encoder first
generates a representation of the question, which is then separately concatenated with
each pre-computed passage representation. A fine-tuned live encoder then updates the
passage representations conditioning on the question, which are finally fed into the
decoder as in standard FiD. Frozen components are in orange, fine-tuned components in
blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 MAIN RESULT:lumen achieves performance close to FiD with fraction of live
layers. Therequiredfractiondecreaseswithscale. Exact match on Natural Questions
(NaturalQ) and TriviaQA validation sets as a function of proportion of live encoder layers
forlumen Base, Large, XL, and XXL models. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 MAIN RESULT: lumen uses significantly less compute than FiD for the same
performance, and this advantage grows with scale. TFLOPs as a function of exact
match on Natural Questions (NaturalQ) and TriviaQA test sets. FLOPs are for single
forward step and exclude pre-computation. Compares FiD andlumen with live proportion
0.33 Large, XL and XXL models. Lower is better. . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 lumen achieves much better performance than MemoryFiD at any compute
budget. Exact match performance on the test set of Natural Questions as a function of
TFLOPs per sample comparing lumen
1
/3 Base, Large and XL models with MemoryFiD
Large, XL, and XXL models. FLOPs are for single forward step and exclude pre-computation. 47
4.6 lumenclosesthegapwithFiDasscaleincreases. Proportion of exact match difference
on Natural Questions between MemoryFiD and FiD closed by lumen as a function of
model scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Transferring memory and especially live encoder from a related dataset can
partially close the gap with FiD, with increased gains for lower live proportion
and smaller final dataset. Exact match on TriviaQA and WebQuestions dev sets with
and without transfer from Natural Questions for FiD and lumen XL models with live
proportion
1
/3 and
1
/8. Live keeps the memory encoder frozen during training on Natural
Questions while Memory also trains the memory on Natural Questions (still frozen after
transfer). The gains from transfer are much more pronounced for smaller live proportion
and on WebQuestions, the smaller dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.8 Neitherconditioningmemoryoninputnorfine-tuningmemoryaresufficientto
recoverFiDperformance. Bothingredientsareimportant,althoughconditioning
appears to contribute more. Exact match on Natural Questions (NaturalQ) and
TriviaQA dev sets as a function of proportion of live encoder layers forlumen-Large and
two relaxations: one in which the memory layers are fine-tuned, and another in which
the memory layers are conditioned on the question. . . . . . . . . . . . . . . . . . . . . . . 51
xii
4.9 Theprimarygainsfromtheliveencoderinlumenresultfromupdatingmemory
representations conditioned on the question. Exact match on Natural Question
dev set as a function of the proportion of live encoder layers for lumen-Large and two
modifications with restricted encoder self-attention. In the ‘no q2mem‘ setting question
tokens cannot attend to passage tokens, and vice versa for ‘no mem2q‘. . . . . . . . . . . . 54
4.10 Fine-tuning the question encoder improves performance significantly. Exact
match on Natural Question dev set as a function of the proportion of live encoder layers
forlumen-Large and a modification for which the question encoder is frozen (so that the
memory encoder and question encoder are shared). . . . . . . . . . . . . . . . . . . . . . . 54
xiii
Abstract
Retrieval-augmented language models set the state-of-the-art on a broad spectrum of knowledge-intensive
tasks, outperforming orders of magnitude larger models. However, such models can also be expensive for
training and inference. Model performance and computational cost represent two sides of the coin: we
can generally improve performance through scale at the expense of an increased computational burden.
Therefore, we are really interested in pushing out thequality-compute frontier, improving performance at
any given level of computational resources.
In this dissertation, I analyze the factors that determine the computational burden of retrieval-augmented
language models and propose strategies to extract a better performance-compute trade-off. The disserta-
tion consists of three sections. The first section contains a detailed analysis of components of retrieval-
augmented models and introduces methods to improve generation efficiency. The second section explores
the use of dense memory to reduce the cost of encoding retrievals. Finally, the third section proposes a
hybrid between dense memory and text retrieval, combining lessons from previous chapters.
xiv
PartI
Background
1
Chapter1
Introduction
1.1 Overview
Recent years have seen the Transformer architecture [80] advance state of the art performance over a broad
spectrum of domains including vision [23, 11], protein modeling [46] and, as I focus on in this dissertation,
natural language processing [19, 13].
In machine learning, as in many endeavors, our goal is to carry out our tasks as well as possible,
conditional on resource and other constraints. The literature has consistently shown that performance
and reliability of Transformer language models can be improved by applying more resources in the form
of larger models [8, 12, 93], training for longer on more [38] and better [13, 56, 77] data, or expensive
sampling procedures [85, 90]. The key to expand what is possible with language models, then, is to improve
the trade-off between performance and resources, either by improving model efficiency or adding model
capacity at low cost.
A large body of work directly aims to make Transformer models more efficient, through methods such
as quantization [18, 92], knowledge distillation [37, 31], efficient parallelization [67] and others. A comple-
mentary direction identifies specific weaknesses in large Transformer models. One prominent weakness
is that pre-trained Transformers imperfectly learn information about the world, holding wrong [65, 2, 70]
or inconsistent [25] beliefs and failing to capture time-sensitive information [20] or learn facts about rare
entities even for very large models [47].
Retrieval-augmented language models [55, 35, 48, 6, 40] aim to improve performance by retrieving
relevant text when processing an input, equipping the model with the ability to access information specific
to the input on the fly rather than storing all information in model parameters. Retrieval-augmented
2
language models outperform models with orders of magnitude more parameters [40, 41], especially on
knowledge-intensive tasks [64]. Models that rely on an external text corpus rather than internal parameters
also have other advantages, as they can attribute their predictions to sources [5] and their beliefs can be
edited by editing the corpus [82].
While retrieval-augmented language models can match performance with fewer parameters, they are
still very expensive for training and inference [39] due to the need to process and read the retrieved text.
Ten times fewer parameters applied to ten times as many passages may not present a better trade-off in
terms of resources. In this dissertation, I aim to understand and improve the performance-compute trade-
off for retrieval-augmented language models.
1.2 DissertationOrganization
The dissertation is divided into two parts. The first part investigates the factors that make up the performance-
compute trade-off for retrieval augmented models, and proposes modifications to the decoder to increase
inference speed and performance. The second part focuses on reducing the reader cost for retrieval-
augmented models through use of pre-computed memory.
As a first step to expanding the quality-cost frontier, it is helpful to understand the factors that affect
the performance and speed of retrieval-augmented models. Part II performs such analysis for Fusion-
in-Decoder [40], the state-of-the-art retrieval-augmented model [41]. Chapter 2 finds that for standard
Fusion-in-Decoder (FiD), the bulk of computations are consumed by reading retrieved passages rather
than processing the input or generating a prediction. However, during inference the majority of time is
spent in the decoder, due to the fact that incremental autoregressive decoding with attention over retrieved
passages leads to a memory bandwidth constraint [86].
The second half of Chapter 2 proposes two modifications to the Fusion-in-Decoder architecture to al-
leviate the memory bandwidth constraint: eliminating most cross-attention layers, and employing multi-
query [73] attention. After applying these modifications the bulk of inference time is spent reading pas-
sages, consistent with FiD’s computational profile. The FiD decoder has a challenging job, being tasked
with fusing information from retrieved passages and producing an output while receiving only a fraction
of the compute of the reader. The final proposed model, FiDO [15] rebalances the compute budget by
massively increasing the decoder size, improving performance at modest cost.
3
Part III focuses on reducing the reader cost for retrieval-augmented models. Over the course of pre-
training, fine-tuning and inference the same passage may be retrieved a large number of times. Chapter 3
takes advantage of this observation bypre-computing dense representations of passages to be retrieved by a
model, amortizing the cost of reading passages across all times they are retrieved and sharply reducing the
total encoding resources. In particular, the TOME model attends over a memory of dense representations
of entity mentions from Wikipedia [17].
Pre-computing representations for retrieval implies that the retrieval representation is not conditioned
on the task or input it is used for, which can degrade quality. Chapter 4 proposes lumen [16], a hybrid
memory architecture that combines standard retrieval and pre-computed memory. lumen employs par-
tially pre-computed representations that are updated by a smaller live encoder that is applied on the fly
and conditions the memory on the input and task. Hybrid memory achieves much of the speed gains from
pre-computation with only minor quality degradation.
1.3 RelationshiptoPublishedWork
Chapter 2 Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei
Sha, William Cohen. FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference.
In ACL Findings. 2023
Chapter3 Michiel de Jong*, Yury Zemlyanskiy*, Nicholas FitzGerald, Fei Sha, and William Cohen. Men-
tion memory: incorporating textual knowledge into transformers through entity mention attention. In
ICLR. 2021
Chapter 4 Michiel de Jong*, Yury Zemlyanskiy*, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai,
Fei Sha, William Cohen. Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval
augmentation makes the most of your compute. In ICML. 2023
OtherWorks The following works are outside the scope of this dissertation but were published during
its preparation.
Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong, Philip Pham, Ilya Eckstein, Fei Sha. ReadTwice:
Reading Very Large Documents with Memories. In NAACL. 2021
4
Yury Zemlyanskiy, Michiel de Jong, Joshua Ainslie, Panupong Pasupat, Peter Shaw, Linlu Qiu, Sumit
Sanghai, Fei Sha. Generate-and-Retrieve: use your predictions to improve retrieval for semantic parsing.
In COLING. 2022
Wenhu Chen, William W Cohen, Michiel De Jong, Nitish Gupta, Alessandro Presta, Pat Verga, John
Wieting. QA Is the New KR: Question-Answer Pairs as Knowledge Bases. In AAAI. 2023
Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, William Cohen. Augmenting Pre-trained
Language Models with QA-Memory for Open-Domain Question Answering. In EACL. 2023
Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy,
David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, Sumit Sanghai. COLT5: Faster
Long-Range Transformers with Conditional Computation. Submitted to EMNLP. 2023
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. Submitted
to EMNLP. 2023
5
PartII
Computationaltrade-offsforretrieval-augmentedlanguagemodels
6
Chapter2
FiDO:Fusion-in-Decoderoptimizedforstrongerperformanceandfaster
inference
Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art
on many knowledge-intensive NLP tasks. However, FiD suffers from very expensive inference. We show
that the majority of inference time results from memory bandwidth constraints in the decoder, and propose
two simple changes to the FiD architecture to speed up inference by 7x. The faster decoder inference then
allows for a much larger decoder. We denote FiD with the above modifications as FiDO, and show that
it strongly improves performance over existing FiD models for a wide range of inference budgets. For
example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than
FiD-Large.
2.1 Introduction
A large body of work has demonstrated that language model performance on downstream tasks can be
improved by augmenting the model with relevant retrieved text [35, 55, 40, 41]. In particular, the Fusion-in-
Decoder (FiD) architecture [40] stands out for strong performance, even outperforming much larger models
on many knowledge-intensive tasks [41]. However, FiD is also expensive, and its high computational
burden leads to challenges in many practical settings. In particular, inferences with FiD are costly, making
it uneconomical to use in the field.
It is usually possible to improve performance at the cost of increased computational resources, through
scale [8, 12], sampling [32] or increased processing [84, 90]. Therefore, performance and computational
cost are two sides of the coin: the computational burden of FiD consumes resources that could be spent on
7
0% 25% 50% 75% 100%
FLOPs
Train
time
Infer
time
14
15
87
86
85
13
Encoder Decoder
Figure 2.1: Shows the percentage of FLOPs in forward pass, training time and inference time for the en-
coder and decoder for a Fusion-in-Decoder model with 40 retrieved passages and batch size 24. The vast
majority of FLOPs and training time originate from the encoder, but the decoder is much more expensive
for inference.
other avenues to improve the model. Progress is made by pushing out the frontier of the performance-cost
tradeoff. That leads to the primary question investigated in this work: what are the factors that most affect
FiD’s inference cost, and how can we reduce that cost without impairing performance?
In our analysis, we show that because the encoder is applied to a large number of retrieved passages,
the encoder requires an order of magnitude more Floating Point Operations (FLOPs) than the decoder.
However, the majority of inference time is actually spent in the decoder, as has been observed in prior
work [39]. This surprising result is shown in Figure 2.1. Our analysis finds that for typical inference
settings the FiD decoder is memory bandwidth bound [86] as a result of multi-head attention [81] over a
large input sequence.
Based on this analysis, we propose two architectural changes. We first propose to reduce the cost of
cross-attention over retrieved passages by removing most cross-attention layers from the decoder. This re-
duces cost and yields much smaller losses in performance than FiD-Light [39], the best previously-proposed
approach for optimizing FiD. We also propose to replace multi-head attention with multi-query attention
[73]. With these changes to the model the memory-bandwidth bottleneck is eliminated: decoder inference
is now orders of magnitude faster and most inference time is spent in the encoder, consistent with the
balance of FLOPs between components.
Finally, we propose to take advantage of the faster decoder by scaling decoder size, using a smaller en-
coder to extract information from retrieved passages and a large decoder to assimilate the information and
8
0 20 40 60 80 100 NQ TQA WQ
FiD
+ LSA
+ MQ
+
Dec
XL
46.8
41.0
42.1
41.8
67.3
64.9
65.3
65.8
48.2
46.3
45.8
46.5
4
1
16
88
13
13
13
13
Time per sample during inference, ms
Encoder Decoder
Figure 2.2: MAIN RESULT. Layer-sparse cross-attention (LSA) and multi-query (MQ) attention
eliminatethebulkofdecoderinferencecostwithminorperformancepenalty,andthedecoder
can then be massively scaled up (Dec
XL
) with only a modest increase in inference time. To the
left, encoder and decoder inference time per sample on a single TPUv4 with batch size 24 and 40 retrieved
passages for variants of base-sized FiD model. To the right, corresponding exact match performance on
Natural Questions (NQ), TriviaQA (TQA) and WebQuestions (WQ) dev sets.
reason about the desired output. We refer to the resulting series of models as FiDO (Fusion in Decoder Op-
timized) and show that FiDO strongly outperforms vanilla and efficient FiD models on question-answering
datasets Natural Questions [52], TriviaQA [44] and WebQuestions [4] for a wide range of inference budgets
and settings. Figure 2.2 summarizes some of these results.
2.2 Analysis
Retrieval-augmented models generally process many context tokens for each question or answer token,
which consumes the bulk of operations. However, past work has shown that most inference time for
Fusion-in-Decoder (FiD) is spent in the decoder [39]. Our own experiments yield the same conclusion
(Figure 2.1). This section investigates expensive FiD decoder inference and finds the slower decoder speed
to be the result of memory bandwidth constraints exacerbated by attention over retrieved documents.
2.2.1 Fusion-in-Decoder
The backbone of the Fusion-in-Decoder model [40] is a T5 encoder-decoder architecture. The model is
provided a question or other input, as well as a number of relevant retrieved text passages. The ques-
tion is prepended to each retrieved passage, and then the encoder is applied to each passage separately.
The resulting representations are concatenated. Finally, the decoder cross-attends to the large number of
9
10 100
40
45
50
55
S-L
B-XL
L-XXL
XL-XXL
S
B
L
Time per sample, ms
Exact Match
NaturalQ
10 100
60
65
70
S-L
B-XL
L-XXL
XL-XXL
S
B
L
Time per sample, ms
TriviaQA
10 100
35
40
45
50
S-L
B-XL
L-XXL
XL-XXL
S
B
L
Time per sample, ms
WebQ
FiD FiDO
Figure 2.3: MAINRESULT.FiDOachievesmuchhigherperformanceforanygiveninferencebud-
get. Exact match on Natural Questions (NaturalQ), TriviaQA and WebQuestions (WebQ) test sets as a
function of inference budget (log scale). Compares FiD Small, Base and Large models with FiDO Small-
Large, Base-XL, Large-XXL and XL-XXL models.
concatenated representations and assimilates the information from the different passages to generate an
answer, hence Fusion-in-Decoder.
2.2.2 FLOPsofFiDmodel
Model speed is determined by the number of floating point operations (FLOPs) required relative to the
speed at which computations are performed, typically measured in floating point operations per second
(FLOP/s). Operations in a Transformer can be roughly divided into MLP layers, attention projection layers,
and attention operations.
Letd be the dimension of the model,n
s
the total number of tokens across all passages,n
p
the number
of tokens in a single retrieved passage,n
t
the number of tokens in the target,L the number of layers, and
assume the MLP dimension is4d. The approximate FLOPs in an encoder layer is given by
FLOPs
encL
= 8n
s
d
2
|{z}
MLP
+ 4n
s
d
2
|{z}
QKVO projections
+2n
s
n
p
d
| {z }
Attention
Since the size of each retrieved passagen
p
≪ d, the computation of the attention score is negligible and
we can approximate total FLOPs in the encoder as
FLOPs
enc
≈ 12n
s
d
2
· L (2.1)
10
Decoder layers additionally have cross-attention layers, leading to FLOPs of
FLOPs
decL
= 8n
t
d
2
+4n
t
d
2
+2n
2
t
d
| {z }
MLP and Self-attention
+ 2n
t
d
2
|{z}
Cross-attention QO
+ 2n
s
d
2
|{z}
Cross-attention KV
+ 2n
t
n
s
d
| {z }
Cross-attention
The output lengthn
t
≪ n
s
,d, so the only non-negligible term for decoder FLOPs originates from the cross-
attention key and value projections, which cost the same FLOPs as encoder key and value projections. We
see that the decoder consumes roughly
1
6
the FLOPs of the encoder.
FLOPs
dec
≈ 2n
s
d
2
· L (2.2)
Figure 2.1 shows that actual measured training time closely mirrors this FLOPs approximation. How-
ever, the decoder is much more expensive for inference. We argue this is because the decoder is memory
bandwidth constrained during inference, specifically the cross-attention layers.
2.2.3 Effectivecomputationalthroughput
In order to perform computations accelerators must transmit data between global memory and registers,
which can be a limiting factor. The actual FLOP/s achieved can be usefully modeled with the roofline
model [86, 62, 60] as the lesser of peak FLOP/s the device is capable of and how fast required data can be
transferred.
Actual FLOP/s = min(Peak FLOP/s,
Operational Intensity
| {z }
Operations per byte
· Peak Memory Bandwidth
| {z }
bytes per second
)
The data constraint is given by the product of device memory bandwidth – how fast data can be transferred
– andoperationalintensity – how many operations are performed per unit of data. The latter is determined
by an algorithm’s degree of data reuse, the number of operations that can be performed before new data
needs to be fetched.
11
High operational intensity is necessary for good performance on modern GPU/TPU hardware, for
which peak FLOP/s are usually two orders of magnitude times larger than memory bandwidth [30, 61]. If
operational intensity is too low, the accelerator will spend the majority of its time waiting for data to be
transferred to registers. Usually, that happens when the model performs minor computations with large
tensors repeatedly, for example in normalization layers or during incremental decoding.
2.2.4 OperationalintensityofFiDinference
[73] shows that the speed of incremental Transformer decoding is memory-bandwidth bound due to low
operational intensity. We follow their analysis and derive the asymptotic inverse of operational intensity
– the ratio of memory operations to the compute performed during each incremental decoding step – for
Fusion-in-Decoder. Let b be the batch size, h the number of attention heads and assume that attention
heads have dimension
d
h
.
OperationalintensityofMLPlayer. For each token the linear projections performO(bd
2
) operations,
and loadO(bd+d
2
) memory, wherebd corresponds to activations andd
2
to the weight matrices. During
training, sequence length effectively multiplies batch size as weights need to be loaded only once for the
entire sequence, but for inference each token is processed incrementally. The inverse operational intensity
is then
R
MLP
=
1
b
+
1
d
(2.3)
Therefore, high operational intensity of MLP layer (R
MLP
≪ 1) during inference requires sufficiently large
batch size.
Operational intensity of attention layers. Memory bandwidth is a more severe bottleneck for at-
tention inference, particularly cross-attention. At each decoding step the model applies projections for a
single token, and has to load all cached key and value projections from encoder tokens and prior decoder
tokens into memory. This leads to very low operational intensity.
Specifically, query/key/value/output projections for a single position take O(bd
2
) operations. As dis-
cussed earlier, we can ignore the attention computation itself. The model needs to load projection matrices
12
Model MaxBatchSize
Vanilla FiD 24
+ LSA 128
+ MQ 256
+ XL Decoder 128
Table 2.1: Maximum batch size for QA inference with 40 retrieved passages on a single TPUv4 for FiD Base
models with different FiDO components.
(O(d
2
) memory) and past keys and values (O(bnd) memory). Therefore, the inverse operational intensities
for self-attention layers,R
S-MHA
and cross-attention layersR
C-MHA
are
R
S-MHA
=
1
b
+
n
t
d
, R
C-MHA
=
1
b
+
n
s
d
(2.4)
Because the source input lengthn
s
is extremely long for FiD, the cross-attention operational intensity
is very low, which bottlenecks inference.
2.3 Method
We have shown that the encoder accounts for the bulk of FiD FLOPs and training cost, while FiD spends
the majority of inference time in the decoder due to low operational intensity of cross-attention layers.
Next we propose several ways to alleviate the decoder bottleneck. This allows us to scale the decoder
without significantly increasing the inference speed. We denote Fusion-in-Decoder with the proposed
optimizations as FiDO (Fusion-in-Decoder Optimized).
Model Pre-training Finetuning
Vanilla FiD 219.9 9.7
+ LSA 247.0 11.8
+ MQ 248.0 11.8
+ XL Decoder 81.9 6.9
Table 2.2: Pre-training and fine-tuning samples per second per chip for FiD Base model with varying FiDO
components. We use 64 TPUv4 chips and batch size 2048 for pre-training and 32 chips and batch size 64
for fine-tuning. See Section 2.5.1 for training information.
13
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
no LSA
LSA-3
LSA-6
LSA-12
Time per sample, ms
Cross-attention
Decoder other
Figure 2.4: Cross-attention and total decoder inference time for FiDO Base-XL with varying factors of
layer-sparse cross-attention. The main FiDO configuration uses LSA-6 which has cross-attention every 6
layers.
2.3.1 Layer-sparsecross-attention
The decoder cross-attention layer is the primary bottleneck for inference due to its low operational in-
tensity. FiD-Light [39] improves the operational intensity by reducing the effective input length by a
factor of K. We instead propose to remove cross-attention from some decoder layers entirely, keeping
cross-attention only in one out of everyK decoder layers. We call this layer-sparse cross-attention (LSA).
Section 2.5 provides evidence that LSA achieves similar speedups without FiD-Light’s drop in quality. For
FiDO we use LSA with sparsityK = 6, which means that a Large decoder has cross-attention at the 6th,
12th, 18th and 24th layer. In principle these methods can be combined, but we find that in practice after
applying layer-sparse cross-attention and multi-query attention the remaining cross-attention makes up
a small proportion of decoder inference cost and further speedups from reducing cross-attention cost are
modest (Figure 2.4).
Removing cross-attention layers also reduces FiD’s FLOPs and memory usage. Cross-attention layers
make up approximately
1
7
of total FiD FLOPs (see Eqn 2.2) and applying LSA-6 leads to a 12% reduction
in FLOPs. Table 2.2 shows the reduction in FLOPs is reflected by an increase in training speed. Moreover,
cross-attention keys and values make up a substantial proportion of memory usage during inference, and
LSA-6 enables a much larger batch size (Table 2.1).
14
2.3.2 Multi-queryattention
[73] proposes to increase the operational intensity of decoder attention layers by applying multi-query
attention, in which keys and values share a single head each and only queries have multiple heads. With
a single head, keys and values use a factorh less memory and are much faster to load. With multi-query
attention, keys and values occupyO(bnd/h) memory, so that the inverse operational intensity of cross-
attention becomes
R
C-MQA
=
1
b
+
1
d
+
n
s
dh
(2.5)
which has the problematic term
ns
d
reduced by factor ofh. Multi-query attention further reduces inference
cost (Figure 2.2) and memory (Table 2.1) on top of layer-sparse cross-attention, though not training speed
(Table 2.2).
2.3.3 AsymmetricDecoder
After applying layer-sparse cross-attention and multi-query attention, the decoder is vastly cheaper than
the encoder for both training and inference. We propose to exploit this computational structure by mas-
sively scaling the decoder by up to 15x, while only modestly increasing inference cost. For example, Fig-
ure 2.2 shows that replacing the Base-sized decoder with an XL-sized decoder increases the total inference
time per sample by only 21%. Fine-tuning costs also increase only modestly (Table 2.2). However, pre-
training costs increase more strongly (though still much less than the scaling factor of the decoder), as
T5 pre-training uses a much smaller ratio of input length to output length. After reducing the decoder
cross-attention memory costs scaling the decoder only mildly increases activation memory, so that FiDO
can still fit much larger batch sizes than vanilla FiD (Table 2.1).
For the FiDO method we use decoders that are typically two T5 sizes larger than the encoder: Small-
Large, Base-XL, Large-XXL and XL-XXL (as XXL is the largest T5 model).
2.4 RelatedWork
Retrieval-augmentedmodels There exists a large body of retrieval-augmented approaches. Some par-
ticularly well known models are REALM [35], RAG [55], RETRO [6] and Fusion-in-Decoder [40]. FiD in
particular has achieved state-of-the-art performance on a wide variety of tasks [40, 41, 90] and in this
15
Model TotalTPS DecoderTPS NaturalQ TriviaQA WebQ
FiDO (base-XL) 15.8 2.0 48.2 67.3 46.8
no LSA 19.2 5.4 47.9 67.4 46.3
no MQ 60.8 47.0 48.2 67.5 45.4
no Asym (base-base) 14.4 0.6 46.3 64.9 41.0
Table 2.3: Inference time per sample, decoder time per sample (ms) and downstream QA exact match for
FiDO base-XL with different components ablated separately. FiDO is evaluated on dev sets for ablation
results.
work we focus on improving the performance-efficiency trade-offs for FiD. RETRO is another closely re-
lated retrieval-augmented model, as it uses a small encoder for retrieved context and a larger primary
decoder like FiDO does. Unlike RETRO, FiDO’s efficiency improvements allow it to tractably attend to
many retrieved passages with a much larger decoder.
Efficient Transformers Our work builds heavily on existing insights into neural network and partic-
ularly Transformer speed. Previous work has found that data movement is often a constraining factor for
computations on modern devices [86, 14, 73]. [73] shows that autoregressive Transformers are particu-
larly bandwidth bound during inference, and proposes multi-query attention as a partial solution. We find
that this is exacerbated by the FiD setting, and adopt multi-query attention for FiDO to ameliorate the
problem. [67] also investigates multi-query attention, primarily in the context of efficient inference and
parallelization for very large language models, whereas we focus on performance/cost trade-offs for the
retrieval-augmented setting.
Another way to alleviate memory bandwidth constraints is to quantize model parameters and possibly
activations [18, 92]. Quantizing models reduces data that needs to be sent to device registers, and also
reduces overall memory usage which allows for larger, more efficient batch sizes. Finally, it is possible
to distill [37, 31] models into a smaller student model, which is cheaper for inference. However, knowl-
edge distillation requires labeling a very large number of samples with the larger model, so reducing the
inference costs of larger models is highly valuable.
Efficient retrieval-augmented models FiDO lies in a body of work that attempts to improve the ef-
ficiency of retrieval-augmented or long-input models. One direction focuses on reducing the cost of the
attention mechanism. LongT5 [33] routes long-range attention through a small number of global tokens.
FiD-Light [39], the most closely related work to FiDO, employs a similar mechanism for FiD, as the decoder
16
attends to only the first
1
K
proportion of representations of each retrieved passage. We opt to introduce
sparsity in attention layers as in ReadTwice [91] instead of attention patterns. FiDO applies cross-attention
from the decoder to the encoder in one out of every K layers, which achieves a similar speedup to FiD-
Light but with only minor performance penalty. FiDO also incorporates multi-query attention leading to
a further order of magnitude reduction in decoder inference cost, and takes advantage of this to massively
scale the decoder.
A different and complementary direction is to reduce the cost of reading retrieved passages. KG-FiD
[89] reranks retrieved passages and reads only the top passages, while [79] reads more retrieved passages
only if it is not confident in its answer. Another approach is to pre-compute and store encoder representa-
tions in a memory and directly retrieve representations from memory, rather than re-encoding retrieved
text [17, 87, 57]. For standard FiD, the decoder actually makes up the bulk of the inference cost. FiDO
reduces the cost of the decoder such that encoding retrieved passages becomes the bottleneck, increasing
the benefit of the above approaches.
2.5 Experiments
2.5.1 ExperimentSetup
Pre-training All models are based on the T5.1.1 architecture [68], pre-trained from scratch on C4 [22]
using JAX [7], FLAX [36], and T5X [69]. We employ the standard T5 training recipe except for a modified
Adafactor [74] optimizer. Appendix A.1 describes training in greater detail.
Downstreamevaluation We evaluate FiDO on open-domain question-answering datasets Natural Ques-
tions [52], TriviaQA [44] and WebQuestions [4]. We report results on the open-domain QA splits from [53].
For all datasets, each sample is paired with a set of 100-word Wikipedia passages ranked by DPR [48] score.
The question is prepended to each retrieved passage, and then truncated to 256 tokens. The experiments
in the paper use 40 retrieved passages to balance performance and computational cost, but our results hold
across a wide range of retrieved passages. We fine-tune each model with batch size 64 and learning rate
0.001 with early stopping based on dev set performance.
17
Inference setup For our main results we choose a setting that we believe is most representative for
common use of retrieval-augmented models. We perform inference on a single TPUv4 and report inference
time per sample (TPS) as measured by xprof [28]. We use a batch size of 64 (or the largest batch size that
fits, if smaller) for the main experiments. Figure 2.1 and 2.2 use batch size 24 to ensure a like-for-like
comparison, as it is the largest batch size that fits for vanilla FiD. All experiments use 40 passages of
256 tokens and output size of 32 tokens. Predictions are generated with greedy decoding as we found
beam search did not meaningfully improve performance for considered tasks. Analysis in Section 2.5.4
investigates how trade-offs change with input and output length, low batch size and different sampling
methods.
2.5.2 Mainresults
Figure 2.3 shows performance as a function of inference budget for FiD and FiDO. FiDO strongly outper-
forms FiD at any given inference budget, and generally achieves the same performance with an order of
magnitude faster speed. The following section investigates how each component of FiDO contributes to
its performance. Table 4.2 compares FiDO to published results.
2.5.3 Components
Model TPS NQ TQA WebQ
FiD 101.8 46.5 65.8 41.83
FiD-Light 28.3 36.3 54.5 30.8
FiD-LSA 29.5 45.8 65.3 41.0
Table 2.4: Time per sample (ms) and QA exact match for FiD, FiD-Light, and FiD Base-sized models with
layer-sparse cross-attention.
Layer-sparse cross-attention First, Table 2.3 shows that layer-sparse cross-attention significantly re-
duces inference cost with modest performance degradation. Separately, Table 2.4 compares the inference
speed and performance impact of layer-sparse cross-attention with the token-sparse cross-attention from
FiD-Light. Reducing cross-attention layers and inducing encoder output sparsity by the same factor lead
to similar speedups, but layer-sparse cross-attention achieves the inference speedup with much lower per-
formance penalty.
18
Note that we find a much larger performance degradation from compressing the encoder output in
our setting compared to the experiments in [39]. Some exploratory experiments suggest that multi-task
training fine-tuning on large amounts of data as done in FiD-Light may ameliorate the performance penalty
from compressing encoder output; however even with such training [39] still report significant peformance
degradation, in contrast to LSA.
Layer-sparsity over a factor of 6 incurs greater performance penalties. However, as shown in Table 2.4,
with layer-sparsity of 6 cross-attention already makes up a small proportion of the total inference cost of
the decoder.
Multi-query attention Table 2.3 shows that multi-query attention achieves a large cost reduction on
top of layer-sparse cross-attention with minimal performance degradation, consistent with our analysis
and findings from [73].
Model NQ TQA WQ
REALM [35] 40.4 - 40.7
RAG [55] 44.5 56.8 45.2
RETRO [6] 45.5 - -
T5-XXL [71] 35.2 51.9 42.8
ATLAS [41] 60.4 79.8 -
FiD-L [40] 51.4 67.6 -
FiD-L (ours) 51.5 68.2 44.3
FiDO (L-XXL) 53.2 70.7 49.7
Table 2.5: Comparison of FiDO with published results on Natural Questions, TriviaQA and WebQuestions
test sets. We focus on comparing with FiD as other works enhance performance with improved retrieval
(such as ATLAS), which is orthogonal to our contributions.
Decoder scale We can see in Table 2.3 that increasing the size of the decoder leads to a significant
improvement in performance at the cost of a modest increase in inference time. Figure 2.5 provides a
visual comparison of the performance-inference profile for FiDO with and without asymmetric decoders
and shows that asymmetric large decoders achieve a better trade-off.
2.5.4 Otheranalysis
Varyinginputandtargetlength Our main results use a middle-of-the-road setting for FiD applications
with a medium number of retrievals and a relatively short output, reflecting common knowledge-intensive
19
10
0.7
10
0.8
10
0.9
10
1
10
1.1
10
1.2
10
1.3
10
1.4
10
1.5
10
1.6
10
1.7
40
41
42
43
44
45
46
47
48
49
50
51
52
Small-Large
Base-XL
Large-XXL
Small
Base
Large
Time per sample (ms, log scale)
Exact Match
FiDO
FiD + LSA + MQ
Figure 2.5: Performance on Natural Questions dev set as a function of inference time for FiDO Small, Base
and Large models with and without asymmetric decoder.
20
(a) TPS by # of retrievals
20 60 100
10
1
10
2
(b) TPS by output length
32 256 512
10
2
10
3
FiD FiD + LSA
FiD + LSA + MQ FiDO
Figure 2.6: Log time per sample (TPS) as a function of retrieved passages (left) or the number of generated
tokens (right) for Base FiD variants and FiDO-Base-XL.
tasks. However, it is interesting to ask how FiDO components affect speed for other settings. Figure 2.6
shows time per sample as a function of retrieved passages and length of the target output for each step
from FiD to FiDO.
We first note that layer-sparse cross-attention and multi-query attention are critical across all settings.
For standard output length, the asymmetric decoder is cheap for any reasonable number of retrieved pas-
sages, becoming negligible as a fraction of total inference time as the number of retrievals increases. As
output length increases, the cost of the disproportionately large decoder rises, although it only becomes
a substantial proportion of inference time for output length of 256-512 and above. For tasks with long
outputs, such as summarization, one may want to reduce the level of decoder asymmetry (e.g. Base-Large
rather than Base-XL).
Low batch size setting For our primary investigation we focus on medium batch sizes (24+). There
are two reasons one might care about smaller batch sizes: either because larger batches do not fit on the
accelerator, or because there are latency constraints and larger batches take too long. The first constraint
is not binding for most retrieval-augmented models one might want to use at scale: due to FiDO’s memory
efficiency we are able to fit larger batches even for the XL-XXL model, and if necessary model size can be
further extended with quantization [92] and parallelization [69].
21
For real-time serving latency can be a constraint, but in those settings it is common practice to use
much smaller models which are distilled from larger teacher models [31]. The student models can utilize
a higher batch size, while the teacher models do not have latency constraints, so FiDO also applies to this
use case.
For rare cases where a lower batch size is required layer-sparse and multi-query attention are still
important, but cannot fully eliminate the decoder as a bottleneck for inference (Table A.1). The
1
b
term in
Equation 2.5 dominates, reflecting the fact that the model has to repeatedly load model parameters without
spreading the cost over many samples.
Instead of scaling the decoder, it would be more cost-effective to apply more expensive sampling meth-
ods, because sampling methods increase the effective batch size. For example, beam search with large
beams is nearly free at lower batch sizes.
Sampling We do not apply beam search for our main experiments as decoder inference time is pro-
portional to beam width for medium batch sizes and beam search does not improve performance on the
considered set of tasks. Instead, we find that scaling decoder size provides a more cost-efficient way to add
decoder capacity. Table A.2 compares the performance vs time trade-offs from beam search and scaling the
decoder for Natural Questions, and shows that scaling the decoder is significantly more effective. Beam
search may be more important for other tasks, such as tasks with longer outputs.
2.6 Conclusion
We perform analysis of the performance-inference speed tradeoff for FiD, showing that the encoder re-
quires more FLOPs but most time is spent in the decoder due to memory bandwidth constraints. We
propose FiDO, an extension of FiD which removes most cross-attention layers and employs multi-query
attention to vastly reduce the cost of the decoder. The resulting model spends most time in the encoder,
consistent with compute analysis, which FiDO takes advantage of by strongly increasing the size of the de-
coder. We show that FiDO achieves much stronger performance given the same inference budget relative
to existing standard and efficient FiD models.
22
PartIII
Pre-computedmemoryrepresentations
23
Chapter3
MentionMemory: incorporatingtextualknowledgeintoTransformers
throughentitymentionattention
Natural language understanding tasks such as open-domain question answering often require retrieving
and assimilating factual information from multiple sources. We propose to address this problem by in-
tegrating a semi-parametric representation of a large text corpus into a Transformer model as a source
of factual knowledge. Specifically, our method represents knowledge with “mention memory”, a table
of dense vector representations of every entity mention in a corpus. The proposed model - tome - is a
Transformer that accesses the information through internal memory layers in which each entity mention
in the input passage attends to the mention memory. This approach enables synthesis of and reasoning
over many disparate sources of information within a single Transformer model. In experiments using a
memory of 150 million Wikipedia mentions, tome achieves strong performance on several open-domain
knowledge-intensive tasks, including the claim verification benchmarks HoVer and FEVER and several
entity-based QA benchmarks. We also show that the model learns to attend to informative mentions with-
out any direct supervision. Finally we demonstrate that the model can generalize to new unseen entities
by updating the memory without retraining.
3.1 Introduction
Neural models have greatly advanced the state of the art in natural language processing and generation
tasks. Accordingly, there has been increasing interest in applying neural language models to tasks which
require extensive world knowledge to solve [64]. Much of this world knowledge can be found distributed
24
over text corpora, which raises the question whether language models pre-trained on text corpora cap-
ture this information. Recent work suggests that while language models may successfully predict facts
about the world [66] such knowledge is superficial and unreliable [9]. Our goal is to reliably incorporate
information from across a text corpus into a language model.
Recent work has represented the information present in a text corpus explicitly by constructing a
virtual knowledge base (vkb) [21, 75]. A vkb consists of dense representations of entity mentions in
the text, designed to reflect the property or relation expressed by the entity mention. We propose to
incorporate a vkb into a language model by using it as an external memory, performing attention over
the entire vkb within a Transformer model. In this way the model can synthesise and reason over many
disparate sources of information from the text corpus. We refer to thevkb used in such a way as Mention
Memory, and the model as tome (Transformer Over Mention Encodings). We first pre-train a mention
encoder to specifically encourage mention representations that are useful for a Transformer model, and
construct a Mention Memory from 150 million entity mentions in English Wikipedia. Then we train a
tome model with attention layers over the Mention Memory, which is kept frozen (see Figure 3.1).
We argue that the Mention Memory approach has several appealing properties. First, tome retrieves
entity mention representations corresponding to specific entity attributes or relations described in the cor-
pus. This retrieval is much more fine-grained than aggregate entity retrieval methods such as Entities as
Experts (EaE) [26], and we show large improvements in accuracy over EaE on tasks that require detailed
entity information, such as claim verification and entity-based question answering. The fine-grained re-
trieval also allows potential users to see more precisely what knowledge the model’s predictions is based
on (see Table 3.4). Second, tome retrieves dense representations, which are easy to incorporate into a
Transformer model without reprocessing the input, unlike raw text. Therefore, tome is able to retrieve,
assimilate and reason over information from many different sources within a single Transformer model,
allowing for multi-source and multi-hop reasoning without the beam search machinery that is required for
multi-hop retrieve-and-read [94]. This also makestome much more scalable: retrieve-and-read approaches
have to read many retrieved passages which becomes expensive with larger reader models, while the cost
of memory layers does not scale with reader size and is negligible for larger readers. Third, the retrieval is
latent, without direct or distant supervision on the retrieved results. We show that, even without super-
vision, the model learns to retrieve highly specific and informative entity attributes and perform multiple
25
Figure 3.1: Overview of Mention Memory. A pre-trained mention encoder is used to generate dense rep-
resentations for each entity mention in Wikipedia (approximately 150 million total) which are stored in a
table. The tome model takes a passage annotated with entity mention boundaries as input, and applies a
Transformer block. Next, thetome model applies one or moreTOMEBlocks. EachTOMEBlock contains a
memory attention layer and a Transformer block.
reasoning steps. Finally, the memory table is semi-parametric, so knowledge can be added or updated by
applying the mention encoder to new text without retraining.
In order to verify the model’s capacity to capture accurate factual information in the corpus, we start
by evaluating tome on the HoVer [42], FEVER [78] and FM2 [24] claim verification datasets, on which
it strongly improves performance over entity aggregate and comparable retrieve-and-read baselines. We
demonstrate that the model learns to attend to informative mentions for verifying claims using only the
verification accuracy as a signal. Ablations show the memory is crucial for performance, and that the
model can effectively use larger memory than it was pre-trained on. In a second set of experiments we
evaluate tome on question-answering benchmarks TriviaQA [45], ComplexWebQuestions [76] and Enti-
tyQuestions [72], improving performance over comparable baselines. Finally we show that the model can
be adapted to generalize to new unseen entities by updating the memory, without retraining.
3.2 Method
Our method represents knowledge in a corpus as a collection of “mentionencodings” – dense vector repre-
sentations for every entity mention that appears in the corpus. Every time an entity appears in a passage
– "[Barack Obama] was elected president in 2008" – some property of the entity or its relation to other
entities is described. The first component of our method, the Mention Encoder model, is responsible for
distilling information from entity mentions in the corpus into high-dimensional mention encodings. We
26
use the Mention Encoder to encode each entity mention in English Wikipedia and gather encodings into a
MentionMemory. The purpose of the Mention Memory is to capture all knowledge contained in the corpus
in a way that can be easily integrated into a Transformer. The second component of our method, thetome
model, applies sparse attention over the Mention Memory to incorporate external information from the
corpus into a Transformer model. An overview of the whole method is shown in Figure 3.1.
Jointly training the Mention Encoder andtome models is computationally costly, since it would require
backpropagating through the Mention Encoder for each attended mention. Consequently, we propose to
train the models in two stages. First, we pre-train the Mention Encoder and generate the Mention Memory.
Second, we pre-train the tome model while keeping the Mention Memory frozen: the gradient does not
propagate through it and the memories are not modified. Mention Encoder pre-training is specifically
designed such that mention encodings capture relevant contextual information about each mention and
are useful for tome even without joint training. We formally define these models in sections 3.2.1 and
3.2.2, and their pre-training procedures in 3.2.3 and 3.2.4.
Notation. An input to the model is a passage x = x
1
,...,x
T
of length T . We assume that each
passage has been annotated with an NER system. Following [3] we use special entity markers to highlight
entity mentions in the passage. We introduce tokens [E
start
] and [E
end
] to the vocabulary and insert them
before and after each mention in the passage. For example, the original passage “Whatisthenationalityof
the hero who killed Medusa” turns into “What is the [E
start
] nationality [E
end
] of the [E
start
] hero [E
end
]
whokilled[E
start
]Medusa[E
end
]”. Each mentionm in a passage is described by a tuple(s,e), wheres and
e are start and end positions of the mention. We consider entity markers to be part of the corresponding
mention, so thatx
s
= [E
start
] andx
e
= [E
end
]. Representations of these tokens are later used to generate
mention encodings.
27
3.2.1 Constructingmentionmemoryfromcorpus
3.2.1.1 MentionEncoder
Let H ∈ R
T× d
be token representations where d is the hidden dimension, such that H
i
∈ R
d
is the
contextualized embedding for thei-th token. Following [26] we compute the encoding of a span (s,e) as
a learnable linear projection W of the concatenation of its start and end token representationsH
s
andH
e
SpanEncodingLayer(H,(s,e)) =W[H
s
;H
e
] (3.1)
The Mention Encoder is a Transformer model with two final SpanEncodingLayers that producekey and
value mention encodings. Valuementionencodings store context-level information about each mention and
are used as inputs to thetome model. Keymentionencodings identify the type of information stored in the
value encodings and serve as attention keys for the memory layer. These two SpanEncodingLayers do
not share weights.
3.2.1.2 Mentionmemory
After the Mention Encoder is pre-trained (see section 3.2.3), we use it to generate a Mention Memory from
entity mentions in Wikipedia. While we could include encodings of any corpus mention in the Mention
Memory, we focus on grounded mentions which can be linked to Wikipedia entities. We denote these
as linked mentions, which we hypothesize contain information that can be retrieved and grounded. We
gather mention encodings into matricesMemKey∈R
N× d
K
andMemValue∈R
N× d
V
, whereN is the total
number of linked entity mentions in English Wikipedia (approximately 150 million) and d
K
and d
V
are
dimensions of key and value encodings. Additionally, we record entity (Wikipedia) IDs of mentions in
MemEnt ∈ R
N
, which we use as labels for auxiliary losses, not as inputs to the model or supervision on
retrieval. MemKey(i),MemValue(i),MemEnt(i) correspond to the key encoding, value encoding and entity
ID for thei-th linked mention in Wikipedia.
3.2.2 tomemodel
The tome model incorporates information from a text corpus into a Transformer by applying sparse at-
tention over the Mention Memory. The model consists of one or more TOMEBlocks, each containing
28
a memory attention layer followed by a post-processing Transformer block. Memory attention layers re-
trieve and attend to relevant “memories” for every mention in the input passage. The model then processes
the retrieval-augmented representation with the Transformer block, allowing it to access and combine in-
formation from multiple sources in the corpus. Finally, multiple TOMEBlocks enable the model to refine
retrievals and perform multi-hop reasoning. More formally, aTOMEBlock receives the output representa-
tion of the previous layerH and produces new representationsH
′
M = MemoryAttention(H), (3.2)
H
′
= TransformerBlock(M) (3.3)
The tome model encodes input passages x with the word embedding layer and initial Transformer
block and then applies one or moreTOMEBlocks
H
0
= InitialTransformerBlock(TokenEmbedding(x)), (3.4)
H
l
= TOMEBlock
l
(H
l− 1
), l = 1...L (3.5)
In this work we consider two configurations of the tome model: tome-1 and tome-2, with one and
two TOMEBlocks respectively. Each TOMEBlock of tome-2 contains half as many Transformer layers as
intome-1 to hold the total number of Transformer layers fixed between models.
3.2.2.1 Attentionovermemory
Each memory attention layer is implemented as a sparse dot-product attention layer that takes the output
H of the previous Transformer block, incorporates information from the Mention Memory, and returns a
representationM (omitting layer indices). Consider a mentionm that starts at positions and ends at posi-
tione. We start by computing its query mention encoding Query(m) by applying aSpanEncodingLayer
Query(m) = SpanEncodingLayer(H,(s,e)), (3.6)
29
Query mention encodings are used to retrieve relevant memories from the Mention Memory table. How-
ever, applying standard attention over 150 million mention encodings is infeasible. Instead, we first per-
form approximate nearest neighbor search to retrieve the top-K mentions with the largest dot product
between queryQuery(m) and key mention encoding fromMemKey. We denote the set of these memories
as TopMem(Query(m)). We compute attention over these memories and incorporate the result into the
token contextual representation at positions
α i
∝ exp(Query(m)· MemKey(i)), i∈ TopMem(Query(m)) (3.7)
Value(m) =
X
i∈TopMem(Query(m))
α i
· MemValue(i) (3.8)
M
s
= LayerNorm(H
s
+W
U
Value(m)) (3.9)
whereW
U
is a learnable matrix of shaped× d
V
.
3.2.2.2 Sparselarge-scaleretrieval
Approximate nearest neighbor search (ANNS) can be performed cheaply using one of multiple ANNS
libraries, for example ScaNN [34]. We implemented two on-device search methods to avoid the engineering
complexity of real-time communication with an ANNS server, though we have verified this is also viable.
The first naively computes a simple dot-product between passage queries and memory keys, and was
used in our main experiments as it was easiest to implement. We also implemented and will be releasing
a much faster version based on CPU ANNS methods. The memory is sharded over devices, so that the
device-memory overhead is negligible.
Holding the number of entries in memory fixed, the compute cost of retrieval from memory does not
grow with the size of the reader or the dimensionality of the memory values, so that the relative cost of the
memory layer becomes smaller with reader size. In particular, the overhead from the memory used in our
pre-training setting is small for BERT-Large and up. More details on ANNS implementation and overhead
can be found in Appendix B.3.
30
3.2.3 Mentionencoderpre-training
While backpropagating through a Wikipedia-scale mention memory is challenging, it is possible to train
smaller-scale memory architectures end-to-end. We take an approach inspired byMarge [54] andRead-
Twice [91] which apply cross-attention over documents within a batch. In particular, we process passages
in each batch twice. As a first step, the Mention Encoder model generates mention encodings from each
passage and aggregates the mention encodings into a batch-wide memory table. In the second step, we
apply atome architecture that attends to the batch memory, which we callbatch-tome. Note thatbatch-
tome is just used for pre-training the Mention Encoder and not evaluated on any downstream tasks. Men-
tion Encoder and batch-tome are jointly trained end-to-end so that the Mention Encoder is encouraged
to produce mention encodings that contain useful information forbatch-tome.
We want to make sure the batch memory contains relevant mentions, so we pre-train the models on
batches of passages constructed from related Wikipedia articles with high entity overlap. Appendix B.1.1
provides more details on Mention Encoder data generation. We use the pre-trained Mention Encoder to
construct the Mention Memory table from corpus, and use the batch-tome model as the initialization
point fortome-specific pre-training (described in Section 3.2.4).
Maskedlanguagemodel. Our primary pre-training objective is the standard masked language mod-
eling task, with the loss computed based on the output of the second read (batch-tome). To encourage
the model to rely on memory, we increase the task’s difficulty relative to standard BERT pre-training by
masking entity mention tokens more aggressively.
Coreference resolution. We wish to encourage the Mention Encoder to represent the entity at-
tributes expressed by entity mentions, so we also employ an entity-oriented pre-training task to the output
of batch-tome for which such attribute information is likely to be especially helpful. Unlike Entities as
Experts [26], batch-tome does not use entity embeddings, so we cannot use the entity linking task. In-
stead, we apply a related entity coreference resolution objective, which asks the model to predict whether
two linked mentions correspond to the same entity based on the similarity of their encodings. Given that
entity surface forms are frequently masked, the model needs to instead use the properties of other men-
tions in the batch to determine which entity it is most compatible with, incentivizing the Mention Encoder
to encode such properties. We compute a coreference mention encoding for every linked mention in the
31
batch by applying a separateSpanEncodingLayer on the output of batch-tome. The loss is implemented
using cross-entropy over dot-product similarity scores. See Appendix B.1.2 for details.
3.2.4 tomepre-training
Astome attends to the full Mention Memory instead of in-batch memory, we do not employ the batching
procedure from Mention Encoder pre-training, instead sampling Wikipedia passages randomly. For the
same reason, we replace the in-batch entity coreference objective by Mention Memory entity coreference,
in which the model has to predict which mentions from the Mention Memory share an entity with the input
mention. The goal of this auxiliary objective is to incentivize the model to learn to retrieve informative
mention encodings to solve the semantically challenging task. Mention Memory entity coreference also
allows us to solve tasks like TriviaQA or ComplexWebQA without a decoder by directly predicting the
answer entity.
Entityprediction. Analogous to batch coreference resolution loss we compute mention encodingz
m
using the output of the tome model. As in section 3.2.2, TopMem(z
m
) returns the top K memories with
the largest dot product between the mention encodingsz
m
and key mention encodingsMemKey from the
Mention Memory. The scoreEntProb(m,j) of entityj equals the sum of attention weights of memories
corresponding to this entity.
EntProb(m,j) =
P
i∈TopMem(zm)
exp(z
m
· MemKey(i))· 1{MemEnt(i) =j}
P
i∈TopMem(zm)
exp(z
m
· MemKey(i))
(3.10)
The final entity prediction is argmax
j
EntProb(m,j). Entity prediction lossL
ep
(m) for a mentionm
of entityEnt(m) isL
ep
(m) =− logEntProb(m,Ent(m)). Total loss equals the average loss over linked
input mentions for which at least one memory of the same entity is retrieved.
Disallowedsamepassageretrieval. For each passage in the pre-training corpus, there exist memo-
ries corresponding to mentions in the passage generated from the unmasked version of the same passage.
In order to prevent the model from ‘cheating’ by attending to such memories, we set the attention weight
for all memories from the same passage to zero.
32
3.3 RelatedWork
Our approach lies at the intersection of three lines of work: i) knowledge-augmented language models, ii)
employing a text corpus as a virtual knowledge base, iii) retrieve-and-read methods.
Knowledge-augmentedlanguagemodels. Entities as Experts (EaE) [26] injects information into a
Transformer model model with an intermediate attention layer over trainable entity embeddings, which
serve as an aggregate representation of entity information in a text corpus. In contrast,tome attends to a
much larger table of mention encodings, allowing for retrieval of more fine-grained information. Attend-
ing to mentions as opposed to entity representations also enables tome to generalize to unseen entities.
FiLM [82] extends EaE by adding an attention layer over facts from a KB on the output of the Transformer.
The fact attention layer enables more fine-grained queries but still retrieves aggregate entity embeddings
as values, which are also not reasoned over by the Transformer. KnowBERT [63] is similar to EaE, but
with entity embeddings generated from a KB instead of trained end-to-end with a text corpus. Marge [54]
and ReadTwice [91] incorporate dense representations from other passages within the same batch into
a Transformer through sparse top-k attention. The first pre-training stage of our method for training the
Mention Encoder is similar to Marge and ReadTwice. However, tome performs global attention over a
full corpus, rather than a single batch. Furthermore, tome attends to a Mention Memory consisting of
pre-computed dense representations. Therefore tome is not limited to downstream task with batches of
relevant documents, and does not need to apply an expensive reader model to an entire batch of documents
for each input.
Text corpus as virtual knowledge base. DrKIT [21] performs multi-hop question answering by
using a text corpus as a virtual knowledge base. Similar to tome, the authors apply a mention encoder
to convert the corpus into a table of mention encodings. A Transformer model encodes the question into
dense queries, which are compared with the mention encodings to traverse the vkb. Conversely, tome
retrieves mention encodings, and then jointly processes theminside the Transformer. In follow-up work to
DrKIT, OPQL [75] uses a FiLM-like approach to access a memory of relation mentions, which are encoded
with a self-supervised relation encoder. However, the relation mention encoding combines a mention-
specific relation representation with EaE-like entity encodings, so they are less fine-grained than tome’s
encodings. Unlike tome, OPQL also lacks a sparse large-scale retrieval mechanism, and relies on ad hoc
33
heuristics to limit the size of the memory.
1
MOLEMAN [27] compares a passage mention encoding with
mention encodings from a corpus to perform entity linking, but does not retrieve the mentions.
Retrieve-and-readmethods. REALM [35] learns to retrieve relevant passages from a text corpus in a
self-supervised manner. Retrieved passages are concatenated to the input passage which is then re-encoded
by a Transformer model to perform a task. The key difference between retrieve-and-read approaches [35,
48, 55, 40] and tome is that tome retrieves dense representations, as opposed to text. That means that
tome only applies a reader model once to a single input, while retrieve-and-read approaches have to apply
an expensive BERT reader to many different passages. In addition, Transformer models can only process
relatively short sequences, which imposes a binding constraint on the number of retrieved text passages
that can be processedtogether, whereastome can retrieve and reason over information from many sources
inside the same reader. Generative models like RAG [55] or FiD [40] attend to different retrieved documents
in the decoder, but still have to apply a BERT read for every retrieved document, do not consider interaction
between retrievals while encoding the question, and cannot perform iterative retrieval.
3.4 Experiments
3.4.1 Experimentalsetup
The Mention Encoder is based on a BERT-base model with two final SpanEncodingLayers that produce
key and value encodings. Mention Encoder andbatch-tome share Transformer weights during Mention
Encoder pre-training. The Mention Memory consists of mention encodings for N = 150 million linked
Wikipedia entity mentions. Transformer layers in tome and batch-tome models are equivalent to those
in the BERT-base model. The tome InitialTransformerBlock contains 4 Transformer layers. tome-1
has a singleTOMEBlock with 8 Transformer layers, andtome-2 has twoTOMEBlocks with 4 Transformer
layers each. Therefore, the number of trainable parameters in tome-1 and tome-2 is approximately the
same as in BERT-base. We use a smaller Mention Memory containing 38m uniformly sampled memories
fortome pre-training. During fine-tuning and evaluation we utilize the full Mention Memory. Appendix
B.1 contains more details.
1
It should be noted that absent heuristics, the number of potential relation mentions (i.e., entity mention pairs) is much larger
than the number of entity mentions.
34
3.4.2 Baselines
We compare tome with existing methods that utilize textual information from a corpus in a language
model. These can be divided into generative LLMs (T5), entity embedding retrieval (Entities as Experts,
OPQL), extractive retrieve-and-read (REALM) and generative retrieve-and-read (RAG, Fusion-in-Decoder).
tome occupies a novel position in the space of retrieval models, being more fine-grained than entity embed-
ding retrieval methods, but performing all its reasoning with a single BERT read, unlike retrieve-and-read
methods. The most closely comparable models are Entities as Experts and REALM, and we use these as
our primary baselines. We report other baselines for reference, with the caveat that these results are not
apples-to-apples: RAG and Fusion-in-Decoder have large decoders and retrievers and encode a large num-
ber of passages with a BERT reader for each question compared totome’s single read. Fusion-in-Decoder
and RAG
2
also use ground-truth supervision for retrieval. We mark the number of parameters and BERT
applications for each baseline in the result tables. Consistent with retrieve-and-read, we count the pa-
rameters of the Mention Encoder and tome, but not the size of the non-trainable and sparsely accessed
Mention Memory.
3.4.3 Claimverification
Data. Our first set of experiments evaluates tome on the claim verification tasks FEVER [78], HoVer [42],
and FM2 [24] in which the model is provided with a claim and has to determine whether the claim is
supported by the Wikipedia corpus. FEVER is a larger dataset with 186k claims for which most of the
claims can be verified with a single Wikipedia passage. In contrast, HoVer is smaller with 26k claims, but
is explicitly constructed to require evidence from multiple sources and multiple reasoning steps. FM2 is
also smaller and is constructed through an adversarial game that leads to more challenging retrieval. The
claim verification training data contains gold evidence passages, but unlike most published results we do
not use these, leaving only the accuracy of the claim verification to guide the retrieval.
Results. Table 3.1 contains our claim verification results. tome outperforms both Entities as Experts
and REALM, especially on HoVer and FM2. This is consistent with the properties of tome: HoVer requires
2
RAG is initialized from DPR which is trained with gold retrieval passages for TriviaQA.
35
Table 3.1: Accuracy on claim verification datasets. #Encoded refers to the number of passages encoded by
a BERT reader to answer a single question.
Model #Params #Encoded HoVer
test
FEVER
test
FM2
dev
RAG 620M 100 - 72.5 -
REALM 330M 5 66.1 67.1 65.8
Entities as Experts 360M 1 66.6 63.6 63.5
tome-1 220M 1 72.8 67.8 67.7
tome-2 220M 1 73.1 68.1 68.4
combining detailed information from multiple sources, whichtome is especially well equipped to do com-
pared to aggregate entity-based or retrieve-and-read models. FM2 features generally challenging retrieval
and may benefit from contextualizing retrieved evidence.
3.4.4 QuestionAnswering
Data. In a second set of experiments we evaluate tome on TriviaQA (TQA) [45], ComplexWebQuestions
(CWQ) [76] and EntityQuestions (EQ) [72], open-domain QA tasks for which most answers are Wikipedia
entities. We approach these datasets as entity-linking tasks, as in [26]. We append a mask token to each
question, which is marked as a question mention. The probability for each candidate entity is predicted as
the aggregate attention weight on mentions of the entity (Section 3.2.4). Questions with answers that do
not correspond to entities in our entity vocabulary are marked as answered incorrectly. TQA consists of
96k trivia questions, for which 84% of answers correspond to a Wikipedia entity. We use the open-domain
setting without gold evidence passages. In order to compare head-to-head performance, we also report
results on a subset of TQA with only questions with Wikipedia entities as an answer. CWQ consists of 35k
complex questions (compositions, conjunctions, etc.) for which 94% of answers correspond to a Wikipedia
entity. EQ contains challenging questions involving rare entities, with Wikipedia entities as answers.
Results. Table 3.2 contains the results for TQA, CWQ and EQ experiments. Like tome, Entities as
Experts and OPQL treat the above datasets as entity-linking tasks. REALM performs extractive QA, while
T5, RAG and Fusion-in-Decoder generate the answer. We note a similar pattern of results as for claim
verification. tome strongly outperforms Entities as Experts on all tasks. tome performs slightly better
than REALM on a simple task like TriviaQA (entity subset) and strongly outperforms REALM on more
challenging tasks that require multiple (CWQ) or challenging (EQ) retrieval.
36
Table 3.2: Accuracy on open-domain QA datasets TriviaQA (TQA), ComplexWebQuestions (CWQ) and
EntityQuestion (EQ). #Encoded refers to the number of passages encoded by a BERT reader to answer a
question. TQA
e-dev
corresponds to TQA with train and dev samples limited to those with Wikipedia entity
as an answer. See Appendix B.2.3 for full results.
Model #Params #Encoded TQA
dev
TQA
test
TQA
e-dev
CWQ
dev
EQ
dev
RAG 620M 100 56.8 68.0 - - -
Fusion-in-Decoder 440M 100 65.0 77.1 - -
REALM 330M 5 55.8 67.1 63.4 46.7 59.0
T5-3B 3B 1 - - - 38.7 -
T5-11B 11B 1 42.3 50.1 - - -
Entities as Experts 360M 1 43.2 53.4 51.3 42.7 32.5
OPQL 220M 1 - - - 41.1 -
tome-1 220M 1 50.8 61.1 60.3 44.9 62.1
tome-2 220M 1 54.6 65.8 64.8 47.7 66.0
Table 3.3: tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) memory attention layer for two passage mentions. Memory mentions are in brackets.
Claim: GreaterSwissMountainDog andHarrier are bothdogbreeds. Label: TRUE
GreaterSwissMountainDog−→
1
Breed History the origin of the[GreaterSwissMountainDog]
is not definitively known. ...
Harrier−→
1
The harrier is a medium-sized dog breed of the[hound] class, used for hunting. . .
3.4.5 Qualitativepropertiesof tome
What memories does tome retrieve? Given that tome retrieval is unsupervised, it is natural to ask
what memories it learns to retrieve. First, we observe that batch-tome and tome trained on just the
MLM objective learn to attend to memories of the same entity as the passage linked mention (55% and
41% average attention score). This is promising as entity mentions from the same entity often contain
mutually relevant information. Quantitative evaluation of downstream retrieval is challenging as tome
often retrieves mentions that are not part of, but equally informative as gold passages. Instead, we provide
tome retrievals onthefirstthree samples of the HoVer dev set to demonstrate its retrieval behavior without
cherry-picking. Table 3.3 demonstrates a successful simple retrieval, while Table 3.4 displays interesting
multi-hop retrieval. The last is found in Appendix B.4.
Importance of memory size. Figure 3.2 shows claim verification performance as a function of
memory-size during fine-tuning (pre-training memory size is held constant). For smaller memory sizes,
entries in memory are uniformly sampled from the full Mention Memory. Performance increases smoothly
37
Table 3.4: tome-2 retrievals for the first HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) and the second (−→
2
) memory attention layers for passage mentions “Life Goes On” and “Hungry”
3
.
Memory mentions are in brackets. The first retrieval for the “Life Goes On” is a different song with the
same name and the first retrieval for “Hungry” is related but not useful. However, the second retrieval
for “Life Goes On” identifies the correct song and describes its position on the album while the second
retrieval for “Hungry” captures its position relative to “Life Goes On”.
Claim: Thesong recorded byFergie that was produced byPolowDaDon and was followed byLife
GoesOn wasHungry. Label: TRUE
LifeGoesOn−→
1
...and Johnny J produced the chart topping hits “All Bout U”, “How Do U Want It”
and[“LifeGoesOn”]. ...
Life Goes On−→
2
...On November 11, 2016, Fergie released the third single from the album, [“Life
GoesOn”]...
Hungry−→
1
...Polow da Don, is an American record producer, songwriter and rapper. His cousin is
[Atlanta] singer Monica. Jones has produced a variety of singles for a multitude of artists including
“Anaconda” by Nicki Minaj (2014), “Love In This Club” by Usher (2008), “Buttons” by the Pussycat Dolls
(2006), “Hungry” by Fergie ...
Hungry−→
2
...“Life Goes On” is a song recorded by American singer Fergie for her second studio
album, Double Dutchess (2017). ...The song serves as the third single from [Fergie’s] second studio
album, following “Hungry”.
with memory size. Larger memory size yields diminishing returns, perhaps reflecting that entity mentions
may contain overlapping information.
Zero-shottransfertounseenentities. An important advantage of memory architectures is that the
behavior of the model can be steered by deciding what to include in the memory. Here we show that the
tome model can use information that was not present in memory during training. We sample questions in
the TQA and CQA dev sets, and generate a subset of the memory without any mentions corresponding to
the answer entities for those questions. Then we pre-train and fine-tune a model on this smaller memory,
which we call tome-unseen. We evaluate tome-unseen on the sampled questions using the full memory
for evaluation only, and compare to standard tome. Table 3.5 shows that using full memory only during
evaluation does not lower performance.
3
We replaced the original song title with the song “Hungry” as the original may be inappropriate.
38
0 50 100 150
64
66
68
70
72
74
Memory size during fine-tuning, in millions.
Accuracy, %
HoVer (dev)
FEVER (dev)
Figure 3.2: Claim verification accuracy as a func-
tion of fine-tuning memory size (in millions).
Table 3.5: Accuracy on held-out subset of Triv-
iaQA and ComplexWebQuestions (CWQ) ques-
tions. tome-1-unseen was pre-trained and fine-
tuned with memory without entities from held-out
set and evaluated with full memory. Note that per-
formance is considerably lower than on the full dev
set as answers in the held-out set (which are in dev
but not train) are more likely to be rare entities.
Dataset TriviaQA
dev
CWQ
dev
tome-1 17.4 16.4
tome-1-unseen 17.6 16.7
3.5 Conclusion
We introducedtome, a Transformer model that performs attention over a semi-parametric representation
of the entire Wikipedia text corpus. This representation, or Mention Memory, consists of a dense encod-
ing for each entity mention in Wikipedia. tome can retrieve information from multiple sources without
supervision, aggregate information within the Transformer, and reason over the retrieved information.
tome leads to strong improvements on multiple open-domain claim verification and entity-based question
answering tasks.
39
Chapter4
Pre-computedmemoryoron-the-flyencoding? Ahybridapproachto
retrievalaugmentationmakesthemostofyourcompute
Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art
on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode
a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a
memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe
quality penalty as the memory representations are not conditioned on the current input. We propose
lumen, a hybrid between these two extremes, pre-computing the majority of the retrieval representation
and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-
tuned for the task. We show that lumen significantly outperforms pure memory on multiple question-
answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget.
Moreover, the advantage of lumen over FiD increases with model size.
4.1 Introduction
Retrieval-augmented language models such as Fusion-in-Decoder [40] achieve strong performance on
knowledge intensive tasks, often outperforming much larger models [41]. Retrieval-augmented models
retrieve related text passages and process the passages along with the input to extract relevant context
information. However, encoding retrieved passages can be computationally expensive. Recent work has
found that with an optimized decoder [73, 15, 67] the cost of encoding retrieved passages makes up the
bulk of total finetuning and inference cost.
40
− 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
42
44
46
48
50
52
54
56
Proportion of live encoder layers
Exact Match
Memory
lumen
FiD
Figure 4.1: Exact match on Natural Questions dev set for lumen-XXL as a function of proportion of live
(fine-tuned and conditioned on question) vs memory (pre-computed) encoder layers. lumen closes the gap
between pure memory and FiD approaches with a fraction of live layers and therefore compute.
Figure 4.2: Overview of thelumen architecture. Before fine-tuning, each passage in the corpus is encoded
by a memory encoder. While processing a sample, a question encoder first generates a representation of
the question, which is then separately concatenated with each pre-computed passage representation. A
fine-tuned live encoder then updates the passage representations conditioning on the question, which are
finally fed into the decoder as in standard FiD. Frozen components are in orange, fine-tuned components
in blue.
An increasingly common approach to reduce this encoding cost retrieves and extracts information from
a memory of pre-computed representations rather than raw text, amortizing the encoding of a passage over
every sample that retrieves the passage entry from memory [57, 17, 87, 95, 10, 88].
1
1
Here we do not refer to pre-computing representations used to select passages for retrieval (as is common practice for dense
retrieval methods), but rather pre-computing the actual representations to be retrieved and incorporated into the language model.
41
However, memory approaches incur a large quality penalty relative to retrieval-augmented models
such as Fusion-in-Decoder [40], because the pre-encoded memory is not conditioned on the task or on the
particular input or question. That means that the pre-encoded representation must be suitable to answer
any question, a challenging undertaking. The human analogue is the difference between reading an entire
book and being quizzed afterwards compared to looking up the answer to a question on the fly.
Memory-based approaches therefore need to massively scale model size in order to achieve comparable
performance. As we will show, this leads to higher overall net FLOPs due to cross-attention and decoding,
as well as impractical increases in pre-training, pre-computation, and storage costs.
We proposelumen (Live Update Memory Network), a middle ground between retrieval and memory.
lumen divides the task of encoding passages between a frozen memory encoder that pre-computes passage
memory representations, and a fine-tuned live encoder that updates the memory representations condi-
tioned on the question. Figure 4.2 provides a detailed overview of the architecture. As can be seen in Figure
4.1, a small proportion of live layers can already achieve performance close to standard Fusion-in-Decoder.
We start with a set of experiments initializing lumen from T5, partitioning the standard T5 encoder
into a memory and live encoder. We evaluate on question-answering datasets Natural Questions [52] and
TriviaQA [44] and find two interesting results. First, for any given proportion of live layers, the per-
formance gap between lumen and FiD becomes smaller with scale. Second, especially for larger models
lumen achieves significantly stronger performance than FiD and FiD with memory given the same com-
putational budget. At T5-XXL sizelumen performs comparably to FiD with only one third proportion of
live layers and FLOPs.
Next, we experiment with improvements to the standardlumen setup, showing that the performance-
compute trade-off can be further improved relative to FiD by 1) transferring a trained memory and live
encoder from a related task and 2) employing a deep and narrow encoder. Ultimately, lumen represents
a desirable trade-off between retrieval and memory-based approaches, achieving better performance for
any given computational budget.
4.2 Background
We are interested in achieving the best possible performance for any given resource budget. However, there
are different types of computational resources, and varying algorithmic approaches yield distinct trade-offs
42
0 0.2 0.4 0.6 0.8 1
40
50
Live proportion
Exact Match
NaturalQ
0 0.2 0.4 0.6 0.8 1
50
60
70
Live proportion
TriviaQA
Base Large XL XXL
Figure 4.3: MAINRESULT:lumenachievesperformanceclosetoFiDwithfractionoflivelayers.
The required fraction decreases with scale. Exact match on Natural Questions (NaturalQ) and Trivi-
aQA validation sets as a function of proportion of live encoder layers forlumen Base, Large, XL, and XXL
models.
between those resources. In this section we provide background on existing retrieval-augmented models
and describe the costs of those models along different computational dimensions.
4.2.1 Computationalresources
The usual life-cycle of current models starts with pre-training, followed by fine-tuning on multiple tasks.
Finally, the model is used for inference, either online or for batch distillation to a smaller model. Each of
these stages features a different cost per sample. Let N
pt
,N
ft
andN
I
be the number of processed samples
for pre-training, fine-tuning and inference, and F
pt
,F
ft
andF
I
the compute cost per sample for each stage,
measured in FLOPs (floating point operations). Then the compute costs for the model are
FLOPs
pre-train
=N
pt
F
pt
FLOPs
fine-tune
=N
ft
F
ft
· number of tasks
FLOPs
inference
=N
I
F
I
As shown in past work, FiD inference can be slower than FLOPs would indicate due to decoder memory
bandwidth constraints [39, 15] but as this can be fixed with modifications to the decoder [15] we use FLOPs
as our measure of computational cost in line with related work [89, 79].
43
52 53 54 55 56 57
0
10
20
30
L
XL
XXL
L
XL
XXL
Exact Match
TFLOPs per sample
NaturalQ
67 68 69 70 71 72 73
0
10
20
30
L
XL
XXL
L
XL
XXL
Exact Match
TriviaQA
lumen
FiD
Figure 4.4: MAINRESULT:lumenusessignificantlylesscomputethanFiDforthesameperfor-
mance, and this advantage grows with scale. TFLOPs as a function of exact match on Natural Ques-
tions (NaturalQ) and TriviaQA test sets. FLOPs are for single forward step and exclude pre-computation.
Compares FiD andlumen with live proportion 0.33 Large, XL and XXL models. Lower is better.
For retrieval-augmented models there are additional costs. The retrieval set must be stored and re-
trievals transmitted to the accelerator. There may also be preprocessing overhead for the retrieval set,
such as pre-computing memory representations. LetN
rs
be the size of the retrieval set andF
pc
the FLOPs
associated with preprocessing a retrieval candidate. Then storage requirements and pre-computaton costs
are given by
Storage = Corpus size· Size of a single sample
FLOPs
precompute
= Corpus size· F
precompute
If retrieval representations are fine-tuned, then a different version of the retrieval set must be pre-computed
and stored for each task. Required bandwidth for transmission is determined by the product of the number
and size of retrieved representations.
4.2.2 Fusion-in-Decoder
Fusion-in-Decoder [40] consists of a T5 encoder-decoder model. For each input, a number of relevant text
passages are retrieved, and the input is prepended to each passage. The resulting passages are encoded
separately by the encoder, and the encoded representations are then concatenated and attended to by
44
the decoder to produce a target output. For each model, fine-tuned components are in blue and frozen
components in orange.
G = Dec
h
Enc(Q; Passage
1
);...Enc(Q; Passage
k
)
i
Let n
s
the number of source tokens, n
t
the number of target tokens, L the number of layers, and d the
dimension of the model. Following analysis from [15], the FLOPs for a single inference sample of FiD
(ignoring attention score computation) is given by
2
F
I
= n
s
· L· 14d
2
| {z }
Encoder and cross-attention
+n
t
· L· 14d
2
| {z }
Decoder
With F
pt
,F
ft
= 3F
I
due to the backward step. For fine-tuning and inference n
s
≫ n
t
due to the large
number of tokens from retrieved passages. As a result, FiD fine-tuning and inference FLOPs per sample are
very high relative to pre-training. In contrast, storage and bandwidth requirements are low as the retrieval
set consists of passages of raw tokens. FiD has no pre-computation costs.
4.2.3 Memory
An increasing number of works reduce the cost of retrieval-augmented models by pre-computing dense
representations of retrieval candidates and storing them in a memory. One such work modifies FiD by
pre-computing passage encoder representations and providing the input as a prefix to the decoder [57].
We denote this approach as MemoryFiD.
G = Dec
h
Q;MemEnc(Passage
1
);..MemEnc(Passage
k
)
i
MemoryFiD saves fine-tuning and inference compute at the expense of increased pre-computation,
storage, and bandwidth requirements. Because MemoryFiD does not encode retrieved passages on the fly,
encoder costs are removed and only cross-attention and other decoder compute is left.
F
I
= n
s
· L· 2d
2
| {z }
Cross-attention
+n
t
· L· 14d
2
| {z }
Decoder
2
We approximate the FLOPS of the MLP block as 8d
2
, the FLOPs from the original Transformer MLP. The T5 MLP has
dimension between2.5d and3d and three matrix multiplication operations including GEGLU, yielding total FLOPs close to 8d.
45
Instead, it pre-computes passage representations, using
FLOPs
precompute
= Corpus size· n
p
L· 12d
2
,
wheren
p
is the number of tokens in a single passage. MemoryFiD stores the final layer representations
for each passage token, taking up
Storage = Corpus size· n
p
d
Holding model size fixed, MemoryFiD saves compute as long as the retrieval corpus is not too large relative
to the number of samples processed for fine-tuning and inference. However, as passage representations are
not conditioned on the question, MemoryFiD incurs a significant performance penalty relative to normal
FiD. Therefore, in order to reach equivalent performance to standard FiD, MemoryFiD must use a much
larger model, which incurs much larger cross-attention, decoder, pre-training, pre-computation, storage
and bandwidth costs. [57] also fine-tune the memory encoder, which requires pre-computing and storing
a separate memory for each task. This is intractable for real applications involving internet-sized corpora,
so for our main results we assume the memory is pre-computed from a single model without fine-tuning
on individual tasks. Without fine-tuning, the performance penalty is even higher. Figure 4.8 shows the
effect of fine-tuning memory; all our results still apply in that case.
4.3 lumen
Intuitively when reading a passage it is helpful to know what information is needed and for what purpose.
For Fusion-in-Decoder, this is achieved by prepending the input to retrieved passages and fine-tuning the
passage encoder, whereas MemoryFiD does not enjoy such an advantage. With lumen, we explore the
possibility that a similar effect can be achieved by a two-step process, in which a large model generates a
general representation for each passage that can be placed in memory, and a smaller model transforms this
general representation into an input-specific representation by conditioning on the input and task. Figure
4.2 provides an overview of thelumen architecture.
46
0 1 2 3 4
45
50
55
L
XL
XXL
B
L
XL
TFLOPs per sample
Exact Match
NaturalQ
0 1 2 3 4
60
65
70
L
XL
XXL
B
L
XL
TFLOPs per sample
TriviaQA
lumen
MemoryFiD
Figure 4.5: lumen achieves much better performance than MemoryFiD at any compute budget.
Exact match performance on the test set of Natural Questions as a function of TFLOPs per sample com-
paringlumen
1
/3 Base, Large and XL models with MemoryFiD Large, XL, and XXL models. FLOPs are for
single forward step and exclude pre-computation.
4.3.1 Architecture
lumen is initialized from a pre-trained T5 encoder-decoder model. The decoder functions the same as the
standard FiD decoder, butlumen features three encoders. The T5 encoder is divided into a largememory
encoder which contains the first 1− α proportion of layers, and a smallerliveencoder with the remaining
α proportion of layers. The memory encoder is applied offline to passages in the corpus to pre-compute
memory representations, which are later updated conditioned on input and task on the fly by the fine-
tuned live encoder. In order to ensure that memory representations and input are compatible, lumen
applies a question encoder to the input before prepending the question representation to the memory
representation. The question encoder shares its structure and initial weights with the memory encoder,
but is fine-tuned.
G = Dec
h
Q;LiveEnc(H
1
);...LiveEnc(H
k
)
i
H
i
=
h
QEnc(Q); MemEnc(Passage
i
)
i
47
4.3.2 Computationalanalysis
During fine-tuning and inference lumen applies only a proportionα of the layers, leading to a fractionα of FiD reader FLOPs for any given model size.
F
I
=n
s
· αL · 12d
2
| {z }
Encoder
+n
s
· L· 2d
2
| {z }
Cross-attention
+n
t
· L· 14d
2
| {z }
Decoder
Pre-computation costs at the same model size are a factor 1− α of MemoryFiD pre-computation costs
(without fine-tuning the memory encoder). Storage and bandwidth costs are the same as for MemoryFiD
(at same model size and without fine-tuning the memory encoder). However, as we will show, lumen can
match FiD performance with only a modest increase in size, leading to a large decrease in computational
cost without the commensurate increases in pre-training, pre-computation, and storage requirements in-
curred with MemoryFiD.
4.4 RelatedWork
Retrieval-augmented models There is a significant amount of research on retrieval-augmented lan-
guage models. Some notable approaches include REALM [35], RAG [55], kNN-LM [49], RETRO [6], and
Fusion-in-Decoder (FiD) [40]. FiD in particular has demonstrated state of the art performance across a
range of tasks [40, 41, 90]. This work focuses on improving the efficiency of FiD through a hybrid memory
approach.
Efficient retrieval-augmented models Retrieval augmentation can be expensive for training and in-
ference, and a large body of work investigates more efficient retrieval-augmented models. The computa-
tional cost of retrieval-augmented models can be partitioned into the cost from reading retrieved passages,
decoding, and long-range attention. Recent work has shown that FiD spends the majority of inference
time in the decoder [39] due to memory bandwidth constraints in cross-attention [15]. However, with the
appropriate modifications [15] the constraint can be ameliorated, after which the majority of training and
inference costs result from reading retrieved passages.
The computational burden from encoding retrieved passages can be reduced by reranking and making
use of only the best retrievals [89, 83, 59]. Alternatively, the resources devoted to retrieval can be adapted
to the difficulty of the input, retrieving fewer or no passages if the model is confident it already knows the
48
answer [51, 79]. In order to efficiently model interaction between different retrieved passages it is common
to employ sparse long-range attention [33, 1, 91]. Finally, there is a large body of work that attempts to
improve the efficiency of Transformer models in general, for example through parallelization [67], quan-
tization [18, 92], and distillation [37, 31].
Memory models FiDO is most nearly related to the literature on memory. Another method to reduce
encoding cost of retrieval-augmented models is to pre-compute representations for the retrieval corpus and
collect these representations into a memory, thereby amortizing the encoding cost over all the instances
for which a sample is retrieved. In particular, FiDO is closely connected to [57], who propose a memory
FiD model with pre-computed encoder representations. FiDO can be seen as a hybrid of [57] and FiD that
partially pre-computes encoder representations for efficiency, and finalizes the encoder representations on-
the-fly conditioned on question and task to avoid the strong performance penalty from pre-computation.
FiDO uses memory in a straightforward manner, simply pre-computing token representations from a
pre-trained model and retrieving passages with a standard dense passage retriever. Other memory models
can be more involved, incorporating end-to-end retrieval within the model [17, 87], storing higher-level
latent representations [17, 10, 88], and specific pre-training for memory [17, 95]. The main idea behind
FiDO to update retrieved memory representations conditioning on the input is complementary to and can
be combined with these more complex memory models.
4.5 Experiments
4.5.1 ExperimentSetup
Trainingprocedure All experiments use models based on the T5.1.1 architecture [68]. The main exper-
iments use models initialized from the public T5 checkpoints [29]. FiD is trained according to the standard
recipe [40]. For lumen, given proportion of live layersα , the memory encoder and question encoder are
each initialized with the first 1 - α proportion of layers of the T5 encoder, and the live encoder is initialized
with the lastα proportion of layers of the T5 encoder.
Models are fine-tuned with the T5X framework [69] based on JAX [7] and FLAX [36] using the Adafac-
tor [74] optimizer with batch size 64 and learning rate 0.0001. Test results are generated from checkpoints
49
with the best dev results. Experiments in Section 4.5.4 pre-train models from scratch. Pre-training fol-
lows the standard T5 training recipe except that we train for 500k steps, and disable the Adafactor second
moment update schedule and factoring.
Data We evaluate lumen on open-domain question-answering datasets Natural Questions [52], Trivi-
aQA [44], and WebQuestions [4] (in Section 4.5.3). For all datasets, each sample is paired with the 20 most
relevant 100-word Wikipedia passages ranked by DPR [48] score. For FiD, the concatenated question and
passage pairs are truncated to 256 tokens. Forlumen, the question and passage are individually truncated
to 48 and 208 tokens to provide a fair comparison, as they are processed separately.
0.85 0.9 0.95
Base
Large
XL
XXL
Proportion of Memory-FiD gap closed
FiDO
1
/3
0.3 0.4 0.5 0.6 0.7 0.8
Large
XL
XXL
Proportion of Memory-FiD gap closed
FiDO
1
/8
Figure 4.6:lumenclosesthegapwithFiDasscaleincreases. Proportion of exact match difference on
Natural Questions between MemoryFiD and FiD closed bylumen as a function of model scale.
4.5.2 Mainresults
Figure 4.3 shows lumen performance as a function of live proportion for varying model sizes. The first
key observation is that a relatively small proportion of live layers is sufficient to achieve quality close to
FiD. The second key observation is that as the model size increases, the required live proportion to recover
FiD performance decreases. This pattern is further supported by results from Figure 4.6, which explicitly
measures how much of the gap between MemoryFiD and FiD is closed by lumen and shows this gap
increases with scale.
Figure 4.4 compares FLOPs as a function of performance forlumen and FiD, demonstrating thatlumen
achieves similar performance at lower FLOPs for fine-tuning and inference (assuming pre-computation is
sufficiently amortized to be effectively free). Moreover, the advantage becomes more pronounced with
larger model size, consistent with the findings from Figure 4.3 and 4.6. Figure 4.5 shows that lumen also
has much stronger performance than MemoryFiD for any FLOP value. Finally, Table 4.2 compareslumen
with published results in the literature.
50
67 68 69 70 71
FiD
L
1
/3
L
1
/8 +1.0 +1.0
+1.0
TriviaQA
40 42 44 46 48 50
FiD
L
1
/3
L
1
/8 +1.4 +5.0
0.8 +3.6
WebQuestions
Baseline Live Memory
Figure 4.7: Transferringmemoryandespeciallyliveencoderfromarelateddatasetcanpartially
closethegapwithFiD,withincreasedgainsforlowerliveproportionandsmallerfinaldataset.
Exact match on TriviaQA and WebQuestions dev sets with and without transfer from Natural Questions
for FiD and lumen XL models with live proportion
1
/3 and
1
/8. Live keeps the memory encoder frozen
during training on Natural Questions while Memory also trains the memory on Natural Questions (still
frozen after transfer). The gains from transfer are much more pronounced for smaller live proportion and
on WebQuestions, the smaller dataset.
0 0.1 0.2 0.3 0.4 0.5
40
45
50
Live proportion
Exact Match
NaturalQ
0 0.1 0.2 0.3 0.4 0.5
55
60
65
Live proportion
TriviaQA
lumen
Fine-tune memory Condition memory FiD
Figure 4.8: Neither conditioning memory on input nor fine-tuning memory are sufficient to re-
coverFiDperformance. Bothingredientsareimportant,althoughconditioningappearstocon-
tributemore. Exact match on Natural Questions (NaturalQ) and TriviaQA dev sets as a function of pro-
portion of live encoder layers forlumen-Large and two relaxations: one in which the memory layers are
fine-tuned, and another in which the memory layers are conditioned on the question.
4.5.3 Transfer
Since the memory encoder is not fine-tuned on each individual task, the live encoder must adapt the
memory representations to the task in addition to conditioning on the input. Especially for smaller live
encoders, this may be difficult to learn while fine-tuning on a single task. Here we evaluate whether lumen
can benefit from transferring from other knowledge-intensive tasks.
In particular, we consider two transfer settings. In the Live setting, we transfer the Live Encoder by
training on Natural Questions with frozen memory before transferring to the target task. In the Memory
51
setting, the model is trained on Natural Questions with fine-tuned memory before transferring both the
Live and Memory encoder to the target task. TheMemory setting follows the intuition that, although it is
infeasible to use a different memory for every task, it may be possible to perform multi-task fine-tuning
before encoding memory.
Figure 4.7 shows the results of transfer from Natural Questions to TriviaQA and WebQuestions. We
note several interesting patterns. First, gains from transfer are higher for smaller live proportion, with
minimal gains for FiD and large gains forlumen
1
/8. Second, transferring memory is only helpful for small
live proportion, where the Live Encoder does not contain sufficient layers to fully adapt the memory to
the task. Third, gains from transfer are significantly higher for WebQuestions, a task with a very small
amount of data.
4.5.4 Memoryshape
Table 4.1: AddingmemorytoFiDleadstosignificantperformancegainswithoutadditionalfine-
tuningorinferenceFLOPs. Exact match performance on Natural Questions and TriviaQA for FiD-Base
and FiDO
1
/3 with Base decoder and live encoder, and memory encoder with 24 Base layers.
Model NQ TQA
FiD Base 47.3 64.4
FiDO Base 24-12 48.9 65.4
In our main experiments we initialize lumen from public T5 checkpoints to avoid costly pre-training
and partition the encoder into a memory encoder and live encoder. Can we achieve a better a trade-off by
pre-training a model with a custom configuration? Fixing the output of the live encoder to be narrow allows
us toscalethememoryencoderwithoutusingmoreFLOPs, as the cross-attention FLOPs are not affected by
the size of the memory encoder. Table 4.1 shows the effect of adding a memory encoder consisting of
24 additional Base layers to an existing T5-Base configuration, yielding increasing performance without
increasing compute. Taken to an extreme, these results suggest that combining a large language model
with a moderately sized live encoder could yield strong results at modest cost.
4.5.5 Ablations
The two main differences between FiD, lumen, and MemoryFiD are the extent to which retrieved passages
are conditioned on the input and the extent to which passage encoders are fine-tuned on particular tasks.
52
Our first ablation investigates how performance differences between lumen and MemoryFiD on the one
hand and FiD on the other hand result from conditioning on the input and fine-tuning. We construct
two ablation settings as intermediate models between lumen and FiD: fine-tuning the memory encoder,
and conditioning the memory encoder on the question (but without fine-tuning it). Figure 4.8 compares
performance as a function of live proportion for these settings. Neither conditioning memory on the input
nor fine-tuning the memory come close to recovering FiD performance by themselves: both are necessary.
However, it seems that conditioning may be more helpful by itself than fine-tuning memory.
Thelumen live encoder jointly processes concatenated passage and input representations. The decoder
therefore receives passages conditioned on the input as well as the input on the passage. In order to
disentangle these conditioning effects, we experiment with ablations that disallow attention from question
to passage (“no q2mem”) or passage to question (“no mem2q”). Figure 4.9 presents results that show that
conditioning the passage on the input is critical, although the passage-conditioned question is still helpful.
Finally, lumen also uses a fine-tuned question encoder to generate a question representation that is
optimized for the live encoder to condition the passage memories on. Figure 4.10 compares performance
between fine-tuning and freezing this question encoder, demonstrating the importance of adapting the
question encoder to the task.
Table 4.2: Comparison of FiDO with published results on Natural Questions and TriviaQA test sets. We
focus on comparing with FiD as other works enhance performance with improved retrieval (such as AT-
LAS), which is orthogonal to our contributions.
Model NQ TQA
REALM [35] 40.4 -
RAG [55] 44.5 56.8
RETRO [6] 45.5 -
T5-XXL [71] 35.2 51.9
ATLAS [41] 60.4 79.8
FiD-L [40] 51.4 67.6
FiD-XXL (ours) 57.3 73.0
FiDO-XXL 57.1 73.1
53
− 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
38
40
42
44
46
48
50
52
Live proportion
Exact Match
lumen
no q2mem
no mem2q
Figure 4.9: The primary gains from the live encoder in lumen result from updating memory
representationsconditionedonthequestion. Exact match on Natural Question dev set as a function
of the proportion of live encoder layers for lumen-Large and two modifications with restricted encoder
self-attention. In the ‘no q2mem‘ setting question tokens cannot attend to passage tokens, and vice versa
for ‘no mem2q‘.
− 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
38
40
42
44
46
48
50
52
Live proportion
Exact Match
lumen
Frozen QEncoder
Figure 4.10: Fine-tuning the question encoder improves performance significantly. Exact match
on Natural Question dev set as a function of the proportion of live encoder layers forlumen-Large and a
modification for which the question encoder is frozen (so that the memory encoder and question encoder
are shared).
54
4.6 Conclusion
Retrieval-augmented language models such as Fusion-in-Decoder are powerful but expensive. Pre-computing
encoder representations into dense memory, a popular method for reducing computation costs of retrieval-
augmented models, leads to a sharp decrease in performance. We proposelumen, a hybrid between Fusion-
in-Decoder and dense memory. Passage representations are partially pre-encoded into a dense memory,
and then reprocessed on the fly by a fine-tuned encoder that conditions on the question. We show that
lumen achieves stronger performance for the same FLOPs, and that this advantage increases with scale.
55
PartIV
ConclusionandFutureWork
56
Chapter5
Conclusion
This thesis analyzed the factors that determine the computational cost of retrieval-augmented language
models, and proposed new methods that yield an improved quality-compute trade-off. The resulting mod-
els achieve the performance of previous retrieval-augmented approaches with an order of magnitude faster
speed.
First, Chapter 2 showed that the majority of FLOPs from a retrieval-augmented model arise from en-
coding retrieved passages, as Transformer FLOPs scale with input length and retrieved text is typically
much longer than the original input. In contrast, during inference the bulk of time is spent in the de-
coder due to the memory bandwidth overhead of repeatedly loading a long sequence of keys and values
in token-by-token decoding.
FiDO [15], our first proposed approach, drastically reduces memory bandwidth requirements by re-
moving most cross-attention layers and replacing the remaining cross-attention layers with multi-query
attention. With reduced memory bandwidth overhead, FiDO spends the majority of inference time in the
encoder proportional to FLOPs. FiDO exploits this imbalance by massively scaling the decoder, increasing
performance at modest computational cost.
The next two chapters introduced methods to decrease the reader cost for retrieved passages through
the use of memory. Chapter 3 proposed the TOME model. Passages that contain important information
are likely to be retrieved multiple times, especially during inference. Rather than re-encode such pas-
sages every time they are retrieved, TOME instead pre-computes dense representations of those passages
and retrieves the dense representations directly, amortizing the cost of encoding across all the times they
are retrieved. In particular, the TOME model attends over a memory of dense representations of entity
mentions from Wikipedia [17].
57
Chapter 4 compared performance of standard retrieval augmentation and memory models, and showed
that using pre-computed memory leads to quality degradation. Memory representations are pre-computed
without knowing what input or task they will be used for, and the resulting general-purpose representa-
tion is less suited to any specific input. To solve this, Chapter 4 proposed lumen [16], a hybrid memory
architecture that combines standard retrieval and pre-computed memory. lumen employs partially pre-
computed representations that are updated by a smaller live encoder that is applied on the fly and con-
ditions the memory on the input and task, an approach that we showed captures most of the quality of
retrieval-augmented models and speed of pre-computed memory
58
Chapter6
FutureWork
This dissertation has so far introduced methods to optimize autoregressive inference and reading retrieved
passages. Here, I reflect on what ingredients are still missing for an optimal retrieval-augmented model,
and propose possible directions for these missing ingredients.
Retrieval
First, even withlumen, encoding retrieved passages is still a major computational bottleneck, and is likely
to become more so as methods for efficient decoding improve further. Another direction to improve the
efficiency of encoding retrieved passages that I have not explored in this dissertation is to simply encode
fewer passages. A number of ways this can be achieved without overly impacting performance is to simply
improve the quality of retrieval such that fewer retrieved passages are required, to retrieve fewer passages
when the model is confident, or to perform reranking, processing only a sub-set of retrieved passages that
are deemed promising by a scoring model that takes the input into account.
Reranking is normally a challenging approach to use during inference, because the reranker must be
sufficiently smaller than the reader model to save on compute while still being of sufficient quality to
judge the relevance of each retrieved passage to the input. However, inlumen we already have access to
a powerful representation for retrieved passages for free: namely, the pre-computed memory. I believe a
shallow reranker that operates on top of these pre-computed representations can effectively rank retrieved
passages while being significantly cheaper than the live encoder such that it can be tractably applied
during inference. Adding reranking to lumen could provide another strong improvement in the quality-
performance trade-off.
59
Pre-training
Retrieval-augmented language models are typically initialized from a standard model pre-trained on lan-
guage modeling without retrieval, such as T5. A consistent theme throughout this dissertation is that we
can do better by designing models and training procedures with retrieval-augmentation in mind. FiDO’s
sparse cross-attention, multi-query attention and especially decoder asymmetry result in favourable trade-
offs specifically for retrieval-augmented models. TOME benefited from a memory-specific pre-training
phase with a training objective specifically designed to challenge memory. For lumen, we saw that trans-
fering memory and live encoder from a larger dataset to a smaller dataset drastically improved perfor-
mance, which suggests performance might have improved further with appropriate pre-training.
I believe we can achieve strong improvements in retrieval augmented models by going further in this
direction and performing large-scale pre-training with retrieval as a first class citizen, mimicking the down-
stream retrieval-augmented setting as much as possible. In particular, I see the missing ingredients as (1) a
challenging semantic objective that relies more on world knowledge than nearby syntax, such as heavily
masked entity prediction, (2) context from end-to-end retrieved relevant passages, instead of nearby text
in the same document, and (3) a lumen-like model that restricts attention between the context passages
and the masked input, such that the resulting model is amenable to pre-computation.
60
Bibliography
[1] Joshua Ainslie, Santiago Ontañón, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham,
Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. ETC: encoding long and structured inputs
in transformers. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedingsofthe
2020ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2020,Online,November
16-20, 2020, pages 268–284. Association for Computational Linguistics, 2020.
[2] Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona T. Diab, and Marjan Ghazvininejad. A review
on language models as knowledge bases. CoRR, abs/2204.06031, 2022.
[3] Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, and Tom Kwiatkowski. Matching the blanks:
Distributional similarity for relation learning. In ACL 2019, 2019.
[4] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from
question-answer pairs. InProceedingsofthe2013ConferenceonEmpiricalMethodsinNaturalLanguage
Processing,EMNLP2013,18-21October2013,GrandHyattSeattle,Seattle,Washington,USA,Ameeting
of SIGDAT, a Special Interest Group of the ACL, pages 1533–1544. ACL, 2013.
[5] Bernd Bohnet, Vinh Q Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisen-
stein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, et al. Attributed question answering: Evaluation
and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022.
[6] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican,
George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas,
Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones,
Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen
Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from
trillions of tokens. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu,
and Sivan Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022,
Baltimore,Maryland,USA, volume 162 ofProceedingsofMachineLearningResearch, pages 2206–2240.
PMLR, 2022.
[7] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau-
rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX:
composable transformations of Python+NumPy programs, 2018.
[8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey
Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben-
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and
Dario Amodei. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato,
61
Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,AdvancesinNeuralInformationPro-
cessing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020,
December 6-12, 2020, virtual, 2020.
[9] Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. Knowl-
edgeable or educated guess? revisiting language models as knowledge bases. In ACL/IJCNLP 2021,
2021.
[10] Wenhu Chen, Pat Verga, Michiel de Jong, John Wieting, and William W. Cohen. Augment-
ing pre-trained language models with qa-memory for open-domain question answering. CoRR,
abs/2204.04581, 2022.
[11] Xi Chen, Xiao Wang, Soravit Changpinyo, A. J. Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian
Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan
Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury,
Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas
Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. Pali: A jointly-scaled mul-
tilingual language-image model. CoRR, abs/2209.06794, 2022.
[12] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,
Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam
Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Brad-
bury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, San-
jay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander
Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanu-
malayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Olek-
sandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat,
Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah
Fiedel. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
[13] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi
Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai,
Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu,
Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,
Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned
language models. CoRR, abs/2210.11416, 2022.
[14] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and
memory-efficient exact attention with io-awareness. CoRR, abs/2205.14135, 2022.
[15] Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, and
William Cohen. Fido: Fusion-in-decoder optimized for stronger performance and faster inference.
arXiv preprint arXiv:2212.08153, 2022.
[16] Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha,
and William Cohen. Pre-computed memory or on-the-fly encoding? a hybrid approach to retrieval
augmentation makes the most of your compute. arXiv preprint arXiv:2301.10448, 2023.
62
[17] Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, and William W. Cohen. Mention
memory: incorporating textual knowledge into transformers through entity mention attention. In
The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,
2022. OpenReview.net, 2022.
[18] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multipli-
cation for transformers at scale. CoRR, abs/2208.07339, 2022.
[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding. In NAACL-HLT, 2019.
[20] Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and
William W. Cohen. Time-aware language models as temporal knowledge bases.Trans.Assoc.Comput.
Linguistics, 10:257–273, 2022.
[21] Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov,
and William W. Cohen. Differentiable reasoning over a virtual knowledge base. In ICLR 2020, 2020.
[22] Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Mar-
garet Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal
clean crawled corpus. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau
Yih, editors,Proceedingsofthe2021ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,
EMNLP2021,VirtualEvent/PuntaCana,DominicanRepublic,7-11November,2021, pages 1286–1305.
Association for Computational Linguistics, 2021.
[23] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In
9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
2021. OpenReview.net, 2021.
[24] Julian Martin Eisenschlos, Bhuwan Dhingra, Jannis Bulian, Benjamin Börschinger, and Jordan Boyd-
Graber. Fool me twice: Entailment from wikipedia gamification. arXiv preprint arXiv:2104.04725,
2021.
[25] Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich
Schütze, and Yoav Goldberg. Measuring and improving consistency in pretrained language mod-
els. Trans. Assoc. Comput. Linguistics, 9:1012–1031, 2021.
[26] Thibault Févry, Livio Baldini Soares, Nicholas FitzGerald, Eunsol Choi, and Tom Kwiatkowski. Enti-
ties as experts: Sparse memory access with entity supervision. In EMNLP 2020, 2020.
[27] Nicholas FitzGerald, Daniel M. Bikel, Jan A. Botha, Daniel Gillick, Tom Kwiatkowski, and Andrew
McCallum. MOLEMAN: mention-only linking of entities with a mention annotation network. In
ACL-IJCNLP 2021, 2021.
[28] Google. Profile your model with cloud tpu tools. https://cloud.google.com/tpu/docs/
cloud-tpu-tools, 2020. Accessed: 2022-11-11.
[29] Google. Pre-trained t5 models. https://github.com/google-research/t5x/blob/main/docs/models.md,
2022. Accessed: 2022-12-20.
63
[30] Google. System architecture tpu vm. https://cloud.google.com/tpu/docs/
system-architecture-tpu-vm, 2022. Accessed: 2022-11-19.
[31] Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey.
Int. J. Comput. Vis., 129(6):1789–1819, 2021.
[32] Alex Graves. Sequence transduction with recurrent neural networks. CoRR, abs/1211.3711, 2012.
[33] Mandy Guo, Joshua Ainslie, David C. Uthus, Santiago Ontañón, Jianmo Ni, Yun-Hsuan Sung, and
Yinfei Yang. Longt5: Efficient text-to-text transformer for long sequences. In Marine Carpuat, Marie-
Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, FindingsoftheAssociationforComputa-
tionalLinguistics: NAACL2022,Seattle,WA,UnitedStates,July10-15,2022, pages 724–736. Association
for Computational Linguistics, 2022.
[34] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar.
Accelerating large-scale inference with anisotropic vector quantization. In ICML 2020, 2020.
[35] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented
language model pre-training. In International Conference on Machine Learning, pages 3929–3938.
PMLR, 2020.
[36] Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas
Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020.
[37] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.
CoRR, abs/1503.02531, 2015.
[38] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric
Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero,
Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-
optimal large language models. CoRR, abs/2203.15556, 2022.
[39] Sebastian Hofstätter, Jiecao Chen, Karthik Raman, and Hamed Zamani. Fid-light: Efficient and effec-
tive retrieval-augmented text generation. CoRR, abs/2209.14290, 2022.
[40] Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open
domain question answering. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, editors, Proceed-
ings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 874–880. Association for Computational
Linguistics, 2021.
[41] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane
Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval
augmented language models. CoRR, abs/2208.03299, 2022.
[42] Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Kumar Singh, and Mohit
Bansal. Hover: A dataset for many-hop fact extraction and claim verification. In EMNLP 2020, 2020.
[43] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. arXiv
preprint arXiv:1702.08734, 2017.
64
[44] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, ed-
itors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL
2017,Vancouver,Canada,July30-August4,Volume1: LongPapers, pages 1601–1611. Association for
Computational Linguistics, 2017.
[45] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly
supervised challenge dataset for reading comprehension. In ACL 2017, 2017.
[46] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger,
Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate
protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
[47] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models
struggle to learn long-tail knowledge. CoRR, abs/2211.08411, 2022.
[48] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi
Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie
Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedingsofthe2020ConferenceonEmpirical
MethodsinNaturalLanguageProcessing,EMNLP2020,Online,November16-20,2020, pages 6769–6781.
Association for Computational Linguistics, 2020.
[49] Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization
through memorization: Nearest neighbor language models. In8thInternationalConferenceonLearn-
ing Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
[50] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio
and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[51] Bernhard Kratzwald and Stefan Feuerriegel. Adaptive document retrieval for deep question answer-
ing. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 -
November 4, 2018, pages 576–581. Association for Computational Linguistics, 2018.
[52] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Al-
berti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones,
Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Nat-
ural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics,
7:452–466, 2019.
[53] Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. Latent retrieval for weakly supervised open
domain question answering. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Pro-
ceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence,
Italy,July28-August2,2019,Volume1: LongPapers, pages 6086–6096. Association for Computational
Linguistics, 2019.
[54] Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida I. Wang, and Luke Zettle-
moyer. Pre-training via paraphrasing. In NeurIPS 2020, 2020.
[55] Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman
Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe
65
Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle,
Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in
NeuralInformationProcessingSystems33: AnnualConferenceonNeuralInformationProcessingSystems
2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[56] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V.
Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam
Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language
models. CoRR, abs/2206.14858, 2022.
[57] Zonglin Li, Ruiqi Guo, and Sanjiv Kumar. Decoupled context processing for context augmented
language modeling. CoRR, abs/2210.05758, 2022.
[58] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR 2019, 2019.
[59] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen.
Reader-guided passage reranking for open-domain question answering. In Chengqing Zong, Fei Xia,
Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics:
ACL/IJCNLP2021,OnlineEvent,August1-6,2021, volume ACL/IJCNLP 2021 ofFindingsofACL, pages
344–350. Association for Computational Linguistics, 2021.
[60] Ankur Mohan. Understanding roofline charts. 2018.
[61] NVIDIA. Nvidia a100 tensor core gpu. https://www.nvidia.com/en-us/data-center/a100/, 2022. Ac-
cessed: 2022-12-06.
[62] Georg Ofenbeck, Ruedi Steinmann, Victoria Caparrós Cabezas, Daniele G. Spampinato, and Markus
Püschel. Applying the roofline model. In 2014 IEEE International Symposium on Performance Anal-
ysis of Systems and Software, ISPASS 2014, Monterey, CA, USA, March 23-25, 2014, pages 76–85. IEEE
Computer Society, 2014.
[63] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh,
and Noah A. Smith. Knowledge enhanced contextual word representations. InEMNLP-IJCNLP2019,
2019.
[64] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick S. H. Lewis, Majid Yazdani, Nicola De Cao,
James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rock-
täschel, and Sebastian Riedel. KILT: a benchmark for knowledge intensive language tasks. InNAACL-
HLT 2021, 2021.
[65] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander H. Miller. Language models as knowledge bases? In Kentaro Inui, Jing Jiang, Vincent
Ng, and Xiaojun Wan, editors,Proceedingsofthe2019ConferenceonEmpiricalMethodsinNaturalLan-
guage Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-
IJCNLP2019,HongKong,China,November3-7,2019, pages 2463–2473. Association for Computational
Linguistics, 2019.
[66] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu,
and Alexander H. Miller. Language models as knowledge bases? In EMNLP-IJCNLP 2019, 2019.
[67] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Lev-
skaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer
inference. CoRR, abs/2211.05102, 2022.
66
[68] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi
Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text
transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
[69] Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor,
Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz,
Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu,
Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo
Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-
Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-
Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua
Newlan, and Andrea Gesmundo. Scaling up models and data with t5x and seqio. arXiv preprint
arXiv:2203.17189, 2022.
[70] Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the pa-
rameters of a language model? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,
Proceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2020,
Online, November 16-20, 2020, pages 5418–5426. Association for Computational Linguistics, 2020.
[71] Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the pa-
rameters of a language model? In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,
Proceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,EMNLP2020,
Online, November 16-20, 2020, pages 5418–5426. Association for Computational Linguistics, 2020.
[72] Christopher Sciavolino, Zexuan Zhong, Jinhyuk Lee, and Danqi Chen. Simple entity-centric questions
challenge dense retrievers. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-
tau Yih, editors, EMNLP 2021, 2021.
[73] Noam Shazeer. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150,
2019.
[74] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost.
In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on
Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of
Proceedings of Machine Learning Research, pages 4603–4611. PMLR, 2018.
[75] Haitian Sun, Patrick Verga, Bhuwan Dhingra, Ruslan Salakhutdinov, and William W. Cohen. Reason-
ing over virtual knowledge bases with open predicate relations. In ICML 2021, 2021.
[76] Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions.
In NAACL-HLT 2018, 2018.
[77] Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia,
Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.
CoRR, abs/2211.09085, 2022.
[78] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale
dataset for fact extraction and verification. In NAACL-HLT 2018, 2018.
[79] Neeraj Varshney, Man Luo, and Chitta Baral. Can open-domain QA reader utilize external knowledge
efficiently like humans? CoRR, abs/2211.12707, 2022.
67
[80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy
Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances
inNeuralInformationProcessingSystems30: AnnualConferenceonNeuralInformationProcessingSys-
tems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
[81] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy
Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances
inNeuralInformationProcessingSystems30: AnnualConferenceonNeuralInformationProcessingSys-
tems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
[82] Pat Verga, Haitian Sun, Livio Baldini Soares, and William W. Cohen. Adaptable and interpretable
neural memoryover symbolic knowledge. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer,
Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao
Zhou, editors,Proceedingsofthe2021ConferenceoftheNorthAmericanChapteroftheAssociationfor
ComputationalLinguistics: HumanLanguageTechnologies,NAACL-HLT2021,Online,June6-11,2021,
pages 3678–3691. Association for Computational Linguistics, 2021.
[83] Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
Tesauro, Bowen Zhou, and Jing Jiang. R
3
: Reinforced ranker-reader for open-domain question an-
swering. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second
AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial In-
telligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5981–5988. AAAI Press, 2018.
[84] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. Self-consistency
improves chain of thought reasoning in language models. CoRR, abs/2203.11171, 2022.
[85] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou.
Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
[86] Samuel Williams, Andrew Waterman, and David A. Patterson. Roofline: an insightful visual perfor-
mance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009.
[87] Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing trans-
formers. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event,
April 25-29, 2022. OpenReview.net, 2022.
[88] Yuxiang Wu, Yu Zhao, Baotian Hu, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. An
efficient memory-augmented transformer for knowledge-intensive NLP tasks. CoRR, abs/2210.16773,
2022.
[89] Donghan Yu, Chenguang Zhu, Yuwei Fang, Wenhao Yu, Shuohang Wang, Yichong Xu, Xiang Ren,
Yiming Yang, and Michael Zeng. Kg-fid: Infusing knowledge graph in fusion-in-decoder for open-
domain question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,
Proceedingsofthe60thAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1: Long
Papers),ACL2022,Dublin,Ireland,May22-27,2022, pages 4961–4974. Association for Computational
Linguistics, 2022.
68
[90] Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu,
Michael Zeng, and Meng Jiang. Generate rather than retrieve: Large language models are strong
context generators. CoRR, abs/2209.10063, 2022.
[91] Yury Zemlyanskiy, Joshua Ainslie, Michiel de Jong, Philip Pham, Ilya Eckstein, and Fei Sha. Read-
twice: Reading very large documents with memories. In Kristina Toutanova, Anna Rumshisky, Luke
Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty,
and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online,
June 6-11, 2021, pages 5189–5195. Association for Computational Linguistics, 2021.
[92] Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan
Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen,
Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. CoRR,
abs/2210.02414, 2022.
[93] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher
Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster,
Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open
pre-trained transformer language models. CoRR, abs/2205.01068, 2022.
[94] Chen Zhao, Chenyan Xiong, Jordan L. Boyd-Graber, and Hal Daumé III. Multi-step reasoning over
unstructured text with beam dense retrieval. In NAACL-HLT 2021, 2021.
[95] Zexuan Zhong, Tao Lei, and Danqi Chen. Training language models with memory augmentation.
CoRR, abs/2205.12674, 2022.
69
Appendices
A AppendixforFiDO:Fusion-in-Decoderoptimizedforstrongerperformance
andfasterinference
A.1 Training
All experiments are built on the T5.1.1 architecture with the training recipe from T5 [68]. The first excep-
tion is the optimizer; we find that the second moment factoring and mixing schedule from Adafactor [74]
can lead to instability, especially with unbalanced encoder and decoder sizes. Instead, we disable factoring
and second moment mixing, leading to an optimizer that is a hybrid between Adafactor and Adam [50].
The second difference to the training recipe arises from the observation that FiDO XL-XXL is unstable
for the standard training regimen. We solve the instability by restarting from a recent healthy checkpoint
with a 10x decreased learning rate, which happened once.
During fine-tuning, we load not only model weights but also second moment estimates, which we find
leads to better fine-tuning in general and particularly for asymmetric models. We finetune with learning
rate 0.001 and batch size 64 for all datasets. For evaluation on test sets we select the checkpoint with the
best validation performance.
A.2 OtherAnalysis
Model TotalTPS DecoderTPS
Vanilla FiD 135 123
+ LSA 51 39
+ MQ 35 23
+ Beam 16 35 23
+ XL Decoder 117 105
Table A.1: Inference time per sample (ms) with batch size 1 for Base FiD with varying FiDO components.
70
Model DecoderTPS NaturalQ
FiD with LSA, MQ 0.6 46.3
+ Beam 4 2.4 46.2
FiDO 2.0 48.2
Table A.2: Decoder inference time (ms) and QA exact match for FiD Base models, comparing the trade-offs
of beam search versus scaling decoder size.
B AppendixforMentionMemory:incorporatingtextualknowledgeinto
Transformersthroughentitymentionattention
B.1 Pre-training
We train on English Wikipedia, processed with the entity linking and named entity recognition tools from
the Google Cloud NLP API
1
. We use existing hyperlinks in Wikipedia as additional entity annotations. All
models are pre-trained on 128 TPUs using AdamW optimizer [58] with learning rate 1e-4 and batch size
of 4096. Each passage in the batch has length T = 128, excluding entity tokens. The Mention Encoder
andbatch-tome are pre-trained for 1 million steps with 50k warmup steps, andtome is trained for 500k
additional steps with 25k warmup steps after initialization frombatch-tome. Both models are trained with
linear learning rate decay. Mention Encoder andbatch-tome share Transformer weights during Mention
Encoder pre-training. We apply gradient clipping with a norm of 1.0 and weight decay of 0.01. Weight
decay is applied to all weights except layer norm and bias weights.
batch-tome and tome are trained with weight 0.85 on the MLM objective and 0.15 on the entity
coreference resolution objective. We mask 20% of whole entity mentions and 10% of other tokens. We limit
the coreference resolution objective to mentions of the 1 million most frequent Wikipedia entities. We use
24 mentions per sample, with a batch size of 32 samples per TPU. We subsample mentions uniformly if
the average number of annotated mentions on a TPU exceeds 24. Key mention encoding have dimension
d
K
= 128 and value and coreference mention encodings have dimensiond
V
=d
C
= 512.
Disallowed same passage retrieval for Mention Encoder. We want the model to use memory as a
source of additional information for processing a passage. Therefore, we explicitly set attention weights
to 0 for memories generated from the same passage as the current one.
1
https://cloud.google.com/natural-language/docs/basics#entity_analysis
71
B.1.1 MentionEncoderdatageneration
We pre-train Mention Encoder to produce mention encodings that are useful forbatch-tome. In order to
providebatch-tome with an incentive to use the memory, we need to ensure that mentions from different
samples within a batch are relevant to each other. We achieve this by batching passages from the same or
related Wikipedia articles.
We generate clusters of 256 passages from Wikipedia articles using a greedy method. First, we create
a cluster from the longest unused Wikipedia article and add related articles until the cluster consists of
256 passages. In particular, at each step we add the article with the largest Jaccard similarity between its
entity set and the entity set of articles in the current cluster.
B.1.2 Coreferenceresolutionloss
For every linked mention m in the batch we compute a mention encoding z
m
by applying a separate
SpanEncodingLayer on the output of thebatch-tome. First, we compute the loss for every linked men-
tionm in the batch. To this end, we denote linked mentions ineveryotherpassage in the batch as positive,
P
+
(m), if they have the same entity ID asm and negative,P
− (m), otherwise. The loss per mention is an
average of cross-entropy losses per every positive mentionm
+
∈P
+
(m)
L
coref
(m) =− 1
|P
+
(m)|
log
X
m
+
∈P
+
(m)
exp(z
T
m
z
m
+)
exp(z
T
m
z
m
+)+
P
m
− ∈P
− (m)
exp(z
T
m
z
m
− )
The total loss is average of losses per linked mentions which have at least one positive mention (set
P
+
(m) is not empty).
B.2 Experiments
B.2.1 Fine-tuningsetup
tome is fine-tuned on 32 TPUs using the Adam optimizer with a learning rate of 1e-5 and total batch
size 32. In contrast to pre-training we set max mentions to 32 per sample for fine-tuning. We use 1000
warmup steps and linear learning rate decay. Gradient clipping and weight decay are the same as during
pre-training. We take the highest scoring checkpoint on dev sets and evaluate it on the test set. We use
the spaCy noun chunker to detect noun phrases and treat these as claim/question entity mentions.
72
The model can be fine-tuned with full memory on a server of 8 A100 GPUs or 16 v3/v4 TPUs. A model
with half memory (75M mentions) can be fine-tuned on 8 V100s/P100s or 8 TPUs.
B.2.2 Baselines
Following [35] we used REALM to perform extractive question answering on TriviaQA and ComplexWe-
bQuestions datasets. We also adapted the model to the classification setting in order to apply it to claim
verification tasks. Given an input claim X we compute probability of a predictionY (whether the claim
holds true or not) as a marginalization over retrievalZ.
Pr(Y|X,Z) =
X
z∈Z
Pr(Y|X,Z =z)· Pr(Z =z|X)
where Pr(Y|X,Z = z) is the output probability produced by the reader model and Pr(Z = z|X) is
produced by the retrieval model.
B.2.3 Claimverification
See Table B.1 for the results on development and test splits of claim verification datasets. Additionally,
Table B.2 compares our FM2 results to the original dataset baselines.
B.2.4 QuestionAnswering
We report additional results on the EntityQuestions dataset from [72]. The dataset consists of questions
involving rare entities, making it especially challenging for modern retrievals methods such as DPR. Eval-
uation results for tome models and baselines are shown in Table B.3 and Table B.4. Following [72] we
report recall at 20 as an evaluation metric. Since tome retrieves mentions rather than passages, a direct
comparison is difficult. We evaluate tome conservatively, treating recall at 20 as successful if one of the 20
highest scoring mentions belongs to the correct entity (in contrast to DPR, for which the correct answer
only has to be somewhere in the retrieved 100 word document).
tome sets the state of the art on this dataset and outperforms DPR by a very large margin. REALM
cannot be fairly compared to DPR due to longer retrieved passages (100 vs 288 tokens). Therefore, we
perform a separate experiment using accuracy with REALM as a baseline, showing large performance
gains over REALM as well.
73
Table B.1: Accuracy on claim verification datasets. #Encoded refers to the number of passages encoded by
a BERT reader to answer a single question. EaE stands for Entities as Experts model.
Model #Params #Encoded HoVer
dev
HoVer
test
FEVER
dev
FEVER
test
FM2
dev
RAG 620M 100 - - 74.5 72.5 -
REALM 330M 5 67.3 66.1 70.4 67.1 65.8
EaE 360M 1 66.2 66.6 66.1 63.6 63.5
tome-1 220M 1 73.6 72.8 70.5 67.8 67.7
tome-2 220M 1 74.1 73.1 71.1 68.1 68.4
Table B.2: Accuracy on FM2 compared with original dataset baselines. Oracle refers to oracle retrieval
followed by a BERT-Base reader.
Model Accuracy
Oracle [24] 69.3
DPR [24] 64.2
EaE 63.5
REALM 65.8
tome-1 67.7
tome-2 68.4
B.2.5 Importanceofpre-trainingobjectives
We perform several ablation experiments for the pre-training procedure (see Table B.5). First, results show
that the entity prediction objective (c.f. Section 3.2.4) is not essential for TOME pre-training. Performance
on claim verification datasets (FEVER and HoVer) is not affected by whether we use entity prediction for
pre-training. More surprisingly, removing this objective only slightly decreases the performance on entity
question answering datasets (TriviaQA and ComplexWebQuestions). We predict entities for question-
answering in the same way as we do for the entity prediction objective during pre-training (c.f. Equa-
tion 3.10), so we expected the entity prediction auxiliary loss to be important.
Table B.3: EntityQuestions recall@20
Model Recall@20
DPR [72] 65.4
BM25 [72] 71.2
tome-1 83.3
tome-2 83.8
Table B.4: EntityQuestions top-1 accuracy
Model Accuracy
Entities as Experts 32.5
REALM 59.0
tome-1 62.1
tome-2 66.0
74
Table B.5: Performance ablations for pre-training objectives experiments.
Dataset HoVer
dev
FEVER
dev
TriviaQA
dev
CWQ
dev
tome-1 73.6 70.5 50.8 44.9
w/o entity coreference loss 69.8 68.4 42.5 40.5
w/o entity prediction loss 73.7 70.7 49.4 43.8
On the other hand, a related entity coreference objective (c.f. Section B.1.1 and Appendix B.1.2) is
crucial for Batch-TOME and Mention Encoder pre-training. That is consistent with our intuition that
semantically challenging tasks incentivize the model to store useful information in memory.
B.2.6 tomeinitialization
We initializetome model with a pre-trainedbatch-tome model which we find to be especially important
for warming up retrieval. If tome is initialized from scratch (or even from BERT weights), tome does
not learn to use the memory. In fact, tome has to be initialized from the same BATCH-TOME used to
generate the memory. This implies that multi-stage training is a vital ingredient for tome to succeed.
Our explanation for whytome is sensitive to initialization is thattome needs to learn two skills: first, to
effectively use retrieved mentions for its predictions, and second, to retrieve relevant mentions. Learning
both capabilities end to end gives rise to a mutual dependence: to get a signal for learning how to use
retrieved mentions, the retrieved mentions have to be useful, and to learn to retrieve useful mentions, the
model needs to utilize retrieved mentions. If initialized from scratch, the model is not able to learn both
skills simultaneously. The pre-training stage with the smaller in-batch memory functions as a curriculum
to address that problem.
B.3 Nearestneighborsearch
Nearest neighbor search is an extremely common problem, and there exist numerous approaches and
packages for fast approximate nearest neighbor search (ANNS). [34, 43]. Most approaches employ two
methods for fast search: 1) compress the search table through projecting to a lower dimension and quan-
tization and perform comparisons in this compressed space and 2) divide the search table in buckets of
similar items, and search only a subset of the buckets. Retrieve-and-read use ANNS packages to search for
related passages [35, 55].
75
Applying such packages fortome is slightly trickier, astome needs to perform ANNS inside the model.
One viable route is to compute queries on-device, transmit them to a separate ANNS server and them
transmit them back. We would recommend this approach for GPU accelerators, with faster host-device
communication and slower device-device communication. As we are using TPU accelerators, we decided
to use on-device ANNS, which does not require coordinating additional servers and will potentially allow
for backpropagating through memory in future work.
B.3.1 On-devicenearestneighborsearch
We shard the Mention Memory over all TPU devices. We perform search by distributing each query to all
devices and retrieving top-K search results from each local memory shard. Then, the results are distributed
back to the original devices and the local search results are aggregated through another, global top-K.
Dot-product The first method we describe is naive dot-product search, taking advantage of matrix mul-
tiplication capacity of TPU accelerators. In this method we perform search over local shards by taking the
dot product between the query and the local memory shard and performing an approximate top-k oper-
ation over the results. Dot-product search is easy to implement and fast for smaller memory sizes (up to
10 million entries). We implemented this method first due to its simplicity and our primary experimental
results employ this search method.
ANNS To speed up search we implemented method 2) from standard CPU-based ANNS, bucketing the
search table and searching only a subset of buckets. In particular we perform k-means clustering to divide
the Mention Memory into clusters, and perform dot-product search over the topn
s
clusters on each device.
Overhead While the Mention Memory is stored on-device, memory overhead is negligible as the mem-
ory table is sharded. For pre-training the Mention Memory took up 2.2% of available device memory. Table
B.6 shows percentage of time spent on ANNS intome-1 pretraining for different reader architectures. The
relative overhead of search becomes smaller with reader size, and ANNS overhead in particular becomes
negligible for BERT-Large and up. We did not measure standard CPU ANNS overhead, but it should be
comparable to or faster than our ANNS numbers.
76
Table B.6: Proportion of time spent on ANNS fortome-1 pre-training setting.
Model Dot-product ANNS
BERT-Base 0.79 0.22
BERT-Large 0.48 0.07
T5-11B Encoder 0.17 0.02
Hyperparameters For ANNS in TOMEBlocks we take top-2 search results from each local memory
shard, and apply top-128 over the retrieved results. For ANNS in the entity prediction layer we take top-32
search results from each local shard, and aggregate across shards without applying an additional top-K
operation.
B.4 Retrievalexamples
Table B.7: tome-2 retrievals for the second HoVer dev sample. We show top-1 retrieval results for the first
(−→
1
) memory attention layer for passage mentions “the novel”, “the movie” and “the album”. Memory
mentions are in brackets. We can see that the model can retrieve relevant mentions for non-named passage
mentions, and generally understands it is looking for mentions related to music. However, while the best
retrieval for “album” is from a passage that mentions sampling the Shining, it is quite far removed and it
is likely the retrieval is not sufficiently accurate here.
Claim:StephenKing wrotethenovel thatthemovie directed byStanleyKubrick that was sampled
inthealbum"WhereBloodandFireBringRest" was based on. Label: TRUE
thenovel−→
1
Music Video. The video is a homage to Stanley Kubrick’s 1980 film The Shining based
on the[StephenKing] novel ...
themovie−→
1
Music Video. The video is a homage to Stanley Kubrick’s 1980 film The Shining based
on the[StephenKing] novel ...
thealbum−→
1
Where Blood and Fire Bring Rest is the third full-length album released by[metalcore]
band ZAO. It was the first album to feature vocalist Dan Weyandt after the departure of Shawn Jonas
along with new bassists/guitarists, Russ Cogdell and Brett Detar. The album contains a sample from the
film The Shining ...
77
Abstract (if available)
Abstract
Retrieval-augmented language models set the state-of-the-art on a broad spectrum of knowledge-intensive tasks, outperforming orders of magnitude larger models. However, such models can also be expensive for training and inference. Model performance and computational cost represent two sides of the coin: we can generally improve performance through scale at the expense of an increased computational burden. Therefore, we are really interested in pushing out the quality-compute frontier, improving performance at any given level of computational resources.
In this dissertation, I analyze the factors that determine the computational burden of retrieval-augmented language models and propose strategies to extract a better performance-compute trade-off. The dissertation consists of three sections. The first section contains a detailed analysis of components of retrieval-augmented models and introduces methods to improve generation efficiency. The second section explores the use of dense memory to reduce the cost of encoding retrievals. Finally, the third section proposes a hybrid between dense memory and text retrieval, combining lessons from previous chapters.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Aggregating symbols for language models
PDF
Building generalizable language models for code processing
PDF
Visual representation learning with structural prior
PDF
Modeling, learning, and leveraging similarity
PDF
Towards understanding language in perception and embodiment
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Identifying and mitigating safety risks in language models
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Spoken language processing in low resource scenarios with applications in automated behavioral coding
PDF
Automatic image matching for mobile multimedia applications
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Semantic-based visual information retrieval using natural language queries
PDF
Augmented simulation techniques for robotic manipulation
PDF
Computational modeling of mental health therapy sessions
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Robust and generalizable knowledge acquisition from text
Asset Metadata
Creator
de Jong, Michiel Steven
(author)
Core Title
Expanding the performance-compute frontier for retrieval-augmented language models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-08
Publication Date
06/01/2023
Defense Date
05/08/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,machine learning,natural language processing,OAI-PMH Harvest,retrieval-augmented language models
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Golubchik, Leana (
committee chair
), Bien, Jacob (
committee member
), Sha, Fei (
committee member
), Yogatama, Dani (
committee member
)
Creator Email
michiel.s.dejong@gmail.com,msdejong@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113144565
Unique identifier
UC113144565
Identifier
etd-deJongMich-11916.pdf (filename)
Legacy Identifier
etd-deJongMich-11916
Document Type
Dissertation
Format
theses (aat)
Rights
de Jong, Michiel Steven
Internet Media Type
application/pdf
Type
texts
Source
20230602-usctheses-batch-1051
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
machine learning
natural language processing
retrieval-augmented language models