Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Inductive biases for data- and parameter-efficient transfer learning
(USC Thesis Other)
Inductive biases for data- and parameter-efficient transfer learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Inductive Biases for Data- and Parameter-Efficient
Transfer Learning
by
Mozhdeh Gheini
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2025
Copyright 2025 Mozhdeh Gheini
To the beautiful Los Angeles, as a thank you for who I have become,
and to the darkest, most uncertain times, as a reminder of change.
ii
Acknowledgments
Writing the Acknowledgments of a dissertation truly feels like the dessert after a meal—rounding
things out and making it, if I may attempt a bad pun, a complete “dessertation”. I am writing this
at a time that also marks the end of eight years that I have been lucky to call Los Angeles home.
Los Angeles was the first place I got to call home in the States, and looking back, I am speechless
at how fortunate I have been to meet and connect with the people I have, who made everything,
including this PhD, possible. While I am grateful to all beyond words, I am going to try.
My advisor, Jonathan May, took as big a bet on me as someone possibly could. The first
time I emailed Jon was in the summer of 2017, right after finishing my first semester of Master’s
at USC. In an ocean of uncertain thoughts, I was certain about one thing: I wanted to work on
natural language processing (NLP). Jon was teaching an NLP course that fall, which I planned
to take. But I felt I was letting days slip away without already working on an NLP research
project. So, I emailed Jon. Having heard about the fate of most emails in the unfathomably busy
academia, I expected no response. But Jon did respond. His response was filled with the same
support and reassurance I came to know more and more every day that I learned from and worked
with him. Jon’s superpower is to advise sympathetically and non-judgmentally. I thrived not only
academically, but in every other imaginable way, because of Jon. So, it comes as a surprise to no
one when I always talk about Jon as a friend and not an advisor.
I also benefited immensely from discussions with Xiang Ren and Xuezhe “Max” Ma that led
to publications included in this dissertation, and I am indebted to both. To this day, I am amazed
at Xiang’s ability to envision all the ways an idea can branch from early on and communicate
how it fits within a bigger puzzle. When Max joined ISI, he was working on a new optimization
algorithm and thinking of architectural modifications. He was changing things I had never dared
iii
touch. His support in answering my amateur questions and patience with my often slower pace
made me not only comfortable with, but excited about diving deep into code that is often taken for
granted. Although I know I will never be as skilled as him, Max has inspired me to take on difficult
problems forever.
Besides Jon, Xiang, and Max, I also was fortunate to have Sven Koenig, Shri Narayanan,
Emilio Ferrara, Khalil Iskarous, and Swabha Swayamdipta on my committee. Their generosity
with their time, feedback, and insights has made me a better researcher.
Throughout my PhD, I was able to work on problems that I found engaging thanks to funding through DARPA LORELEI (go team ELISA!), IARPA MATERIAL (go team SARAL!), and
DARPA LwLL (go team CORAL!) programs, which I am grateful for. Before starting a PhD and
during my Master’s, Sheila Tejada and Mayank Kejriwal were the first to trust me with jobs that
paid: as a course producer for the AI course and as a student worker at ISI. They also provided
recommendation letters for my PhD applications. Their support was crucial in shaping my path
early on.
Navigating everything in the PhD was not only easier, but simply enjoyable because of my
lab mates and peers: Thamme Gowda, Xusen Yin, Nada Aldarrab, Meryem M'hamdi, Kushal
Chawla, Alex Spangher, Justin Cho, Katy Felkner, Jake Bremerman, Yanze Wang, Dhananjay
Ashok, Zhejian Zhou, Linghao Jin, Shushan Arakelyan, Brihi Joshi, Lee Kezar, Jaspreet Ranjit,
Johnny Wei, Qinyuan Ye, Pei Zhou, and all other USC/ISI NLP students. I thank them all for the
fabulous and fun conversations that we have had. They made sure I never felt alone in this journey.
USC NLP and ISI have been remarkably nurturing environments for me to grow. Robin Jia,
Jesse Thomason, Dani Yogatama, and Jieyu Zhao along with Jon, Xiang, Max, and Swabha, whom
I already mentioned, have created a uniquely diverse, strong, and supportive NLP research community at USC. I have learned from all and cannot imagine growing the way I have without the
team culture that they have created.
I was able to focus without logistical distractions thanks to the support from the administrative
and operations staff at USC and ISI: Karen Rawlins, Peter Zamar, Alma Nava, Amy Feng, Magali
iv
Gruet, Jessica Madrigal, Melissa Snearl-Smith, Lizsl De Leon, Asiroh Cham, Ellecia Williams,
Felanté Charlemagne, and Lisa Avalos, who always helped resolve non-research matters smoothly.
I spent three wonderful summers as an intern at Apple, which heavily shaped my professional
career. I appreciate the unconditional support from my managers Qin Gao and Theo Rekatsinas
and the fruitful collaborations with Tatiana Likhomanenko, Hendra Setiawan, Matthias Sperber,
Sarthak Garg, and Omar Attia.
When I arrived in the United States, Betty Jaferi, Afshin Ashfaei, and their at-the-time fourand six-year-old sons, Suren and Shervin, welcomed me into their home. Little did I know then
that they would become like family, though their immense kindness should have been a giveaway.
They were there for countless firsts: my first burger in the US was with them, they were the first
people I called when I scratched another car with my first car in LA, and many more moments that
are all now cherished memories because of them.
My friends: Nazanin, Mehrnoosh, Pegah, Nazgol, Negar M., Sina H., Sina A., Omid, Hana,
Majid, Negar G., Roohy, Parima, and Saba, make me a happier and better person in so many ways.
I got through many seemingly endless tasks and days solely because I was looking forward to plans
to hang out with them.
I never understood how my parents, Minoo and Reza, were able to dream so big and sacrifice
so much for me and my sister. I am here because of their unwavering support and never-ending
encouragement to be ambitious and not afraid of the unknown. My sister, Mahya, makes me laugh
and puts up with me being the know-it-all big sister. Mahya might be the youngest in our family,
but she has taken care of all of us the most.
Finally, Aref, my spouse, is my why. He makes things meaningful and worthwhile. He is the
bright color in dim gray times. He is the energy and patience when I have none of my own left.
He is the proof that no decision could have been wrong, because it led to meeting him. He is my
partner in crime. He is my rock. He is my everything.
v
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis Statement and Contributions . . . . . . . . . . . . . . . . . . . 2
1.1.1 On Data Efficiency. . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 On Parameter Efficiency . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 On Computation Efficiency . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Efficiency as a Forethought . . . . . . . . . . . . . . . . . . . . 5
1.2 Past and Future . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Data Efficiency: A Universal Parent for Low-Resource Machine Translation . . 7
2.1 Transfer Learning for Low-Resource MT . . . . . . . . . . . . . . . . . 9
2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Data and NMT System . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Chapter Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 3: Parameter Efficiency: Exclusive Cross-Attention Transfer for Translation . . . 22
3.1 Cross-Attention Fine-Tuning for MT . . . . . . . . . . . . . . . . . . . 25
3.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Data and Model Details . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 Cross-attention’s Power and Importance . . . . . . . . . . . . . . . 30
3.3.2 Learned Representations Properties . . . . . . . . . . . . . . . . . 32
vi
3.4 Utilities of Aligned Embeddings. . . . . . . . . . . . . . . . . . . . . 34
3.4.1 Mitigating Forgetting . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Zero-Shot Translation . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Chapter Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 4: Computation Efficiency: Transferring from Pre-trained MEGA . . . . . . . . 39
4.1 MEGA Background . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Experiments and Results. . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Small-Scale Pre-Training . . . . . . . . . . . . . . . . . . . . . 43
4.2.1.1 Pre-Training Details . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1.2 Fine-Tuning Details . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1.3 Transfer Results . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Large-Scale Pre-Training . . . . . . . . . . . . . . . . . . . . . 46
4.2.2.1 Pre-Training Details . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2.2 Fine-Tuning Details . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2.3 Transfer Results . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Chapter Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 5: Planning for Efficiency: Meta-Learning for Parameter-Efficient Fine-Tuning . . 50
5.1 Meta-Learning Background . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Priming for PE Fine-Tuning through Meta-Learning . . . . . . . . . . . . . 53
5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 Priming Algorithm. . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3.1 Data Details . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 57
5.3.3 Baselines and Method Evaluation Settings . . . . . . . . . . . . . . 57
5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.1 Ablation 1: Substitute MAML Inner Loop . . . . . . . . . . . . . . 59
5.4.2 Ablation 2: Number of Inner Steps . . . . . . . . . . . . . . . . . 61
5.5 Chapter Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter 6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.1 Low-Resource NMT . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Model Restricting & Architecture Understanding . . . . . . . . . . . . . . 65
6.3 Cross-Lingual Embeddings. . . . . . . . . . . . . . . . . . . . . . . 65
6.4 MEGA Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Parameter-Efficient Fine-Tuning. . . . . . . . . . . . . . . . . . . . . 66
6.6 Meta-Learning for Parameter-Efficient Fine-Tuning . . . . . . . . . . . . . 67
Chapter 7: Takeaways and Future Directions . . . . . . . . . . . . . . . . . . . 69
7.1 Inductive Biases Elsewhere. . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Inductive Biases or Scaling? . . . . . . . . . . . . . . . . . . . . . . 71
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vii
List of Tables
2.1 Detailed data sizes of the 19 LORELEI languages used to create the 2M-sentence
polyglot–English corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Training corpus information and test scores over 5 child languages. To simulate a
low-resource setting, we randomly sample a subset of the corpus available for each
language. Determined by paired bootstrap resampling (Koehn, 2004) with 100
bootstraps, the polyglot models (with polyglot–English data and polyglot-derived
vocabulary) are better than their French counterparts with 95% statistical significance in all cases (underlined) but Romanian (which is close to French), where
they are better with 75% statistical significance. . . . . . . . . . . . . . . . . . . . 14
2.3 Related-language parents trained for each language (French is close to Romanian,
so we do not include it in this table as well as Table 2.2). In each case, we provide the statistical significance of the superior parent between 2M-sentence polyglot and related-language parents (underlined) in parenthesis. 11M-sentence polyglot model, which is able to come on top in all cases, is shown in bold. Results
summarize problems with training a parent on a related language that our method
mitigates. Related-language parent data is often in short supply (Hindi, Finnish),
and never in as large quantities as the 11M-sentence polyglot parent, which yields
equivalent or better results than all related parent models, without the cost of identifying and assembling data, and then building a specific parent model for a new
child language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Romanian analysis experiments. Fr′
refers to (2M) French replaced with non-latin
characters to avoid sharing any subword information. Whenever training data is
concatenated to build a multilingual model, we use +, and whenever we transfer a
model and further fine-tune it, we use →. . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Comparing the effects of building subword vocabulary from a polyglot corpus and
using the polyglot corpus to train the parent model. In most cases the effects are
additive. Row 2 has the same shatter rate as row 3, as it uses the same vocabulary.
For shatter rate, lower is better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Example sentences from the Finnish test set. The system transferred from French
parent with French subwords completely misses ‘congestion’ and ‘fuel reserves’
due to shattering ‘ruuhkien’ and ‘polttoainevarannot’, respectively. . . . . . . . . . 19
viii
2.7 ‘Philippines’ in four languages romanized and their universal vocabulary segmentations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Data sources and statistics for each of the child language pairs. . . . . . . . . . . . 29
3.2 BLEU scores for each of the five experiments across six language pairs. Bold
numbers indicate the top two scoring approaches. Percentages in parentheses
next to fine-tuning strategy is the fraction of parameters that had to be updated
and hence stored as new values for future use. Numbers in parentheses next to
{src,tgt}+xattn scores show the difference from {src,tgt}+body. . . . . . . . 31
3.3 Performance of zero-shot systems for three language pairs. De–Es is evaluated on
newstest2013 test set. Ro–Es and Ro–De are evaluated on respective TED talks
corpus test sets (Qi et al., 2018). . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Sampled German words and their equivalents based on the embeddings learned
by each of the models. The correct translations are highlighted. Each pair was
manually checked for correctness using an automatic translator. . . . . . . . . . . . 38
4.1 Comparison of final perplexity across pre-trained models. While eBART and
eMEGABART achieve similar performance, omitting EMA in MEGA-SkipEMA
results in a higher (worse) perplexity. . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Data sources used for fine-tuning eBART and eMEGABART. . . . . . . . . . . . 45
4.3 BLEU scores for each of the three fine-tuning settings across five language pairs;
newstest on the top, FLORES+ on the bottom. Bold numbers indicate the top two
scoring approaches. MEGA-based models consistently place at the top. . . . . . . . 46
4.4 BLEU scores for mBART and MEGAmBART fine-tuning experiments. Bold
numbers indicate the top scoring pre-trained model. While MEGAmBART is only
trained for 76K steps compared to 500K steps of mBART, it is able to perform
competitively. This, again, demonstrates that MEGAmBART is more computeefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Entity-level micro F1 under each of the fine-tuning settings for NER across six languages. Bold numbers indicate top-scoring methods in each category. Percentages
next to each setting are the fraction of parameters that are updated (all AT settings
have the same percentage). Priming as described in this work is most effective
in improving PE fine-tuning performance and closing the gap with Full FT. All
priming experiments are run twice (including the priming stage), and we report the
average score over two runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
ix
List of Figures
2.1 Comparison between the ETA of systems using universal (T1) and customized (T2)
parents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Overview of our transfer learning experiments, depicting (a) training from scratch,
(b) conventional fine-tuning (src+body), (c) fine-tuning cross-attention (src+xattn),
(d) fine-tuning new vocabulary (src), (e) fine-tuning cross-attention when transferring target language (tgt+xattn), (f) transfer learning with updating crossattention from scratch (src+randxattn). Dotted components are initialized randomly, while solid lines are initialized with parameters from a pre-trained model.
Shaded, underlined components are fine-tuned, while other components are frozen. 23
3.2 BLEU scores across different transfer settings using mBART as parent. Exclusive
fine-tuning of embeddings (embed) is not effective at all due to lack of translation
knowledge in the cross-attention layers. . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Accuracy of bilingual dictionaries induced through embeddings learned under tgt+body
and tgt+xattn settings. De and Es effectively get aligned with En under tgt+xattn
(left). As they are both aligned to En, we can also indirectly obtain a De–Es dictionary (right). Similar practice completely fails under tgt+body. . . . . . . . . . 34
3.4 Performance on the original language pair after transfer. The original Fr–En parent
model scores 35.0 BLEU on the Fr–En test set. {src,tgt}+xattn outperforms
{src,tgt}+body on the parent task. . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 How much influence the updated representation of token xt gets from surrounding tokens under the governance of the attention mechanism. Edge labels indicate
the attention score between xt and the respective token. The figure assumes bidirectional attention. In an autoregressive setting, the token only receives influence
from tokens to its left. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 How much influence the updated representation of token xt gets from surrounding tokens under the governance of EMA. Edge labels indicate the weight of the
respective token. Lighter edges signify more discounted influence. The figure assumes a bidirectional setting. In an autoregressive setting, the token only receives
influence from the recent tokens to its left. . . . . . . . . . . . . . . . . . . . . . . 41
x
4.3 Perplexity curves of eBART and eMEGABART (labeled as eMEGABART in the
graph’s legend). eMEGABART converges significantly faster than its Transformerbased counterpart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Fr–En Loss (on the left vertical axis) and test BLEU (on the right vertical axis)
curves of MEGAmBART (labeled as MEGAmBART in the graph’s legend) and
mBART against epoch. MEGAmBART progresses faster until mBART catches up. 48
5.1 Transfer learning for NLP pipeline; the shaded block is our contribution. Conventional transfer practice (dashed arrows) does not differentiate between full finetuning and parameter-efficient fine-tuning in any way. This work proposes a metalearning solution to further modify and prime a pretrained model parameters to
specifically target parameter-efficient fine-tuning. . . . . . . . . . . . . . . . . . . 51
5.2 Overall model architecture used in our experiments. θa comprises a single adapter
layer directly after the pretrained model. . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Comparison between different priming strategies for downstream full fine-tuning.
In this case, as opposed to parameter-efficient fine-tuning, it is usually beneficial
to use full fine-tuning in the inner loop. . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Know Where You’re Going: Best performances appear on the diagonal, where a
homogeneous priming strategy is devised before downstream fine-tuning. Directly
comparable blocks are linked using arrows. . . . . . . . . . . . . . . . . . . . . . 60
5.5 First-order approximation of meta-gradient update. Illustration courtesy of Wild
(2020). After three steps of inner updates (θ →→→ φ), θ is updated by the gradient of the query set loss evaluated at φ (the green dashed arrow). . . . . . . . . . . 61
7.1 A schematic of an ever-repeating pattern: System 1 with more incorporated knowledge only performs well over a limited period of time, whereas System 2 performance scales consistently. The gray curve follows the ideal trajectory. . . . . . . . 71
xi
Abstract
Data- and resource-intensive pre-training and fine-tuning applied upon Transformer-based models
is the dominant paradigm at the forefront of rapid advancements in natural language processing,
human language technologies, and most notably, large language models. Such reliance on massive
amounts of data, computation, and energy, while effective and impressive from a performance-only
perspective, can hinder open, nonexclusive, and sustainable development of these technologies. In
this work, we study how certain inductive biases can be devised to adjust current natural language
methods under resource-constrained scenarios and provide insights into why the proposed inductive biases are successful in such cases.
Specifically, this dissertation presents four research directions on data and parameter efficiency
of fine-tuning and transfer learning in natural language processing: (1) a universal regimen that
creates a single pre-trained checkpoint suitable for machine translation transfer to practically any
language pair and eliminates the need for ad hoc pre-training; (2) an architecture-guided parameterefficient fine-tuning method that performs competitively with full fine-tuning while exclusively
updating cross-attention parameters; (3) an analysis of MEGA, a recently introduced augmentation
of the Transformer architecture to incorporate explicit recency bias, through the lens of transfer
learning; and (4) a meta-learning algorithm to prime pre-trained models for specific fine-tuning
strategies.
Combined with ablations that show how they are effective and analyses that demonstrate their
generalizability, these directions are meant to serve as tools for resource-efficient transfer learning
for natural language processing. Additionally, we will situate this dissertation’s contributions in the
current climate of scaling efforts in natural language processing to discuss possible paths forward
to evolve this research.
xii
Chapter 1
Introduction
“One’s own troubles sharpen one’s eyes sometimes.”
Agatha Christie, The Murder at the Vicarage
The advent of the Transformer architecture (Vaswani et al., 2017) coincided with the start of the
research journey that led to this dissertation. A key design choice in the Transformer is to dispense
with the recurrence in the previously-dominant architectures in natural language processing (NLP),
recurrent neural networks (RNNs) (Rumelhart et al., 1986; Hochreiter and Schmidhuber, 1997;
Sutskever et al., 2014), and to solely rely on the attention mechanism (Bahdanau et al., 2015; Luong
et al., 2015). With attention, the representations of the input tokens can be computed in parallel
with a constant number of sequential operations, leading to significant gains in terms of training
cost on the hardware as well as task performance relative to recurrence-reliant architectures.
Concurrently, evidence increasingly showed that transfer learning (Pan and Yang, 2010) applied to deep learning representations (Bengio, 2012) is a powerful tool in NLP, as well as other
artificial intelligence (AI) areas. Transfer learning is embodied by means of unsupervised (Dai and
Le, 2015) or supervised (Zoph et al., 2016) pre-training, and subsequent fine-tuning: continuing
training on the desired downstream task using the initializations from the pre-trained parameters.
The effectiveness and relative efficiency of the Transformer streamlined the scaling of this process,
in particular the pre-training stage. This resulted in an ever-growing squad of pre-trained language
models (Radford et al., 2018a; Devlin et al., 2019; Radford et al., 2018b; Lewis et al., 2020; Brown
1
et al., 2020) that underlie the rapid growth in the performance of present-day NLP technologies,
e.g., GPT-4 (OpenAI, 2023).
The widespread success of these models, however, can easily distract from the sheer amount
of data and computational resources they rely on. Even on the more modest end of the spectrum,
these models rely on upwards of two trillion tokens of data, a billion parameters (Groeneveld et al.,
2024), and a substantial amount of power for training, e.g., 159 MWh for a one-billion parameter
model, equivalent to 9 tonnes of CO2eq (Luccioni et al., 2024)
1—all privileges that low-resource
languages, by definition, do not enjoy (Singh, 2008).
In stark contrast to this trend, human children are argued to be able to resolve ambiguities
and generalize to new sentences with seemingly insufficient data (i.e., poverty of the stimulus
(Chomsky, 1980)). While this argument has been at the center of fervent debate in cognitive
science and linguistics (Pearl, 2022), recent works in computational linguistics suggest that when
trained on textual data comparable to children’s linguistic input, neural network architectures for
NLP need specialized biases to generalize (Yedetore et al., 2023). In a similar spirit, the work in
this thesis focuses on the effectiveness of specialized biases for efficient transfer learning.
1.1 Thesis Statement and Contributions
The underlying theme behind the body of work presented in this thesis is that inductive biases
are indispensable for optimal transfer performance in resource-constrained scenarios, which
are by no means rare. To the best of our knowledge, the earliest phrasing of biases that lead to
effective generalization as “inductive” biases can be attributed to Mitchell (1980):
Learning involves the ability to generalize from past experience in order to deal with
new situations that are “related to” this experience. The inductive leap [emphasis
added] needed to deal with new situations seems to be possible only under certain
biases [emphasis added] for choosing one generalization of the situation over another.
1This equates to the greenhouse gas emissions from 2.1 gasoline-powered passenger vehicles driven for one year
as calculated by epa.gov.
2
Take the following problem as an example.2
If presented with these two points and requested
to connect them with a line,
humans are significantly more likely to choose this solution:
over an arbitrary one like this:
This demonstrates an inductive bias in us—a preference for the shortest path—that effectively
limits the search space and pushes us towards a specific solution.
We posit that, similarly, inductive biases help in resource-constrained scenarios by restricting
the search space. This thesis makes the following empirical and technical contributions to support
its statement on data, parameter count, and computational resources fronts.
1.1.1 On Data Efficiency
While transfer learning from a high-resource language pair “parent”3
is an effective way to improve
neural machine translation quality for low-resource language pair “child”,4 previous approaches
build a custom parent model or update an existing parent model’s vocabulary for each child language pair they wish to train, in an effort to align parent and child vocabularies (Zoph et al., 2016).
As parent models are often large, this is not a practical solution due to the overhead of the customization needed for each new child, which results in inefficient use of data and slow and costly
experimental cycles.
2This example is courtesy of Samira Abnar.
3We use ‘pre-trained’, ‘upstream’, and ‘parent’ model interchangeably to refer to the model that serves as the
initialization in transfer.
4We use ‘fine-tuned’, ‘downstream’, and ‘child’ model interchangeably to refer to the model that is the product of
transfer.
3
In Chapter 2, we present a “universal” pre-trained neural parent model with constant vocabulary
that can be used as a starting point for training practically any new low-resource language to a fixed
target language. Our approach, which leverages orthography unification and a broad-coverage
approach to subword identification, generalizes well to several languages from a variety of families,
yielding translation systems that are built more quickly and at higher quality than those built using
other methods.5
1.1.2 On Parameter Efficiency
In Chapter 3, we propose an architecture-guided parameter-efficient fine-tuning approach. Specifically, we study the power of cross-attention in the Transformer architecture within the context of
transfer learning for machine translation. We conduct a series of experiments through fine-tuning
a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning
all parameters (i.e., the entire translation model). We provide insights into why this is the case and
observe that limiting fine-tuning in this manner yields cross-lingually aligned embeddings. The
implications of this finding include a mitigation of catastrophic forgetting, the potential for zeroshot translation, and the ability to extend machine translation models to several new language pairs
with reduced parameter storage overhead.6
1.1.3 On Computation Efficiency
While the Transformer architecture inherently computes token representation in a parallel manner
in each layer, it still has a quadratic time complexity with respect to sequence length n, as each
token in the sequence attends to every other token in the sequence. This makes it undesirable for
sequence modeling tasks with long sequences with complex long-range dependencies, e.g., protein
sequence modeling (Choromanski et al., 2021). There have been several lines of attempt to reduce
5Chapter 2 is also available as a stand-alone paper accessible at Gheini and May (2019).
6Chapter 3 is also available as a stand-alone paper accessible at Gheini et al. (2021).
4
this complexity (Kitaev et al., 2020; Beltagy et al., 2020; Wang et al., 2020; Choromanski et al.,
2021; Gu et al., 2022; Ma et al., 2023; Gu and Dao, 2023).
In Chapter 4, we focus on the MEGA architecture (Ma et al., 2023), which achieves this feat
by incorporating recency bias in the form of exponential moving average (EMA) across the time
dimension and managing to avoid recurrence at the same time. By pre-training two language
models at different scales and fine-tuning them for machine translation across several language
pairs, we demonstrate the benefits of EMA specifically during transfer learning, which includes
faster convergence and consequently, compute efficiency.
1.1.4 Efficiency as a Forethought
Of all the practices discussed for efficient transfer learning, parameter-efficient fine-tuning is
unique in one aspect: while we use different fine-tuning strategies, we do not account for that
difference in any way during pre-training. In Chapter 5 we examine if and how such knowledge
of the downstream fine-tuning approach calls for complementary measures after pre-training and
before fine-tuning.
We show that taking the ultimate choice of fine-tuning into consideration boosts the performance of parameter-efficient fine-tuning. By relying on optimization-based meta-learning using
MAML (Finn et al., 2017) with certain modifications for our distinct purpose, we prime the pretrained model specifically for parameter-efficient fine-tuning, resulting in gains of up to 4.96 points
on cross-lingual NER fine-tuning. Our ablation settings and analyses further reveal that the specific
approach we take to meta-learning is crucial for the attained gains, i.e., it is not meta-learning in
general, but how we formulate the meta-learning setup that leads to observed gains.7
7Chapter 5 is also available as a stand-alone paper accessible at Gheini et al. (2023).
5
1.2 Past and Future
We dedicate Chapter 6 and Chapter 7 to discussing related pieces of work and future directions,
respectively. We contextualize how we think of this work in the larger backdrop of natural language
processing progression and how we imagine it to grow in the future.
6
Chapter 2
Data Efficiency:
A Universal Parent for Low-Resource Machine Translation
As with other central tasks in NLP, in machine translation (MT), transfer learning from a single
large “parent” language pair to many low-resource “child” language pairs leads to higher quality
results, particularly when the target language does not change (Zoph et al., 2016). While this
approach has the potential to enable rapid construction of good quality machine translation systems
and efficient experimentation across multiple language pairs, in practice, building new systems
requires building a new parent model for each child language pair (Zoph et al., 2016; Nguyen and
Chiang, 2017; Kocmi and Bojar, 2018). The time and computational cost required for building a
translation model for a low-resource language pair is thus equal to the sum of the costs for building
from both parent and child data sets. As a real example of a situation where this overhead can
be harmful, Lewis et al. (2011), Castillo (2016), and Christianson et al. (2018) note the potential
of rapidly-built MT to reduce preventable casualties shortly after humanitarian crises, and under
humanitarian assistance scenarios. Under such circumstances, the sooner a response team can find
out about the immediate needs of the locals through an MT system, the better chance of being able
to provide useful assistance there is.
Transfer learning approaches that require time-consuming parent model training would not be
useful in these scenarios. Rather than spend days or weeks to build a new parent model before
transferring to the intended language pair and fine-tuning model weights, we would prefer to have
7
Figure 2.1: Comparison between the ETA of systems using universal (T1) and customized (T2)
parents.
a single model at the ready, capable of being quickly fine-tuned toward any new language pair
and deployed within hours. Training new parent models also wastes resources; the parent models
discussed in this work use the modest Transformer base architecture (Vaswani et al., 2017) but still
generate 26 lbs. of CO2
every time they are trained (Strubell et al., 2019). We seek to avoid this
waste.
To that aim, in this chapter, we present our method for pre-training a universal parent model as a
solution. Our wide-coverage training regimen makes use of romanization (Hermjakob et al., 2018)
to avoid orthography issues by incorporating a universal subword vocabulary (Wu et al., 2016),
thus yielding a pre-trained parent that can be used out-of-the-box, and is suitable for translating
from nearly any language into English. This enables rapid generation of neural translation models
for new languages, without requiring any retraining of the parent or updating its vocabulary and
thus nets a 5-fold speedup and reduction of resources over previous approaches.
Figure 2.1 illustrates how advantageous a universal parent is over a customized parent. Each
parent first needs to be pre-trained (requiring TP units of time) and then fine-tuned (which takes less
time). The universal parent can be pre-trained well beforehand, and fine-tuned as soon as a crisis
hits, and data is available for the child language. However, even under the optimistic assumption
that customized parent pre-training can immediately start once we know what the child language
8
is, customized parent fine-tuning cannot start any sooner than T0 + TP. So using a customized
parent would cost an equivalent of TP units of time more resource (T1 −T0 = T2 −(T0 +TP)).
Our core contributions are:
• We propose a universal parent model that includes a pre-learned and fixed vocabulary, fitted
for fine-tuning without any customization needed (§2.1).
• We extensively evaluate our model’s improved quality on five low-resource language pairs.
We ablate different factors (i.e., size of training data and relatedness between parent and
child languages) to examine their effect on the final translation model (§2.2).
• We analyze the reasons that transfer learning helps neural machine translation (NMT) and
determine that transfer does in fact ensure the models are “ready-to-translate”, and are not
simply improved by language model enhancements (§2.3).
• We analyze lexical issues that can arise in transfer learning, specifically those from naively
building a subword tokenization model such as a byte-pair encoding (BPE) (Sennrich et al.,
2016b) for a parent language before a child language is known. We show how our universal
many-language vocabulary avoids these issues (§2.3).
2.1 Transfer Learning for Low-Resource MT
Parameter transfer as described by Zoph et al. (2016), which initializes a child model with the
trained parameters of a parent model and then fine-tunes on the child language pair, is trivial and
straightforward for all model parameters except for embeddings, which are tied to vocabulary
types. Concretely, for non-embedding parameters we provide the following definition:
Formal Definition. Consider a model fθ trained on the parent dataset, where each training instance (xsp
, ytp
) is a pair of source and target sentences in the parent language pair sp–tp. Then
fine-tuning is the practice of taking the model’s parameters θ from the model fθ to initialize another model gθ . gθ is then further optimized on a dataset of (xsc
, ytc
) instances in the child language
9
pair sc–tc until it converges to gφ . We assume either sc = sp or tc = tp (i.e., child and parent language pairs share one of the source or target sides).
With respect to embedding parameters, however, parent and child models translate different
languages. So, under normal circumstances, their vocabularies are, in fact, quite different. Hence,
it is not immediately clear how to map vocabulary embeddings from a parent model’s source
vocabulary to the child model’s source vocabulary. While in practice, an arbitrary or frequencybased reassignment is done, were the parent and child vocabularies the same, we would avoid
the reassignment problem entirely. While this can be done using character models, we would
like the vocabulary to contain more “semantic heft” than characters can provide. The general
workaround (Zoph et al., 2016; Nguyen and Chiang, 2017; Kocmi and Bojar, 2018) biases the
parent vocabulary to be formed from the union of the parent and child corpora. This leads to the
undesirable requirement that a new parent model be built for each child model. When time is no
object and a new parent model can be constructed for each child model, this is a feasible, though
time- and resource-intensive approach (Figure 2.1).
Previously explored alternatives to pre-biasing the parent vocabulary update the vocabulary
of the parent model at the time of fine-tuning. This is accomplished by either simply adding the
child vocabulary to the parent’s (Neubig and Hu, 2018; Lakew et al., 2018) or by aligning parent
and child monolingual embedding spaces via a cross-lingual projection from child embedding
space to parent embedding space (Kim et al., 2019). However, the former approach causes a
significant increase in the model vocabulary size while denying new vocabulary the benefit of any
parent model pre-training, and the latter approach again delays the fine-tuning process, requiring
an extra mapping step and large monolingual corpora to form the embedding spaces. Specifically,
in the latter, we need to learn monolingual child embeddings (assuming we have a large enough
monolingual corpus) and cross-lingual linear mapping between the child and parent spaces.
Our method, by contrast, a priori constructs a single universal vocabulary. The method is
simple but effective: subwords (Wu et al., 2016) are constructed out of a polyglot corpus, that is,
one containing as many languages as available. This provides both an efficient vocabulary in the
10
face of a new language and a fast way to have a ready-to-go model with fixed-sized vocabulary that
one can fine-tune on any child. This universal vocabulary is formed without exposure to either the
child or parent language pairs (similar to Johnson et al. (2017)).1 To make sure that this universal
vocabulary works for all languages, we unite the orthographies by using the universal romanizer
uroman (Hermjakob et al., 2018).
We furthermore use the polyglot corpus as parent model training data. We show, in §2.3, that
the effects of polyglot word pieces and a polyglot parent model are additive improvements over
current practice. This suggests that simply incorporating a universal vocabulary can lead to improvements, independent of using a polyglot parent corpus. As obtaining a universal vocabulary is
very streamlined, and only needs as many monolingual corpora as available fed through uromanization, and put together, we find this feature very attractive and suggest it to be considered while
training any future parent models.
2.2 Experiments
To demonstrate the success of our universal approach in transfer learning, we run several sets of
experiments:
1. Baselines use no transfer and are trained on low-resource child language–English parallel
data only. They may be trained quickly, but with low-vocabulary coverage do not yield high-quality
results.
2. Single-language parent models trained on only one parallel corpus, such as French–
English. That particular language pair replicates the general conditions used in Zoph et al. (2016),
ported to the Transformer architecture. While a sensible general choice for a parent, due to the
large amount of parallel data available, we may also choose to use related-language parent models
such as is done in Nguyen and Chiang (2017). We examine four other single language pair parents,
which have less parallel data available than French–English, but have sources closer to some of the
child source languages in our experiment set.
1To reinforce this point, we only experiment on child languages not included in obtaining the universal vocabulary.
11
3. Polyglot parent models, our main contribution, which are trained on data from many
languages, each paired with English translations. We compare the polyglot–English parents to
French–English parents as well as to the other single-language parent models. Specifically, in the
case of French–English, we experiment at two different data points (2 million & 11 million training sentences) to determine if the boost given by switching from a French–English to a polyglot–
English parent model is mitigated by increasing training data.
4. Probing conditions such as varying the choice of vocabulary or manipulating the character sets, in order to tease apart the impact of the language model, subword model, and source
contributions.
2.2.1 Data and NMT System
For the universal parent model, we use the data from 19 language pairs (all with English as the
target side): Akan, Amharic, Arabic, Bengali, Persian, Hausa, Hungarian, Indonesian, Russian,
Somali, Spanish, Swahili, Tamil, Tagalog, Turkish, Uzbek, Vietnamese, Wolof, and Yoruba. These
were prepared for the DARPA-LORELEI program (Christianson et al., 2018; Strassel and Tracey,
2016). The size of each corpus (detailed in Table 2.1) ranges from ∼27k sentences (Indonesian)
to ∼387k sentences (Spanish). When put together, this gives us a parallel polyglot corpus of
around 2 million sentences. To prepare the larger corpus for the bigger universal parent model (to
be compared with the bigger French parent model), we append the aforementioned data with the
TED multilingual corpus (excluding the child languages we want to test on) (Qi et al., 2018),2
and
Europarl German3
and Spanish4
corpora. For the French parent, we use the Giga French–English
Corpus (Callison-Burch et al., 2009).5
We experiment with and report results on 5 child languages: Romanian, Hindi, Lithuanian,
Finnish, and Estonian. For each one, we use one of WMT16,6
IITB (Kunchukuttan et al., 2017),
2https://github.com/neulab/word-embeddings-for-nmt
3http://www.statmt.org/europarl/v9/training/
4http://www.statmt.org/europarl/
5http://www.statmt.org/wmt15/translation-task.html
6http://www.statmt.org/wmt16/translation-task.html
12
Parallel Corpus
Source Language Corpus ID No. of
Sentences
Akan LDC2018E07 235,382
Amharic LDC2016E87 58,451
Arabic LDC2016E89 56,698
Bengali LDC2017E60 107,565
Persian LDC2016E93 61,372
Hausa LDC2015E70 43,370
Hungarian LDC2016E99 165,445
Indonesian LDC2017E66 26,823
Russian LDC2016E95 199,141
Somali LDC2016E91 50,393
Spanish LDC2016E97 387,216
Swahili LDC2017E64 46,686
Tamil LDC2017E70 56,888
Tagalog LDC2017E68 49,902
Turkish LDC2014E115 59,194
Uzbek LDC2016E29 106,309
Vietnamese LDC2016E103 146,535
Wolof LDC2018E09 32,902
Yoruba LDC2016E105 43,438
Table 2.1: Detailed data sizes of the 19 LORELEI languages used to create the 2M-sentence
polyglot–English corpus.
and Europarl7
corpora, as indicated in Table 2.2. We also train four customized single-language
parents with our child languages taken into consideration to compare against our polyglot parent.
Romanian is close to French, so we do not train another separate parent for it. However, we do
train:
7http://www.statmt.org/europarl
13
Ro–En Hi–En Lt–En Fi–En Et–En
Corpus WMT16 IITB Europarl
Subset Size Used (Sentences) 60k 35k 43k 20k 10k
Baseline 24.9 2.4 10.9 4.2 2.9
2M French Parent 26.8 6.8 12.6 7.6 5.7
2M polyglot Parent 27.4 9.1 13.8 8.5 6.5
11M French Parent 27.5 7.1 13.8 8.0 5.8
11M polyglot Parent 28.0 9.2 14.7 9.1 7.8
Table 2.2: Training corpus information and test scores over 5 child languages. To simulate a
low-resource setting, we randomly sample a subset of the corpus available for each language. Determined by paired bootstrap resampling (Koehn, 2004) with 100 bootstraps, the polyglot models
(with polyglot–English data and polyglot-derived vocabulary) are better than their French counterparts with 95% statistical significance in all cases (underlined) but Romanian (which is close to
French), where they are better with 75% statistical significance.
• a Latvian parent for Lithuanian using the Latvian–English parallel corpus from Digital Corpus of European Parliament (Hajlaoui et al., 2014)
• an Estonian parent for Finnish using the Europarl corpus
• a Finnish parent for Estonian using the Europarl corpus
• an Urdu-Panjabi-Bengali-Marathi parent for Hindi using the PMIndia Corpora (Haddow and
Kirefu, 2020).8
All data is romanized using uroman9
(Hermjakob et al., 2018), and then universal subwords
are formed from the concatenation of the 19 languages and their English translations.
We use the tensor2tensor (Vaswani et al., 2018) implementation of the Transformer (Vaswani
et al., 2017) to train our models. We use a vocabulary size of 8k subword pieces when training
8Since each of these languages alone has a quite small corpus, we are forced to use more than one language for
Hindi to have a fairly sized parent compared to the others. They are all modern Indo-Aryan languages descended from
Sanskrit.
9uroman cannot transliterate Linear B, Cuneiform, Egyptian Hieroglyphics, or Japanese Kanji well, but otherwise
universally transliterates into a common pronunciation space.
14
Hi–En Lt–En Fi–En Et–En
Related Parent Language(s) Ur, Pa, Bn, Mr Lv Et Fi
Parent Corpus Size (Sentences) 107k 2M 653k 1.8M
Baseline 2.4 10.9 4.2 2.9
Related Parent 5.4 12.9 9.1 (83%) 7.7 (98%)
2M polyglot Parent 9.1 (100%) 13.8 (81%) 8.5 6.5
11M polyglot Parent 9.2 14.7 9.1 7.8
Table 2.3: Related-language parents trained for each language (French is close to Romanian, so
we do not include it in this table as well as Table 2.2). In each case, we provide the statistical
significance of the superior parent between 2M-sentence polyglot and related-language parents
(underlined) in parenthesis. 11M-sentence polyglot model, which is able to come on top in all
cases, is shown in bold. Results summarize problems with training a parent on a related language
that our method mitigates. Related-language parent data is often in short supply (Hindi, Finnish),
and never in as large quantities as the 11M-sentence polyglot parent, which yields equivalent or
better results than all related parent models, without the cost of identifying and assembling data,
and then building a specific parent model for a new child language.
baseline models, and a vocabulary size of 16k subword pieces when training and transferring our
parent models. Other than that, we use the default hyperparameters, and train the Transformer base
model as described in Vaswani et al. (2017), maintaining a single merged embedding vocabulary
and parameter set for both source and target languages.
2.2.2 Results
Our core results measured in BLEU are in Table 2.2 and Table 2.3. From Table 2.2, firstly, we
find that using a polyglot parent model is generally better than using the French parent model only.
Furthermore, comparing between larger parents, except for Romanian, which is close to French,
even the small polyglot parent can beat the big French parent. So large amounts of monolingual
data cannot win over universality. Also, per our intuition, using the big universal parent is superior
in all cases. Overall we are able to gain as much as a 6.8 BLEU increase over the baseline (row 1)
and 2.1 BLEU increase over vanilla transfer learning (row 2 & 4) using the universal parent.
15
Regarding training parents on closely-related languages, and how that compares to our method,
Table 2.3 signifies how our method is indeed universal and more effective. While we tried to find
as much data as possible for related parents to make them comparable with the polyglot parent
in size, there is just not enough available data more often than not. We were often not able to
find close to two million sentences of related language parent parallel data. Also, related parent
model only outperforms the polyglot parent model for the most closely-related language pairs we
experimented with, Finnish (transferred from Estonian), and Estonian (transferred from Finnish).
This all nicely illustrates the benefits of our polyglot model: it is competitive in performance, there
are no limitations in terms of the amount of training data that one can use to pre-train it (as many
languages as possible), and it provides generalized (to any child language of choice) hassle-free
fine-tuning (thanks to fixed-sized available vocabulary).
2.3 Discussion
In this section we substantiate our results by answering two questions:
1. Do common subword pieces help translation, or are improvements by both polyglot training
and transfer learning just an effect of an improved target language model or parent training
data?
2. If subword pieces do help, how do they help?
We focus on the Ro–En data and carry out ablative experiments, the results of which we report
in Table 2.4.
To answer Question 1, we replace all non-punctuation characters from the source side of the
French–English corpus with otherwise unused non-Latin characters that are thus guaranteed to not
share any subsequences with the Romanian or English parts of the vocabulary10; we label this
French′
(Fr′
). French′
is virtually French, only with a new character set. If the improved results
were just due to an improved language model, French′ would have been able to perform as well as
10i.e., we pick an arbitrary character map
16
# Model Subwords
Used
Ro Test
BLEU
1 Ro Ro 24.9
2 Fr+Ro Fr+Ro 25.5
3 Fr′+Ro Fr′+Ro 25.0
4 Fr → Ro Fr+Ro 28.1
5 poly → Ro poly 27.4
6 Fr+Ro → Ro Fr+Ro 27.8
7 Fr′+Ro → Ro Fr′+Ro 27.4
Table 2.4: Romanian analysis experiments. Fr′
refers to (2M) French replaced with non-latin
characters to avoid sharing any subword information. Whenever training data is concatenated to
build a multilingual model, we use +, and whenever we transfer a model and further fine-tune it,
we use →.
French, since the data on the target side is the same in both cases. However, we can see that both
in multilingual training (rows 2 and 3) and transfer learning (rows 6 and 7), French′
is not able to
perform as well as French.
We also investigate the gain due to training the “slow” way, i.e., incorporating child vocabulary
knowledge in the parent by training a vocabulary on concatenated corpora (row 4), and indeed it is
slightly better than using the polyglot model and vocabulary (row 5). So with enough time at hand,
it is best to train a custom parent with knowledge of the child vocabulary. However, in the face of
an emergency, a universal parent is able to significantly close the gap, and give comparable results
in much less time. In our experiments, while the child models need 20k-50k steps to converge, the
parent needs over 120k steps. The converged parent model gets BLEU of 28.9, but it only scores
23.4 and 26.7 after training for 20k and 50k steps, respectively.
To determine how common subwords can help (Question 2), we define and investigate a ‘shatter
rate‘ heuristic. We say a word is shattered by a vocabulary if the number of the segments is more
than half of the length of the unsegmented word, meaning there is at least one single character
in the resulting segmentation. We ideally want words to not be shattered. The intuition behind
finding “less shattering” desirable lies in characters having the least amount of semantic weight: a
17
Parent Parent BLEU (↑)/shatter rate (↓)
Data Vocabulary Ro–En Hi–En Lt–En Fi–En Et–En
French French 26.8/.04 6.8/.1 12.6/.09 7.6/.1 5.7/.09
French polyglot 27.1/.02 7.2/.03 13.2/.02 7.5/.01 5.9/.01
polyglot polyglot 27.4/.02 9.1/.03 13.8/.02 8.5/.01 6.5/.01
Table 2.5: Comparing the effects of building subword vocabulary from a polyglot corpus and using
the polyglot corpus to train the parent model. In most cases the effects are additive. Row 2 has the
same shatter rate as row 3, as it uses the same vocabulary. For shatter rate, lower is better.
word corrupted by simply distorting single characters can be more easily recognized than a word
corrupted by manipulating longer sequences of characters. When the subwords from French are
used we get a shatter rate of 0.04 on Romanian data vs a shatter rate of 0.02 when we use subwords
from the universal vocabulary. For Finnish, as an example of a language with high morphological
inflection, the shatter rates are 0.01 (using universal subwords) vs. 0.1 (using French subwords).
All shatter rates along with the effect of parent data vs. parent subword vocabulary are shown
in Table 2.5. The polyglot subword vocabulary is helpful by decreasing the shatter rate, even if
the parent data remains French–English, and the polyglot data provides additional benefit. So the
impacts and benefits are indeed incremental.
An example is illustrative. The word ‘liikenneruuhkien’ in Finnish, which has high morphological inflection, means ‘traffic jam’ and is constructed by concatenating ‘liikenne’ (transportation,
traffic) and ‘ruuhkien’ (congestion). When segmented by French subwords, it shatters into ‘li ik
enn er u u h ki en_’ and when segmented by universal subwords, it turns into ‘lii ken ner uu hki
en_’ (an _ indicates the end of the word). As another example, take ‘polttoainevarannot’, which
means ‘fuel reserves’ in Finnish and is made from smaller words ‘polttoaine’ (fuel) and ‘varannot’
(reserves). When segmented by French subwords, it turns into ‘pol tt oa ine va ran not_’ and when
segmented by universal subwords, it turns into ‘pol tto aine vara nno t_’. In both cases the French
subwords break the words into more segments than the universal subwords do; in the first case
even down to single characters. Table 2.6 shows how this can impact translation output.
18
Source Yksi haenen tavoitteistaan olikin liikenneruuhkien vaelttaeminen.
French vocab. segmentation Y ks i_ ha ene n_ ta voit te ist a an_ oli kin_ li ik enn er u u
h ki en_ va el tta em ine n_ ._
Universal vocab. segmentation Y ksi_ ha ene n_ tav oit te ista an_ oli kin_ lii ken ner uu hki
en_ va elt tae mine n_ ._
Reference One of his goals was to avoid traffic jam.
French parent output One of his objectives was to avoid mobility.
Universal parent output One of his goals was to avoid traffic jams.
Source Myoes polttoainevarannot ovat ehtymaessae.
French vocab. segmentation My oe s_ pol tt oa ine va ran not_ ov at_ e ht ym aes sa e_ ._
Universal vocab. segmentation My oes_ pol tto aine vara nno t_ ovat_ eh ty mae ssa e_ ._
Reference Also fuel reserves are running short.
French parent output This too is the case.
Universal parent output fuel reserves are also under pressure.
Table 2.6: Example sentences from the Finnish test set. The system transferred from French
parent with French subwords completely misses ‘congestion’ and ‘fuel reserves’ due to shattering
‘ruuhkien’ and ‘polttoainevarannot’, respectively.
We can gain even more insight into how these shared subwords get reused by looking at the
segmentations of the same word in different languages when we feed them through the universal
vocabulary romanized in several languages. Table 2.7 show segmentations of the word ‘Philippines’ in four languages included in the universal parent (namely Russian, Somali, Hausa, and
Yoruba). We can observe that subwords ‘Fili’ and ‘ppi’ are reused among these languages. More
importantly, again an example from Finnish, when the universal vocabulary faces the new word
‘Filippiinit’ (‘Philippines’ in Finnish), it breaks it down to ‘Fili ppi ini t’,11 reusing the same subwords it has learned from and seen before in the parent model languages. This is while, when the
French vocabulary breaks ‘Filippiinit’ down to ‘Fil ip pi ini t’; not only does its segmentation have
more subwords than the universal vocabulary’s segmentation (5 vs. 4), it is also unlikely those
11For clarity we omit word-ending markers when presenting subword tokenization in this narrative.
19
subwords carry much semantic weight given that ‘Philippines’ in French is ‘Philippines’ and does
not share any subwords included in ‘Fil ip pi ini t’. In fact, we find it worth noting that when
we feed ‘Filippiinit’ in the universal parent model, without any further fine-tuning, it correctly
translates it into ‘Philippines’. However, the French parent can do no more than simply outputting
‘Filippiinit’ as is with no further fine-tuning. This knowledge, of course, is the result of having a
polyglot vocabulary that is properly trained thanks to a polyglot corpus (e.g. subword ‘Fili’ seen
before in sequences leading to generating ‘Philippines’). The fact that we maintain this baked-in
knowledge, by using this very same well-trained vocabulary in any future fine-tuning sessions, is
the key to the effectiveness of our method.
‘Philippines’ romanized
in Language X
Universal Vocab.
Segmentation
Filippiny Fili ppi ny
Filibiin Fili bi in
Filifin Fili fin
Filippin Fili ppi n
Table 2.7: ‘Philippines’ in four languages romanized and their universal vocabulary segmentations.
2.4 Chapter Conclusion
In this chapter, we propose a single parent model and vocabulary, which flexibly enables rapid
generation of low-resource NMT systems for nearly any source language, even with a small parallel
corpus.
The approach is at heart a resolution of the problem of vocabulary mismatch between parent
and child languages that occurs during transfer. Previous solutions lead to significant overhead
due to the need to train customized parent models. In contrast, our method, with a ready-for-use
parent, is applicable to triage in emergency scenarios and other cases where rapid deployment is
desired.
20
This chapter describes how our proposed solution relies on obtaining a universal vocabulary
from a heavily multilingual corpus, the languages in which need to be orthographically unified
using romanization. This vocabulary, accompanied by a model built from a multilingual parallel
corpus, is our ready-for-use package that can be employed in a matter of seconds for training
virtually on any new child language. Reporting the results of our numerous experiments, we
show the effectiveness and superiority of our approach across 5 languages. We also show that
the advantages that come from its universality cannot simply be overtaken by having a larger (in
terms of the number of sentences in the training corpus) parent model with a single language on
the source side, or a more closely-related parent model. We provide a comprehensive discussion
and analysis on where our model gets its edge from; we show common subwords are indeed
helpful, and by defining a new metric, “shatter rate”, and providing examples we are able to provide
intuitions on why that is so. We also show quantitatively that benefits from universal vocabulary
and fine-tuning a polyglot parent are additive, and that using both of these elements provides the
best all-around-solution.
21
Chapter 3
Parameter Efficiency:
Exclusive Cross-Attention Transfer for Translation
The Transformer (Vaswani et al., 2017) is the de facto architecture to use across tasks with sequential data. It has been dominantly used for natural language tasks, and has more recently also pushed
the state-of-the-art on vision tasks (Dosovitskiy et al., 2021). In particular, transfer learning from
large self-supervised pre-trained Transformer-based language models has been widely adopted to
train new models: adapting models such as BERT (Devlin et al., 2019) and XLM-R (Conneau et
al., 2020) for encoder-only tasks and models such as BART (Lewis et al., 2020) and mBART (Liu
et al., 2020) for encoder-decoder tasks like machine translation. As discussed in Chapter 2, this
transfer learning is predominantly performed in the form of fine-tuning: using the values of several
hundred million parameters from the pre-trained model to initialize a model and start training from
there.
Fine-tuning pre-trained models often involves updating all parameters of the model without
making a distinction between them based on their importance. However, copious recent studies have looked into the relative cruciality of multi-headed self- and cross- attention layers when
training an MT model from scratch (Voita et al., 2019; Michel et al., 2019; You et al., 2020). Crossattention (also known as encoder-decoder attention) layers are more important than self-attention
layers in the sense that they result in more degradation in quality when pruned, and hence, are
more sensitive to pruning (Voita et al., 2019; Michel et al., 2019). Also, cross-attention cannot be
22
Figure 3.1: Overview of our transfer learning experiments, depicting (a) training from scratch,
(b) conventional fine-tuning (src+body), (c) fine-tuning cross-attention (src+xattn), (d) finetuning new vocabulary (src), (e) fine-tuning cross-attention when transferring target language
(tgt+xattn), (f) transfer learning with updating cross-attention from scratch (src+randxattn).
Dotted components are initialized randomly, while solid lines are initialized with parameters from
a pre-trained model. Shaded, underlined components are fine-tuned, while other components are
frozen.
replaced with hard-coded counterparts (e.g., an input-independent Gaussian distribution) without
significantly hurting the performance, while self-attention can (You et al., 2020). With the ubiquity of fine-tuning as a training tool, we find a similar investigation focused on transfer learning
missing. In this chapter, we inspect cross-attention and its importance and capabilities through the
lens of transfer learning for MT.
At a high level, we look at training a model for a new language pair by transferring from a pretrained MT model built on a different language pair. Given that, our study frames and addresses
three questions: 1) How powerful is cross-attention alone in terms of adapting to the new language
pair while other modules are frozen? 2) How crucial are the cross-attention layers pre-trained
values with regard to successful adaptation to the new task? and 3) Are there any qualitative differences in the learned representations when cross-attention is the only module that gets updated?
To answer these questions, we compare multiple strategies of fine-tuning towards a new language pair from a pre-trained translation model that shares one language with the new pair. These
23
are depicted in Figure 3.1: a) Ignoring the pre-trained parameters and training entirely from randomly initialized parameters (that is “from scratch”) b) Fine-tuning all parameters except the embeddings for the language in common,1
(that is “regular” fine-tuning, our upper bound), c) Fine–
tuning solely the cross-attention layers and new embeddings, and d) Fine-tuning only the new
embeddings. Here, new embeddings refer to randomly initialized embeddings corresponding to
the vocabulary of the new language. In Figures 3.1a–3.1d, we assume the new language pair has a
new source language and not a new target language; Figure 3.1e shows an example of target-side
transfer. In the experiments that follow we will always train new, randomly initialized embeddings
for the vocabulary of the newly introduced language. Generally, all other parameters are imported
from a previously built translation model and, depending on the experiment, some will remain
unchanged and others will be adjusted during training.
Our experiments and analyses show that fine-tuning the cross-attention layers while keeping
the encoder and decoder fixed results in MT quality that is close to what can be obtained when
fine-tuning all parameters (§3.3). Evidence also suggests that fine-tuning the previously trained
cross-attention values is in fact important—if we start with randomly initialized cross-attention
parameter values instead of the pre-trained ones, we see a quality drop.
Furthermore, intrinsic analysis of the embeddings learned under the two scenarios reveals that
full fine-tuning exhibits different behavior from cross-attention-only fine-tuning. When the encoder and decoder bodies are not fine-tuned, we show that the new language’s newly-learned embeddings align with the corresponding embeddings in the pre-trained model. That is, when we
transfer from Fr–En to Ro–En for instance, the resulting Romanian embeddings are aligned with
the French embeddings. However, we do not observe the same effect when fine-tuning the entire
body. In §3.4 we see how such aligned embeddings can be useful. We specifically show they can
be used to alleviate forgetting and perform zero-shot translation.
Finally, from a practical standpoint, our strategy of fine-tuning only cross-attention is also a
more lightweight2 fine-tuning approach (Houlsby et al., 2019) that reduces the storage overhead
1Freezing shared language embeddings is common practice (Zoph et al., 2016).
2We use descriptors “parameter-efficient” and “lightweight” interchangeably.
24
for extending models to new language pairs: by fine-tuning a subset of parameters, we only need
to keep a copy of those instead of a whole-model’s worth of values for the new pair. We quantify
this by reporting the fraction of parameters that is needed in our case relative to having to store a
full new model for each adapted task.
Our contributions are:
• We empirically show the competitive performance of fine-tuning the cross-attention layers
exclusively when contrasted with fine-tuning the entire Transformer body.
• We show that when fine-tuning only the cross-attention layers, the new embeddings get
aligned with the respective embeddings in the pre-trained model. The same effect does not
hold when fine-tuning the entire Transformer body.
• We demonstrate effective application of this aligning artifact in mitigating catastrophic forgetting (Goodfellow et al., 2014) and zero-shot translation.
3.1 Cross-Attention Fine-Tuning for MT
Fine-tuning pre-trained Transformer models towards downstream tasks has pushed the limits of
NLP, and MT has been no exception (Liu et al., 2020). Despite the prevalence of using pre-trained
Transformers, recent studies focus on investigating the importance of self- and cross- attention
heads while training models from scratch (Voita et al., 2019; Michel et al., 2019; You et al.,
2020). These studies verify the relative importance of cross-attention over self-attention heads by
exploring either pruning (Voita et al., 2019; Michel et al., 2019) or hard-coding methods (You
et al., 2020). Considering these results and the popularity of pre-trained Transformers, our goal in
this work is to study the significance of cross-attention while focusing on transfer learning for MT.
This section formalizes our problem statement, introduces the notations we will use, and describes
our setup to address the questions we raise.
25
3.1.1 Problem Formulation
In this chapter, we focus on investigating the effects of the cross-attention layers when fine-tuning
pre-trained models towards new MT tasks. Fine-tuning for MT is a transfer learning method that,
in its simplest form (Zoph et al., 2016) as we saw in Chapter 2, involves training a model called
the “parent” model on a relatively high-resource language pair, and then using the obtained parameters to initialize a “child model” when further training towards a new, potentially low-resource,
language pair. Here, high-resource and low-resource refer to the amount of parallel data that is
available for the languages. Henceforth, we use “parent” and “child” when referring to training
components (e.g., model, data, etc.) in the pre-training and fine-tuning stages, respectively.
Granular Notations. It is common practice for fine-tuning to further update all parent parameters θ on the child data without making any distinction between them. We instead consider θ at a
more granular level, namely as:
θ =
S
{θsrc,θtgt,θenc,θdec,θxattn}
where θsrc includes source-language token embeddings, source positional embeddings, and source
embeddings layer norm parameters; θtgt similarly includes target-language (tied) input and output
token embeddings, target positional embeddings, and target embeddings layer norm parameters;
θenc includes self-attention, layer norm, and feed-forward parameters in the encoder stack; θdec
includes self-attention, layer norm, and feed-forward parameters in the decoder stack; and θxattn
includes cross-attention and corresponding layer norm parameters.
3.1.2 Analysis Setup
Inspections like ours into individual modules of Transformer often rely on introducing some constraints in order to understand the module better. These constraints come in the form of full removal
or pruning (Tang et al., 2019; Voita et al., 2019), hard-coding (You et al., 2020), and freezing (Bogoychev, 2021). We rely on freezing. We proceed by taking pre-trained models, freezing certain
parts, and recording the effect on performance, measured by BLEU.
26
Within the framework of our problem, to address the questions raised at the beginning of this
chapter, our analysis compares full and partially-frozen fine-tuning for MT under several settings,
which we summarize here:
Cross-attention fine-tuning & embedding fine-tuning comparative performance. This is to
realize how much fine-tuning the cross-attention layers helps in addition to fine-tuning respective
embeddings alone.
Cross-attention fine-tuning & full fine-tuning comparative performance. We wish to find out
where fine-tuning cross-attention stands relative to fine-tuning the entire body. This is to confirm
whether or not cross-attention alone can adapt to the child language pair while the encoder and
decoder layers are frozen.
Pre-trained cross-attention layers & random cross-attention layers. We wish to understand
how important a role cross-attention’s pre-trained values play when single-handedly adapting to a
new language pair. This determines if the knowledge encoded in cross-attention itself has a part in
its power.
Translation cross-attention & language modelling cross-attention. Finally, we contrast the
knowledge encoded in cross-attention learned by different pre-training objectives. This is to evaluate if the knowledge brought about by a different pre-training objective affects the patterns observed from a cross-attention pre-trained on MT while fine-tuning for MT.
3.2 Experimental Setup
In this section, we describe our experiments and the data and model that we use to materialize the
analysis outlined in §3.1.2.
27
3.2.1 Methods
We first provide the details of our transfer setup, and then describe the specific fine-tuning baselines
and variants used in our experiments.
General Setup. An important concern when transferring is initializing the embeddings of the
new language (refer to §2.1 for a more detailed explanation). When initializing parameters in the
child model, there are several ways to address the vocabulary mismatch between the parent and the
child model: frequency-based assignment, random assignment (Zoph et al., 2016), joint (shared)
vocabularies (Nguyen and Chiang, 2017; Kocmi and Bojar, 2018; Neubig and Hu, 2018; Gheini
and May, 2019; Liu et al., 2020), and no assignment at all, which results in training randomly
initialized embeddings (Aji et al., 2020). In our experiments, we choose to always use new random
initialization for the new embeddings (including token embeddings, positional embeddings, and
corresponding layer norm parameters). This decision is made to later let us study what happens to
embeddings under each of the settings, independent of any pre-training artifacts that exist in them.
For instance, when transferring from Fr–En to {Ro–En, Fr–Es}, respectively, all parameters are
reused except for {θsrc, θtgt},
3 which get re-initialized given the new {source, target} language.
The side that remains the same (e.g., En when going from Fr–En to Ro–En), uses the parent
vocabulary and keeps the corresponding embeddings frozen during fine-tuning.4
Fine-tuning Settings. With the general transfer setup, we employ different settings in our experiments to address the points in §3.1.2. Each fine-tuning method is clarified based on our notations in
§3.1.1: 1) {src,tgt} only updates the embeddings {θsrc, θtgt} (Figure 3.1d). 2) {src,tgt}+body
additionally updates the entire Transformer body ({θsrc, θtgt} + θenc + θdec + θxattn) (Figure 3.1b).
3) {src,tgt}+xattn only updates the cross-attention layers in addition to the first baseline ({θsrc,
θtgt} + θxattn), and keeps the encoder and decoder stacks frozen (Figure 3.1c, 3.1e). These collectively address the first and second settings in §3.1.2. 4) {src,tgt}+randxattn similarly only
3We drop the “respectively” henceforth and use {...} throughout to indicate alternation.
4Preliminary ablations fine-tuning all embeddings did not change the outcome or conclusions of our experiments.
28
Train Corpus
(Sent. Count) Test Corpus Vocab. Size
Ro–En WMT16
(612.4 K) newstest2016 16 K / reuse tgt
Ja–En IWSLT17
(223.1 K) IWSLT17 8 K / reuse tgt
De–En IWSLT16
(196.9 K) IWSLT16 8 K / reuse tgt
Ha–En ParaCrawl v8
(159.0 K) newsdev2021 8 K / reuse tgt
Fr–Es News Comm. v15
(283.5 K) newstest2013 reuse src / 8 K
Fr–De News Comm. v15
(284.1 K) newstest2020 reuse src / 8 K
Table 3.1: Data sources and statistics for each of the child language pairs.
updates the cross-attention layers in addition to embeddings, but uses randomly initialized values
instead of pre-trained values (Figure 3.1f). This addresses the third setting in §3.1.2.
For all transfer experiments, we also conduct the scratch variant (Figure 3.1 a), where we train
a model from scratch on the child dataset. This is to confirm the effectiveness of transfer under
each setting. We conduct all the above experiments using a French–English translation model as
parent and transferring to six different child language pairs. In §3.3.1 we conduct an ablation that
substitutes mBART (Liu et al., 2020) as a parent. mBART is trained with denoising objective in
a self-supervised manner. In contrast to a translation model, the cross-attention layers in mBART
have thus not been learned using any parallel data. This enables us to distinguish between different
pre-training objectives, addressing the fourth setting in §3.1.2.
3.2.2 Data and Model Details
Dataset. For the choice of language pairs and datasets, we mostly follow You et al. (2020) (Fr–En,
Ro–En, Ja–En, De–En) and additionally include Ha–En, Fr–Es, and Fr–De. We designate Fr–En
as the parent language pair and Ro–En, Ja–En, De–En, Ha–En (new source), Fr–Es, Fr–De (new
29
target) as child language pairs. Our Fr–En parent model is trained on the Europarl + Common
Crawl subset of WMT14 Fr–En,5 which comprises 5,251,875 sentences. Details and statistics of
the data for the child language pairs are provided in Table 3.1.
Model Details. We use the Transformer base architecture (6 layers of encoder and decoder with
model dimension of 512 and 8 attention heads) for all models, (Vaswani et al., 2017) and the
Fairseq (Ott et al., 2019) toolkit for all our experiments.
All models rely on BPE subword vocabularies (Sennrich et al., 2016b) processed through the
SentencePiece (Kudo and Richardson, 2018) BPE implementation. The vocabulary for the parent
model consists of 32K French subwords on the source side, and 32K English subwords on the
target side. The sizes of the vocabularies for child models are also reported in Table 3.1. We
follow the advice from Gowda and May (2020) when deciding what vocabulary size to choose,
i.e., we choose the maximum number of operations to ensure a minimum of 100 tokens per type.
3.3 Results and Analysis
Our preliminary empirical results consist of five experiments for each child language pair based
on methods described in §3.2.1: scratch, {src,tgt}, {src,tgt}+body, {src,tgt}+xattn, and
{src,tgt}+randxattn. Our core results, which rely on transferring from the Fr–En parent under
each setting, are reported in Table 3.2. All scores are detokenized cased BLEU computed using
SACREBLEU (Post, 2018).6
3.3.1 Cross-attention’s Power and Importance
Translation Quality. Table 3.2 shows that the {src,tgt}+xattn setting substantially improves
upon {src,tgt} in all but one case (Ha–En), especially when transferring to a pair with a new
target language, and is competitive with {src,tgt}+body across all six language pairs, suggesting
5http://statmt.org/wmt14/translation-task.html
6Signature: BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.8.
30
Ro–En Ja–En De–En Ha–En Fr–Es Fr–De
scratch (100%) 29.0 9.2 30.8 5.4 24.4 18.5
{src,tgt} (8%) 29.8 8.7 32.4 8.6 21.6 11.6
{src,tgt}+body (75%) 31.0 11.8 36.2 8.8 27.3 21.4
{src,tgt}+xattn (17%) (-0.1) 30.9 (-2.0) 9.8 (-1.2) 35.0 (-0.4) 8.4 (-0.8) 26.5 (-1.8) 19.6
{src,tgt}+randxattn (17%) 27.9 8.4 33.3 7.0 26.0 18.8
Table 3.2: BLEU scores for each of the five experiments across six language pairs. Bold numbers indicate the top two scoring approaches. Percentages in parentheses next to fine-tuning
strategy is the fraction of parameters that had to be updated and hence stored as new values for
future use. Numbers in parentheses next to {src,tgt}+xattn scores show the difference from
{src,tgt}+body.
that cross-attention is capable of taking advantage of encoded generic translation knowledge in the
Transformer body to adapt to each child task. Performance gain from {src,tgt} and drop from
{src,tgt}+body when changing the target language (i.e., Fr–Es and Fr–De) are more pronounced
than when transferring the source. This is expected—when changing the target, two out of three
cross-attention matrices (key and value matrices) are now exposed to a new language. When
transferring source, only the query matrix is exposed to the new language.
Storage. We also report the fraction of the parameters that need to be updated in each case.
This is equivalent to the storage overhead that the training process incurs, as the updated parameters need to be stored to be used later. However, the parameters that are reused are only
stored once. The number of parameters updated is dependent on the size of the vocabulary in
each experiment, since embeddings for a new vocabulary are included. Hence, the single number reported for each fine-tuning strategy is the average across the six language pairs. Extending
to new language pairs following {src,tgt}+xattn is much more efficient in this regard, as expected. We concretely calculate the number of parameters that need to be stored combined for
the six new language pairs: {src,tgt}+xattn stores only 124,430,336 parameters compared to
{src,tgt}+body’s 313,583,616.
31
Pre-trained and Random Values. Finally, {src,tgt}+randxattn experiments also offer perspective on the importance of translation knowledge encoded in cross-attention itself. Not only
does randomly initialized cross-attention fail to perform as well as pre-trained cross-attention when
being transferred, but in two cases, it even falls behind training from scratch.
Our results from transferring mBART (Liu et al., 2020) to the child language pairs also emphatically illustrate the importance of the type of knowledge encoded in cross-attention. mBART
is a 12-layer Transformer pre-trained with a denoising objective in a self-supervised manner using span masking and sentence permutation noising functions. Hence, its cross-attention does not
have any translation knowledge a priori, in contrast with the French–English MT parent model.
We transfer mBART to the same language pairs as in Table 3.2 and provide the results in Figure 3.2. Since mBART uses a shared vocabulary and tied embeddings between the encoder and
decoder, in Figure 3.2 we use embed in experiments’ names to signify all embeddings get updated
in the case of mBART (θsrc + θtgt).
mBART is a larger model than our Fr–En parent. So a higher range of scores is expected. While
the same patterns hold across embed+{body,xattn,randxatnn} fine-tuning, the crux of the matter
is that embed fine-tuning fails in contrast to the comparable {src, tgt} fine-tuning setting of the
translation parent. src fine-tuning has higher BLEU than scratch in three cases (Ro–En, De–En,
Ha–En). However, embed fine-tuning has higher BLEU than the scratch baseline only in the Ja–
En case, and even then, very slightly so (only by 0.1 BLEU). This shows that absence of translation
knowledge in mBART’s pre-trained cross-attention leads to its fine-tuning being more crucial in
mBART’s functionality for translation adaptation: exclusively fine-tuning embeddings in mBART
simply fails, while doing the same with a translation parent model is more successful.
3.3.2 Learned Representations Properties
Given that besides cross-attention, embeddings are the only parameters that get updated in both
{src,tgt}+body and {src,tgt}+xattn settings, we take a closer look at them. We want to know
how embeddings change under each setting.
32
Figure 3.2: BLEU scores across different transfer settings using mBART as parent. Exclusive
fine-tuning of embeddings (embed) is not effective at all due to lack of translation knowledge in
the cross-attention layers.
To probe the relationship between embeddings learned as a result of different kinds of finetuning, we examine the quality of induced7 bilingual lexicons, a common practice in cross-lingual
embeddings literature (Artetxe et al., 2017) but incidentally learned in this case.
We use the bilingual dictionaries released as a resource in the MUSE (Lample et al., 2018)
repository.8 For instance, to compare the German embeddings from each of the src+body and
src+xattn De–En models to the French embeddings learned in the parent model, we use the
De–Fr dictionary. We filter our learned embeddings (which are, in general, of subwords) to be
compatible with the MUSE vocabulary. Of the 8,000 German subwords in the vocabulary, 2,025
are found in MUSE. For each of these, we find the closest French embedding by cosine similarity;
if the resulting (German, French) pair is in MUSE, we consider this a match. Via this method, we
find the accuracy of the bilingual lexicon induction through the embeddings of src+xattn model
is 55%. However, the accuracy through the embeddings of src+body is much lower at 19.7%. Due
to only considering the exact matches against the gold dictionary, this is a very strict evaluation.
We also manually look at a sample of 40 words from the German set and check for the correctness
of retrieved pairs for those using an automatic translator: while src+xattn scores in the range of
80%, src+body scores in the range of 30%. Details of this manual inspection are provided in
Table 3.4 at the end of this chapter. We further report the accuracy of the bilingual dictionaries of
three other pairs learned under the two fine-tuning settings for which gold dictionaries are available
7via nearest neighbor retrieval
8https://github.com/facebookresearch/MUSE
33
Figure 3.3: Accuracy of bilingual dictionaries induced through embeddings learned under
tgt+body and tgt+xattn settings. De and Es effectively get aligned with En under tgt+xattn
(left). As they are both aligned to En, we can also indirectly obtain a De–Es dictionary (right).
Similar practice completely fails under tgt+body.
in Figure 3.3. We do not limit ourselves to child-parent dictionary induction; we also consider
child-child dictionary induction (e.g., De–Es) which essentially relies on both languages being
aligned with the parent (i.e., En).
Overall, these results confirm that embeddings learned under {src,tgt}+xattn effectively
get aligned with corresponding parent embeddings. However, this is not the case with embeddings
learned under {src,tgt}+body. This suggests such effect is not the default pattern in translation
models, but rather an artifact of the freezing choices made in {src,tgt}+xattn.
3.4 Utilities of Aligned Embeddings
We saw how fine-tuning only cross-attention results in cross-lingual embeddings with respect to
parent embeddings. That is how cross-attention is able to use the baked-in knowledge in the encoder and decoder without any further updates to them. In this section, we discuss two areas where
this can be turned to our advantage: mitigating forgetting and performing zero-shot translation.
34
Figure 3.4: Performance on the original language pair after transfer. The original Fr–En parent
model scores 35.0 BLEU on the Fr–En test set. {src,tgt}+xattn outperforms {src,tgt}+body
on the parent task.
3.4.1 Mitigating Forgetting
One area where the discovery of §3.3.2 can be taken advantage of is mitigating catastrophic forgetting. Catastrophic forgetting refers to the loss of previously acquired knowledge in the model
during transfer to a new task. To the best of our knowledge, catastrophic forgetting in MT models
has only been studied within the context of inter-domain adaptation (Thompson et al., 2019; Gu
and Feng, 2020), and not inter-lingual adaptation.
The effectiveness of the cross-lingual embeddings learned under the {src,tgt}+xattn setting
at mitigating forgetting is evident from the results provided in Figure 3.4. Here we take three
of the transferred models, plug back in the appropriate embeddings in them, and compare their
performance on the original language pair against the parent model. Specifically, we take the De–
En, Ro–En, and Fr–Es models transferred from Fr–En under each of the two {src,tgt}+xattn
and {src,tgt}+body settings, plug in back the original {Fr, En} embeddings, and evaluate
performance on the Fr–En test set. This score is then compared against the Fr–En parent model
performance on Fr–En test set, which scores 35.0 BLEU. While being comparable in terms of
performance on the child task as reported in Table 3.2, {src,tgt}+xattn constantly outperforms
{src,tgt}+body on Fr–En. Compared to the original Fr–En model, the source-transferred models
(De–En, Ro–En) outperform the target-transferred model (Fr–Es). However, tgt+xattn is much
35
more robust against forgetting compared to tgt+body, which remembers close to nothing (0.2
BLEU).
3.4.2 Zero-Shot Translation
Another area where well-aligned embeddings from the {src,tgt}+xattn setting can come in
handy is zero-shot translation. Since the source embeddings are aligned, we can, for instance, replace the French embeddings in the Fr–Es model learned via tgt+xattn with German embeddings
from the De–En model learned via src+xattn and form a De–Es translation model with no De–Es
training or direct De–Fr alignment. We additionally build two more zero-shot systems in the same
manner: Ro–Es (using transferred Ro–En and Fr–Es models) and Ro–De (using transferred Ro–En
and Fr–De models). To put zero-shot scores in context, for each pair we also train a model from
scratch: for De–Es using 294,216-sentence News Commentary v14 corpus, and for Ro–Es and
Ro–De using 387,653-sentence and 385,663-sentence Europarl corpora respectively. All scores
are provided in Table 3.3.
De–Es Ro–Es Ro–De
Zero-shot BLEU 9.2 14.7 9.8
Supervised BLEU 18.3 18.6 13.4
Table 3.3: Performance of zero-shot systems for three language pairs. De–Es is evaluated on
newstest2013 test set. Ro–Es and Ro–De are evaluated on respective TED talks corpus test sets
(Qi et al., 2018).
In the case of De–Es, we train two additional models from scratch on 50,000- and 100,000-
sentence subsets of the training corpus. These respectively score 7.2 and 12.0 BLEU on the newstest2013 De–Es test set (v.s. zero-shot performance of 9.2). Taken together, these results show that
the zero-shot methods we obtain from cross-attention-based transfer can yield reasonable translation models in the absence of parallel data.
36
3.5 Chapter Conclusion
We look at how powerful cross-attention can be under constrained transfer learning setups. We
empirically show that cross-attention can single-handedly result in comparable performance with
fine-tuning the entire Transformer body, and it is through no magic: it relies on translation knowledge in the pre-trained values to do so and has new embeddings align with corresponding parent
language embeddings. We furthermore show that such aligned embeddings can be used towards
catastrophic forgetting mitigation and zero-shot transfer. We hope this investigative study encourages more analyses in the same spirit towards insights into the inner workings of different modules
and how they can be put to good use.
37
German Word src+xattn
French Equivalent
src+body
French Equivalent
Entdeckung découverte amende
Feind ennemi ennemi
Architekten architectes architecture
gibt existe jette
erforschen explorer sond
Philosoph philosophie philosophie
Cent centi centaines
formen forme forme
lassen laissez PCP
Nummer numéro Key
können puissent puisse
dasselbe mêmes lourds
gelöst résoud résoud
wenig peu peu
zerstört détruit dévas
Bericht reportage témoin
Mark Mark trailer
Brief lettre lettres
Linien lignes lignes
entworfen conçus monté
Dunkelheit ténèbres obscur
Kreis cercle rond
Haie requins Hun
spielt joue tragédie
Elektrizität électricité électriques
Solar solaire Arabes
Flügel ailes avion
Konzept concept alliance
Strukturen structures définit
will veut voulons
Hier Ici Vous
verlieren perdent perdent
unterstützen soutien appui
Planet planète planète
buchstäblich littéralement multimédia
Schuld blâ génére
dass que toi
plötzlich soudainement risques
Kann Pouvez ciel
Ball ballon ballon
Table 3.4: Sampled German words and their equivalents based on the embeddings learned by
each of the models. The correct translations are highlighted. Each pair was manually checked for
correctness using an automatic translator.
38
Chapter 4
Computation Efficiency:
Transferring from Pre-Trained MEGA
While Transformers are more efficient than RNNs from a parallelizability perspective, their time
complexity is quadratic with respect to sequence length—a drawback when compared to the linear
time complexity of RNNs. Specifically, for a sequence of length n, the self-attention layer in a
Transformer with hidden representation dimension d has a time complexity of O(n
2
. d), whereas
the RNN layer has a time complexity of O(n . d
2
) (Vaswani et al., 2017). If n is less than d, which
can often be the case,1
this presents no problem. However, for tasks where n > d and complex
long-distance dependency understanding is required, this poses a challenge.
This challenge can be addressed in a number of ways (Tay et al., 2023). Methods that do so can
be categorized into two groups: methods that introduce architectural modifications to reduce the
complexity of the self-attention in Transformers but do not depart from attention, and methods that
completely rely on alternative architectures, e.g., Structured State Space sequence models (S4) (Gu
et al., 2022). A common strategy to reduce the complexity from quadratic to linear in the methods
that still rely on attention is to limit attention to fixed patterns, e.g., fixed input blocks (Qiu et al.,
2020a) or fixed context intervals (Beltagy et al., 2020).
Limiting attention to a fixed window, however, can hurt performance over long-range context,
which was the problem to begin with. This is due to the fact that information from outside the
1d is in the order of thousands, whereas in many tasks (but not all), n is in the order of hundreds.
39
current attention window simply has no way to reach and affect the representation of the tokens
inside it. MEGA (Ma et al., 2023), a recent augmentation of the Transformer, tackles this by
incorporating exponential moving average (EMA) of “recent” (nearby) token representations in
the updated representations of the current token in addition to attention-guided updates. The key is
that EMA can efficiently cover tokens outside the attention window as well. Hence, in long-range
contexts, EMA provides the only path for information flow from tokens outside the attention span.
In contrast, methods that inject positional bias directly into the attention scores and solely in the
attention layer (Press et al., 2022) still get blocked by the permitted attention interval.
However, EMA is interesting from an inductive bias point of view as well: it incorporates
recency bias across the time dimension—something that attention lacks by itself. To study the
benefits of EMA for transfer learning, in this chapter, we first provide an overview of EMA in
MEGA in §4.1. We then report our findings from experiments at two different scales: 1) transferring from a pre-trained encoder-decoder MEGA English language model over 7.8 billion tokens
(§4.2.1), and 2) transferring from a pre-trained encoder-decoder MEGA over 25 languages and 180
billion tokens (§4.2.2). In both cases, our empirical results show that MEGA-based models are able
to converge faster than their Transformer-based counterparts while consistently performing better
or on par with them.
4.1 MEGA Background
In the Transformer architecture, the interdependence between tokens is modeled solely through
the attention mechanism, which does not inherently capture any positional-dependent information.
Specifically, the attention weights in each head for weighted aggregation of the tokens in each layer
are calculated as:
softmax
ρQKT
(4.1)
where ρ is a scaling factor, and Q ∈ R
n×dh and K ∈ R
n×dh
, called query and key matrix, respectively, each are linear transformations of input X ∈ R
n×d
: Q = XWq; K = XWk
.
40
Per Equation 4.1, attention weights have no dependency on the time step. MEGA addresses this
by including an EMA layer2 before the attention layer:
X
′ = EMA(X),
and at each time step the EMA update is as:
x
′
t = α ⊙xt + (1−α)⊙xt−1
(4.2)
where α ∈ (0,1)
d
is learnable. A comparison of the two strategies is shown in Figures 4.1 and 4.2.
xt−3 xt−2 xt−1 xt xt+1 xt+2 xt+3
attt→−3
attt→−2
attt→−1
attt→self
attt→+1
attt→+2
attt→+3
Figure 4.1: How much influence the updated representation of token xt gets from surrounding tokens under the governance of the attention mechanism. Edge labels indicate the attention score
between xt and the respective token. The figure assumes bidirectional attention. In an autoregressive setting, the token only receives influence from tokens to its left.
xt−3 xt−2 xt−1 xt xt+1 xt+2 xt+3
α(1−α)
3
α(1−α)
2
α(1−α)
α
α(1−α)
α(1−α)
2
α(1−α)
3
Figure 4.2: How much influence the updated representation of token xt gets from surrounding
tokens under the governance of EMA. Edge labels indicate the weight of the respective token.
Lighter edges signify more discounted influence. The figure assumes a bidirectional setting. In an
autoregressive setting, the token only receives influence from the recent tokens to its left.
2For our purposes, this impression of the EMA layer is sufficient. However, to be precise, MEGA uses damped
EMA, where a damping factor δ is used to further relax the coupling between α and 1−α:
x
′
t = α ⊙xt + (1−δ ⊙α)⊙xt−1.
41
EMA’s output can be computed efficiently using fast Fourier transform (Ma et al., 2023). It
is then fed into a gated variant of the attention mechanism with a single head, which is, as Ma
et al. (2023) theoretically prove, as expressive as multi-headed attention. The rest of MEGA’s
architecture matches that of the Transformer (the embeddings, the feedforward layers, the residual
connections, and the normalization layers).
4.2 Experiments and Results
To investigate whether EMA’s explicit positional inductive bias positively impacts transfer learning, we conduct self-supervised model pre-training at two scales: 1) using 7.8 billion tokens of
English data and 2) using 180 billion tokens of multilingual (25 languages) data before performing
fine-tuning. The reason as to why we operate at two scales is that our larger model is particularly
computation-intensive, the exact details of which we will provide in §4.2.2. This makes it prohibitive for us to train different variants of it for ablation purposes. In fact, as we will see, we are
only able to produce one instance of our larger model using MEGA to compare against its industrybuilt Transformer counterpart, mBART (Liu et al., 2020)—as first introduced for experiments in
Chapter 3. In contrast, operating at the smaller scale, we are able to pre-train three models: two
based on MEGA and one based on Transformer for a more comprehensive investigation.
We provide the specific details of each category of experiments below. But in terms of workflow, experiments in both categories follow the same pattern:
• We first train a denoising autoencoder in the pre-training stage.
• We then fine-tune the resulting checkpoint on MT.
We use the Fairseq toolkit (Ott et al., 2019) for all experiments, which the MEGA implementation
(Ma et al., 2023)
3
is also based on.
3https://github.com/facebookresearch/mega
42
4.2.1 Small-Scale Pre-Training
To provide a complete picture of the effect of EMA, we pre-train three models:
• eBART: An English BART model (Lewis et al., 2020) (essentially, the monolingual counterpart of mBART), a Transformer encoder-decoder pre-trained with denoising objective.
• eMEGABART: MEGA-based equivalent of eBART.
• MEGA-SkipEMA: A variant of eMEGABART where we skip EMA, i.e., the transformation
in Equation 4.2 is omitted and the input is directly fed into the single-head gated attention.
This is to fully ablate EMA and assess how that affects the outcome.
All models have 6 encoder layers and 6 decoder layers with a hidden dimension of 512. eBART,
which is a Transformer, uses 8 attention heads. We adjust all the MEGA-specific model settings
affecting the total number of parameters in eMEGABART and MEGA-SkipEMA in a way that all
three models end up having roughly the same number of parameters (∼170 million parameters).
4.2.1.1 Pre-Training Details
We use a randomly sampled subset of the English portion of CC-100 (Wenzek et al., 2020; Conneau
et al., 2020)
4
, consisting of 7.8 billion tokens. We use the same vocabulary as mBART (Liu et al.,
2020) for subword tokenization. Each model is trained on a single NVIDIA 32GB V100 GPU
for 135,000 steps with a batch size of 2,048 tokens. We report the final perplexity, a measure of
language modeling performance, for each of the models in Table 4.1.
Additionally, we provide the perplexity curves for the best performing models, eBART and
eMEGABART, in Figure 4.3. We observe that while both architectures achieve similar final performance levels, eMEGABART with explicit positional bias converges faster, making it a more
compute-efficient choice.
4https://huggingface.co/datasets/statmt/cc100
43
Pre-Trained Model Perplexity (↓)
eBART 3.12
eMEGABART 3.17
MEGA-SkipEMA 3.31
Table 4.1: Comparison of final perplexity across pre-trained models. While eBART and
eMEGABART achieve similar performance, omitting EMA in MEGA-SkipEMA results in a
higher (worse) perplexity.
Figure 4.3: Perplexity curves of eBART and eMEGABART (labeled as eMEGABART in the
graph’s legend). eMEGABART converges significantly faster than its Transformer-based counterpart.
4.2.1.2 Fine-Tuning Details
Besides being compute-efficient, we want to make sure eMEGABART performs competitively in
downstream tasks as well. To demonstrate this, we transfer our small-scale pre-trained models to
MT over five language pair directions: Fr–En, En–Fr, De–En, Es–En, and Ru–En. Note that while
our models are pre-trained only on English, because we use mBART’s multilingual vocabulary,
they can ingest {Fr, De, Es, Ru} data for fine-tuning. Details of the data used during fine-tuning
are provided in Table 4.2. In all cases, we use two test sets: newstest and the devtest split of
44
the FLORES+ (NLLB Team et al., 2022) MT benchmark to evaluate both in-domain and out-ofdomain performance.
Train Corpus
(Sent. Count) Test Corpora
Fr↔En News Comm. v16
(156.1 K)
newstest2012
FLORES+
De–En News Comm. v16
(294.5 K)
newstest2012
FLORES+
Es–En News Comm. v16
(49.1 K)
newstest2012
FLORES+
Ru–En News Comm. v16
(265.8 K)
newstest2013
FLORES+
Table 4.2: Data sources used for fine-tuning eBART and eMEGABART.
For each language pair, we carry out three fine-tuning settings:
• eBART full fine-tuning
• eMEGABART full fine-tuning
• eMEGABART fine-tuning sans EMA
In the last setting, we freeze all parameters in the EMA layer and then fine-tune everything else.
This setting is to examine the degree to which the biases that EMA captures are transferable.
4.2.1.3 Transfer Results
Table 4.3 reports the BLEU scores5
for all fine-tuning settings (eMEGABART-EMA^ refers to
the setting in which we freeze the EMA). Across all language pairs and test sets, both settings
involving MEGA-based models perform better than eBART, which is Transformer-based.
Notably, freezing EMA negligibly affects final MT results. This suggests that the knowledge
captured by EMA during pre-training, albeit learned only from monolingual data, is highly transferable during MT fine-tuning. Additionally, it implies that we can also freeze EMA layers in the
encoder and decoder stacks and use exclusive cross-attention fine-tuning as described in Chapter 3.
5SACREBLEU Signature: BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1.
45
Fr–En En–Fr De–En Es–En Ru–En
newstest
eBART 15.4 11.6 13.1 18.5 11.9
eMEGABART 16.8 14.7 14.3 19.1 12.4
eMEGABART-EMA^ 16.8 14.6 14.1 19.1 12.4
FLORES+
eBART 17.7 10.8 14.9 12.2 11.0
eMEGABART 19.8 14.7 17.0 12.7 11.6
eMEGABART-EMA^ 19.8 14.8 16.9 12.7 11.7
Table 4.3: BLEU scores for each of the three fine-tuning settings across five language pairs; newstest on the top, FLORES+ on the bottom. Bold numbers indicate the top two scoring approaches.
MEGA-based models consistently place at the top.
4.2.2 Large-Scale Pre-Training
For large-scale pre-training, we undertake to replicate mBART (Liu et al., 2020) using MEGA as
the base model architecture to the best of our ability with the resources available to us. As Liu
et al. (2020) report, building mBART at FAIR6
took 256 NVIDIA 32GB V100 GPUs over 2.5
weeks for a total of 500K steps. Through an AWS7 grant, we were able to pre-train our MEGAbased mBART, MEGAmBART, using 80 NVIDIA 32GB V100 GPUs over 10 days for a total
of 76K steps. Therefore, due to the limit of our resources, our checkpoint of MEGAmBART is
under-trained compared to mBART.
4.2.2.1 Pre-Training Details
We mimic all data and model settings as described by Liu et al. (2020), including the batch size
by means of gradient accumulation. After 76K steps, our final checkpoint achieves a perplexity of
2.35.8 However, since Liu et al. (2020) do not report pre-training perplexity, we cannot provide
any comparison between MEGAmBART and mBART.
6Currently AI at Meta: https://ai.meta.com/.
7https://aws.amazon.com/
8At 76K steps, the training completes one epoch. This means each data point is seen (exactly) once by our model,
and there are no samples that are seen by mBART but not our model.
46
Xs
in Xs–En Arabic Czech German Spanish Estonian Finnish French Gujarati
mBART 16.0 19.8 24.5 18.4 16.7 15.9 26.8 12.6
MEGAmBART 15.0 19.3 24.5 17.6 15.6 14.5 25.9 11.4
Xs
in Xs–En Hindi Italian Japanese Kazakh Korean Lithuanian Latvian Burmese
mBART 16.0 19.4 8.3 10.3 9.5 15.3 17.3 8.2
MEGAmBART 13.6 18.6 8.6 11.1 9.6 14.6 15.5 7.8
Xs
in Xs–En Nepali Dutch Romanian Russian Sinhala Turkish Vietnamese Chinese
mBART 14.9 18.6 24.8 15.3 9.7 16.1 19.7 10.2
MEGAmBART 13.7 18.7 24.5 16.5 9.8 14.6 18.5 10.4
Table 4.4: BLEU scores for mBART and MEGAmBART fine-tuning experiments. Bold numbers
indicate the top scoring pre-trained model. While MEGAmBART is only trained for 76K steps
compared to 500K steps of mBART, it is able to perform competitively. This, again, demonstrates
that MEGAmBART is more compute-efficient.
4.2.2.2 Fine-Tuning Details
We transfer our large-scale pre-trained model to Xs–En for all Xs
that were involved in the multilingual denoising during pre-training (besides English). All experiments use 800 sentences from
the dev split of the FLORES+ benchmark for training and the remaining 197 sentences for validation (the dev split in FLORES+ contains a total of 997 sentences). The devtest split of FLORES+
remains as the test set as with the small-scale experiments.
4.2.2.3 Transfer Results
We present the results of fine-tuning both mBART and MEGAmBART on translation between
each of the 24 languages and English in Table 4.4. MEGAmBART, under-trained by more than
400K steps relative to mBART, performs very competitively. Averaged across all language pairs,
it scores only 0.6 BLEU shy of mBART’s performance. We also repeated the fine-tuning setting
where EMA is frozen, and observed the same pattern as we had in §4.2.1: freezing EMA never
caused a drop of more than 0.1 BLEU in the final score.
47
Figure 4.4: Fr–En Loss (on the left vertical axis) and test BLEU (on the right vertical axis) curves
of MEGAmBART (labeled as MEGAmBART in the graph’s legend) and mBART against epoch.
MEGAmBART progresses faster until mBART catches up.
Similar to the perplexity graph in §4.2.1, we plot the loss and BLEU curves of Fr–En, as
an example, in Figure 4.4. We find that as before, MEGAmBART starts off strong. However,
mBART, with its edge in terms of pre-training, catches up in terms of BLEU later in the training
process.
These results suggest that MEGA is very likely to be compute-efficient for pre-training at larger
scales as well. But we concur that to paint a full picture and quantify by exactly how much, we
need to start from equally pre-trained checkpoints of both.
4.3 Chapter Conclusion
In this chapter, we assess MEGA—an extension of Transformer by Ma et al. (2023)—from a transfer learning perspective. We consider pre-trained MEGA models at two scales and compare them
against their Transformer counterparts. The small-scale pre-training allows us to create directly
48
comparable checkpoints of both architectures, including under ablation settings, with available
in-house computational resources. Our results show that MEGA is compute-efficient during pretraining and performs better during downstream fine-tuning.
For the large-scale pre-training, we pre-train a relatively under-trained but otherwise equivalent
replica of mBART (in terms of data and model size) using MEGA . Despite being under-trained
due to resource limitations, the results from downstream fine-tuning show that the MEGA-based
model remains competitive with mBART, highlighting its compute-efficiency again.
49
Chapter 5
Planning for Efficiency:
Meta-Learning for Parameter-Efficient Fine-Tuning
As we have seen in the previous chapters, the pre-training → fine-tuning paradigm is the dominant
practice in natural language processing, owing to state-of-the-art performance on a wide variety
of tasks (Qiu et al., 2020b). The impressive effectiveness of this approach does not come at a
low price. It requires iterative adjustment of anywhere between millions (Devlin et al., 2019)
to staggering billions of parameters (Chowdhery et al., 2024). With this many parameters, finetuning all parameters, as is common, becomes exceedingly computationally expensive: where
many models need to be fine-tuned, serving a separate copy of all a model’s parameters for each
instance is costly in terms of storage (an issue we brought up in Chapter 3 as well).
Recent works on parameter-efficient (PE) fine-tuning address this issue by introducing methods
that alternatively rely on only changing a tiny set of extra parameters (Houlsby et al., 2019; Li and
Liang, 2021; Hambardzumyan et al., 2021; Lester et al., 2021; Hu et al., 2022; He et al., 2022) or
a small fraction of the existing model’s parameters (Gheini et al., 2021; Ben Zaken et al., 2022).
These methods have been shown to be competitive with full fine-tuning despite modifying only as
little as 0.01% of all the parameters (Liu et al., 2022).
With this shift towards lightweight fine-tuning, we ask if the pre-training needs to be complemented in any way as well. Ought we further modify the pre-trained model, knowing that we
are going to opt for PE fine-tuning? Specifically, can we extend pre-training in a way that leads
50
Figure 5.1: Transfer learning for NLP pipeline; the shaded block is our contribution. Conventional
transfer practice (dashed arrows) does not differentiate between full fine-tuning and parameterefficient fine-tuning in any way. This work proposes a meta-learning solution to further modify
and prime a pretrained model parameters to specifically target parameter-efficient fine-tuning.
to parameter initializations that better suit PE fine-tuning than the initializations coming outright
from the pre-trained language model (PLM) and used by full fine-tuning?
In this chapter, we show that, in fact, we can use optimization-based meta-learning to further
modify the parameters from a PLM so that they are more beneficial for PE fine-tuning and result
in improved performance on the target task after transfer. We term this step, which sits between
conventional pre-training and fine-tuning, “priming” (see Figure 5.1). Specifically, as we describe
in §5.2.2, we tweak the popular meta-learning approach MAML (Finn et al., 2017) for priming
and crucially simulate the actual PE fine-tuning procedure in the inner loop of the algorithm. This
means that instead of including all the parameters in the inner loop gradient update, we only consider those that will be updated by the PE fine-tuning method. Thus, during the meta-gradient
update in the outer loop of the algorithm, this information about the ultimate fine-tuning approach
will be incorporated into the pre-trained values.
51
We choose cross-lingual transfer for named entity recognition (NER) as the testbed to show
the effectiveness of the priming stage. We show that priming a PLM boosts the performance of
cross-lingual PE fine-tuning for NER by up to 4.96 F1 points. We provide the details of our
lightweight fine-tuning setup in §5.3. Our ablation study in §5.4.1 reveals that simulating the finetuning procedure is indispensable to the observed improvements: it is not meta-learning in general,
but how we formulate the meta-learning setup that leads to observed gains.
Our contributions are:
• We propose a meta-learning based mechanism termed “priming” to further update the parameters of a PLM in a way that improves the final PE transfer performance.
• We show the effectiveness of priming for cross-lingual transfer for NER as an exhibit.
• We justify and shed more light on the importance of the design elements in the priming
algorithm through an ablation analysis.
5.1 Meta-Learning Background
The meta-learning problem can be viewed as acquiring meta-parameters θ using meta-training
data Dmeta-train such that θ, when used for adaptation, improves performance on a new task with
training data Dtrain (Finn, 2019). Optimization-based meta-learning algorithms formulate adaptation as an optimization procedure during which task parameters φ are obtained by fine-tuning
meta-parameters θ:
φ = θ −α∇θL(θ,Dtrain) (5.1)
where L is the task-dependent loss function.
Under this model of adaptation, meta-learning becomes a search for meta-parameters θ such
that when used as initialization, optimal φ may be found via fine-tuning over many tasks. During
meta-training, a “task” is modeled as a tuple of a training (support) set Dtr and a testing (query)
52
set Dts. Hence, Dmeta-train = {(Dtr
1
,Dts
1
),··· ,(Dtr
n
,Dts
n
)}. Specifically, MAML (Finn et al., 2017),
which we take inspiration from, moves towards solution θ
⋆
for meta-parameters θ through a bilevel optimization procedure:
θ
⋆ = argmin
θ
∑
(Dtr
i
,Dts
i
)
∈Dmeta-train
L(
inner optimization loop
z }| {
θ −α∇θL(θ,D
tr
i
),D
ts
i
)
| {z }
outer optimization loop
(5.2)
where the inner loop takes gradient steps with respect to θ using the support set of each task to
obtain task parameters φi for each one. The outer loop optimization process then takes metagradient steps with respect to θ by evaluating post-inner-update performance on the query set of
each task, modifying θ to be a better initialization.
5.2 Priming for PE Fine-Tuning through Meta-Learning
5.2.1 Problem Formulation
Provided with a PLM parameterized by parameters θp, and a dataset D for a target task, conventional fine-tuning practice adds a task-specific head parameterized by parameters θh (initialized
randomly) to the PLM and updates all parameters θp ∪θh. To avoid such expensive updates with
all parameters, PE fine-tuning designates an additional set of parameters (initialized randomly)
as θa as the only parameters to be updated along θh while keeping θp frozen. Note that θa is
deliberately added in such a way that |θh|+|θa| ≪ |θp|.
With this alteration, perhaps prior to fine-tuning, θp can first be further updated to reach θ
⋆
p
,
which, if transferred specifically under the parameter-efficient setting, results in better performance. We call this extra step between pre-training and fine-tuning and the problem of finding
such parameters “priming”. As an additional benefit, during priming we can also learn parameters
θ
⋆
a
to be used instead of random initializations θa. Priming does not take away the benefits of PE
fine-tuning: ultimately fine-tuning still relies on changing (and hence storing) the same number
53
Algorithm 1 Priming for Lightweight Fine-Tuning (PE FT)
Require: model fθ=θp∪θh∪θa
: pre-trained params θp, task head params θh, and PE FT params θa
Require: Dmeta-train = {(Dtr
1
,Dts
1
),··· ,(Dtr
n
,Dts
n
)}
Require: L = {L1,...,Lt}: set of loss functions corresponding to all potential different tasks
Require: α, β: learning rates
Require: S: number of inner gradient steps
1: while not converged do
2: Sample a batch of tasks T
3: for all Ti ∈ T do
4: θ
i = θ
5: for s ← 1,...,S do
6: θ
i
a = θ
i
a −α∇θ
i
a
LTi
(fθ
i ,Dtr
Ti
); θ
i
h = θ
i
h −α∇θ
i
h
LTi
(fθ
i ,Dtr
Ti
)
θ
i
p = θ
i
p −α∇θ
i
p
LTi
(fθ
i ,Dtr
Ti
) ▷ In MAML, but not here as we are simulating PE
FT.
7: end for
8: end for
9: Meta-gradient steps θa = θa −β∇θaΣTiLTi
(fθ
i ,Dts
Ti
);
θp = θp −β∇θp
ΣTiLTi
(fθ
i ,Dts
Ti
)
10: θh = θ
1
h
11: end while
12: return θp, θa
of parameters that would change without priming (|θh| + |θ
⋆
a
|); it just starts from more suitable
initializations θ
⋆
p
and θ
⋆
a
.
5.2.2 Priming Algorithm
We model priming as an optimization-based meta-learning problem. However, we refrain from
directly applying MAML to it. This is due to the key observation that under PE fine-tuning, the
adaptation procedure, as shown in Equation 5.1, has changed: only a subset of parameters are
54
updated during adaptation. Hence, it should be properly simulated in the inner loop in Equation 5.2.
So during priming, we only include θa and θh in the inner loop, mimicking PE fine-tuning and do
not include θp. θp and θa then receive the meta-gradients in the outer loop and change accordingly.
Algorithm 1 outlines the adaptations used for priming. The inner loop (lines 3-8) simulates
exactly how we are going to ultimately fine-tune in a lightweight fashion by only updating θa and
θh. The statement marked as red and without a line number, which additionally updates pre-trained
parameters θp, would be executed by MAML. But we crucially omit it in our proposed priming
algorithm. At the end of the outer loop (line 9), we take meta-gradient steps with respect to the
parameters the initializations of which we are trying to enhance, θa and θp. As θh will be initialized
from scratch for each new task at the time of fine-tuning, we do not compute meta-gradients for it,
and simply assign it to one of the calculated sets in the inner loop, e.g., the first set corresponding
to the first task in the sampled batch of tasks (θh = θ
1
h
on line 10).
5.3 Experimental Setup
While our proposed priming algorithm is model-agnostic, we need a concrete PE fine-tuning and
meta-training setup for empirical evaluation.
For lightweight fine-tuning, we choose adapters (Houlsby et al., 2019). In our experiments,
we add a single adapter after the last layer of the pre-trained Transformer (for an illustration, see
Figure 5.2). Our model then computes the logits for input as: h(g(f(x;θp);θa);θh), where f is the
pre-trained model, g is the single adapter layer at the top, and h is the task-specific head.
As a testbed, we experiment with cross-lingual NER. For this case, we can design the priming
(meta-learning) and fine-tuning stages as such:
Meta-Learning: Using one or more source languages, we construct the meta dataset and run priming. Per our problem formulation, θp and θa are shared among languages, but each source language
l has a separate head, parameterized by θhl
.
55
Figure 5.2: Overall model architecture used in our experiments. θa comprises a single adapter
layer directly after the pretrained model.
Fine-Tuning: For each desired target language, we use the pre-trained and adapter parameter initializations acquired during meta-learning along with randomly initialized new head parameters as
the model’s starting point. We then fine-tune only the adapter parameters and the head parameters.
In our single adapter layer setup, this means only updating fewer than 0.4% of all the parameters.
5.3.1 Data Details
We use the WikiAnn multilingual NER dataset (Pan et al., 2017), which is available from the
Datasets Python library (Lhoest et al., 2021). The train, validation, and test splits, as provided by
Rahimi et al. (2019), range from 100 to 20k instances. In our experiments, we use the English and
Spanish sets as source languages, each with 20k instances during the priming stage. At fine-tuning,
we evaluate the quality of transfer for six target languages: Hindi (5k instances), Afrikaans (5k),
Azerbaijani (10k), Lithuanian (10k), Estonian (15k), and Dutch (20k).
56
5.3.2 Implementation Details
We use mBERTBASE as the PLM. The meta-gradient in the outer loop relies on second-order gradients, which are expensive to compute. Thus, following Finn et al. (2017), we use a first-order
approximation in our implementation. For the inner loop, we take five steps of stochastic gradient
descent with a learning rate of 0.03. For the outer loop, we use the AdamW optimizer (Loshchilov
and Hutter, 2019) with a learning rate of 5e-5 and a linear learning rate scheduler.
Our implementation is based off of the Transformers (Wolf et al., 2020) and Lightning (Falcon
and The PyTorch Lightning team, 2019) libraries. For our pre-trained model, we use multilingual
BERT (mBERT, bert-base-multilingual-cased) (Devlin et al., 2019). For the adapter layer,
we set the bottleneck dimension as 64 in our experiments.
Our experiments (both priming and fine-tuning stages) are each run on one NVIDIA Quadro
RTX 8000 GPU, taking a maximum of twelve hours.
5.3.3 Baselines and Method Evaluation Settings
To assess the effectiveness of priming, we run two categories of experiments as listed in Table 5.1.
The setting numbers in the table match those used below (e.g., 1/Full FT ↔ 1/Full fine-tuning
baseline).
The first category includes no priming:
1/Full fine-tuning baseline corresponds to fine-tuning θp ∪θh, where θh is initialized randomly.
It provides an upper bound for PE fine-tuning, and notably is not parameter-efficient.
2/Head tuning (HT) baseline corresponds to freezing θp (treating the PLM as a feature extractor)
and fine-tuning θh, where θh is initialized randomly. It provides a lower bound for PE fine-tuning.
3/Adapter tuning (AT) baseline corresponds to fine-tuning θa ∪ θh. It is the baseline PE finetuning, and we investigate if priming improves upon it.
We also experiment with a second category, which incorporates priming:
4/Adapter tuning after priming as proposed corresponds to fine-tuning θa∪θh where θp (frozen)
57
Hindi Afrikaans Azerbaijani Lithuanian Estonian Dutch
Without
Priming
1/Full FT (100%) 86.73 91.29 87.70 89.43 90.88 91.47
2/HT (3e-3%) 72.71 79.11 74.24 78.34 81.23 78.90
3/AT (0.4%) 77.76 84.10 81.08 83.00 85.13 83.89
With
Priming
4/Meta Priming → AT 81.30 87.76 82.98 86.03 86.73 88.85
5/FT Priming → AT 80.34 87.70 81.74 85.84 86.43 88.61
6/MP [MAML Loop] → AT 80.15 86.10 81.54 85.66 86.06 88.15
7/MP [1 Inner Step] → AT 80.54 86.48 80.74 84.87 86.43 88.72
Table 5.1: Entity-level micro F1 under each of the fine-tuning settings for NER across six languages. Bold numbers indicate top-scoring methods in each category. Percentages next to each
setting are the fraction of parameters that are updated (all AT settings have the same percentage).
Priming as described in this work is most effective in improving PE fine-tuning performance and
closing the gap with Full FT. All priming experiments are run twice (including the priming stage),
and we report the average score over two runs.
and θa are acquired through priming, and θh is initialized randomly. Compared to the adapter tuning baseline (3), it measures how much priming can improve PE fine-tuning.
5/Adapter tuning after priming through fine-tuning is the same as setting 4 except that instead of priming as proposed, we simply fine-tune θp ∪θa ∪θh on the same data that would have
constructed the meta dataset before proceeding with PE fine-tuning just as in setting 4. This is to illustrate that mere exposure to data during priming is not enough, and treating it as an optimizationbased meta-learning problem is beneficial.
Additionally, we have two ablation settings to study the effect of simulating PE fine-tuning in
the inner loop and the number of inner steps in priming algorithm, which we will discuss in §5.4.1
and §5.4.2.
58
Figure 5.3: Comparison between different priming strategies for downstream full fine-tuning. In
this case, as opposed to parameter-efficient fine-tuning, it is usually beneficial to use full finetuning in the inner loop.
5.4 Results and Analysis
Per Table 5.1, among all PE fine-tuning settings without any priming and those with priming,
4/Meta Priming → AT, which is the materialization of our priming algorithm, is the bestperforming. In comparison with baseline PE fine-tuning (3/AT), our approach results in gains
of up to 4.96 points, indicating that priming with the knowledge of the ultimate transfer process is
substantially helpful. Additionally, the approach results in gains of up to 1.24 points compared to
fine-tuning-based priming (5/FT Priming → AT), signifying that it is not just a matter of exposure to more data, but a matter of appropriately using the extra exposure to simulate the eventual
fine-tuning approach.
5.4.1 Ablation 1: Substitute MAML Inner Loop
To highlight the importance of the change we introduce in MAML, we run the ablation setting
6/MP [MAML Loop] → AT (MP stands for Meta Priming). This is essentially 4/Meta Priming
→ AT where we update all parameters, and not only those involved in PE fine-tuning, in the inner
loop. It can be observed across the board that, in fact, simulating the downstream PE fine-tuning
setting is essential for superior performance.
59
We can also generalize the question at the core of this work: Can we expect gains by using
optimization-based meta-learning and simulating the eventual transfer method, whatever it might
be?
To determine the answer, we repeat the settings in this section (4/Meta Priming → AT and
6/MP [MAML Loop] → AT), but replace adapter tuning (AT) with full fine-tuning. As shown in
Figure 5.3, in most cases, matching downstream full fine-tuning with a parameter-dense MAML
inner loop (green bar in the middle in each series) is superior to mixing it with PE optimization in
the inner loop. We hypothesize that the discrepancy in the case of Lithuanian and Estonian is due
to the fact that full fine-tuning is powerful, and potentially more robust to heterogeneous priming
conditions.
Figure 5.4 provides an overview of what we recommend based on our experiments. It displays
all four possible combinations of priming strategy → ultimate fine-tuning strategy sequences. Each
block reports the average performance of downstream fine-tuning for NER across the six languages
in our experiments using the corresponding combination. The best performing in each case happens on the main diagonal of the matrix. Therefore, for best performance, ideally, a priming stage
simulating the subsequent fine-tuning strategy should be included in the transfer pipeline.
Figure 5.4: Know Where You’re Going: Best performances appear on the diagonal, where a homogeneous priming strategy is devised before downstream fine-tuning. Directly comparable blocks
are linked using arrows.
60
5.4.2 Ablation 2: Number of Inner Steps
We find that under first-order MAML, the number of inner steps is critical for reaching better
initialization. The ablation setting 7/MP [1 Inner Step] → AT, which is identical to 4/Meta
Priming → AT with only one inner step, highlights this. 4/Meta Priming → AT with five inner
steps, always performs better.
Figure 5.5: First-order approximation of meta-gradient update. Illustration courtesy of Wild
(2020). After three steps of inner updates (θ →→→ φ), θ is updated by the gradient of the query
set loss evaluated at φ (the green dashed arrow).
To provide an intuition as to why that is, a visualization of how parameters receive updates
under first-order MAML by Wild (2020) is provided in Figure 5.5. Meta-parameters θ are updated
in the direction of the gradient of the query set loss calculated at the value reached at the end of
the inner loop. Hence, the fewer the number of inner steps, the more the updates will be similar to
those under regular fine-tuning (in the limit of zero inner steps, it will be equivalent to conventional
fine-tuning). So additional inner steps are beneficial.
5.5 Chapter Conclusion
We propose to add “priming” between the conventional pre-training and parameter-efficient finetuning to incorporate awareness of the transfer procedure in the pre-trained model. We model this
as optimization-based meta-learning, which integrates such knowledge by updating pre-trained
parameters under PE fine-tuning simulation. We show the effectiveness of priming in improving
61
baseline PE fine-tuning on cross-lingual transfer for NER. Further analysis reveals that our decisions to 1) model priming with meta-learning instead of simple fine-tuning and 2) simulate the
actual PE fine-tuning in the meta-learning instead of using it unadjusted contribute to the effectiveness of priming.
62
Chapter 6
Related Work
Several works inspired the questions and ideas presented in this dissertation. In this chapter, we
present these works grouped by how they relate to ours.
6.1 Low-Resource NMT
Four methods dominate the struggle to improve low-resource NMT: polyglot training (Johnson
et al., 2017), transfer learning (Zoph et al., 2016), meta-learning (Gu et al., 2018b) and backtranslation (Sennrich et al., 2016a).
Close to the focus of our work in Chapter 2, transfer learning and enabling universal machine
translation, Nguyen and Chiang (2017) and Kocmi and Bojar (2018) separately extend Zoph et
al. (2016)’s approach. However, they both make their BPE/subword vocabulary using the union
of both parent and child corpora, which is memory-inefficient and leads to undertrained child
vocabulary.
Perhaps the most in line with our efforts to apply multilinguality to achieve a transfer-ready
model, are works by Neubig and Hu (2018) and Kim et al. (2019). Neubig and Hu (2018) train a
massively multilingual parent model on top of the TED corpus (Qi et al., 2018), and transfer it to
child languages under warm start (child language included at the time of the training the parent)
and cold start (child language introduced only at the time of fine-tuning) scenarios. They also
propose similar language regularization, where at the time of fine-tuning, they append data from a
63
similar language to the child data to avoid over-fitting to the child language, as it is often a lowresource language. The main difference between our method and Neubig and Hu’s, is that while
we create our vocabulary once jointly on the polyglot parent corpus, they create their vocabulary
separately on each language in the polyglot corpus, and later further update it for each new child
language. This results in a huge vocabulary (above 300k), impacting training speed and memory
requirements. Our model uses a constant, fixed-sized vocabulary that accompanies our ready-to-go
parent model. One of the languages they experiment with is Azerbaijani. They report BLEU of 8.8
transferring their multilingual parent model to Azerbaijani trained on TED corpus with Azerbaijani
removed at the time of pre-training (cold-start scenario). For comparison, we also transferred our
2M polyglot model to Azerbaijani, which was able to score BLEU of 8.1 on the very same test
set. This again shows our approach is very competitive (8.1 vs. 8.8) despite the facts that (1) our
parent model has not seen and does not include Azerbaijani subwords (whereas those get added
at the time of fine-tuning in theirs), (2) our 2M parent model is pre-trained on less training data
(compared to ∼4M sentences in TED corpus), and (3) the data for our parent model is from a very
different domain than TED talks.
Kim et al. (2019) approach the vocabulary mismatch between the parent and child by learning a
cross-lingual mapping between their embedding spaces. This requires time and resource-intensive
monolingual embedding training and cross-lingual projection training for each new child language.
Lakew et al. (2018) use a dynamic vocabulary (similar to Neubig and Hu (2018)), updated for
each new language pair. This increases the size of their model with each new child language and
requires a preprocessing phase for each new transfer session to update the vocabulary.
Gu et al. (2018a) extend multilingual NMT by suggesting a universal representation technique,
where all tokens in all languages are represented as a mixture of a basis for the universal token
space, which in their case is English embeddings. However, they first need large amounts of
monolingual data to train monolingual embeddings and also have a more time-consuming pipeline
in the face of a new language.
64
6.2 Model Restricting & Architecture Understanding
In terms of restrictions introduced in Chapter 3, our work is related to a group of recent works that
freeze certain modules while fine-tuning (Zoph et al., 2016; Artetxe et al., 2020; Lu et al., 2021).
Artetxe et al. (2020) conduct their study on an encoder-only architecture. They show that by
freezing a pre-trained English Transformer language model body and only lexically (embedding
layers) transferring it to another language, they can later plug in those embeddings into a finetuned downstream English model, achieving zero-shot transfer on the downstream task in the other
language. Lu et al. (2021) also work with a decoder-only architecture. They show that by only
fine-tuning the input layer, output layer, positional embeddings, and layer norm parameters of an
otherwise frozen Transformer language model, they can match the performance of a model fully
trained on the downstream task in several modalities.
Several recent works consider the importance of self- and cross-attention heads in the Transformer architecture (Voita et al., 2019; Michel et al., 2019; You et al., 2020). The consensus
among these works is that cross-attention heads are relatively more important than self-attention
heads when it comes to introducing restrictions in terms of pruning and hard-coding.
6.3 Cross-Lingual Embeddings
While we were able to obtain cross-lingual embeddings through our transfer learning approach
without using any dictionaries or direct parallel corpora in Chapter 3, Wada et al. (2021) use a direct
parallel corpus and a shared LSTM model that does translation and reconstruction at the same
time to obtain aligned embeddings. Given tremendously large monolingual corpora for embedding
construction, cross-lingual embeddings can also be obtained by applying a linear transformation on
one language’s embedding space to map it to the second one in a way that minimizes the distance
between equivalents in the shared space according to a dictionary (Mikolov et al., 2013; Xing et al.,
2015; Artetxe et al., 2016). These works specifically targeted the parallel dictionary reconstruction
65
task, while we used the task incidentally, to intrinsically evaluate the parameters learned by our
methods.
6.4 MEGA Pre-Training
Concurrent with our efforts to build a pre-trained language model based on MEGA, Ma et al.
(2024) introduce MEGALODON. MEGALODON is a 7-billion parameter model (trained using 256
NVIDIA A100 GPUs) built upon MEGA with some novel technical extensions. At the 7-billion
scale, MEGALODON is shown to be more efficient than Llama 2 (Touvron et al., 2023). However,
as of the writing of this dissertation, a public checkpoint of MEGALODON is not available.
6.5 Parameter-Efficient Fine-Tuning
Parameter-Efficient fine-tuning methods are a response to the ever-growing size of the PLMs,
which makes full fine-tuning prohibitively expensive. Houlsby et al. (2019) reduce the number
of parameters to be updated by inserting adapter modules in every layer of the Transformer model.
Then during fine-tuning, they update the adapter parameters from scratch and fine-tune layer norm
parameters while keeping the rest of the parameters frozen. Since adapters are only inserted and
initialized at the time of fine-tuning, they are not able to reveal anything about the importance of
pre-trained modules. Our approach in Chapter 3, however, enables highlighting the crucial role of
the encoded translation knowledge by contrasting {src,tgt}+xattn and {src,tgt}+randxattn.
Bapna and Firat (2019) devise adapters for MT by inserting language pair-specific adapter parameters in the Transformer architecture. In the multilingual setting, they show that by fine-tuning
adapters in a shared pre-trained multilingual model, they can compensate for the performance drop
of high-resource languages incurred by shared training. Philip et al. (2020) replace language pairspecific adapters with monolingual adapters, which enables adapting under the zero-shot setting.
Another family of lightweight fine-tuning approaches (Li and Liang, 2021; Hambardzumyan
et al., 2021; Lester et al., 2021), inspired by prompt tuning (Brown et al., 2020), also relies on
66
updating a set of additional new parameters from scratch towards each downstream task. Such sets
of parameters equal a very small fraction of the total parameters in the pre-trained model. Another
method that solely updates a new set of parameters, is LoRA (Hu et al., 2022). By contrast, our
approach in Chapter 3 updates a subset of the model’s own parameters instead of adding new ones.
From this perspective, it is in the same category as BitFit (Ben Zaken et al., 2022).
6.6 Meta-Learning for Parameter-Efficient Fine-Tuning
Despite the rich literature on different parameter-efficient transfer approaches, to the best of our
knowledge, no existing study investigates whether in response pre-training practices need to be
updated in any way. In Chapter 5, we attempt to address that void. He et al. (2022) provide a unified
framework within which several flavors of lightweight fine-tuning can be interpreted. Therefore
we, while studying an adapter-based approach in this work, expect priming to be fundamentally
applicable and useful to other flavors too.
We are inspired by the body of work that takes advantage of optimization-based meta-learning
to come by initializations that would be better suited for a specific objective. Xia et al. (2021)
use meta-learning to learn representation transformations that transform representations of a highresource language in a way that they become more beneficial for effective transfer to low-resource
languages. Nooralahzadeh et al. (2020) effectively use meta-learning to leverage training data
for zero-shot and few-shot cross-lingual transfer on Question Answering and Natural Language
Inference. Javed and White (2019) use a meta-objective to optimize representations for continual
learning.
Perhaps closest in spirit to our objective in Chapter 5 and trying to bring these two lines of
work together, Min et al. (2022) offer a meta-learning-like solution to “learn to learn in context”:
using our terminology, while we address priming for PE fine-tuning, they address priming for incontext learning (Brown et al., 2020). In-context learning is a few-shot learning technique with
no additional training required, where an LM is used to label new instances after conditioning on
67
only a few supervised examples. Min et al. (2022) propose to better prepare the model for such an
inference process on a new unseen task by including a tuning stage where the model is trained to
do the same on simulated input sequences from a set of available tasks. The extra training stage
that they include can be seen as equivalent to our priming stage, where in both cases, the goal is to
prepare the model for what is subsequently coming.
68
Chapter 7
Takeaways and Future Directions
ANTONIO: How are we going to get back Jeff’s $10,000?
J. CHEEVER LOOPHOLE: It’s very easy. Offer a reward of $15,000.
Chico Marx and Groucho Marx, At the Circus, 1939
In this dissertation, we presented four pieces of work focused on efficient transfer learning pipelines
for natural language processing from data, parameter count, and computation perspectives, comprising the following contributions:
• We proposed a universal pre-training framework that eliminates the need for disposable parent pre-training per child task. With machine translation as our central testbed, we comprehensively demonstrated the effectiveness of our method and dissected the benefits of a fixed
model vocabulary as the key part of our approach.
• We introduced exclusive cross-attention fine-tuning as a parameter-efficient fine-tuning strategy for machine translation. Not only did we show the necessity and sufficiency of crossattention for machine translation transfer, but we also revealed the underlying process that
makes it effective, i.e., aligning parent and child embedding spaces.
• We investigated the benefits of the exponential moving average component for transfer learning and, through experiments across different scales, reported evidence that it leads to more
compute-efficient training.
69
• Finally, we addressed how the downstream fine-tuning decision as to whether to update all
parameters or some can be taken into account in advance and guide the training even far
ahead of fine-tuning by means of meta-learning, leading to improved final performance.
Based on our experimental findings throughout this dissertation, we provide a few final words
on what we think are some appropriate future directions and what we find is a constructive way to
think about inductive biases as one of many tools for developing machine learning systems.
7.1 Inductive Biases Elsewhere
This dissertation was solely focused on systems for natural language understanding and generation.
However, it is important to remember that while Transformers—the models at the center of it all—
were first developed and popularized in the NLP community, they are, first and foremost, sequence
models. They are routinely applied to problems beyond language and modalities beyond text.
Therefore, it is important to distinguish between inductive biases that are human languagespecific and those that are not. For instance, romanization for orthography unification in Chapter 2
is unique to language. In contrast, the meta-learning-based solution presented in Chapter 5 is a
general framework for incorporating familiarity with downstream settings ahead of time.
In this spirit, a thrilling direction is to study inductive biases useful in other domains (and
perhaps propose new ones to contribute to them). For instance, for AlphaFold—a protein structure prediction model—Jumper et al. (2021) develop invariant point attention. This specialized
attention operation incorporates the geometric inductive bias that protein structures that can be
transformed into one another by rotation and translation are equivalent.
The overarching goal of investigating inductive biases that are useful in other fields and modalities is to detect common patterns across them and incorporate them in novel architectures that
unify them under one model (much like how the attention mechanism is now a ubiquitous module). This streamlines using this architecture in the future for new problems, where such inductive
biases are also very likely to be helpful.
70
7.2 Inductive Biases or Scaling?
Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) refer to the empirical results that show
the performance (loss) of Transformer language models improves steadily when three factors are
scaled up in tandem: number of (non-embedding) parameters, dataset size, and amount of compute.
Scaling laws are indeed, for lack of a better word, fortuitous given “The Bitter Lesson”1—we
have a combination of architectures and optimization methods that continue to improve as more
computation resources become available. In contrast, the Bitter Lesson warns against the bait of
trying to encode in “how we [humans] think we think”. Such shortcuts are very likely to turn into
constraints that cause the methods to plateau at some point, no matter how abundant the compute.
The distinction between these two approaches is visualized in Figure 7.1. System 1 represents
a system with more structure (perhaps human-imposed) and System 2 represents a more generalpurpose system. System 1 performs better at the lower end of the x-axis, but its progress is hindered
later on where System 2 continues to improve.
Resource:
Data, Parameters, Compute
Performance
System 1
System 2
Ideal Trajectory
Figure 7.1: A schematic of an ever-repeating pattern: System 1 with more incorporated knowledge
only performs well over a limited period of time, whereas System 2 performance scales consistently. The gray curve follows the ideal trajectory.
The graph in Figure 7.1, while overly simplified, resembles those in the early days of migrating
from statistical MT to neural MT (Koehn and Knowles, 2017), or the one in Figure 4.4 in Chapter 4,
for instance.
1http://www.incompleteideas.net/IncIdeas/BitterLesson.html
71
Given the inevitability of such patterns, it is reasonable to take a step back and ask, “Is it
an unwise investment to work on inductive biases? Are they shackling structures? Will they at
some point lead to settling for worse performance?” We can only hope that at this point in the
dissertation, we have communicated effectively that our answer is No on all counts. However, we
share some additional thoughts.
We argue that inductive biases should not be thought of as rigid and untouchable additions to the
intelligent systems. They ought to be considered as instruments that can be adopted and disposed
of as necessary in order to create solutions that move along the ideal trajectory in Figure 7.1.
Thinking about when to remove inductive biases can be discussed from another perspective as
well. Back in Chapter 1, we deemed inductive biases to be tools for reducing the search space.
Under limited resources, this is critical for finding a solution at all. However, it is important to note
that perhaps not all inductive biases are created equal. While some prune only invalid solutions
in the hypothesis space, some might prune valid ones as well. With constrained resources, this
problem might not be apparent: the valid ones might have been among the options we would
not have been able to explore with available resources anyway. However, when we have enough
computational budget to explore more, the fact that some valid options are missing can impede our
progress. Take the problem with the two points in Chapter 1, for instance. This solution,
guided by our inductive bias toward the shortest path, is indeed the optimal solution as well. However, if we were to follow the straight-line bias, we would still be fine for the two-point problem
but would also pick the same solution for the 2D map of the globe. In this case, our inductive bias
would have failed us, pushing us away from the red curve, which is the optimal solution:
Straight-line distance
Geodesic distance (e.g., flights path)
72
Perhaps the most promising direction to pursue is developing and detecting the inductive biases
that only eliminate invalid solutions. These are more likely to be scalable as opposed to saturable.
They can also lead to developments in new architectures or novel optimization algorithms that
bring about better scaling ratios in the scaling laws. To be able to find inductive biases that should
stay, makes continuing research efforts along the lines of those presented in this dissertation worthwhile.
73
References
Aji, Alham Fikri, Nikolay Bogoychev, Kenneth Heafield, and Rico Sennrich (2020). “In Neural
Machine Translation, What Does Transfer Learning Transfer?” In: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce
Chai, Natalie Schluter, and Joel Tetreault. Online: Association for Computational Linguistics,
pp. 7701–7710. URL: https://aclanthology.org/2020.acl-main.688/.
Artetxe, Mikel, Gorka Labaka, and Eneko Agirre (2017). “Learning bilingual word embeddings
with (almost) no bilingual data”. In: Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). Ed. by Regina Barzilay and MinYen Kan. Vancouver, Canada: Association for Computational Linguistics, pp. 451–462. URL:
https://aclanthology.org/P17-1042/.
Artetxe, Mikel, Gorka Labaka, and Eneko Agirre (2016). “Learning principled bilingual mappings
of word embeddings while preserving monolingual invariance”. In: Proceedings of the 2016
Conference on Empirical Methods in Natural Language Processing. Ed. by Jian Su, Kevin
Duh, and Xavier Carreras. Austin, Texas: Association for Computational Linguistics, pp. 2289–
2294. URL: https://aclanthology.org/D16-1250/.
Artetxe, Mikel, Sebastian Ruder, and Dani Yogatama (2020). “On the Cross-lingual Transferability of Monolingual Representations”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce Chai, Natalie Schluter,
and Joel Tetreault. Online: Association for Computational Linguistics, pp. 4623–4637. URL:
https://aclanthology.org/2020.acl-main.421/.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2015). “Neural Machine Translation by
Jointly Learning to Align and Translate”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Ed. by Yoshua Bengio and Yann LeCun. URL: http://arxiv.org/abs/1409.0473.
Bapna, Ankur and Orhan Firat (2019). “Simple, Scalable Adaptation for Neural Machine Translation”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP). Ed. by Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan. Hong Kong, China:
Association for Computational Linguistics, pp. 1538–1548. URL: https://aclanthology.
org/D19-1165/.
74
Beltagy, Iz, Matthew E. Peters, and Arman Cohan (2020). “Longformer: The Long-Document
Transformer”. In: CoRR abs/2004.05150. URL: https://arxiv.org/abs/2004.05150.
Ben Zaken, Elad, Yoav Goldberg, and Shauli Ravfogel (2022). “BitFit: Simple Parameter-efficient
Fine-tuning for Transformer-based Masked Language-models”. In: Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Ed.
by Smaranda Muresan, Preslav Nakov, and Aline Villavicencio. Dublin, Ireland: Association
for Computational Linguistics, pp. 1–9. URL: https : / / aclanthology . org / 2022 . acl -
short.1/.
Bengio, Yoshua (2012). “Deep Learning of Representations for Unsupervised and Transfer Learning”. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning. Ed. by
Isabelle Guyon, Gideon Dror, Vincent Lemaire, Graham Taylor, and Daniel Silver. Vol. 27.
Proceedings of Machine Learning Research. Bellevue, Washington, USA: PMLR, pp. 17–36.
URL: https://proceedings.mlr.press/v27/bengio12a.html.
Bogoychev, Nikolay (2021). “Not all parameters are born equal: Attention is mostly what you
need”. In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting
Neural Networks for NLP. Ed. by Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux,
Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, and Hassan Sajjad. Punta Cana, Dominican
Republic: Association for Computational Linguistics, pp. 363–374. URL: https://aclanthology.
org/2021.blackboxnlp-1.28/.
Brown, Tom et al. (2020). “Language Models are Few-Shot Learners”. In: Advances in Neural
Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan,
and H. Lin. Vol. 33. Curran Associates, Inc., pp. 1877–1901. URL: https://proceedings.
neurips . cc / paper _ files / paper / 2020 / file / 1457c0d6bfcb4967418bfb8ac142f64a -
Paper.pdf.
Callison-Burch, Chris, Philipp Koehn, Christof Monz, and Josh Schroeder (2009). “Findings of
the 2009 Workshop on Statistical Machine Translation”. In: Proceedings of the Fourth Workshop on Statistical Machine Translation. Ed. by Chris Callison-Burch, Philipp Koehn, Christof
Monz, and Josh Schroeder. Athens, Greece: Association for Computational Linguistics, pp. 1–
28. URL: https://aclanthology.org/W09-0401/.
Castillo, Carlos (2016). Big crisis data: social media in disasters and time-critical situations. Cambridge University Press.
Chomsky, Noam (1980). “Rules and representations”. In: THE BEHAVIORAL AND BRAIN SCIENCES 3, pp. 1–61.
Choromanski, Krzysztof Marcin et al. (2021). “Rethinking Attention with Performers”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May
3-7, 2021. OpenReview.net. URL: https://openreview.net/forum?id=Ua6zuk0WRH.
75
Chowdhery, Aakanksha et al. (2024). “PaLM: scaling language modeling with pathways”. In: J.
Mach. Learn. Res. 24.1.
Christianson, Caitlin, Jason Duncan, and Boyan Onyshkevych (2018). “Overview of the DARPA
LORELEI Program”. In: Machine Translation 32.1, pp. 3–9. URL: https://doi.org/10.
1007/s10590-017-9212-4.
Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov (2020).
“Unsupervised Cross-lingual Representation Learning at Scale”. In: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics. Ed. by Dan Jurafsky, Joyce
Chai, Natalie Schluter, and Joel Tetreault. Online: Association for Computational Linguistics,
pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747/.
Dai, Andrew M and Quoc V Le (2015). “Semi-supervised Sequence Learning”. In: Advances in
Neural Information Processing Systems. Ed. by C. Cortes, N. Lawrence, D. Lee, M. Sugiyama,
and R. Garnett. Vol. 28. Curran Associates, Inc. URL: https://proceedings.neurips.cc/
paper_files/paper/2015/file/7137debd45ae4d0ab9aa953017286b20-Paper.pdf.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019). “BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Ed. by Jill Burstein,
Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational
Linguistics, pp. 4171–4186. URL: https://aclanthology.org/N19-1423/.
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby (2021). “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale”. In: International Conference on Learning Representations. URL: https:
//openreview.net/forum?id=YicbFdNTTy.
Falcon, William and The PyTorch Lightning team (2019). PyTorch Lightning. Version 1.4. URL:
https://github.com/PyTorchLightning/pytorch-lightning.
Finn, Chelsea (2019). Meta-Learning Recipe, Black-Box Adaptation, Optimization-Based Approaches.
Last accessed 14 May 2022. URL: http://cs330.stanford.edu/fall2019/slides/cs330_
lecture3.pdf (visited on 05/14/2022).
Finn, Chelsea, Pieter Abbeel, and Sergey Levine (2017). “Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks”. In: Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine
Learning Research. PMLR, pp. 1126–1135. URL: https://proceedings.mlr.press/v70/
finn17a.html.
76
Gheini, Mozhdeh, Xuezhe Ma, and Jonathan May (2023). “Know Where You‘re Going: MetaLearning for Parameter-Efficient Fine-Tuning”. In: Findings of the Association for Computational Linguistics: ACL 2023. Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki.
Toronto, Canada: Association for Computational Linguistics, pp. 11602–11612. URL: https:
//aclanthology.org/2023.findings-acl.737/.
Gheini, Mozhdeh and Jonathan May (2019). “A Universal Parent Model for Low-Resource Neural
Machine Translation Transfer”. In: CoRR abs/1909.06516. URL: http://arxiv.org/abs/
1909.06516.
Gheini, Mozhdeh, Xiang Ren, and Jonathan May (2021). “Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Ed. by Marie-Francine Moens,
Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 1754–1765. URL: https : / /
aclanthology.org/2021.emnlp-main.132/.
Goodfellow, Ian J., Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio (2014). “An
empirical investigation of catastrophic forgeting in gradientbased neural networks”. In: In Proceedings of International Conference on Learning Representations (ICLR).
Gowda, Thamme and Jonathan May (2020). “Finding the Optimal Vocabulary Size for Neural
Machine Translation”. In: Findings of the Association for Computational Linguistics: EMNLP
2020. Ed. by Trevor Cohn, Yulan He, and Yang Liu. Online: Association for Computational
Linguistics, pp. 3955–3964. URL: https://aclanthology.org/2020.findings- emnlp.
352/.
Groeneveld, Dirk et al. (2024). “OLMo: Accelerating the Science of Language Models”. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers). Ed. by Lun-Wei Ku, Andre Martins, and Vivek Srikumar. Bangkok, Thailand:
Association for Computational Linguistics, pp. 15789–15809. URL: https://aclanthology.
org/2024.acl-long.841/.
Gu, Albert and Tri Dao (2023). “Mamba: Linear-Time Sequence Modeling with Selective State
Spaces”. In: CoRR abs/2312.00752. URL: https://doi.org/10.48550/arXiv.2312.00752.
Gu, Albert, Karan Goel, and Christopher Ré (2022). “Efficiently Modeling Long Sequences with
Structured State Spaces”. In: The International Conference on Learning Representations (ICLR).
Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O.K. Li (2018a). “Universal Neural Machine
Translation for Extremely Low Resource Languages”. In: Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Ed. by Marilyn Walker, Heng Ji, and Amanda
Stent. New Orleans, Louisiana: Association for Computational Linguistics, pp. 344–354. URL:
https://aclanthology.org/N18-1032/.
77
Gu, Jiatao, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho (2018b). “Meta-Learning
for Low-Resource Neural Machine Translation”. In: Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing. Ed. by Ellen Riloff, David Chiang, Julia
Hockenmaier, and Jun’ichi Tsujii. Brussels, Belgium: Association for Computational Linguistics, pp. 3622–3631. URL: https://aclanthology.org/D18-1398/.
Gu, Shuhao and Yang Feng (2020). “Investigating Catastrophic Forgetting During Continual Training for Neural Machine Translation”. In: Proceedings of the 28th International Conference on
Computational Linguistics. Ed. by Donia Scott, Nuria Bel, and Chengqing Zong. Barcelona,
Spain (Online): International Committee on Computational Linguistics, pp. 4315–4326. URL:
https://aclanthology.org/2020.coling-main.381/.
Haddow, Barry and Faheem Kirefu (2020). “PMIndia - A Collection of Parallel Corpora of Languages of India”. In: CoRR abs/2001.09907. URL: https://arxiv.org/abs/2001.09907.
Hajlaoui, Najeh, David Kolovratnik, Jaakko Väyrynen, Ralf Steinberger, and Daniel Varga (2014).
“DCEP-Digital Corpus of the European Parliament.” In: LREC, pp. 3164–3171.
Hambardzumyan, Karen, Hrant Khachatrian, and Jonathan May (2021). “WARP: Word-level Adversarial ReProgramming”. In: Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li,
and Roberto Navigli. Online: Association for Computational Linguistics, pp. 4921–4933. URL:
https://aclanthology.org/2021.acl-long.381/.
He, Junxian, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig (2022).
“Towards a Unified View of Parameter-Efficient Transfer Learning”. In: International Conference on Learning Representations. URL: https://openreview.net/forum?id=0RDcd5Axok.
Hermjakob, Ulf, Jonathan May, and Kevin Knight (2018). “Out-of-the-box Universal Romanization Tool uroman”. In: Proceedings of ACL 2018, System Demonstrations. Ed. by Fei Liu and
Thamar Solorio. Melbourne, Australia: Association for Computational Linguistics, pp. 13–18.
URL: https://aclanthology.org/P18-4003/.
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Memory”. In: Neural Computation 9.8, pp. 1735–1780. URL: https://doi.org/10.1162/neco.1997.9.8.1735.
Hoffmann, Jordan et al. (2022). “Training Compute-Optimal Large Language Models”. In: CoRR
abs/2203.15556. URL: https://doi.org/10.48550/arXiv.2203.15556.
Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly (2019). “Parameter-efficient transfer
learning for NLP”. In: International Conference on Machine Learning. PMLR, pp. 2790–2799.
Hu, Edward J, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,
and Weizhu Chen (2022). “LoRA: Low-Rank Adaptation of Large Language Models”. In:
78
International Conference on Learning Representations. URL: https : / / openreview . net /
forum?id=nZeVKeeFYf9.
Javed, Khurram and Martha White (2019). “Meta-Learning Representations for Continual Learning”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. URL:
https://proceedings.neurips.cc/paper/2019/file/f4dd765c12f2ef67f98f3558c282a9cdPaper.pdf.
Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil
Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey
Dean (2017). “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot
Translation”. In: Transactions of the Association for Computational Linguistics 5, pp. 339–351.
URL: https://www.aclweb.org/anthology/Q17-1024.
Jumper, John et al. (2021). “Highly accurate protein structure prediction with AlphaFold”. In:
Nature 596.7873, pp. 583–589.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child,
Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei (2020). “Scaling Laws for Neural
Language Models”. In: CoRR abs/2001.08361. URL: https://arxiv.org/abs/2001.08361.
Kim, Yunsu, Yingbo Gao, and Hermann Ney (2019). “Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies”. In: Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. Ed. by Anna Korhonen,
David Traum, and Lluís Màrquez. Florence, Italy: Association for Computational Linguistics,
pp. 1246–1257. URL: https://aclanthology.org/P19-1120/.
Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya (2020). “Reformer: The Efficient Transformer”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. URL: https : / / openreview . net /
forum?id=rkgNKkHtvB.
Kocmi, Tom and Ondˇrej Bojar (2018). “Trivial Transfer Learning for Low-Resource Neural Machine Translation”. In: Proceedings of the Third Conference on Machine Translation: Research
Papers. Ed. by Ondˇrej Bojar et al. Brussels, Belgium: Association for Computational Linguistics, pp. 244–252. URL: https://aclanthology.org/W18-6325/.
Koehn, Philipp (2004). “Statistical Significance Tests for Machine Translation Evaluation”. In:
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.
Ed. by Dekang Lin and Dekai Wu. Barcelona, Spain: Association for Computational Linguistics, pp. 388–395. URL: https://aclanthology.org/W04-3250/.
Koehn, Philipp and Rebecca Knowles (2017). “Six Challenges for Neural Machine Translation”.
In: Proceedings of the First Workshop on Neural Machine Translation. Ed. by Thang Luong,
79
Alexandra Birch, Graham Neubig, and Andrew Finch. Vancouver: Association for Computational Linguistics, pp. 28–39. URL: https://aclanthology.org/W17-3204/.
Kudo, Taku and John Richardson (2018). “SentencePiece: A simple and language independent
subword tokenizer and detokenizer for Neural Text Processing”. In: Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
Ed. by Eduardo Blanco and Wei Lu. Brussels, Belgium: Association for Computational Linguistics, pp. 66–71. URL: https://aclanthology.org/D18-2012/.
Kunchukuttan, Anoop, Pratik Mehta, and Pushpak Bhattacharyya (2017). “The IIT bombay englishhindi parallel corpus”. In: Proc. LREC, pp. 3473–3476.
Lakew, Surafel M, Aliia Erofeeva, Matteo Negri, Marcello Federico, and Marco Turchi (2018).
“Transfer Learning in Multilingual Neural Machine Translation with Dynamic Vocabulary”.
In: International Workshop on Spoken Language Translation.
Lample, Guillaume, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou
(2018). “Word translation without parallel data”. In: International Conference on Learning
Representations. OpenReview.net. URL: https://openreview.net/forum?id=H196sainb.
Lester, Brian, Rami Al-Rfou, and Noah Constant (2021). “The Power of Scale for ParameterEfficient Prompt Tuning”. In: Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing. Ed. by Marie-Francine Moens, Xuanjing Huang, Lucia Specia,
and Scott Wen-tau Yih. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 3045–3059. URL: https : / / aclanthology . org / 2021 . emnlp -
main.243/.
Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer (2020). “BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Ed. by Dan
Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault. Online: Association for Computational Linguistics, pp. 7871–7880. URL: https : / / aclanthology . org / 2020 . acl - main .
703/.
Lewis, Will, Robert Munro, and Stephan Vogel (2011). “Crisis MT: Developing A Cookbook for
MT in Crisis Situations”. In: Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics. URL: https://www.microsoft.com/enus/research/publication/crisis-mt-developing-a-cookbook-for-mt-in-crisissituations/.
Lhoest, Quentin et al. (2021). “Datasets: A Community Library for Natural Language Processing”.
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Ed. by Heike Adel and Shuming Shi. Online and Punta Cana,
Dominican Republic: Association for Computational Linguistics, pp. 175–184. URL: https:
//aclanthology.org/2021.emnlp-demo.21/.
80
Li, Xiang Lisa and Percy Liang (2021). “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers). Ed. by Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli. Online:
Association for Computational Linguistics, pp. 4582–4597. URL: https://aclanthology.
org/2021.acl-long.353/.
Liu, Chunxi, Qiaochu Zhang, Xiaohui Zhang, Kritika Singh, Yatharth Saraf, and Geoffrey Zweig
(2020). “Multilingual Graphemic Hybrid ASR with Massive Data Augmentation”. eng. In: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL).
Ed. by Dorothee Beermann, Laurent Besacier, Sakriani Sakti, and Claudia Soria. Marseille,
France: European Language Resources association, pp. 46–52. URL: https://aclanthology.
org/2020.sltu-1.7/.
Liu, Haokun, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and
Colin A Raffel (2022). “Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning”. In: Advances in Neural Information Processing Systems. Ed. by S.
Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh. Vol. 35. Curran Associates, Inc., pp. 1950–1965. URL: https : / / proceedings . neurips . cc / paper _ files /
paper/2022/file/0cde695b83bd186c1fd456302888454c-Paper-Conference.pdf.
Loshchilov, Ilya and Frank Hutter (2019). “Decoupled Weight Decay Regularization”. In: International Conference on Learning Representations. URL: https://openreview.net/forum?id=
Bkg6RiCqY7.
Lu, Kevin, Aditya Grover, Pieter Abbeel, and Igor Mordatch (2021). “Pretrained Transformers as
Universal Computation Engines”. In: CoRR abs/2103.05247. URL: https://arxiv.org/abs/
2103.05247.
Luccioni, Alexandra Sasha, Sylvain Viguier, and Anne-Laure Ligozat (2024). “Estimating the carbon footprint of BLOOM, a 176B parameter language model”. In: J. Mach. Learn. Res. 24.1.
Luong, Thang, Hieu Pham, and Christopher D. Manning (2015). “Effective Approaches to Attentionbased Neural Machine Translation”. In: Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing. Ed. by Lluís Màrquez, Chris Callison-Burch, and
Jian Su. Lisbon, Portugal: Association for Computational Linguistics, pp. 1412–1421. URL:
https://aclanthology.org/D15-1166/.
Ma, Xuezhe, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May,
Luke Zettlemoyer, Omer Levy, and Chunting Zhou (2024). “Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length”. In: CoRR abs/2404.08801. URL: https:
//doi.org/10.48550/arXiv.2404.08801.
Ma, Xuezhe, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan
May, and Luke Zettlemoyer (2023). “Mega: Moving Average Equipped Gated Attention”.
81
In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023. OpenReview.net. URL: https : / / openreview . net / forum ? id =
qNLe3iq2El.
Michel, Paul, Omer Levy, and Graham Neubig (2019). “Are Sixteen Heads Really Better than
One?” In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. URL:
https://proceedings.neurips.cc/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69fPaper.pdf.
Mikolov, Tomás, Quoc V. Le, and Ilya Sutskever (2013). “Exploiting Similarities among Languages for Machine Translation”. In: CoRR abs/1309.4168. URL: http://arxiv.org/abs/
1309.4168.
Min, Sewon, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi (2022). “MetaICL: Learning to Learn In Context”. In: Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies.
Ed. by Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz. Seattle, United States: Association for Computational Linguistics, pp. 2791–2809. URL: https:
//aclanthology.org/2022.naacl-main.201/.
Mitchell, Tom M (1980). “The need for biases in learning generalizations”. In: Rutgers Computer
Science Department Technical Report CBM-TR-117. URL: https://www.cs.cmu.edu/~tom/
pubs/NeedForBias_1980.pdf.
Neubig, Graham and Junjie Hu (2018). “Rapid Adaptation of Neural Machine Translation to New
Languages”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Ed. by Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii.
Brussels, Belgium: Association for Computational Linguistics, pp. 875–880. URL: https :
//aclanthology.org/D18-1103/.
Nguyen, Toan Q. and David Chiang (2017). “Transfer Learning across Low-Resource, Related
Languages for Neural Machine Translation”. In: Proceedings of the Eighth International Joint
Conference on Natural Language Processing (Volume 2: Short Papers). Ed. by Greg Kondrak and Taro Watanabe. Taipei, Taiwan: Asian Federation of Natural Language Processing,
pp. 296–301. URL: https://aclanthology.org/I17-2050/.
NLLB Team et al. (2022). “No Language Left Behind: Scaling Human-Centered Machine Translation”. In.
Nooralahzadeh, Farhad, Giannis Bekoulis, Johannes Bjerva, and Isabelle Augenstein (2020). “ZeroShot Cross-Lingual Transfer with Meta Learning”. In: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Ed. by Bonnie Webber, Trevor
Cohn, Yulan He, and Yang Liu. Online: Association for Computational Linguistics, pp. 4547–
4562. URL: https://aclanthology.org/2020.emnlp-main.368/.
82
OpenAI (2023). “GPT-4 Technical Report”. In: CoRR abs/2303.08774. URL: https://doi.org/
10.48550/arXiv.2303.08774.
Ott, Myle, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier,
and Michael Auli (2019). “fairseq: A Fast, Extensible Toolkit for Sequence Modeling”. In:
Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics (Demonstrations). Ed. by Waleed Ammar, Annie Louis, and Nasrin
Mostafazadeh. Minneapolis, Minnesota: Association for Computational Linguistics, pp. 48–
53. URL: https://aclanthology.org/N19-4009/.
Pan, Sinno Jialin and Qiang Yang (2010). “A Survey on Transfer Learning”. In: IEEE Transactions
on Knowledge and Data Engineering 22.10, pp. 1345–1359.
Pan, Xiaoman, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji (2017).
“Cross-lingual Name Tagging and Linking for 282 Languages”. In: Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Ed. by Regina Barzilay and Min-Yen Kan. Vancouver, Canada: Association for Computational
Linguistics, pp. 1946–1958. URL: https://aclanthology.org/P17-1178/.
Pearl, Lisa (2022). “Poverty of the Stimulus Without Tears”. In: Language Learning and Development 18.4, pp. 415–454. URL: https://doi.org/10.1080/15475441.2021.1981908.
Philip, Jerin, Alexandre Berard, Matthias Gallé, and Laurent Besacier (2020). “Monolingual Adapters
for Zero-Shot Neural Machine Translation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Ed. by Bonnie Webber, Trevor Cohn,
Yulan He, and Yang Liu. Online: Association for Computational Linguistics, pp. 4465–4470.
URL: https://aclanthology.org/2020.emnlp-main.361/.
Post, Matt (2018). “A Call for Clarity in Reporting BLEU Scores”. In: Proceedings of the Third
Conference on Machine Translation: Research Papers. Ed. by Ondˇrej Bojar et al. Brussels, Belgium: Association for Computational Linguistics, pp. 186–191. URL: https://aclanthology.
org/W18-6319/.
Press, Ofir, Noah A. Smith, and Mike Lewis (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation”. In: The Tenth International Conference on
Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. URL:
https://openreview.net/forum?id=R8sQPpGCv0.
Qi, Ye, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig (2018).
“When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?”
In: Proceedings of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Ed.
by Marilyn Walker, Heng Ji, and Amanda Stent. New Orleans, Louisiana: Association for
Computational Linguistics, pp. 529–535. URL: https://aclanthology.org/N18-2084/.
83
Qiu, Jiezhong, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang (2020a). “Blockwise
Self-Attention for Long Document Understanding”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Ed. by Trevor Cohn, Yulan He, and Yang Liu. Online:
Association for Computational Linguistics, pp. 2555–2565. URL: https://aclanthology.
org/2020.findings-emnlp.232/.
Qiu, Xipeng, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang (2020b). “Pretrained models for natural language processing: A survey”. In: Science China technological
sciences 63.10, pp. 1872–1897.
Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever (2018a). “Improving language understanding with unsupervised learning”. In: URL: https://openai.com/index/
language-unsupervised/.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever (2018b).
“Language Models are Unsupervised Multitask Learners”. In: URL: https://openai.com/
index/better-language-models/.
Rahimi, Afshin, Yuan Li, and Trevor Cohn (2019). “Massively Multilingual Transfer for NER”.
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Ed. by Anna Korhonen, David Traum, and Lluís Màrquez. Florence, Italy: Association for
Computational Linguistics, pp. 151–164. URL: https://aclanthology.org/P19-1015/.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams (1986). “Learning representations by back-propagating errors”. In: Nature 323.6088, pp. 533–536. URL: https://doi.
org/10.1038/323533a0.
Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016a). “Improving Neural Machine Translation Models with Monolingual Data”. In: Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Katrin Erk and
Noah A. Smith. Berlin, Germany: Association for Computational Linguistics, pp. 86–96. URL:
https://aclanthology.org/P16-1009/.
Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016b). “Neural Machine Translation of
Rare Words with Subword Units”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Ed. by Katrin Erk and Noah
A. Smith. Berlin, Germany: Association for Computational Linguistics, pp. 1715–1725. URL:
https://aclanthology.org/P16-1162/.
Singh, Anil Kumar (2008). “Natural Language Processing for Less Privileged Languages: Where
do we come from? Where are we going?” In: Proceedings of the IJCNLP-08 Workshop on NLP
for Less Privileged Languages. URL: https://aclanthology.org/I08-3004/.
Strassel, Stephanie and Jennifer Tracey (2016). “LORELEI Language Packs: Data, Tools, and
Resources for Technology Development in Low Resource Languages”. In: Proceedings of the
Tenth International Conference on Language Resources and Evaluation (LREC‘16). Ed. by
84
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente
Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis.
Portorož, Slovenia: European Language Resources Association (ELRA), pp. 3273–3280. URL:
https://aclanthology.org/L16-1521/.
Strubell, Emma, Ananya Ganesh, and Andrew McCallum (2019). “Energy and Policy Considerations for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Ed. by Anna Korhonen, David Traum, and Lluís
Màrquez. Florence, Italy: Association for Computational Linguistics, pp. 3645–3650. URL:
https://aclanthology.org/P19-1355/.
Sutskever, Ilya, Oriol Vinyals, and Quoc V Le (2014). “Sequence to Sequence Learning with Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger. Vol. 27. Curran Associates,
Inc. URL: https : / / proceedings . neurips . cc / paper _ files / paper / 2014 / file /
a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
Tang, Gongbo, Rico Sennrich, and Joakim Nivre (2019). “Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models”. In: Proceedings of the International
Conference on Recent Advances in Natural Language Processing (RANLP 2019). Ed. by Ruslan Mitkov and Galia Angelova. Varna, Bulgaria: INCOMA Ltd., pp. 1186–1193. URL: https:
//aclanthology.org/R19-1136/.
Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler (2023). “Efficient Transformers: A
Survey”. In: ACM Comput. Surv. 55.6, 109:1–109:28. URL: https://doi.org/10.1145/
3530811.
Thompson, Brian, Jeremy Gwinnup, Huda Khayrallah, Kevin Duh, and Philipp Koehn (2019).
“Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers). Ed. by Jill Burstein, Christy Doran, and Thamar Solorio. Minneapolis, Minnesota: Association for Computational Linguistics, pp. 2062–2068. URL: https://aclanthology.org/
N19-1209/.
Tiedemann, Jörg (2012). “Parallel Data, Tools and Interfaces in OPUS”. In: Proceedings of the
Eighth International Conference on Language Resources and Evaluation (LREC‘12). Ed. by
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Do ˘ gan, Bente Maegaard, ˘
Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Istanbul, Turkey: European Language Resources Association (ELRA), pp. 2214–2218. URL: https://aclanthology.
org/L12-1246/.
Touvron, Hugo et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. URL:
https://arxiv.org/abs/2307.09288.
85
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. URL: https :
//proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aaPaper.pdf.
Vaswani, Ashish et al. (2018). “Tensor2Tensor for Neural Machine Translation”. In: Proceedings
of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1:
Research Papers). Boston, MA: Association for Machine Translation in the Americas, pp. 193–
199. URL: https://www.aclweb.org/anthology/W18-1819.
Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov (2019). “Analyzing
Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned”.
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Ed. by Anna Korhonen, David Traum, and Lluís Màrquez. Florence, Italy: Association for
Computational Linguistics, pp. 5797–5808. URL: https://aclanthology.org/P19-1580/.
Wada, Takashi, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, and Jey Han Lau (2021).
“Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely LowResource Languages Using Parallel Corpora”. In: Proceedings of the 1st Workshop on Multilingual Representation Learning. Ed. by Duygu Ataman, Alexandra Birch, Alexis Conneau,
Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin. Punta Cana, Dominican Republic: Association for Computational Linguistics, pp. 16–31. URL: https://aclanthology.org/2021.
mrl-1.2/.
Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma (2020). “Linformer: SelfAttention with Linear Complexity”. In: CoRR abs/2006.04768. URL: https://arxiv.org/
abs/2006.04768.
Wenzek, Guillaume, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán,
Armand Joulin, and Edouard Grave (2020). “CCNet: Extracting High Quality Monolingual
Datasets from Web Crawl Data”. eng. In: Proceedings of the Twelfth Language Resources and
Evaluation Conference. Ed. by Nicoletta Calzolari et al. Marseille, France: European Language
Resources Association, pp. 4003–4012. URL: https : / / aclanthology . org / 2020 . lrec -
1.494/.
Wild, Cody Marie (2020). A Search for Efficient Meta-Learning: MAMLs, Reptiles, and Related
Species. Last accessed 16 May 2022. URL: https://towardsdatascience.com/a-searchfor-efficient-meta-learning-mamls-reptiles-and-related-species-e47b8fc454f2
(visited on 05/16/2022).
Wolf, Thomas et al. (2020). “Transformers: State-of-the-Art Natural Language Processing”. In:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations. Ed. by Qun Liu and David Schlangen. Online: Association for Computational Linguistics, pp. 38–45. URL: https://aclanthology.org/2020.emnlp-demos.6/.
86
Wu, Yonghui et al. (2016). “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”. In: CoRR abs/1609.08144. URL: http:/ /arxiv.
org/abs/1609.08144.
Xia, Mengzhou, Guoqing Zheng, Subhabrata Mukherjee, Milad Shokouhi, Graham Neubig, and
Ahmed Hassan Awadallah (2021). “MetaXL: Meta Representation Transformation for Lowresource Cross-lingual Learning”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Ed. by Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz
Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou. Online: Association for Computational Linguistics, pp. 499–511. URL: https://aclanthology.org/
2021.naacl-main.42/.
Xing, Chao, Dong Wang, Chao Liu, and Yiye Lin (2015). “Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation”. In: Proceedings of the 2015 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Ed. by Rada Mihalcea, Joyce Chai, and Anoop Sarkar. Denver, Colorado:
Association for Computational Linguistics, pp. 1006–1011. URL: https://aclanthology.
org/N15-1104/.
Yedetore, Aditya, Tal Linzen, Robert Frank, and R. Thomas McCoy (2023). “How poor is the
stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed
speech”. In: Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). Ed. by Anna Rogers, Jordan Boyd-Graber, and Naoaki
Okazaki. Toronto, Canada: Association for Computational Linguistics, pp. 9370–9393. URL:
https://aclanthology.org/2023.acl-long.521/.
You, Weiqiu, Simeng Sun, and Mohit Iyyer (2020). “Hard-Coded Gaussian Attention for Neural Machine Translation”. In: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics. Ed. by Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel
Tetreault. Online: Association for Computational Linguistics, pp. 7689–7700. URL: https :
//aclanthology.org/2020.acl-main.687/.
Zoph, Barret, Deniz Yuret, Jonathan May, and Kevin Knight (2016). “Transfer Learning for LowResource Neural Machine Translation”. In: Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing. Ed. by Jian Su, Kevin Duh, and Xavier Carreras.
Austin, Texas: Association for Computational Linguistics, pp. 1568–1575. URL: https : / /
aclanthology.org/D16-1163/.
87
Abstract (if available)
Abstract
Data- and resource-intensive pre-training and fine-tuning applied upon Transformer-based models is the dominant paradigm at the forefront of rapid advancements in natural language processing, human language technologies, and most notably, large language models. Such reliance on massive amounts of data, computation, and energy, while effective and impressive from a performance-only perspective, can hinder open, nonexclusive, and sustainable development of these technologies. In this work, we study how certain inductive biases can be devised to adjust current natural language methods under resource-constrained scenarios and provide insights into why the proposed inductive biases are successful in such cases.
Specifically, this dissertation presents four research directions on data and parameter efficiency of fine-tuning and transfer learning in natural language processing: (1) a universal regimen that creates a single pre-trained checkpoint suitable for machine translation transfer to practically any language pair and eliminates the need for ad hoc pre-training; (2) an architecture-guided parameter-efficient fine-tuning method that performs competitively with full fine-tuning while exclusively updating cross-attention parameters; (3) an analysis of MEGA, a recently introduced augmentation of the Transformer architecture to incorporate explicit recency bias, through the lens of transfer learning; and (4) a meta-learning algorithm to prime pre-trained models for specific fine-tuning strategies.
Combined with ablations that show how they are effective and analyses that demonstrate their generalizability, these directions are meant to serve as tools for resource-efficient transfer learning for natural language processing. Additionally, we will situate this dissertation's contributions in the current climate of scaling efforts in natural language processing to discuss possible paths forward to evolve this research.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards more human-like cross-lingual transfer learning
PDF
Learning at the local level
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Leveraging training information for efficient and robust deep learning
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Towards generalized event understanding in text via generative models
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Learning logical abstractions from sequential data
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Fast and label-efficient graph representation learning
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Modeling, learning, and leveraging similarity
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Aggregating symbols for language models
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Identifying and mitigating safety risks in language models
PDF
Pretraining transferable encoders for visual navigation using unlabeled datasets
Asset Metadata
Creator
Gheini, Mozhdeh
(author)
Core Title
Inductive biases for data- and parameter-efficient transfer learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2025-05
Publication Date
02/04/2025
Defense Date
01/24/2025
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
fine-tuning,low-resource machine translation,natural language processing,OAI-PMH Harvest,parameter-efficient fine-tuning,pre-training,transfer learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
May, Jonathan (
committee chair
), Ferrara, Emilio (
committee member
), Iskarous, Khalil (
committee member
), Ma, Xuezhe (
committee member
)
Creator Email
gheini@usc.edu,gheini72@gmail.com
Unique identifier
UC11399GBJF
Identifier
etd-GheiniMozh-13811.pdf (filename)
Legacy Identifier
etd-GheiniMozh-13811
Document Type
Dissertation
Format
theses (aat)
Rights
Gheini, Mozhdeh
Internet Media Type
application/pdf
Type
texts
Source
20250204-usctheses-batch-1240
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
fine-tuning
low-resource machine translation
natural language processing
parameter-efficient fine-tuning
pre-training
transfer learning