Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards more human-like cross-lingual transfer learning
(USC Thesis Other)
Towards more human-like cross-lingual transfer learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Towards More Human-Like Cross-lingual Transfer Learning
by
Meryem M’hamdi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Meryem M’hamdi
In loving memory of my late grandmother Fadma.
To my parents
and
anyone who encouraged me throughout this Ph.D.
ii
Acknowledgments
First of all, I would like to express my deepest gratitude to my advisor, Jonathan May, for accepting
to be my PhD supervisor and for his support and patience. I was tremendously fortunate to have
such an incredibly understanding and dedicated advisor who helped me grow not only academically
but also professionally.
A heartfelt thanks goes out to all my committee members: Aiichiro Nakano, Khalil Iskarous,
Shri Narayanan, Kallirroi Georgila, Xuezhe Ma, Aram Galstyan, and Greg Ver Steeg for generously
serving on my committee for the thesis, proposal, qualification exam, and for their invaluable
feedback that sparked fruitful discussions.
I extend my appreciation to USC/ISI IARPA’s BETTER and DAPRA’s GAIA programs for
funding the majority of my research. Special thanks to my ISI PI supervisors and mentors: Marjorie
Freedman and Elizabeth Boschee. I am also thankful for the exceptional mentorship of Prof. Xiang
Ren, who greatly contributed to the successful publication of two papers. Gratitude is also owed
to Adobe Research for two enriching summer internships and the outstanding mentorship of Doo
Soon Kim, David Seunghyun Yoon, Trung Bui, and Franck Dernoncourt, which resulted in two
papers and the filing of two patents.
I extend my wholehearted thanks to my mentors from my Master’s program at EPFL for
kickstarting my passion for research, particularly Claudiu Musat, who supervised my master’s thesis
leading to my first paper on churn detection, in collaboration with Christian Abbet, and encouraging
me to pursue a Ph.D.
I want to acknowledge all my fellow CUTELABNAME labmates, colleagues, and friends —
Mozhdeh Gheini, Justin Cho, Kushal Chawla, Katy Felkner, Thamme Gowda, Nada Aldarrab,
Xusen Yin, Alex Spangher, Jake Bremerman, Ulf Hermjacob, and Gleb Satyukov — for fruitful
discussions and their feedback during group meetings. The NLP program at ISI and USC has
provided me with an incredible opportunity not only to enhance my research skills but also to
expand my network connections.
Special thanks to Andy Shangson Chen, Lizsl De Leon, Asiroh Cham, Peter Zamar, and Melissa
Snearl-Smith for their assistance with administrative and onboarding tasks.
This PhD could not have been possible without the unconditional love and support of my parents
El Batoul Amharech and Hmida M’hamdi. I extend my appreciation to all my friends, especially
Qing Jin, Pegah Jendaghi, Amanda Rios, Cedric Klein, and Nazanin Alipourfard, for the engaging
and uplifting social activities involving cooking, hiking, and Zumba classes.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Cross-lingual Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Downstream Base Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Human-Like Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.1 Forgetting Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2 Leitner Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3: Cross-lingual and Multilingual Few-Shot Learning . . . . . . . . . . . . . . 17
3.1 Cross-lingual Direct Fine-Tuning: Contextualized Cross-Lingual Event Trigger
Extraction Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1.1 Bi-LSTM-Char-CRF networks . . . . . . . . . . . . . . . . . . . 19
3.1.1.2 BERT-CRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2.4 Hyper-parameters and Embedding . . . . . . . . . . . . . . . . . 23
3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3.1 Comparison with Feature-Based State-of-the-art . . . . . . . . . 24
3.1.3.2 Comparison between MUSE and BERT Embedding . . . . . . . 25
3.1.3.3 Cross-lingual Event Trigger Extraction . . . . . . . . . . . . . . 27
3.2 Cross-lingual Meta-Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . 28
iv
3.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1.1 Psuedo-task Datasets . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1.2 X-METRA-ADA Algorithm . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3.1 Zero-shot and Few-shot Cross-Lingual NLU and QA . . . . . . . 37
3.2.3.2 More Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Multilingual Meta-Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 Multilingual Semantic Search . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1.1 Task Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.1.2 Task Language Variants . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1.3 Supervision Degrees . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Multilingual Meta-Distillation Learning . . . . . . . . . . . . . . . . . . . 45
3.3.2.1 Original MAML Algorithm . . . . . . . . . . . . . . . . . . . . 46
3.3.2.2 MAML-Align Algorithm . . . . . . . . . . . . . . . . . . . . . 46
3.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3.1 Downstream Benchmarks . . . . . . . . . . . . . . . . . . . . . 46
3.3.3.2 Base Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3.3 Meta-Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3.4 Baselines & Model Variants . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4.1 Multilingual Performance Evaluation . . . . . . . . . . . . . . . 52
3.3.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 4: Cross-lingual Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Cross-lingual Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.4 Downstream Tasks and Datastreams . . . . . . . . . . . . . . . . . . . . . 61
4.1.5 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Baseline & Reference Models . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Continual Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 Bootstrap Sampling & Statistical Significance . . . . . . . . . . . . . . . . 67
4.4 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.1 How is a Multi-Hop Analysis Different from its One-Hop Counterpart? . . 69
4.4.2 Can a Multi-lingual Language Model Learn to Preserve and Accumulate
Knowledge across Different Languages? . . . . . . . . . . . . . . . . . . . 70
v
4.4.3 Is Continual Learning Effective in Boosting Knowledge Preservation,
Accumulation, and Model Utility? . . . . . . . . . . . . . . . . . . . . . . 71
4.4.4 Which Permutations Impose More Challenges on Knowledge Preservation,
Accumulation, and Model Utility? . . . . . . . . . . . . . . . . . . . . . . 72
4.4.5 How do Continual Learning Models Generalize to Unseen Languages? . . . 74
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 5: Cross-lingual Human Algorithms . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.1 Cross-lingual Experience Replay . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.2 Leitner-based Skill Rating System . . . . . . . . . . . . . . . . . . . . . . 77
5.1.3 Leitner-Guided Cross-lingual Experience Replay (LER) . . . . . . . . . . . 77
5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 Baselines & Model Variants . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Benchmarks & Base Models . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.3 Datastreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Average Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.2 Fine-grained Language Analysis . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Cross-lingual Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Event Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.2 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.3 Semantic Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4 Human-Like Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 7: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Appendix A: Cross-lingual Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.1 Results on In-House Intent Classification Dataset . . . . . . . . . . . . . . . . . . 121
A.2 Full Results for QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Appendix B: Multi-lingual Meta-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.1 More Experimental Setup Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.1.1 Upstream Meta-Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
B.1.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
vi
B.2 More Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Appendix C: Cross-lingual Continual Learning . . . . . . . . . . . . . . . . . . . . . . 130
C.1 More Results & Analysis using Boostrap Sampling . . . . . . . . . . . . . . . . . 130
C.1.1 Full Average Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C.1.2 Per M-BERT Components Analysis . . . . . . . . . . . . . . . . . . . . . 130
C.1.3 Full Results on Language Permutations . . . . . . . . . . . . . . . . . . . 132
C.1.4 Per Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
C.1.5 More Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C.1.6 Experience Replay Ablation Studies . . . . . . . . . . . . . . . . . . . . . 137
C.2 More Results using Multiple Seeds . . . . . . . . . . . . . . . . . . . . . . . . . . 138
C.3 Statistical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Appendix D: Human-like Cross-lingual Continual Learning . . . . . . . . . . . . . . . 150
D.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
D.2 Dataset License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
D.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
vii
List of Tables
3.1 Number of documents and triggers per language and split in ACE2005 dataset. . . . 22
3.2 Comparison of performance testing on English using prior work baselines in the
first half and our method using Bi-LSTM-Char-CRF with MUSE embeddings,
BERT-CRF in the 2
half. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Comparison of performance testing on Chinese using prior work baselines in the
first half and our method using Bi-LSTM-Char-CRF with MUSE embeddings,
BERT-CRF in the second half. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Comparison of performance testing on Arabic using different training modes
comparing Bi-LSTM-Char-CRF with MUSE Embeddings to BERT-CRF. . . . . . 26
3.5 Examples of trigger extraction mislabelled with MUSE but disambiguated with
BERT and those missed/mislabeled with monolingual training only and corrected
with multilingual BERT model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Statistics of MTOD dataset (Schuster et al., 2019a) per language and split. . . . . . 33
3.7 Statistics of TyDiQA-GoldP (Hu et al., 2020) dataset per language and split. Korean
is excluded due to some encoding issues. . . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Performance evaluation on MTOD between meta-learning approaches, fine-tuning
internal baselines and external baselines. All our internal experiments use = = 6.
Zero-shot learning experiments that train only on English are distinguished from
few-shot learning, which include a fair internal comparison. Models in bold
indicate our own internal models. MONO, FT, FT w/EN, X-METRA, and
X-METRA-ADA models include results for each test language when training on
that language. FT w/EN trains jointly on English and only the target language. We
highlight the best scores in bold and underline the second best for each language
and sub-task. The rest are reported from †Schuster et al. (2019a), ‡Liu et al. (2020),
and +Siddhant et al. (2020). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 F1 comparison on TyDiQA-GoldP between different meta-learning approaches, fine
tuning and external baselines. We highlight the best scores in bold and underline
the second best for each language. Our own models are in bold, whereas the rest are
reported from †Hu et al. (2020). This is using = = 6. . . . . . . . . . . . . . . . 38
3.10 Statistics of LAReQA in each 5-fold cross-validation split. #Q denotes the number
of question whereas #C denotes the number of candidates. . . . . . . . . . . . . . 48
viii
3.11 Statistics of the STSBMulti from SEM-Eval2007 in each 5-fold cross-validation split.
* means that for Turkish-English, there are only 250 ground truth similarity scores,
while there are 500 sentence pairs. We assume that the ground truth scores are only
for the first 250 sentence pairs. In addition to that, we use 5749 train, 1500 dev, and
1379 test splits from the STSB original English benchmark. . . . . . . . . . . . . . 49
3.12 This is a comparison of different zero-shot baselines, few-shot learning, and
machine translation-enhanced models. Other zero-shot external models (Table B.2)
show sub-optimal results so we don’t include them. For LAReQA and STSBMulti,
we report mAP@20 and Pearson’s r × 100, respectively. All results are averaged
over 5-fold cross-validation and multiple language choices. Models in (*) are our
main contribution. We report the average over many model variants translating
from English to one target language at a time for T-Train model variants. Best and
second-best results for each benchmark are in bold and underlined, respectively. . . 52
4.1 Number of sentences in MTOP per language and split. . . . . . . . . . . . . . . . 62
4.2 Simulated language permutations. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Runtime and parameters statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 The average final performance across different language permutations for the
baseline compared to reference models. We highlight the best scores in bold and
underline the second best across models. . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Forgetting (F) and transfer (T) performance averaged across different language
permutations for sequential baseline and reference models. We highlight the best
models in bold for each subtask and metric. . . . . . . . . . . . . . . . . . . . . . 71
4.6 Comparison of intent classification for two language permutations. We highlight
in bold the best forgetting (F), highest transfer (T), and final performance (FP) of
accuracy scores among H2L and L2H, whereas the best and second best scores
across approaches for H2L and L2H separately are underlined and italicized,
respectively. We report mean performances for each metric and language order. All
95% confidence intervals range from ± 0.01 to ± 0.04. . . . . . . . . . . . . . . . 73
4.7 Performance on intent classification comparison between two versions of the
data: original data version and balanced data for Naive Seq FT across the same
permutations as Table 4.6. We bold the best among H2L and L2H for each metric. . 73
5.1 Statistics of MTOP, MultiATIS++, and TyDiQA per language and split. . . . . . . . 79
5.2 Language permutations for MTOP, MultiATIS++, and TyDiQA. . . . . . . . . . . . 80
5.3 Average Test forgetting scores based on the Dev data split performance of different
models and baselines. We compare two Leitner-guided memory replay variants
LER (Easy) and LER (Hard) to the baselines. Since no previous work on experience
replay in the cross-lingual setup reports any forgetting results, we implement in
addition to No ER our internal baselines: Balanced and Random adapted from
†Lopez-Paz and Ranzato (2017) and ‡Riemer et al. (2019), respectively. Best
(lowest ↓) forgetting scores are highlighted in bold for each task and subtask. . . . 81
ix
5.4 Examples of intractable examples and their golden truth and prediction intent labels
from each category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.1 Statistics of In-House multilingual intent classification Dataset per language and split.121
A.2 X-METRA results on an In-House multilingual intent data. Best results are
highlighted in bold for each test language. . . . . . . . . . . . . . . . . . . . . . . 121
A.3 Full F1 results on TyDiQA-GoldP comparing X-METRA and X-METRA-ADA
to monolingual (MONO) and fine-tuning (FT) baselines. Best results for each
language across all models are in bold whereas the second best results are underlined.122
A.4 Full EM results on TyDiQA-GoldP comparing X-METRA and X-METRA-ADA
to monolingual (MONO) and fine-tuning (FT) baselines. Best results for each
language across all models are in bold whereas the second best results are underlined.123
B.1 Arrangements of languages for the different modes of transfer and meta-learning
stages for two standard benchmark datasets LAReQA and STSBMulti. X→Y denotes
transfer from an X model (for example a monolingual model) used to sample the
support set to a Y model (for example bilingual model) used to sample the query
set. We denote a support or query set in LAReQA by x_y where x and y are the
ISO language codes of the question and the candidate answers and x_y in STSBMulti
where x and y are the ISO language codes of sentence 1 and 2 respectively. We use
parenthesis to mean that the same language pairs cannot be used in both support
and query sets, brackets to denote non-exclusivity (or in other words the language
pairs used as a support can also be used as a query), and curled braces to mean the
query set may be sampled from more than one language. We do not experiment
with mono→multi, bi→multi, mixt, and trans for STSBMulti, since it is not a
multilingual parallel benchmark, but we still experiment with mono→bi→multi
using machine-translated data in that case. . . . . . . . . . . . . . . . . . . . . . . 125
B.2 Comparison of mAP@20 multilingual 5-fold cross-validation evaluation of different
S-BERT models compared to M-BERT model. Best results are highlighted in bold. 126
B.3 Runtime per model variant excluding evaluation. . . . . . . . . . . . . . . . . . . . 127
B.4 mAP@20 multilingual 5-fold cross-validated performance tested for different
languages. Best and second-best results for each language are highlighted in bold
and italicized respectively, whereas best results across categories of models are
underlined. Gains from meta-learning approaches are consistent across few-shot
and zero-shot languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.5 Pearson correlation Pearson’s r × 100 5-fold cross-validated performance on
STSBMulti benchmark using different models few-shot learned on STSBMulti or its
translation. Best and second-best results for each language are highlighted in bold
and italicized respectively, whereas best results across categories of models are
underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
C.1 A summary of results for different continual learning approaches over the average
across language order. For each metric and score, we highlight the best score in
bold and underline the second best score. . . . . . . . . . . . . . . . . . . . . . . 131
x
C.2 Per group layer analysis: ablation studies of different M-BERT’s components. Best,
second best, and third best scores for each metric are in bold, underlined, and
italicized respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
C.3 Per language permutation view: a pairwise comparison between H2L (English →
German → French → Hindi → Spanish → Thai) and L2H (Thai → Spanish →
Hindi → French → German → English). We highlight the best forgetting (lowest),
transfer (highest), zero-shot transfer (highest), and final performance (highest) of
accuracy and f1 scores among those two orders for each approach in bold, whereas
the best scores across approaches for the two orders separately are underlined. . . . 133
C.4 Per language permutation view: a pairwise comparison between Order 3 (Spanish
→ Hindi → English → German → Thai → French) and Order 4 (French → Thai
→ German → English → Hindi → Spanish). We highlight the best forgetting
(lowest), transfer (highest), zero-shot transfer (highest), and final performance
(highest) of accuracy and f1 scores among those two orders for each approach in
bold, whereas the best scores across approaches for the two orders separately are
underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
C.5 Per language permutation view: a pairwise comparison between Order 5(Hindi →
English → Spanish → Thai → French → German) and Order 6 (German → French
→ Thai → Spanish → English → Hindi). We highlight the best forgetting (lowest),
transfer (highest), zero-shot transfer (highest), and final performance (highest) of
accuracy and f1 scores among those two orders for each approach in bold, whereas
the best scores across approaches for the two orders separately are underlined. . . . 135
C.6 Impact of language order across the balanced dataset for Naive Seq FT. Best
and second best scores for each language for intent classification and slot filling
independently across approaches are highlighted in bold and underlined, respectively.136
C.7 CCL per language analysis of forgetting. Best and second best scores for each
language are highlighted in bold and underlined respectively. . . . . . . . . . . . . 141
C.8 CCL per language analysis of transfer. Best and second best scores for each
language are highlighted in bold and underlined respectively. . . . . . . . . . . . . 142
C.9 CCL per language zero-shot forward transfer. Best and second best scores for each
language for intent classification and slot filling independently across approaches
are highlighted in bold and underlined respectively. . . . . . . . . . . . . . . . . . 143
C.10 Ablation Studies of Experience Replay where we experiment with different memory
sizes per language. For each metric and score, we highlight the best score in bold
and underline the second best score. . . . . . . . . . . . . . . . . . . . . . . . . . 144
C.11 The average final performance across different language permutations for the
baseline compared to reference models using multiple seeds. We highlight the best
scores in bold and underline the second best across models. We notice the same
findings as when using bootstrap sampling but with tighter confidence intervals as
shown in Table 4.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xi
C.12 Forgetting (F) and transfer (T) performance averaged across different language
permutations for sequential baseline and reference models using different seeds.
We highlight the best models in bold. We notice exactly the same trends as when
using bootstrap sampling for our analysis in Table 4.5. . . . . . . . . . . . . . . . . 144
C.13 Performance on intent classification comparison between the baseline and continual
learning algorithms across two language permutations using multiple seeds.
We highlight in bold the lowest forgetting (F), highest transfer (T), and final
performance (FP) of accuracy scores among H2L and L2H, whereas the best scores
across approaches for H2L and L2H separately are underlined. We notice the same
trends and findings as Table 4.6 where only bootstrap sampling is used to compute
the confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
D.1 Fine-grained runtime analysis per model for one single language order on MTOP. . 151
D.2 Fine-grained parameter analysis per benchmark. . . . . . . . . . . . . . . . . . . . 151
xii
List of Figures
1.1 An overview of M-BERT and XLM-R zero-shot performances on the XTREME (Hu
et al., 2020) tasks grouped by language families. Branches of the Indo-European
language family, such as Germanic, Romance, and Slavic, exhibit the best transfer
performance. However, the quality of the transfer is reduced for other language
families, especially Niger-Congo, Kra-Dai, and Kartvelian for M-BERT and
Niger-Congo, Uralic for XLM-R on different downstream tasks. In general, a high
variance between different language families across tasks indicates the difficulty of
M-BERT and XLM-R generalizing to different scenarios. Image and analysis are
taken from Hu et al. (2020). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline of human-like cross-lingual transfer learning. . . . . . . . . . . . . . . . . 3
2.1 A conceptual view of the resources hierarchy. Image is taken from Ruder et al.
(2019b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Cross-lingual Transfer Learning Pipeline: for simplicity, we use 1 and 2 to denote
the high-resource and low-resource languages, respectively. The most desirable yet
challenging application is zero-shot if 1 is the source language the downstream
model is fine-tuned on and 2 is the target language the model is applied to; few-shot
if the model is fine-tuned and applied on language 2 in a low-resource setup;
or multi-tasking if the model leverages whatever languages it has access to and
fine-tunes one single joint model on them. . . . . . . . . . . . . . . . . . . . . . . 6
2.3 An example of an utterance in English and its translation to Thai labeled with intent:
SET_ALARM and slots using BIO annotation. . . . . . . . . . . . . . . . . . . . . 8
2.4 The architecture of MTOD model. An input utterance "Set alarm at 7 am" is
encoded using M-BERT. The [CLS] encoding is used to detect the utterance’s
intent: "Alarm/Set_Alarm"; whereas the other token encodings are fed into a CRF +
Slot Classifier to detect the slot labels in BIO annotation. . . . . . . . . . . . . . . 8
2.5 An example of question, context, and answer triplets in TyDiQA: The answer is
framed in green within the context. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Question answering base model: The question and its corresponding context are
encoded using M-BERT. Then, a linear layer classifies the answer span to the
question within the context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
xiii
2.7 Illustration of continual learning: given a stream of tasks, forward transfer leverages
knowledge from old tasks to improve on new tasks, while backward transfer use
new tasks to improve on older ones. Without loss of generality, we illustrate
forward and backward transfer for only task 1 and task n, respectively. . . . . . . . 13
2.8 Illustration of forgetting curve for humans. . . . . . . . . . . . . . . . . . . . . . . 14
2.9 Illustration of Leitner queues system. . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Examples showing the context-dependence nature of trigger extraction: different
events are triggered by the same trigger word depending on the context. Event
triggers are circled and their corresponding articles are highlighted. . . . . . . . . . 18
3.2 Event trigger extraction base model architectures: In Bi-LSTM-Char-CRF
architecture, we feed word-level features to bidirectional-LSTMs (bi-LSTM). We
also use bi-LSTM to obtain character embeddings. For BERT-CRF, we use BERT
as an off-the-shelf encoder model which encapsulates different levels of encodings.
Then, CRF is used on top of both architectures to learn the inter-dependencies
between output trigger tags in BIO annotation and find the optimal tag sequence. . 20
3.3 A zero-shot transfer learning architecture for cross-lingual event trigger extraction:
We start by either pre-training a multilingual language model to capture the
hierarchical nature of natural language input by combining word and character
embeddings and learning the sequential interactions between them or reusing an
off-the-shelf multilingual language model, namely BERT. We then fine-tune an
event extraction model on source languages and adapt that to other languages
through different scenarios. We show in this figure one such scenario where the
model fine-tuned on English is applied directly to target languages in a zero-shot
manner or direct transfer of annotation. . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 An overview of the X-METRA-ADA framework: we use English as the source
and Spanish as the target language. The meta-train stage transfers from the source
to the target languages, while the meta-adaptation further adapts the model to the
target language. The application is few-shot if the test language is seen in any stage
of X-METRA-ADA; or zero-shot if the test language is unseen. . . . . . . . . . . 29
xiv
3.5 Species of Transfer-Learning: A conceptual comparison between different
variants of naive fine-tuning and meta-learning. Naive fine-tuning can take the
form of either PRE, MONO, FT w/EN, or FT (described in §3.2.2.2 as well).
While X-METRA is the closest form of meta-learning using the original MAML
formulation, X-METRA-ADA has two levels of optimization: meta-training and
meta-adaptation.
ℎ and
ℎ are the datasets in English and Thai from which
batches are sampled and that the model gets exposed to during naive fine-tuning.
During the meta-train stage, {
ℎ
}
=1
are drawn from
ℎ; and {
ℎ
}
=1
are
drawn from a percentage of
ℎ. During the meta-adapt stage, {
′ℎ
}
=1
and
{
′ℎ
}
=1
are all drawn from the remaining 1 − percentage
ℎ. Blue denotes
the optimization using English data, while green is used to denote optimization on
Thai data. Dashed lines denote the inner loop, whereas dotted line denote evaluation
on the query sets without a backward pass yet, and bold lines denotes the outer loop. 32
3.6 Ablation of the role of adaptation in X-METRA-ADA compared to X-METRA
(X-METRA-ADA with the meta-training stage only). X-METRA-ADA converges
faster than X-METRA which in turn is better than FT for both languages. . . . . . 39
3.7 K-Shot Analysis for different downstream tasks. For MTOD, the number of shots is
the same for both support and query sets (i.e. = ). For TyDiQA-GoldP, we use
different shots for both the support and query sets. The best results across models
for each subtask and language are highlighted in bold. . . . . . . . . . . . . . . . . 40
3.8 Downsampling analysis on MTOD with different percentages of query data. The
best results across models for each subtask are highlighted in bold. . . . . . . . . . 41
3.9 The effect of freezing BERT layers of X-METRA-ADA during few-shot on intent
classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 A high-level diagram of our meta-distillation MAML-Align framework for
multilingual semantic search and some of its application scenarios. This differs
from standard cross-lingual transfer setups where the focus is on transferring
between individual languages. Given the nature of the downstream task where
multiple language combinations could be used in the query and content to be
retrieved, we study the transfer here between different variants of the task. As
illustrated above, we focus on the three most to least resourced variants where the
queries and contents are either from the same language (monolingual), two different
languages (bilingual), or multiple languages (multilingual). We leverage knowledge
distillation to align between the teacher T-MAML (Finn et al., 2017), specialized in
transferring from monolingual to bilingual, and the student S-MAML specialized
in transferring from bilingual to multilingual semantic search. We show the merit
of gradually transferring between those variants through few-shot and zero-shot
applications involving different language arrangements in the training and evaluation. 43
xv
3.11 A conceptual comparison between MAML-Align and the original meta-learning
baseline MAML. A single iteration of MAML involves one inner loop optimizing
over a batch of support sets from a source language variant of the task followed up
by an outer loop optimizing over the batch query sets curated from the target task
variant. In MAML-Align, on the other hand, we curate two support sets and one
query set, where the second support set is used as both a query and support set in
T-MAML and S-MAML, respectively. We perform two inner loops. Then, in the
outer loop, we optimize jointly over the distillation and task-specific losses of the
query sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.12 Architecture of Transformer-based triplet encoder for asymmetric semantic search:
We use three towers encoding the question, answer combined with its context, and
the negative candidate and its context. On top of that, two Euclidean distances
are computed between the question and the answer and between the question and
the negative candidate, respectively. Then, triplet loss is optimized to minimize
the distance between the question and the answer encodings and to maximize the
distance between the question and the negative candidate encodings. . . . . . . . . 50
3.13 Architecture of Transformer-based dual-encoder for symmetric semantic search:
We use a dual-encoder model that encodes each sentence pair using the same shared
encoder. Then, we minimize the mean squared error between that similarity score
and the golden score for each sentence pair. . . . . . . . . . . . . . . . . . . . . . 50
3.14 mAP@20 and Pearson’s r × 100 5-fold cross-validated multilingual performance
evaluation evaluated on LAReQA and STSBMulti in the first and second subplots,
respectively. There are consistent gains in favor of MAML and MAML-Align
compared to their fine-tuning and Zero-Shot counterparts for all languages and
language-pairs. Languages in (*) are used for zero-shot evaluation, whereas other
languages are included either during Meta-train and Meta-valid stages or fine-tuned
on. Best results for each language or language pair are highlighted in Bold. . . . . 53
3.15 mAP@20 multilingual performance averaged over 5-fold cross-validation splits
on LAReQA comparing between different meta-transfer modes for Fine-tune
and MAML models. The gap is large between Fine-tune and MAML across all
meta-transfer modes and is even larger in favor of MAML when trans mode (uses
mono→bi and bi→multi in the meta-training and meta-validation, respectively) is
used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.16 mAP@20 multilingual 5-fold cross-validated performance on LAReQA between
different query set sampling modes in meta-tasks for MAML and MAML-Align.
We notice that random query sampling has better generalization for both models. . 55
3.17 mAP@20 5-fold cross-validated mean multilingual performance over different
triplet negative sampling modes on LAReQA tested on different languages using
MAML-Align. Random sampling seems best on average for few-shot learning,
whereas hard sampling is more stable across cross-validation splits. . . . . . . . . . 55
xvi
4.1 We present here an overview of cross-lingual continual learning, an extension of the
general continual learning paradigm illustrated in Figure 2.7. We use an example
of a non-stationary datastream moving from high to low resource languages.
Each bold and dashed box represents either a training or test data instance being
fine-tuned or evaluated on, respectively. To support this problem setup, we evaluate
the cross-lingual capabilities of continual approaches. Those capabilities include
knowledge preservation on old languages, accumulation to the current language,
and generalization to unseen languages at each point of the training. In addition to
that, we evaluate model utility at the end of continual learning. . . . . . . . . . . . 58
4.2 A comparison between different variants of model expansion for this problem setup:
either at the side of the input (Lang-Spec Trans), the output (Lang-Spec Task), or
using adapters (Lang-Spec Ada). . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Comparison between forgetting trends for intent classification using one-hop
(crossed boxplots on the left) and multi-hop analysis (dotted boxplots on the right),
showing the variance over different language permutations. One-hop analysis
exhibits higher variance than its multi-hop counterpart. . . . . . . . . . . . . . . . 69
4.4 Correlations between different pairs of metrics: (a) Final performance versus
negative forgetting for the task of intent classification. The lower the forgetting, the
higher the final performance. (b) Final performance versus transfer for the task
of intent classification. As hypothesized, there is no direct correlation between
final performance and transfer. (c) Transfer versus negative forgetting for intent
classification task. In general, there is no direct correlation between transfer
and forgetting. (d) Zero-shot generalization versus negative forgetting for intent
classification. Model expansion approaches are highlighted in shades of green. We
zoom over the rest of the models in the main graph and show an overview of all
approaches in the lower right corner subplot. Mitigating forgetting leads to higher
generalization, with the exception of multi-headed models highlighted in green. . . 72
5.1 An overview of Leitner-guided memory replay for multi-phase cross-lingual
continual learning: an extension from the cross-lingual continual learning paradigm
illustrated in Figure 4.1. On top of a cross-lingual datastream, we build a skill
rating system to continually guide the memory population and update. Skill ratings
are scores from 1 to 5 obtained from Leitner queues; a higher score reflects greater
learnability. At the end of each phase, the skill ratings on the main data items from
the phase language are used to choose what goes in the memory, and the skill
ratings of data items already in the memory are re-evaluated to determine if they
can remain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Average forgetting and final performance of slot filling for different model variants
compared to the Random baseline averaged over different language orders. The
lower the forgetting and the higher the final performance the better. . . . . . . . . . 81
5.3 Fine-grained analysis of forgetting over different language orders as defined in
Table 5.2. Best (lowest) results for each language order are highlighted in bold. . . 82
xvii
5.4 Fine-grained analysis of forgetting over different languages. Best (lowest) results
for each language are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . 83
5.5 Percentages of examples that never get promoted past skill rating 1 (Skill never
promoted) and those that converge to the maximum skill rating 5 (Converged to
max skill) per language averaged over different language orders. . . . . . . . . . . 84
5.6 Distribution of different categories of intractable examples in the English data. . . . 84
5.7 t-SNE visualization of centroids of different intent labels highlighting some
ambiguous labels indistinguishable in the embeddings space. . . . . . . . . . . . . 85
C.1 Correlations between different pairs of metrics: (a) Final performance versus
negative forgetting for the task of slot filling. The lower the forgetting the
higher the final performance. (b) Final performance versus transfer for the task
of slot filling. (c) Transfer versus negative forgetting for slot filling task. (d)
Zero-shot generalization versus negative forgetting for slot filling. Model expansion
approaches are highlighted in shades of green. We zoom over the rest of the models
in the main graph and show an overview of all approaches in the lower right corner
subplot. The same trends observed for intent classification in Figure 4.4 can be
observed here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.2 Comparing cross-lingual generalization of Naive Seq FT across many hops and
different languages for intent classification and slot filling. . . . . . . . . . . . . . 138
C.3 Measuring cross-lingual generalization to new languages across many hops for
intent classification and slot filling. This is both in terms of zero-shot transfer
metric and plain accuracy and f1 scores. . . . . . . . . . . . . . . . . . . . . . . . 139
C.4 Comparison between different metrics using one-hop (crossed boxplots) and
multi-hop analysis (dotted boxplots), on the left and right respectively for each
approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
C.5 P-values for different pairwise comparison of different continual learning
approaches using Tukey’s honestly significant difference (HSD) test using bootstrap
sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
C.6 P-values for different pairwise comparison of different continual learning
approaches using Tukey’s honestly significant difference (HSD) test using bootstrap
sampling (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
C.7 P-values for different pairwise comparison of different continual learning
approaches using Tukey’s honestly significant difference (HSD) test using different
seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
C.8 P-values for different pairwise comparison of different continual learning
approaches using Tukey’s honestly significant difference (HSD) test using different
seeds (Cont.). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
xviii
Abstract
Cross-lingual transfer learning comprises a set of techniques used to adapt a model trained on
(a) source language(s), enabling it to generalize to new target languages. With the emergence of
Transformer-based contextualized encoders, there has been a surge in multilingual representations
that adapt these encoders to various cross-lingual downstream applications. The surprising zeroshot capabilities of these encoders make them promising substitutes for other fully-supervised
techniques, bypassing the need for large-scale annotation. However, these representations are
still far from solving the long-tail of NLP phenomena, where models are biased more towards
high-resource and typologically similar languages to the source language. This bias can be attributed
to the over-reliance of current transfer learning pipelines on what we define as the ’Data-Intensive
Identically-Distributed Minimally-Evaluated’ paradigm. In other words, current cross-lingual
models often need a lot of training data to perform well, lack robustness to different language
distribution shifts, and are minimally evaluated, overlooking critical human-like generalization
capabilities.
In this thesis, we analyze and propose techniques to advance the capabilities of multilingual
language models beyond this traditional paradigm and more toward human-like cross-lingual transfer
learning. We achieve that through 1) human-inspired input requirements by using data-efficient
few-shot techniques, 2) human-inspired outcomes by defining a cross-lingual learning evaluation
paradigm for learning over a continuously evolving data stream of languages, and 3) human-inspired
approaches through devising cognitive strategies to consolidate retention of knowledge learned
across languages and balance between different cross-lingual capabilities.
Our contributions to advancing the current transfer learning paradigm towards human-like learning are four-fold: 1) We explore cross-lingual fine-tuning on low-resource multilingual applications
such as event trigger extraction and semantic search, shedding light on the strengths and limitations
of existing cross-lingual transfer learning techniques. 2) We propose language-agnostic metalearning approaches to bridge the gap between source and target typologically diverse languages.
We show the merits of our approaches in reaching quicker and smoother generalization compared to
naive fine-tuning, especially under low-resource scenarios. 3) We are the first to define a lifelong
learning paradigm that analyzes language shifts. We show the merits and challenges of a multi-phase
analysis where the system continually learns over several languages one at a time in multiple phases.
4) We are the first to adapt a cognitively inspired technique based on Leitner-queues to choose
what to repeat in a cross-lingual continual learning setup and investigate its impact on reducing the
forgetting of previously learned languages while maintaining transfer to new languages.
xix
Chapter 1
Introduction
Learning in humans is the process of unlocking and consolidating new knowledge, skills, behaviors,
and ideas as a response to exposure to specific environmental stimuli (Rehman et al., 2023). One of
the most famous learning theories is classical conditioning, where learning is enabled through the
repeated pairings of a neutral stimulus with a reward (Ivan, 2010). When presented with new stimuli
identical to the original conditioned stimulus, biological systems respond similarly, building on past
knowledge. On the other hand, learning in the classic supervised machine learning paradigm is
defined as an independent process for each knowledge unit in isolation. Building expertise in a new
task or language often requires training from scratch using large amounts of training resources for
each new task or language. With more than 7000 languages spoken worldwide, training languagespecific model experts from scratch is costly and highly biased towards high-resource languages.
This bias leads to a phenomenon known as the “long tail of NLP”, where models tend to perform
worse on low-resource languages, which are most of the world’s languages. Cross-lingual transfer
learning has evolved as a set of techniques to bridge the gap between high-resource and low-resource
languages. Approaches based on transfer learning provide a more pertinent alternative to fully
supervised paradigms by leveraging a learned transformation between languages via “Cross-lingual
embedding”. These approaches are variations of direct transfer of annotation or what is known as
“Zero-shot learning”.
The standard cross-lingual transfer learning pipeline includes pre-training, fine-tuning, and
application stages. During pre-training, a shared multilingual embedding model is learned to
generalize across different language spaces. During fine-tuning, that model is further adapted to
the downstream task of interest using more task-specific supervision in the source high-resource
language. Then, the model is applied in a zero-shot, few-shot, or joint manner to target languages.
Among the most ubiquitous models that follow this pipeline are multilingual Transformer-based
contextualized encoders such as M-BERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020), etc.
Despite their surprising zero-shot performance, those off-the-shelf models are subject to numerous
criticisms of their quality and generalization capabilities. While these multilingual representations
exhibit some cross-lingual capabilities even for languages with low lexical overlap with English,
the transfer quality is reduced for languages that exhibit different typological characteristics (Pires
et al., 2019), as shown in Figure 1.1. Numerous approaches have attempted to build more robust
cross-lingual representations on top of those multilingual models; however, most require parallel
1
(a) For M-BERT. (b) For XLM-R.
Figure 1.1: An overview of M-BERT and XLM-R zero-shot performances on the XTREME (Hu
et al., 2020) tasks grouped by language families. Branches of the Indo-European language family,
such as Germanic, Romance, and Slavic, exhibit the best transfer performance. However, the
quality of the transfer is reduced for other language families, especially Niger-Congo, Kra-Dai, and
Kartvelian for M-BERT and Niger-Congo, Uralic for XLM-R on different downstream tasks. In
general, a high variance between different language families across tasks indicates the difficulty of
M-BERT and XLM-R generalizing to different scenarios. Image and analysis are taken from Hu
et al. (2020).
corpora (Wang et al., 2019; Conneau and Lample, 2019) and are biased towards high-resource and
balanced setups, impeding their generalization capabilities.
This never-ending quest for building models with better generalization to different languages is
driven by a misconception of what it takes for a model to be generalizable. Numerous claims about
the performance of such models make us question the tenets of this cross-lingual transfer paradigm in
reaching human-level generalization. Compared to a paradigm designed for maximizing human-like
generalization derived from psychological cognitive theories (Langley, 2022), the current crosslingual paradigm, which we denote as “Data-Intensive Identically-Distributed Minimally-Evaluated”
conflicts with that in many aspects:
• Data-Intensive: Current models rely on inefficient data-intensive pre-training or fine-tuning
processes. These processes are orthogonal to how cognitive structures are acquired and
refined rapidly, from small numbers of training cases.
• Identically-Distributed: The current cross-lingual transfer paradigm is not robust to continuously emerging language shifts and often needs to be pre-trained and fine-tuned from scratch
for each new language. Humans, on the other hand, are creatures capable of continually
learning from new experiences as they are encountered.
• Minimally-Evaluated: Cross-lingual transfer learning models are often evaluated with
respect to their generalization to new knowledge, overlooking knowledge retention on previously learned languages. On the other hand, humans use cognitive strategies to attain a
good compromise between mitigating forgetting of previously attained knowledge while
accommodating new information.
2
Fine-Tuning Continual Learning Preservation
Learning to Learn Cognitive Mechanisms
Human Requirements Human Outcomes Human Algorithms
5
4
3
2
1
Correctly Predicted
Incorrectly Predicted
Target
Language
�∗
�∗
Pre-trained Multilingual
Language Model
Target
Language
Language 2
…
Language 1
Language n
Generalization
Figure 1.2: Outline of human-like cross-lingual transfer learning.
Thesis Statement: We can improve our ability to design cross-lingual models that reach humanlevel generalization by focusing on directly instilling human-inspired learning characteristics. As
illustrated in Figure 1.2, we propose to make cross-lingual transfer learning exhibit gradually more
human-inspired content on the following levels:
• Human-inspired requirements: We build cross-lingual models exhibiting human-inspired
input requirements by enforcing the provision of limited resources seen during fine-tuning on
new languages. Faced with the limitations of fine-tuning pre-trained multilingual languages
on target languages with fewer resources, we explore few-shot learning paradigms beyond
naive fine-tuning. Meta-learning, i.e. learning how to learn, is a set of techniques enabling
faster adapting to low-resource setups. Through multiple simulations of data from such lowresource languages combined with a dedicated optimization process, we use meta-learning to
build on the structure among multiple languages the same way the process of compositionality
works for humans: the ability to produce novel combinations from known components.
• Human-inspired outputs: We explore human-inspired expected outcomes for cross-lingual
models through understanding what it means to learn continuously across languages like
humans. Faced with the need to incorporate resources from a stream of languages at different
stages of training, we are the first to comprehensively investigate the continual learning
capabilities of current cross-lingual transfer learning. Those capabilities include generalization
to new languages and preservation of knowledge learned from previously seen languages,
among other desiderata we define. We also analyze the effectiveness of different continual
learning algorithms at reaching human-inspired continual learning desiderata.
• Human-inspired approaches: We directly leverage here a cognitively inspired strategy for
choosing what to learn in cross-lingual continual learning. One major challenge in crosslingual continual learning is catastrophic forgetting: a stability-plasticity dilemma, where
3
performance on previously seen languages decreases as the model learns to transfer to new
languages. Experience replay, which revisits data from a fixed-size memory of old languages
while training on new ones, is among the most successful approaches for solving this dilemma.
Faced with the challenge of dynamically storing the memory with high-quality examples
while complying with its fixed size limitations, we consider Leitner queuing, a human-inspired
spaced-repetition technique, to carefully pick informative examples to be replayed at each
learning phase depending on the degree of demonstrated mastery of the prediction model on
them.
To summarize, we investigate, in this dissertation, the incorporation of different aspects of
human-inspired learning in cross-lingual transfer learning models. Here is the outline of the rest
of this document: In Chapter 2, we review key concepts, processes, and general algorithms for
different paradigms related to cross-lingual transfer learning, meta-learning, continual learning,
and human-like learning. In Chapter 3, we present our work on adapting cross-lingual fine-tuning
to under-explored downstream tasks with minimal resources. On top of that, we propose metalearning approaches and investigate their ability to learn a fast adaptation and generalization to
low-resource languages. Chapter 4 proposes the first comprehensive evaluation paradigm for crosslingual continual learning focusing on language shifts. We evaluate the aggregated effectiveness of
various continual learning approaches at surmounting the challenges of sequential cross-lingual finetuning. Then, we make concrete recommendations on model design to balance between different
continual learning desiderata. Chapter 5 presents a novel Leitner-guided memory replay approach
inspired by human-like learning for cross-lingual continual learning. We investigate the impact
of different memory design attributes on reducing forgetting and taming the stability-plasticity
dilemma. Chapter 6 discusses related work for each sub-field covered in this dissertation. We
summarize our findings and discuss future work ideas in Chapter 7.
4
Chapter 2
Background
In this Chapter, we introduce some key terminology and algorithms, laying the foundations for
different paradigms covered in this thesis. First, we define terminology related to cross-lingual
transfer learning and different steps in its standard pipeline (Section 2.1). Second, we define the
meta-learning paradigm and its stages, explain how it differs from conventional machine learning,
and describe a popular meta-learning algorithm (Section 2.3). Third, we define continual learning
and its desiderata and contrast them with standard transfer learning and meta-learning (Section 2.4).
Fourth, we present some terminology related to the field of human-like learning in addition to
techniques used to surmount its challenges (Section 2.5).
2.1 Cross-lingual Transfer Learning
Cross-lingual learning is a paradigm of transductive transfer learning, similar to domain adaptation.
However, in the case of cross-lingual transfer learning, the adaptation is learned between different languages instead of domains. Cross-lingual transfer learning distinguishes between source
language(s) and target language(s), where the overall goal is to use the knowledge learned in the
source language(s) to generalize to the target language(s). Ruder et al. (2019b) define a conceptual
pyramidal view of the NLP resource hierarchy, depicted in Figure 2.1.
Figure 2.1: A conceptual view of the resources hierarchy. Image is taken from Ruder et al. (2019b).
5
Figure 2.2: Cross-lingual Transfer Learning Pipeline: for simplicity, we use 1 and 2 to denote
the high-resource and low-resource languages, respectively. The most desirable yet challenging
application is zero-shot if 1 is the source language the downstream model is fine-tuned on and 2
is the target language the model is applied to; few-shot if the model is fine-tuned and applied on
language 2 in a low-resource setup; or multi-tasking if the model leverages whatever languages it
has access to and fine-tunes one single joint model on them.
Cross-lingual transfer learning is useful because of the great divide between languages. This
means current resources are unbalanced between languages and biased towards high-resource ones.
The higher the amount of data available for a particular language, the more high-resource that
language is. Those resources can be unlabelled monolingual online data (e.g., Wikipedia), curated
parallel corpora, and/or labeled task-specific data.
As shown in Figure 2.2, the general pipeline of cross-lingual transfer learning consists of the
following stages:
• Learning cross-lingual pre-trained language models: General-purpose cross-lingual language models are first pre-trained to generalize across languages and tasks. Learning them
can be done through incorporating different forms and degrees of supervision in the form of
alignment at different levels of granularity (e.g., word, sentence, or document and whether
the pre-training data is parallel or comparable) (Søgaard et al., 2019). Current state-of-the-art
pre-trained models, namely M-BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020),
are trained on large amounts of unlabelled data in different languages without the use of
explicit supervised alignment.
• Fine-tuning on task-specific data: This implies adapting the previously pre-trained model
to the labeled data of the downstream task. This involves adding a few task-specific layers on
top of the multilingual language model. This fine-tuning is usually carried on high-resource
languages (mostly English), where training resources are more available.
• Application to the target language(s): Ideally, if a model has learned sufficient information
at the cross-lingual level, it means it can generalize to the target low-resource languages in a
zero-shot manner if the target language is not seen during fine-tuning. Otherwise, if small
6
amounts of training resources in the target languages are fine-tuned on, then the fine-tuning
application scenario is related to few-shot learning. In the case of models that multi-task on
multiple languages simultaneously, we can talk about cross-lingual multi-task learning.
Cross-lingual transfer learning is especially successful when source and target languages share
commonalities: lexical, structural, and/or semantic properties. However, it becomes more challenging as the similarities between source and target languages decrease, as shown in Pires et al. (2019).
Although multilingual language models like M-BERT are successful in zero-shot applications to
the target language(s), even with no significant lexical overlap with the source language(s), the
quality of the transfer is reduced for languages exhibiting different word orders. This suggests that
pre-trained multilingual models like M-BERT do not effectively learn transformations of linguistic
structures to accommodate new languages.
2.2 Downstream Base Models
In this subsection, we define the architectures of some commonly used downstream tasks in our
evaluation.
Multilingual Task-Oriented Dialog (MTOD) A task-oriented dialogue system is a goal-oriented
system, often decomposed into three components: natural language understanding (NLU), dialogue
management (DM), and natural language generation (NLG). The Multilingual Task-Oriented
Dialogue (MTOD) benchmark, defined by Schuster et al. (2019a), focuses on the NLU module. This
component is critical to the well-functioning of goal-oriented systems. It consists of two sub-tasks:
intent classification and slot filling. Together, they are relaxed forms of predicate extraction and
semantic role labeling in frame semantic parsing. Formally, given an utterance of length :
= (1, 2, . . . , ), intent classification is a classification problem that predicts the speaker’s
intention (along with its domain) label
. Slot filling is a sequence labeling problem that assigns a
slot label for each word in the utterance:
= (
1
,
2
, . . . ,
). A BIO annotation is used to ease the
one-to-one mapping between words and slot labels. This makes it possible to deal with slot labeling
spanning over more than one word (labeled with ’B’ for the beginning word and ’I’ for every other
word) and words that cannot be mapped to a slot label (labeled with ’O’). Figure 2.3 shows an
example of an utterance in English and its translation to Thai, along with their intent and slot labels.
Similar to the architecture in Castellucci et al. (2019), we jointly model MTOD’s intent classification and slot-filling subtasks. We use a joint text classification and sequence labeling framework
with feature representation based on a Transformer (Vaswani et al., 2017) encoder. More specifically,
given a multilingual pre-trained encoder model, we use it to initialize the word-piece embedding
layer. Then, we add on top of it a text classifier to predict the intent from the [] token representation and a sequence labeling layer in the form of a linear layer to predict the slot spans (in
BIO annotation), as shown in Figure 2.4. We optimize parameters using the sum of both intent and
CRF-based slot losses.
7
Figure 2.3: An example of an utterance in English and its translation to Thai labeled with intent:
SET_ALARM and slots using BIO annotation.
Figure 2.4: The architecture of MTOD model. An input utterance "Set alarm at 7 am" is encoded
using M-BERT. The [CLS] encoding is used to detect the utterance’s intent: "Alarm/Set_Alarm";
whereas the other token encodings are fed into a CRF + Slot Classifier to detect the slot labels in
BIO annotation.
Multilingual Question Answering (MQA) Question answering is a reading comprehension task
consisting of answering a question given a document or a specific context. This is a challenging
task, as it requires understanding documents in natural language and a grounded knowledge of the
world. Figure 2.5 shows examples in English and Arabic taken from TyDiQA (Clark et al., 2020).
TyDiQA is an extension of general question answering (often language-specific) to typologically
diverse languages without relying on machine translation. Given the question “Who is the mayor
of Toronto?” and the context “The 65th and current mayor of Toronto is John Tory, in office since
December 1, 2014.”, the goal is to teach a machine to detect the answer span in the form of word
start and end offsets from the context. The answer to the previous question is “John Tory”.
8
Inspired by Hu et al. (2020), we apply to MQA the same architecture as the original BERT
fine-tuning procedure for question answering on SQuAD (Devlin et al., 2019). Specifically, the
input question (after prepending it with a [] token) and the context are concatenated as a single
packed sequence separated by a [] token. Then, the embeddings of the context are fed to a
linear layer plus a softmax to compute the probability that each token is the START or END of the
answer. The whole architecture is fine-tuned by optimizing for the joint loss over the START and
END predictions. Figure 2.6 illustrates the architecture.
Figure 2.5: An example of question, context, and answer triplets in TyDiQA: The answer is framed
in green within the context.
Figure 2.6: Question answering base model: The question and its corresponding context are encoded
using M-BERT. Then, a linear layer classifies the answer span to the question within the context.
9
2.3 Meta-learning
In the dawn of machine learning, learning was more focused on optimizing on examples from a
given training dataset for a particular task. Given a training dataset: D = {(1, 1), . . . , (, )},
the goal is to find the most optimal parameters
∗
that minimize the loss L of the predictive model,
as follows:
∗ = arg min
L (;; D). (2.1)
where is some already acquired prior knowledge or assumption on how to learn (Hospedales
et al., 2020).
Unlike conventional machine learning, which assumes prior knowledge to perform well on a particular task, meta-learning or learning to learn aims to learn with less training data and assumptions.
This is achieved through intelligently anticipating a process for learning prior knowledge either
through better optimization strategies, parameter initialization, generalizable model architectures,
and more. While conventional machine learning focuses on one task at a time, which can be
anything from low-resource to high-resource, meta-learning considers distributions over many
meta-tasks, focusing more on low-resource setups.
Meta-tasks are made up of many simulations of the same task. Each meta-task is defined
as a tuple made of a support set and a query set , which are labeled samples. The support
and query sets can differ in the task, domain, or language used to sample them. Let X and ′
denote the tasks, domains, or languages used to sample and , respectively. In the inner loops of
meta-learning, the loss on from a model trained on is used to adapt the initialization point. and
simulate the role of train and validation subsets. The end goal is to learn from the support sets to
generalize on the query set’s domain. Like conventional learning, meta-learning has meta-training
Dmeta-training = {Dtrain
support, Dtrain
query
}, meta-testing Dmeta-testing = {Dtest
support, Dtest
query
}, and optionally meta-validation
Dmeta-validation sets and phases. During meta-training, we start by optimizing the prior knowledge :
∗ = arg min
L (|Dmeta-training). (2.2)
∗
is the meta-knowledge acquired through meta-training. This is leveraged along with the new
support set in the meta-testing dataset Dtest
support during meta-testing to adapt to Dtest
query
, as follows
quickly:
∗ = arg min
L ( |
∗
, Dtest
support). (2.3)
Compared with Equation 2.1, Equation 2.3 is conditioned on the prior knowledge (i.e., metaknowledge) which is already learned. It is also worth noting the distinction between applying all
acquired knowledge from all seen support tasks directly to a new set versus using the acquired
meta-knowledge to learn how to learn from a new support set Dtest
support. The former is undesirable,
denoted as over-fitting, and shifts away from the promised task generalization we are vigorously
seeking.
We can define meta-learning using a bi-level optimization view. This is one way to perform the
meta-training. It is mainly based on optimization-based approaches simulating the inner and outer
10
loops. The inner loop is specialized in learning optimization over the support set; the outer loop, on
the other hand, learns the generalization over the query set in a leader-follower manner.
Model-Agnostic Meta-Learner (MAML) (Finn et al., 2017) is an instance of optimization-based
approaches. It is a model-agnostic approach, which implies it can be compatible with any network
architecture. We denote the downstream network architecture that MAML is applied to as the base
model. Thus, we distinguish between an upstream learning stage consisting of MAML applied to
a downstream learning stage consisting of the base model. As depicted in Algorithm 1, MAML
learns an optimal initialization using the following steps:
• A copy of the base model with its initial parameters is created. In the original MAML
algorithm, those parameters are usually initialized randomly.
• Given a batch of tasks T = {1,2, . . . }, the algorithm performs a forward pass over the
model using the support set of each task and performs gradient updates with step size for a
few steps . For each task,
, the updated parameters are stored in
. This process is referred
to as the inner loop and explained in Algorithm 2.
• At the end of each task, the model is applied to the query set of each task. The accumulation
of the task-specific losses on the query sets is saved for later use (Line 10 in Algorithm 2).
• At the end of all tasks, the algorithm updates the initial parameters using one pass over the
accumulated query loss. This is referred to as the outer loop or meta-optimization (Line 5 in
Algorithm 1). This optimization with respect to the initial model parameters over an objective
function computed using the updated parameters makes it possible to maximize the effect of
small changes to the parameters (Finn et al., 2017).
Algorithm 1 MAML Algorithm for Transfer Learning from X to ′
Require: Meta-task set distribution DX ′ simulating transfer from X to ′ task, domain, or language
variants, a parameterized downstream base model with parameters , step size hyper-parameters:
and , and number of training steps .
1: Randomly initialize .
2: while not done do
3: Sample a batch of meta-tasks T = {1, . . . , } ∼ DX ′.
4: L
T
, L
′
T
= INNER_LOOP(T, , , ).
5: Outer Loop: Update ← − ∇L
′
T
.
6: end while
The original MAML optimization in the outer loop has expensive computation, as it is hard to
optimize for the accumulation of the losses evaluated on with respect to . Expanding the outer
loop requires using the chain rule as follows:1
← −
. (2.4)
1For simplicity of notation, we drop the sum and consider the update rule based on one task .
11
Algorithm 2 INNER_LOOP
1: function INNER_LOOP(T, , , )
2: L
T ← 0.
3: L
′
T ← 0.
4: for each = (
, ′
) in T do
5: Initialize ← .
6: for = 1 . . . do
7: Update = − ∇ L
(
).
8: L
T ← L
T
+ L
(
). ⊲ Modified from the original algorithm, we save the accumulated
support loss when the knowledge distillation extension is used (Algorithm 4).
9: end for
10: L
′
T ← L′
T
+ L
′
(
). ⊲ Accumulated query loss is saved for the outer loop.
11: end for
12: return L
T
, L
′
T
.
13: end function
where
can be further expanded as follows:
= ∇L (
).
= ∇ L (
) · (∇ −1L (
)) . . . (∇1L (2
)) · (∇L (1
)).
= ∇ L (
) · Ö
=2
(I − ∇−1
(∇L (−1
)).
First Order MAML (FoMAML) (also proposed in Finn et al. (2017)) relaxes it by omitting
those second-order gradients (in red) and taking the gradients directly with respect to
instead of
. In other words, it only computes the gradients on the last version of "fake" parameters (i.e., the
parameters at the last task, which are in a batch of tasks of size ). So, the equation in Line 5 in
Algorithm 1 becomes as follows:
← − ∇
∑︁
=1
L
(
). (2.5)
2.4 Continual Learning
Continual learning is a machine learning paradigm, first introduced in Ring (1995), which aims to
learn continuously over evolving datastream from different tasks, domains, and/or classes. Unlike
transfer learning and meta-learning paradigms, where the goal is mainly to generalize well to new
tasks, continual learning is concerned with comprehensively improving different types of transfer in
addition to other practical considerations. Continual learning aims to improve the generalization
on future tasks while continuously improving on previously seen tasks or at least not hurting the
performance on them, a phenomenon known as catastrophic forgetting (French, 1993). We include
12
below a more comprehensive list of continual learning desiderata as defined by Biesialska et al.
(2020):
• Knowledge retention: The capability to combat catastrophic forgetting on previously seen
tasks.
• Backward transfer: The capability not only to combat catastrophic forgetting but also to
improve performance on previously seen tasks.
• Forward transfer: The capability to leverage knowledge from previously seen tasks to
generalize to new tasks.
• On-line learning: The ability to learn from a continuously evolving data stream.
• No task boundaries: The ability to learn without requiring clear task or data boundaries.
• Fixed model capacity: The ability to keep the memory size or model capacity fixed regardless
of the length of the datastream or the number of tasks involved.
Figure 2.7: Illustration of continual learning: given a stream of tasks, forward transfer leverages
knowledge from old tasks to improve on new tasks, while backward transfer use new tasks to
improve on older ones. Without loss of generality, we illustrate forward and backward transfer for
only task 1 and task n, respectively.
The above desiderata are not meant to be all satisfied by all continual learning systems. Diverse
continual learning systems often look at different subsets depending on the application. We illustrate
forward and backward transfer in continual learning in Figure 2.7.
2.5 Human-Like Learning
Human-like learning is a field that closely verifies and questions the analogy between how machines
learn and how humans actually learn. This investigates how replicating psychologically inspired
methodologies in machine learning setups can be beneficial (Langley, 2022). Other closely related
fields include:
13
• Human-in-the-loop: This refers to the set of approaches that use human feedback to improve
machine learning processes. Unlike human-like learning, this field is not concerned with
understanding how humans learn but uses the explicit supervision of humans in the form of
direct feedback.
• Human-computer interaction: This is concerned with the design of computers that can
better interact with humans. Unlike human-like learning, this does not rely on human learning
strategies but rather on understanding human needs.
• Machine-assisted learning: This differs from human-like learning in that it is not about using
human processes to inspire machines but the design of AI that surpasses human capabilities.
2.5.1 Forgetting Curve
One of the simplest and most famous models is Ebbinghaus’s exponential forgetting curve (Ebbinghaus, 1885), portrayed in Figure 2.8. Suppose we want to learn a new concept. Usually, the retention
starts decaying after some time of no exposure. How can we flatten this curve towards long-term
retention? The idea is to schedule revisions strategically. Suppose we can predict the time when
the review should happen. In that case, we can plan prospective revision intervals, making them
gradually longer as the information becomes more familiar to achieve long-term memory retention
eventually. This is what spaced repetition is all about.
Figure 2.8: Illustration of forgetting curve for humans.
2.5.2 Leitner Queues
One implementation of spaced repetition in machine learning is Leitner Queues, for strategically
scheduling when humans should review flashcards more effectively than cramming or massed
learning. Algorithm 3 describes how the general Leitner queues system is used as a spaced
14
repetition mechanism to sample data and schedule when to review it. The Leitner queues system Q
consists of queues = {1, 2, . . . , }. Initially, all training instances from are placed in the
lowest queue 1. Then, at each epoch, the Leitner scheduler chooses a batch of data to train from.
Often, a sampling policy based on a static scheduler is used to pick the queues to review from as a
function of the training epoch (Reddy et al., 2016). The most popular scheduler samples the dataset
to be reviewed at each epoch () as follows:
() =
1 = 1
1 + 2 = 2
1 = 3
1 + 2 + 3 = 4
1 = 5
1 + 2 = 6
1 = 7
1 + 2 + 3 = 8
1 = 9
1 = 10
...
(2.6)
At the end of each epoch, instances in all queues are evaluated. Instances correctly recalled
are promoted to the next queue, whereas instances not correctly recalled are demoted to the
previous queue (Reddy et al., 2016). As the training progresses, more items make their way to
later queues such that higher queues will accumulate easier instances for the learner while harder
instances stay in lower queues. Our preliminary evaluation shows that the direct application of
this spaced repetition technique doesn’t lead to a gain in performance for our use cases. So, we
repurpose Leitner queues as a technique for choosing what to replay instead of when to replay it.
15
Figure 2.9: Illustration of Leitner queues system.
Algorithm 3 Leitner Queues System
Require: Training data , base model , number of training epochs , and number of queues .
Ensure: Trained model M.
1: = {1, 2, . . . , }, where = [], ∀ ∈ [1, ]
2: Populate 1 = []
3: for ep ← 1 to k do
4: () ← SAMPLE(ep, Q, n) ⊲ SAMPLE follows Equation 2.6
5: for batch ∈ do
6: M ← train(M, batch)
7: end for
8: ← 1 + 2 + . . . +
9: for d ∈ do
10: if () == 0 then
11: q[d] = max(q[d] - 1, 1) ⊲ Demote the item d to the previous queue.
12: else
13: q[d] = min(q[d] + 1, n) ⊲ Promote the item d to the next queue.
14: end if
15: end for
16: end for
16
Chapter 3
Cross-lingual and Multilingual Few-Shot Learning
Recent advances in pre-trained models have made them extensible to cross-lingual downstream tasks
through fine-tuning. This raises doubts about their generalization capabilities to different languages
and downstream tasks. Meta-learning or learning to learn is a technique that has been leveraged for
multiple applications but is under-explored for cross-lingual and multilingual applications. In this
Chapter, we start by investigating the performance of cross-lingual fine-tuning on top of state-ofthe-art pre-trained models (Section 3.1). We analyze their gains and challenges for low-resource
applications in zero-shot, few-shot, and joint multilingual scenarios. Then, we propose meta-learning
algorithms and adapt them to both cross-lingual and multilingual applications in Sections 3.2 and 3.3,
respectively. We evaluate their ability to learn a fast adaptation and generalization to low-resource
languages with less data.
3.1 Cross-lingual Direct Fine-Tuning: Contextualized CrossLingual Event Trigger Extraction Case Study
Event trigger extraction, as defined by the Automatic Content Extraction multilingual evaluation
benchmark (ACE2005) (Walker, 2006), is a subtask of event extraction that requires systems to
detect and label the lexical instantiation of an event, known as a trigger. As an example, in the
sentence "John traveled to NYC for a meeting", traveled is a trigger of a Movement-Transport
event. Trigger detection is typically the first step in extracting the structured information about
an event (e.g., the time, place, and participant arguments; distinguishing between past, habitual,
and future events). This definition of the task restricts it to events that can be triggered explicitly
by actual words and makes it context-vulnerable: the same event might be expressed by different
triggers, and a specific trigger can represent different event types depending on the context.
Event trigger extraction is challenging as it involves understanding the context in order to
be able to identify the event to which the trigger refers. Figure 3.1 shows two examples where
context plays a crucial role in disambiguating the word sense of leaving, which is a trigger for a
Movement-Transport event in the first sentence and for a End-Position event in the second sentence.
Due to the complexity of the task and the difficulty in constructing a standard annotation scheme,
there exists limited data, for a few languages. The earliest work has focused mainly on English, for
17
Figure 3.1: Examples showing the context-dependence nature of trigger extraction: different events
are triggered by the same trigger word depending on the context. Event triggers are circled and their
corresponding articles are highlighted.
which there are relatively many annotated sentences, and relies extensively on language-specific
linguistic tools to extract the lexical and syntactic features that need to be computed as a pre-requisite
for the task (Ji and Grishman, 2008; Liao and Grishman, 2010; Hong et al., 2011; Li et al., 2013).
Simply generating annotated corpora for each language of interest is not only costly and timeconsuming, but it is also not necessarily guaranteed to address the “Great NLP Divide”, where
performance depends on the language, the ability to generate language-specific features, and the
quality tools, in particular, syntactic parsers, available for each language. In an attempt to reduce the
great NLP divide, we observe a tendency drifting away from linguistic features and more towards
continuous distributed features that can be obtained without hand-engineering, based simply on
publicly available corpora. Recently, approaches have tried to overcome the limitation of traditional
lexical features, which can suffer from the data sparsity problem and inability to fully capture
the semantics of the words, by making use of sequential modeling methods including variants of
convolutional neural networks (Chen et al., 2015), probabilistic soft logic (Liu et al., 2016), recurrent
neural networks (Nguyen et al., 2016; Sha et al., 2018), and attention-based graph convolutional
network (Liu et al., 2018b).
Existing approaches that take into consideration the cross-lingual aspect of event trigger extraction tend to take inspiration from machine translation, distant supervision, or multitasking. Machine
translation is used by Liu et al. (2018a) to project monolingual text to parallel multilingual texts to
model the confidence of clues provided by other languages. However, this approach suffers from
error propagation of machine translation and adds a latency overhead caused by additional API
calls.
Another approach relies on multilingual embedding, which can be pre-trained beforehand on
large monolingual corpora, using no explicit parallel data and bridging the gap between different
languages by learning a way to align them into a shared vector space. Their ability to learn a
common representation of words across languages makes them attractive to numerous downstream
NLP applications. Multilingual Unsupervised and Supervised embedding (MUSE) comes up with
a framework for training cross-lingual embedding in an unsupervised manner, which leads to
competitive results, even compared to supervised approaches (Lample et al., 2018). However, there
is prior work leveraging this kind of representation for cross-lingual event trigger extraction. More
recently, BERT, a deep bidirectional representation which jointly conditions on both left and right
context (Devlin et al., 2019), was proposed which, unlike MUSE, is contextualized encoder that has
18
been shown to achieve state-of-the-art performance on many NLP tasks. However, it has not been
applied to event trigger extraction.
In this part of the Chapter, we investigate the possibility of automatically learning effective
features from data while relying on zero language-specific linguistic resources. Moreover, we
explore the application of multilingual embedding to the event trigger extraction task in a direct
transfer of annotation scheme where ground truth is only needed for one language and can be used
to predict labels in other languages and other boosted and joint multilingual schemes. We perform
a systematic comparison between training using monolingual versus multilingual embedding and
the difference in gain on performance with respect to different train/test language(s) pairs. We
evaluate our framework against two embedding approaches: type-based unsupervised embedding
(MUSE) and contextualized embedding (BERT). For the latter, we demonstrate that our proposed
model1
achieves better performance for all languages compared to an exhaustive list of benchmarks
for event extraction on the ACE2005 dataset.
Our main contributions2
can be summarized as follows:
(1) We apply different state-of-the-art NN architectures for sequence tagging on trigger extraction
and compare them to feature-based baselines and multilingual projection-based models
(Section 3.1.1).
(2) We evaluate the effectiveness of a multilingual approach using zero-shot transfer learning,
targeted cross-lingual, and joint multilingual training schemes (Section 3.1.2.2).
(3) We achieve better performance using contextualized word representation learning in event
trigger extraction backed up with both quantitative and qualitative analysis (Section 3.1.3).
(4) We are the first to investigate event trigger extraction in Arabic.
3.1.1 Methodology
We treat trigger extraction as a sequence tagging problem for which we start by designing a basic
state-of-the-art approach for sequence tagging based on bidirectional Long Short Term Memory
(bi-LSTM) with word and character embedding and a CRF layer on top of it. Then, we describe
an approach that trains BERT with a CRF layer for the task. In both architectures, the input is in
the form of BIO notation used to differentiate between the beginning (B), inside (I), and (O) for no
associated trigger labels.
3.1.1.1 Bi-LSTM-Char-CRF networks
The Bi-LSTM-Char-CRF for sequence tagging model is a hierarchical neural network model based
on three components: character-level using character embedding, word-level using bi-LSTM over
1Our source code is openly available at https://github.com/meryemmhamdi1/CONLL2019-Multi
-EvDet-MUSE-BERT.
2This work was published as a long paper in the Proceedings of the 23rd Conference on Computational Natural
Language Learning (M’hamdi et al., 2019).
19
word embedding, and sequence-level using CRF. The architecture of the model is depicted in
Figure 3.2.
Figure 3.2: Event trigger extraction base model architectures: In Bi-LSTM-Char-CRF architecture,
we feed word-level features to bidirectional-LSTMs (bi-LSTM). We also use bi-LSTM to obtain
character embeddings. For BERT-CRF, we use BERT as an off-the-shelf encoder model which
encapsulates different levels of encodings. Then, CRF is used on top of both architectures to learn
the inter-dependencies between output trigger tags in BIO annotation and find the optimal tag
sequence.
Bi-LSTM networks LSTMs (Hochreiter and Schmidhuber, 1997) are variants of RNNs that help
learn long-range dependencies efficiently thanks to their use of memory and forget cells. Those cells
help control the amount of the input to be retained/forgotten from previous states. Given an input
character or word embedding representation for a given time step , we use bidirectional LSTMs
by encoding features in their forward: ℎ =
−−−−→
() and backward ℎ =
←−−−−
() directions
and concatenating them ℎ = [ ℎ
, ℎ] to capture information from both the past and future.
Character Embedding Character embeddings are used to capture orthographic patterns and to
deal with out-of-vocabulary words, especially in the cross-lingual setting. We follow the same setup
as Lample et al. (2016) to obtain character embedding using bi-LSTM. Specifically, we concatenate
both character and word-level features and use a bi-LSTM on top of that.
CRF Layer The encoded character and word-level features are fed to a CRF layer to learn interdependencies between output trigger tags and find the optimal tag sequence. This layer simulates
bi-LSTM in its use of past and future tags to predict the current tag. Following Lafferty et al. (2001),
CRF layers define a transition matrix A and use a score to model the transition from
ℎ state to
ℎ for a pair of consecutive time steps. The scores [ ], of the matrix is the score output by the
network with parameters , for the sentence []
1
and the
ℎ tag, at the
ℎ word. The score of a
20
sequence of tags []
1
for a particular sequence of words []
1
is the sum of transition scores and
network scores which are computed efficiently using dynamic programming.
( []
1
, []
1
) =
∑︁
=1
( [][]−1,[] + [ ][]
,) (3.1)
3.1.1.2 BERT-CRF
BERT is a multi-layer bidirectional transformer encoder, an extension to the original Transformer
model (Vaswani et al., 2017). The input representation consists of a concatenation of WordPiece
embedding (Wu et al., 2016), positional embedding, and the segment embedding. A special token
([CLS]) is inserted at the beginning of each sentence and another special token ([PAD]) is used
to normalize the length of sentences (no ([SEP]) token is used in this case). The pre-trained
BERT model provides a powerful contextualized representation which gives the state-of-the-art
performance for many NLP tasks. We use BERT-CRF, which adds a CRF layer on top of BERT’s
contextualized embedding layer.
3.1.2 Experimental Setup
In this Section, we describe the dataset used (Section 3.1.2.1), transfer learning schemes for our
internal evaluation (Section 3.1.2.2), external baselines (Section 3.1.2.3), and the hyperparameters
and other implementation details (Section 3.1.2.4).
3.1.2.1 Dataset
We evaluate our approach on the ACE2005 sentence-level event mention multilingual benchmark.3
This dataset is annotated with 33 event subtypes, which when represented in BIO annotation, results
in a 67-way labeling task. For a sound comparison, we use the same data split as the English
baseline and for Chinese systems (as detailed in Section 3.1.2.3). To the best of our knowledge,
there are no Arabic benchmark systems, so we produced our own split. Statistics of the split for
train, validation, and testing for the three languages: English (EN), Chinese (ZH), and Arabic (AR)
are included in Table 3.1.
3.1.2.2 Evaluation
We design different experiments for the evaluation of trigger extraction, where we train several
language-specific and multilingual models using different embedding and sequence labeling architectures. We evaluate the following training schemes:
• Monolingual Baselines: We train and fine-tune on EN, ZH, or AR using monolingual FastText
embedding and testing on the trained language.
3https://catalog.ldc.upenn.edu/LDC2006T06.
21
Language ISO
Train Dev Test
#doc #triggers #doc #triggers #doc #triggers
English EN 529 3,334 30 347 40 293
Chinese ZH 557 1,680 32 88 44 163
Arabic AR 354 1,451 21 86 28 113
Table 3.1: Number of documents and triggers per language and split in ACE2005 dataset.
• Zero-Shot learning experiments: As depicted in Figure 3.3, we train and fine-tune on EN
using multilingual embedding (MUSE or BERT(multi)) and test on ZH and AR assuming no
resources for those languages. To simplify experiments, we evaluate direct transfer only from
EN since it is a high-resource target language for learning projections needed in multilingual
embedding. We also believe AR and ZH are not good language-pair candidates, so we expect
training on AR and testing on ZH and EN or training on ZH and testing on AR and EN would
not lead to improvements.
• Targeted cross-lingual experiments: For each test language (ZH and AR), we train and finetune using multilingual embedding on language pairs involving the test language in addition
to EN to test to what extent adding training instances from the target language boosts the
performance over zero-shot learning from EN only. When testing on EN, we train on EN+AR
and EN+ZH.
• Joint multilingual experiments: We train and fine-tune on all languages (EN, ZH, and AR)
using multilingual embedding and testing on EN, ZH and AR. The hypothesis to be tested is
whether a single language-independent model can work well across languages.
3.1.2.3 Baselines
We compare our methodology against different systems based on:
• Discrete Only: hand-crafted features only: Ji’s Cross-Entity’08 (Ji and Grishman, 2008),
Liao’s Cross-Event’10 (Liao and Grishman, 2010), and Li’s Joint-Beam’13 (Li et al., 2013).
• Discrete + Continuous: using a combination of both linguistic features and trainable continuous features: Chen’s Dynamic CNN (DMCNN’15) (Chen et al., 2015), Nguyen’s Joint RNN
(JRNN’16) (Nguyen et al., 2016), Liu’s Jointly Multiple Events (JMEE’18) (Liu et al., 2018b),
and Zhang’s Generative Adversarial Imitation Learning (GAIL’19) (Zhang and Ji, 2018).
• Continuous Only: language-independent features only: Feng’s Hybrid Neural Network
(HNN)’16 (Feng et al., 2016) and Liu’s Gated Multilingual Attention (GMLATT)’18 (Liu
et al., 2018a).
22
Figure 3.3: A zero-shot transfer learning architecture for cross-lingual event trigger extraction:
We start by either pre-training a multilingual language model to capture the hierarchical nature of
natural language input by combining word and character embeddings and learning the sequential
interactions between them or reusing an off-the-shelf multilingual language model, namely BERT.
We then fine-tune an event extraction model on source languages and adapt that to other languages
through different scenarios. We show in this figure one such scenario where the model fine-tuned on
English is applied directly to target languages in a zero-shot manner or direct transfer of annotation.
For cross-lingual results, we include a comparison with ZH baselines: Li’s Maximum Entropy
(MaxEnt)’13 (Li et al., 2013), Chen’s Rich-C’12 (Chen and Ng, 2012), Feng’s HNN’16 (Feng et al.,
2016), and Hsi’s Multi’16 (Hsi et al., 2016).
3.1.2.4 Hyper-parameters and Embedding
We describe the hyper-parameters leading to the best attainable performance for each event trigger
extraction architecture. Those parameters are selected based on random search and performance on
the validation dataset. For Bi-LSTM-Char-CRF, we train character embedding using one bi-LSTM
layer with 100 hidden units and use another layer of bi-LSTM with 300 hidden units to train on the
concatenated word and char embedding. We use a dropout rate of 0.5. We optimize using Adam
with a learning rate of 0.01, a weight decay rate of 0.9, 1 = 0.7, 2 = 0.999, and = 1 − 8.
For monolingual embedding, we use 300-dimensional word embedding for EN, ZH, and AR
from fastText (Bojanowski et al., 2017). For multilingual experiments, we use MUSE library 4
to
train unsupervised alignments from ZH and AR to EN, resulting in a unified vector space for the
three languages. We use the same training hyper-parameters across monolingual and multilingual
training to ensure a fair comparison.
For BERT-CRF, we train monolingual EN and ZH using cased BERT-Base and BERT-ZH
models5
respectively, and for all multilingual experiments, we use the recommended multi-cased
4https://github.com/facebookresearch/MUSE.
5No pre-trained BERT model exists for Arabic.
23
BERT-Base model6
. All models were trained using 12 hidden layers with a dimension of 768,
12 self-attention heads, and 110 Million parameters. We fine-tune all BERT models with their
default parameters. We use Adam optimizer with a learning rate of 0.01, a weight decay rate of 0.9,
1 = 0.9, 2 = 0.999, and = 1 − 6.
For all experiments, we use a batch size of 32 and limit the maximum sequence length of
sentences to 128; we pad or cut otherwise. We use early stopping based on the performance of
the validation dataset with patience 10. In the end, we report on F1-scores(F1) for both trigger
identification and classification tasks computed using the seqeval7
framework for sequence labeling
evaluation based on the CoNLL-2000 shared task, complying with previous work. Trigger classification doesn’t assume the identification is correct but rather gives a stricter performance metric for
measuring whether the trigger is not only identified but also classified correctly.
3.1.3 Results
In tables 3.2, 3.3, and 3.4, we provide a fine-grained performance analysis on trigger identification
and classification where we report precision, recall, and F1 scores for English, Chinese, and Arabic,
respectively. We show a comparison between baselines and two event architectures, Bi-LSTM-CharCRF and BERT-CRF, using different embedding and training schemes.
3.1.3.1 Comparison with Feature-Based State-of-the-art
Before digging deeper into the comparison of our results with previous state-of-the-art methodology,
it is worth comparing the different approaches taken by the prior work. For both EN and ZH,
we observe that the best F1 scores over trigger identification and classification are obtained by
Liu’s JMEE in the first place and Feng’s HNN with a close performance (with scores of 73.7% and
73.4% on trigger classification). For the multilingual case (ZH), it is clear that Feng’s HNN is very
competitive, whereas models relying on machine translation, namely Liu’s GMLATT and Hsi’s
multi, lag behind the rest of the models.
It is not surprising that a neural-based system outperforms other hand-crafted architectures since
the former can capture richer sequence information beyond sentence level than traditional NLP
pre-processing, such as dependency parsing, and avoid errors propagated from such tools.
We observe that, in general, our language-independent (monolingual) Bi-LSTM-Char-CRF and
BERT-CRF methods either outperform or are on par with the best attainable results. In particular,
BERT-CRF trained monolingually using BERT(Base) embedding outperforms other baselines
for both EN and ZH, with F1-scores of 79.2 and 75.3 on trigger identification and classification,
respectively, amounting to a 3.3% and 1.6% gain for EN. For ZH, we obtain F1-scores of 84.4% and
79.9%, amounting to an increase of 16.2% and 16.9% over the previous state-of-the-art. On the other
hand, although results using Bi-LSTM-Char-CRF lag behind state-of-the-art for EN, incurring a loss
6https://github.com/google-research/bert.
7https://github.com/chakki-works/seqeval.
24
Feature
Type
Model
Train
Lang
Embed
-dings
Trigger
Identification (%)
Trigger
Classification(%)
P R F1 P R F1
Disc.
Only
Ji’s Cross-entity’08
EN
-
N/A 72.9 64.3 68.3
Liao’s Cross-event’10 N/A 68.7 68.9 68.8
Li’s Joint-Beam’13 76.9 65.0 70.4 73.7 62.3 67.5
Disc. + Cont.
Chen’s DMCNN’15 Skip-Gram 80.4 67.7 73.5 75.6 63.6 69.1
Nguyen’s JRNN’16 C-BOW 68.5 75.7 71.9 66.0 73.0 69.3
Lius’s JMEE’18 Glove 80.2 72.1 75.9 76.3 71.3 73.7
Zhang’s GAIL’19 ELMo 76.8 71.2 73.9 74.8 69.4 72.0
Cont.
Only
Feng’s HNN’16 Skip-Gram 80.8 71.5 75.9 84.6 64.9 73.4
Liu’s GMLATT’18 Skip-Gram 80.9 68.1 74.1 78.9 66.9 72.4
Our
Method
Bi-LSTM-Char-CRF
EN FastText 67.7 67.2 67.5 63.4 63.0 63.2
EN
MUSE
69.4 68.4 68.9 62.9 62.0 62.5
EN+ZH 77.8 62.7 69.5 73.7 59.4 65.8
EN+AR 74.1 67.5 70.6 70.2 63.9 66.9
All 63.1 70.3 66.5 58.5 65.1 61.6
BERT-CRF
EN Base 77.6 80.8 79.2 73.8 76.9 75.3
EN
BERT
(multi)
79.5 76.2 77.8 74.8 71.6 73.1
EN+ZH 78.8 80.9 79.8 74.3 76.1 75.2
EN+AR 81.4 77.4 79.3 76.4 72.6 74.5
All 85.6 73.8 79.2 79.7 68.3 73.5
Table 3.2: Comparison of performance testing on English using prior work baselines in the first
half and our method using Bi-LSTM-Char-CRF with MUSE embeddings, BERT-CRF in the 2
half.
of 10.5% over trigger classification, they are competitive for ZH, with scores of 86.6% and 69.5% and
gains of 17.9% and 6.2% over Feng’s HNN for trigger identification and classification respectively.
3.1.3.2 Comparison between MUSE and BERT Embedding
We observe a significant difference in performance in favor of BERT-CRF compared to BiLSTM-Char-CRF with a gain of 12.1%, 10.4%, and 13.9% on the classification task. The better
performance of BERT-CRF compared to Bi-LSTM-Char-CRF can be attributed to the fact that
BERT is able to learn contextualized representation and long-range dependencies at different levels
of granularity. Table 3.5 provides some examples where the surface form of the trigger is hard to
disambiguate without context information. Reconsidering the second example from the introduction,
we notice that a Bi-LSTM-Char-CRF fails to effectively associate it with position context clues.
25
Feature
Type
Model
Train
Lang
Embed
-dings
Trigger
Identification (%)
Trigger
Classification(%)
P R F1 P R F1
Disc.
Only
Li’s MaxEnt’13
ZH
-
50.0 77.0 60.6 47.5 73.1 57.6
Chen’s Rich-C’12 62.2 71.9 66.7 58.9 68.1 63.2
Disc. + Cont. Hsi’s multi’16 multi_proj N/A 44.3 20.9 39.6
Cont. Only Feng’s HNN’16 Skip-Gram 74.2 63.1 68.2 77.1 53.1 63.0
Our
Method
Bi-LSTM-Char-CRF
ZH FastText 89.7 83.8 86.6 68.6 64.5 69.5
ZH
MUSE
28.6 30.8 29.6 24.1 25.9 25.0
EN 71.5 53.6 61.3 56.9 42.7 48.8
EN+ZH 83.9 71.5 77.2 76.7 65.4 70.6
All 71.3 73.9 72.6 63.2 65.5 64.3
BERT-CRF
ZH Base 76.4 94.3 84.4 72.3 89.2 79.9
ZH
BERT
(multi)
76.4 92.6 83.7 72.8 88.2 79.8
EN 66.5 90.9 76.8 59.3 81.0 68.5
EN+ZH 76.4 94.9 84.7 73.3 91.1 81.2
All 80.9 95.7 87.7 76.8 90.9 83.2
Table 3.3: Comparison of performance testing on Chinese using prior work baselines in the first
half and our method using Bi-LSTM-Char-CRF with MUSE embeddings, BERT-CRF in the second
half.
Model
Train
Lang
Embed
-dings
Trigger
Identification (%)
Trigger
Classification (%)
P R F1 P R F1
Bi-LSTMChar-CRF
AR FastText 63.1 48.5 54.9 60.8 46.8 52.8
AR
MUSE
47.0 12.9 20.3 43.3 11.9 18.7
EN 71.5 42.2 53.0 56.9 33.5 42.2
EN+AR 68.1 47.7 56.1 64.6 45.3 53.2
All 69.7 69.1 69.4 62.6 62.0 62.3
BERT-CRF
AR
BERT
(multi)
66.7 73.2 69.8 63.7 69.9 66.7
EN 26.4 66.7 37.8 21.6 54.6 30.9
EN+AR 73.8 76.1 74.9 68.5 70.6 69.5
All 71.9 74.5 73.2 67.7 70.2 68.9
Table 3.4: Comparison of performance testing on Arabic using different training modes comparing
Bi-LSTM-Char-CRF with MUSE Embeddings to BERT-CRF.
26
MUSE BERT
Davies is leaving to become chairman of the London
School of Economics
Movement:
Transport
Personnel:
End-Position
The EU is set to release 20 million euros (US 21) million
in immediate humanitarian aid ...
Justice:
Release-Parole
Transaction:
Transfer-Money
Palestinian uprising as Isreal removed all major checkpoints
in the coastal territory.
Conflict:
Demonstrate
Conflict:
Attack
BERT(mono) BERT(multi_all)
. . .
é
Jj.
ÊË@ éJ
Ê« I Q
¯ IJ
k
éÓ@Q ªË@ áÓ ÈA
JP@ ÕÎ
ÕË
"Arsenal has not been released from the fine ..."
O B-Justice:Fine
ÈA
úÍ@
èPñ
JË@ Èñj
J
K
à@ @Pñ
¯ ù
®J
.
K
"The stone revolution must immediately turn into a fight."
O Conflict:Attack
由于月之海已经宣布年底前要解散 ,所以使得...
"Since ’the sea of the moon’ has been announced to be
disbanded before the end of the year, ... "
B-Business:
DeclareBankruptcy
Business:
End-Org
Table 3.5: Examples of trigger extraction mislabelled with MUSE but disambiguated with BERT
and those missed/mislabeled with monolingual training only and corrected with multilingual BERT
model.
3.1.3.3 Cross-lingual Event Trigger Extraction
In general, we observe that multilingual training leveraging multilingual embedding provides a
boost in performance for both event architectures, especially for EN and AR. More precisely, there
is a gain of 3.1% and 4.0% for EN and ZH respectively on the identification performance of
multilingual over monolingual models. We notice that AR benefits the most from multilingual
training with an improvement of 9.5% and 2.8% on the classification score with BERT-CRF and
Bi-LSTM-Char-CRF, respectively. This supports our claim about the effectiveness of multilingual
models, which are efficient to train and more robust.
Although F1-scores for zero-shot transfer learning(train: EN, test: ZH/AR) are not the best
among multilingual experiments, they are still promising and exceed prior published work given the
fact that no data from the target language was used to fine-tune. In particular, training with EN
using BERT-CRF was helpful for ZH with a performance not far from monolingual performance
and making it possible to exceed previous state-of-the-art performance. The same can be observed
in the case of EN→ZH and EN→AR using MUSE. The lower performance of EN→AR using
BERT-CRF raises questions about the quality of BERT(multi) embedding model training for Arabic.
27
Not surprisingly, training given a reasonable amount of language-specific resources from the test
language under a targeted cross-lingual scheme(train: EN+ZH/EN+AR, test: EN/ZH/AR) boosts
(with rare exceptions) the performance over both monolingual training and zero-shot learning:
EN+AR>EN, EN+ZH>ZH>EN and EN+AR>EN>AR when testing on ZH, AR and even for EN
for which we have a strong monolingual baseline.
When all languages are used to train one single joint multilingual model (train: All, test:
EN/AR/ZH), we don’t always notice improvements over monolingual models. To gain more insight
into why multilingual training boosts performance over monolingual models, we include some
examples of when EN is complementary to ZH and AR without which the model fails to identify
some events. In the Chinese example, there are only 12 "nearby" Chinese words to the trigger word
解散 (Jiesàn) in ZH training data, whereas there are 4 times as many nearby words in EN (e.g. ˇ
disband, dissolve, shut, cease, etc).
3.2 Cross-lingual Meta-Transfer Learning
As we have seen in Section 3.1, multilingual Transformer-based contextualized encoders don’t boost
the performance over all languages equally. In fact, the transfer quality is reduced for languages that
exhibit different typological characteristics. Numerous approaches have attempted to build stronger
cross-lingual representations on top of those multilingual models; however, most require parallel
corpora (Wang et al., 2019; Conneau and Lample, 2019) and are biased towards high-resource and
balanced setups. This fuels the need for a few-shot learning method that doesn’t require explicit
cross-lingual alignment for faster adaptation to low-resource setups.
Meta-learning, a method for “learning to learn” with less data, has found favor, especially
among the computer vision and speech recognition communities (Nichol et al., 2018; Triantafillou
et al., 2020; Winata et al., 2020). Meta-learning has been used for machine translation (Gu et al.,
2018), few-shot relation classification (Gao et al., 2019), and a variety of GLUE tasks (Dou et al.,
2019). Recently, Nooralahzadeh et al. (2020) apply the MAML (Finn et al., 2017) algorithm to
cross-lingual transfer learning for XNLI (Conneau et al., 2018) and MLQA (Lewis et al., 2020),
NLU tasks that are naturally biased towards machine translation-based solutions. Nooralahzadeh
et al. are able to show improvement over strong multilingual models, including M-BERT. However,
they mainly show the effects of meta-learning as a first step in a framework that relies on supervised
fine-tuning, making it difficult to properly compare and contrast both approaches.
We study cross-lingual meta-transfer learning using MAML from a different perspective. We
distinguish between meta-learning and fine-tuning and design systematic experiments to analyze
the added value of meta-learning compared to naive fine-tuning. We extensively evaluate on more
challenging and typologically diverse NLU tasks: Multilingual Task-Oriented Dialogue System
(MTOD) (Schuster et al., 2019a) and Typologically Diverse Question Answering (TyDiQA) (Clark
et al., 2020). While MTOD fits in the classification rubric used by most other NLP tasks upon
which meta-learning is evaluated, TyDiQA is not a classification task. We show how meta-learning
can be applied usefully to non-classification tasks such as TyDiQA. We also show performance
improvements using meta-transfer learning between typologically diverse languages. In this section,
28
we conduct an extensive analysis applied to MTOD and TyDiQA to evaluate the quality of crosslingual meta-transfer.
Our main contributions8
can be summarized as follows:
(1) We propose X-METRA-ADA (Section 3.2.1.2), a language-agnostic meta-learning framework
(Figure 3.4), and extensively evaluating on it.
(2) We apply X-METRA-ADA to two challenging cross-lingual and typologically diverse taskoriented dialog and QA tasks, which includes recipes for constructing appropriate meta-tasks
(Section 3.2.1.1).
(3) We analyze the scalability of our approach across different k-shot and down-sampling configurations and investigate the importance of different components in cross-lingual transfer
(Section 3.2.3.2).
Figure 3.4: An overview of the X-METRA-ADA framework: we use English as the source and
Spanish as the target language. The meta-train stage transfers from the source to the target languages,
while the meta-adaptation further adapts the model to the target language. The application is fewshot if the test language is seen in any stage of X-METRA-ADA; or zero-shot if the test language
is unseen.
8This work was published as a long paper in Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies (M’hamdi et al., 2021).
29
3.2.1 Methodology
We make use of MAML, the optimized-based meta-learning approach explained at the end of
Section 2.3, on top of pre-trained base models with two levels of adaptation to reduce the risk of
over-fitting to the target language(s): (i) meta-training from the source language to the target language(s) (ii) meta-adaptation on the same target language(s) for more language-specific adaptation
(Figure 3.4). We apply this approach to two cross-lingual downstream tasks: MTOD and TyDiQA.
We describe the base architectures used for each task in Section 2.2. In this Section, we explain
how they are incorporated into our meta-learning upstream pipeline. Applying meta-learning to a
task requires the construction of multiple ‘pseudo-tasks’, which are instantiated as pairs of datasets.
We describe this construction for our downstream tasks in Section 3.2.1.1. Finally, we present our
X-METRA-ADA algorithm (Section 3.2.1.2).
3.2.1.1 Psuedo-task Datasets
Meta-learning is distinguished from fine-tuning in that the former seeks an initialization point that is
maximally useful to multiple downstream learning tasks, while the latter seeks to directly optimize
a downstream ‘child’ task from the initialization point of a ‘parent’ task. To apply meta-learning to
data scenarios that more closely fit fine-tuning, we construct multiple ‘pseudo-tasks’ by subsampling
from parent and child task datasets. Let be the overall set of data points that the model is exposed
to during fine-tuning. For every language , we define
, which consists of drawn from that
language. In this section, we describe our approach for constructing pseudo-tasks that are drawn
from . As explained in the task-distribution perspective (Section 2.3), each pseudo-task is defined
as = (, ), where and are the support and query sets. Pseudo-tasks are constructed in such
a way as to make them balanced and non-overlapping. We describe our approach for each task
below. Then, we present how this pseudo-task construction mechanism is extended to fit into our
cross-lingual transfer learning framework.
MTOD Pseudo-task Construction MTOD labeled data consists of a sentence from a dialogue
along with a sentence-level intent label and a sequence of slot labels. From the available data, we
draw a number of task sets T; each = (, ) ∈ T consists of intent and slot-labeled items per
intent class in and items per class in . Although carefully arranged to have the same number
of items per class per task in each of the support and the query sets, the same task splits are used
for slot prediction as well. During meta-training and meta-adaptation, task batches are sampled
randomly from T.
TyDiQA Pseudo-task Construction Unlike MTOD, TyDiQA is not a standard classification
task with fixed classes; thus, it is not directly amenable to class distribution balancing across
pseudo-task query and support sets. To construct pseudo-tasks for TyDiQA from the available
(question, context, answer) span triplet data, we use the following procedure: we draw a task
= (, ) by first randomly drawing triplets, forming . For each triplet in , we draw the /
30
most similar triplets to from the remaining available data, thus forming .
9 For two triplets 1,
2, we define similarity as cos( (1), (2)), where (.) is a representation of the concatenation of
the triplet elements delimited by a space; we use a cross-lingual extension to SBERT’s pre-trained
model (Reimers and Gurevych, 2019, 2020).
Cross-lingual extension In the original MAML (Finn et al., 2017) Algorithm, in every iteration,
we sample a task set T from a single distribution D, and the support and query sets in a single task
would be drawn from a common space. We distinguish between the distributions Dmeta-train and
Dmeta-adapt, which correspond to the two levels of adaptation explained below in Section 3.2.1.2. To
enable cross-lingual transfer, we draw data for the support set of tasks in Dmeta-train from task data
in the high-resource base language (English, in our experiments). For the query set in Dmeta-train and
for both support and query sets in Dmeta-adapt, we sample from task data in the low-resource target
language to be evaluated on.
3.2.1.2 X-METRA-ADA Algorithm
Following the notation described in the above sections, we present our algorithm X-METRA-ADA,
our adaptation of MAML to cross-lingual transfer learning in two stages. In each stage, we use the
procedure outlined in Algorithm 1. We start by sampling a batch of tasks from distribution D. For
every task = (
, ′
), we update over steps using batches drawn from
. At the end of
this inner loop, we compute the gradients with respect to the loss of on
′
. At the end of all
tasks of each batch, we sum over all pre-computed gradients and update , thus completing one
outer loop. The difference between meta-train and meta-adapt stages comes down to the parameters
and hyperparameters passed into Algorithm 1.
• Meta-train: This stage is similar to classical MAML. Task sets are sampled from Dmeta-train,
which uses high-resource (typically English) data in support sets and low-resource data
′ in the query sets. The input model is typically a pre-trained multilingual downstream
base model as described in Section 2.2, and we use hyperparameters = 5, = 1 × 10−3
and
= 1 × 10−2
for MTOD and = = 3 × 10−5
for TyDiQA.
• Meta-adapt: During this stage, we ensure the model knows how to learn from examples
within the target language under a low-resource regime. Task sets are sampled from Dmeta-adapt,
which uses low-resource data in both support and query sets. In other words, = ′.
The input model is the optimization resulting from meta-train, and we use hyperparameters
= 5, = 1 × 10−3
and = 1 × 10−2
for MTOD and = = 3 × 10−5
for TyDiQA.
Figure 3.5 wraps up the descriptions above by providing a visualization of different variants
of meta-transfer learning: X-METRA-ADA and X-METRA (X-METRA-ADA without the metaadaptation step) compared to naive fine-tuning variants.
9Thus is constrained to be a multiple of .
31
9
903
a) PRE: Fine-tune on English. b) MONO: Fine-tune on Thai. c) FT w/EN: PRE + Fine-tune
on Thai + English.
d) FT: PRE + Fine-tune on Thai. e) X-METRA on Thai. f) X-METRA-ADA on Thai.
Figure 3.5: Species of Transfer-Learning: A conceptual comparison between different variants of
naive fine-tuning and meta-learning. Naive fine-tuning can take the form of either PRE, MONO,
FT w/EN, or FT (described in §3.2.2.2 as well). While X-METRA is the closest form of metalearning using the original MAML formulation, X-METRA-ADA has two levels of optimization:
meta-training and meta-adaptation.
ℎ and
ℎ are the datasets in English and Thai from
which batches are sampled and that the model gets exposed to during naive fine-tuning. During the
meta-train stage, {
ℎ
}
=1
are drawn from
ℎ; and {
ℎ
}
=1
are drawn from a percentage
of
ℎ. During the meta-adapt stage, {
′ℎ
}
=1
and {
′ℎ
}
=1
are all drawn from the remaining
1 − percentage
ℎ. Blue denotes the optimization using English data, while green is used to
denote optimization on Thai data. Dashed lines denote the inner loop, whereas dotted line denote
evaluation on the query sets without a backward pass yet, and bold lines denotes the outer loop.
3.2.2 Experimental Setup
In this Section, we describe the benchmarks evaluated on (Section 3.2.2.1), the internal and external
baselines (Section 3.2.2.2), and the hyperparameters used (Section 3.2.2.3).
3.2.2.1 Datasets
We use MTOD dataset as provided by Schuster et al., 2019a. MTOD covers 3 languages (English,
Spanish, and Thai), 3 intent domains (alarm, reminder, and weather), 12 intent types, and 11 slot
32
types.10 We consider English as a high-resource source language and Spanish and Thai as target
low-resource languages. Therefore, whenever we fine-tune on English, we use the training split
( ), but for the other languages, we use the provided development sets () to further our
goals to analyze methods of few-shot transfer. During testing, we evaluate on the provided test sets.
Table 3.6 shows the statistics of MTOD per language and split. Moreover, we fine-tune and evaluate
separately on an in-house intent classification dataset of 7 languages.11
Lang ISO Train Dev Test
English EN 30,521 4,181 8,621
Spanish ES 3,617 1,983 3,043
Thai TH 2,156 1,235 1,692
Table 3.6: Statistics of MTOD dataset (Schuster et al., 2019a) per language and split.
For TyDiQA, we use TyDiQA-GoldP (Clark et al., 2020) dataset. In addition to English,
TyDiQA-GoldP encompasses 7 typologically diverse languages.12 Like Hu et al. (2020), we use a
simplified version of the primary task. Specifically, we discard questions that don’t have an answer
and use only the gold passage as context, keeping only the short answer and its spans. This makes
the task similar to XQuAD and MLQA, although, unlike these tasks, the questions are written
without looking at the answers and without machine translation. As with MTOD, we use the English
training data as . Since development sets are not specified for MTOD, we instead reserve 10%
of the training data in each of the other languages as . Finally, we evaluate on the same provided
test splits. Table 3.7 shows the statistics of TyDiQA per language and split.
3.2.2.2 Evaluation
In order to fairly and consistently evaluate our approach to few-shot transfer learning via metalearning and to ablate components of the method, we design a series of experiments based on both
internal and external baselines. Our internal baselines ablate the effect of the X-METRA-ADA
algorithm vs. conventional fine-tuning from a model trained on a high-resource language by keeping
the data sets used for training constant. As our specific data conditions are not reproduced in any
externally reported results on these tasks, we instead compare them to other reported results using
English-only or entirely zero-shot training data.
Internal Evaluation: We design the following fine-tuning/few-shot schemes:
• PRE: An initial model is fine-tuned on the split of English only and then evaluated on
new languages with no further tuning or adaptation. This strawman baseline has exposure to
English task data only.
10We follow the same pre-processing and evaluation as Liu et al. (2020).
11More details are included in the Appendix A.1.
12Korean is excluded from the evaluation due to some sub-optimal performance.
33
Lang ISO Train Dev Test
English EN 3,326 370 440
Arabic AR 13,324 1,481 921
Bengali BN 2,151 239 113
Finnish FI 6,169 686 782
Indonesian ID 5,131 571 565
Russian RU 5,841 649 812
Swahili SW 2,479 276 499
Telugu TE 5,006 557 669
Table 3.7: Statistics of TyDiQA-GoldP (Hu et al., 2020) dataset per language and split. Korean is
excluded due to some encoding issues.
• MONO: An initial model is fine-tuned on the split of the target language. This baseline
serves as a comparison for standard fine-tuning (FT, below), which shows the value of
combining MONO and PRE.
• FT: We fine-tune the PRE model on the split of the target language. This is a standard
transfer learning approach that combines PRE and MONO.
• FT w/EN: Like FT, except both the split of the target language and the split of
English are used for fine-tuning. This is used for dataset equivalence with X-METRA-ADA
(below).
• X-METRA: We use the PRE model as for meta-train, the split from English to form
support sets in −, and all of the split of the target language to form query sets in
−.
• X-METRA-ADA: We use the PRE model as for meta-train, the split from English
to form support sets in −, and 75% of the split of the target language to form
query sets in −. We use the remaining 25% of the split of the target language
for both the support and query sets of −. For TyDiQA, we use 60% of the query for
meta-train and the remaining for meta-adaptation.
All models are ultimately fine-tuned versions of M-BERT, and all have access to the same task
training data relevant to their variant. That is, X-METRA-ADA and PRE both see the same English
data, whereas MONO, FT, and X-METRA-ADA see the same target language data.
However, since X-METRA-ADA uses both and to improve upon PRE, and FT only uses
, we make an apples-to-apples comparison, data-wise, by including FT w/EN experiments as
well.
34
External Baselines: We focus mainly on transfer learning baselines from contextualized embedding for a coherent external comparison; supervised experiments on target language data such
as those reported in Schuster et al. (2019a) are inappropriate for comparison because they use
much more in-language labeled data to train. The experiments we compare are zero-shot in the
sense that they are not trained directly on the language-specific task data. However, most of these
external baselines involve some strong cross-lingual supervision, either through cross-lingual alignment or mixed-language training. We also include machine translation baselines, which are often
competitive and hard to beat. Our work, by contrast, uses no parallel language data or resources
beyond pre-trained multilingual language models, labeled English data, and few-shot labeled target
language data. To the best of our knowledge, we are the first to explore cross-lingual meta-transfer
learning for those benchmarks, so we only report on our X-METRA-ADA approach in addition to
those baselines.
For MTOD, then, we focus on the following external baselines:
• Cross-lingual alignment-based approaches: We use MCoVe, a multilingual version of contextualized word vectors with an autoencoder objective as reported by Schuster et al. (2019a)
in addition to M-BERT (Liu et al., 2020). We also include XLM trained on Translation
Language Modeling (TLM) + Masked Language Modeling (MLM) (Conneau and Lample,
2019) as enhanced by Transformer and mixed-training as reported by Liu et al. (2020).
• Mixed-language training approaches: We use M-BERT + Transformer + mixed training
using data from the dialogue domain: from (a) human-based word selection (MLT) and (b)
attention-based word selection (MLT), both are reported by Liu et al. (2020).
• Translation-based approaches: We use the zero-shot version of MMTE, the massively
multilingual translation encoder by Siddhant et al. (2020) fine-tuned on intent classification.
We also include Translate Train (TTrain) (Schuster et al., 2019a), which translates English
training data into target languages to train on them in addition to the target language training
data.
For TyDiQA-GoldP, out of the already mentioned baselines, we use M-BERT, XLM, MMTE,
and TTrain (which unlike Schuster et al. (2019a) only translates English to the target language to
train on it without data augmentation). In addition to that, we also include XLM-R as reported
by Hu et al. (2020).
3.2.2.3 Implementation Details
We use M-BERT (bert-base-multilingual-cased)13 with 12 layers as initial models for MTOD and
TyDiQA-GoldP in our internal evaluation. We use xlm-r-distilroberta-base-paraphrase-v114 model
13github.com/huggingface/transformers version 3.4.0 pre-trained on 104 languages, including all
languages evaluated on in this analysis.
14github.com/UKPLab/sentence-transformers which uses XLM-R as the base model.
35
for computing similarities when constructing the QA meta-dataset (Section 3.2.1.1). Our implementation of X-METRA-ADA from scratch uses learn2learn (Arnold et al., 2020) for differentiation
and update rules in the inner loop.15 We use the first-order approximation option in learn2learn for
updating the outer loop, as explained at the end of Section 2.3. For each model, we run at least 3
different random initializations and report the average and standard deviation of the best model for
the few-shot language for each run. We use training loss convergence as a criterion for stopping. For
the FT and MONO baselines, we don’t have the luxury of performance since those baselines
use the dataset for training.16 The set is chosen to simulate a low-resource setup.
For MTOD, we fine-tune PRE on English training data. We use a batch size of 32, a dropout
rate of 0.3, AdamW with a learning rate of 4 × 10−5
, and of 1 × 10−8
. We train for around 2000
steps. Beyond that point more training does not reveal necessary, so we perform early stopping at
that point. For MONO, using a lower learning rate of 4 × 10−5 helped achieve a good convergence
for that model. For all FT experiments, we use the same learning rate of 1 × 10−3
, which gives a
better convergence.
For QA, we use a batch size of 4, doc stride of 128, a fixed maximum sequence length of
384, and a maximum length of questions of 30 words. We use AdamW optimizer throughout all
experiments, which uses weight decay of 1 × 10−3
, learning rate of 3 × 10−5
, and a scheduler of 4
warm-up steps.17 We fine-tune PRE for 2 epochs and observe no more gains in performance. For all
MONO and FT experiments, we use the same learning rate of 3 × 10−5
. This is the same optimizer
and learning rate used for the outer loops in meta-learning as well.
For X-METRA-ADA and X-METRA, we sample 2500 tasks in total for both MTOD and QA.
For each task, we randomly sample = = 6 examples from each intent class to form the support
and query sets, respectively (we consider all classes, not only the intersection across languages). For
QA, we use only one support example per query class and 6 query examples as classes. For the inner
loop, we use learn2learn pre-built optimizer. For the outer loop, we use a standard Adam optimizer.
In splitting the few-shot set, we use 75% for the meta-training and 25% for the meta-adaptation for
MTOD. For QA, we use 60% of the query for meta-train and the remaining for meta-adaptation.
3.2.3 Results and Discussion
In this Section, we present zero-shot and few-shot results on our two benchmarks (Section 3.2.3.1).
Then, in Section 3.2.3.2, we provide more analysis shedding light on the role of meta-adaptation in
addition to the robustness of our approach under different k-shot and downsampling scenarios. We
conclude with an analysis of the role of different layers in BERT in few-shot learning.
15github.com/learnables/learn2learn.
16All experiments are run using Pytorch version 1.6.0, 1 GeForce RTX P8 GPU of 11MB of memory CUDA version
10.1. The runtime depends on the size of the dev data, but most MTOD models take around 3 hours to converge, and
TyDiQA models take a maximum of 10 hours of training (including evaluation at checkpoints).
17Those hyperparameters are chosen based on Hu et al. (2020).
36
3.2.3.1 Zero-shot and Few-shot Cross-Lingual NLU and QA
Table 3.8 shows the results of cross-lingual transfer learning on MTOD comparing different
baselines.18 In general, PRE model performs worse than other baselines. It performs less than
the simplest baseline, MCoVe, when transferring to Thai with a decrease of 25.3% and 23.1% and
an average cross-lingual relative loss of 4.5% and 2.1% for intent classification and slot filling
respectively. This suggests that zero-shot fine-tuning M-BERT on English only is overfitting on
English and its similar languages. Using MLT, which adds more dialogue-specific mixed training,
helps reduce that gap for Thai on intent accuracy mainly, but not with the same degree on slot filling.
Model
Spanish Thai
Intent Acc Slot F1 Intent Acc Slot F1
External Baselines
MCoVe† 53.9 19.3 70.7 35.6
M-BERT‡ 73.7 51.7 28.1 10.6
‡ 82.9 74.9 53.8 26.1
‡ 87.9 73.9 73.5 27.1
XLM‡ 87.5 68.5 72.6 27.9
MMTE+ 93.6 - 89.6 -
TTrain‡ 85.4 72.9 95.9 55.4
Zero-shot Learning
PRE 70.2 38.2 45.4 12.5
Few-shot Learning
MONO 82.4 ±6.0 43.9 ±1.5 79.1 ±4.7 54.1 ±3.9
FT 90.7 ±0.3 67.6 ±1.3 78.9 ±0.2 66.0 ±2.1
FT w/EN 88.7 ±0.4 67.4 ±1.4 73.7 ±0.1 66.0 ±1.6
X-METRA 89.6 ±1.3 63.6 ±0.5 80.2 ±1.2 70.4 ±1.2
X-METRA-ADA 92.9 ±0.6 60.9 ±1.9 86.3 ±1.7 69.6 ±1.9
Table 3.8: Performance evaluation on MTOD between meta-learning approaches, fine-tuning
internal baselines and external baselines. All our internal experiments use = = 6. Zero-shot
learning experiments that train only on English are distinguished from few-shot learning, which
include a fair internal comparison. Models in bold indicate our own internal models. MONO, FT,
FT w/EN, X-METRA, and X-METRA-ADA models include results for each test language when
training on that language. FT w/EN trains jointly on English and only the target language. We
highlight the best scores in bold and underline the second best for each language and sub-task. The
rest are reported from †Schuster et al. (2019a), ‡Liu et al. (2020), and +Siddhant et al. (2020).
18More results on our in-house NLU dataset can be found in Appendix A.1.
37
Model
Test on
Arabic Bengali Finnish Indonesian Russian Swahili Telugu
External Baselines
M-BERT† 62.2 49.3 59.7 64.8 60.0 57.5 49.6
XLM† 59.4 27.2 58.2 62.5 49.2 39.4 15.5
XLM-R† 67.6 64.0 70.5 77.4 67.0 66.1 70.1
MMTE† 63.1 55.8 53.9 60.9 58.9 63.1 54.2
TTrain† 61.5 31.9 62.6 68.6 53.1 61.9 27.4
Zero-shot Learning
PRE 62.4 ±2.2 32.9 ±1.4 57.7 ±4.4 67.8 ±3.8 58.2 ±3.7 55.5 ±2.9 33.0 ±5.9
Few-shot Learning
MONO 74.0 ±1.1 38.9 ±0.8 63.3 ±1.5 67.1 ±1.9 54.4 ±1.3 60.3 ±1.2 61.4 ±1.0
FT 77.0 ±0.3 51.0 ±2.7 70.9 ±0.4 77.0 ±0.4 64.8 ±0.4 70.2 ±1.7 65.4 ±0.6
X-METRA 78.5 ±0.6 53.2 ±0.5 72.7 ±0.4 77.7 ±0.2 66.1 ±0.1 71.7 ±0.2 66.6 ±0.4
X-METRA-ADA 76.6 ±0.1 57.8 ±0.6 73.0 ±0.3 77.3 ±0.1 66.9 ±0.1 70.3 ±0.2 72.8 ±0.1
Table 3.9: F1 comparison on TyDiQA-GoldP between different meta-learning approaches, fine
tuning and external baselines. We highlight the best scores in bold and underline the second best for
each language. Our own models are in bold, whereas the rest are reported from †Hu et al. (2020).
This is using = = 6.
The results confirm the positive effects of cross-lingual fine-tuning; although PRE is not a very
effective cross-lingual learner, fine-tuning with in-language data on top of PRE (i.e., FT) adds
value over the MONO baseline. Adding English data to fine-tuning (FT w/EN) is slightly harmful.
However, the meta-learning approach appears to make the most effective use of this data in almost
all cases (Spanish slot filling is an exception). We perform a pairwise two-sample t-test (assuming
unequal variance) and find the results of X-METRA-ADA compared to FT on intent classification
to be statistically significant with p-values of 1.5% and 2.4% for Spanish and Thai respectively,
rejecting the null hypothesis with 95% confidence.
X-METRA-ADA outperforms all previous external baselines and fine-tuning models for both
Spanish and Thai (except for slot filling in Spanish). We achieve the best overall performance
with an average cross-lingual cross-task increase of 3.2% over the FT baseline, 6.9% over FT w/EN,
and 12.6% over MONO. Among all models, MONO has the least stability, as suggested by a
higher average standard deviation. There is a tendency for X-METRA-ADA to work better for
languages like Thai compared to Spanish, as Thai is a truly low-resource language. This suggests
that fine-tuning on English only learns an unsuitable initialization, impeding its generalization to
other languages. As expected, fine-tuning on small amounts of the data does not help the model
generalize to new languages. MONO baselines exhibit less stability than X-METRA-ADA. On the
38
other hand, X-METRA-ADA learns a more stable and successful adaptation to that language even
on top of a model fine-tuned on English only with less over-fitting.
Table 3.9 shows a comparison of TyDiQA-GoldP models across seven languages, evaluating
using F1 scores.19 The benefits of fine-tuning and improvements from X-METRA-ADA observed
in Table 3.8 are confirmed. We also compare X-METRA-ADA to X-METRA, which is equivalent
to X-METRA-ADA without the meta-adaptation phase. On average, X-METRA increases by
10.8% and 1.5% over the best external and fine-tuning baseline, respectively, whereas MONO results
lag behind. X-METRA-ADA outperforms X-METRA on average and is especially helpful on
languages like Bengali and Telugu. We compare X-METRA and X-METRA-ADA in more depth in
Section 3.2.3.2. Meta-learning consistently outperforms fine-tuning.
# Training Mini-batches
Accuracy Scores
0.94
0.92
0.90
0.88
0.86
0 200 400 600 800 1000 1200
X-METRA-ADA (Spanish)
X-METRA (Spanish)
FT (Spanish)
FT (Spanish + English)
(a) Intent Accuracy on Spanish.
# Training Mini-batches
0.85
0.80
0.75
0.70
0.60
0.65
0.55
0 200 400 600 800 1000 1200
X-METRA-ADA (Thai)
X-METRA (Thai)
FT (Thai)
FT (Thai + English)
Accuracy Scores
(b) Intent Accuracy on Thai.
# Training Mini-batches
F1 Scores
0.68
0.66
0.64
0.62
0.60
0.58
0.56
0 200 400 600 800 1000 1200
X-METRA-ADA (Spanish)
X-METRA (Spanish)
FT (Spanish)
FT (Spanish + English)
0.70
(c) Slot F1 on Spanish.
# Training Mini-batches
0.725
0.700
0.675
0.650
0.625
0.600
0.575
0.550
0.525
0 200 400 600 800 1000 1200
X-METRA-ADA (Thai)
X-METRA (Thai)
FT (Thai)
FT (Thai + English)
F1 Scores
(d) Slot F1 on Thai.
Figure 3.6: Ablation of the role of adaptation in X-METRA-ADA compared to X-METRA
(X-METRA-ADA with the meta-training stage only). X-METRA-ADA converges faster than
X-METRA which in turn is better than FT for both languages.
19Full results using Exact Match scores too can be found in Appendix A.2.
39
In Table A.3, we report all F1 results on TyDiQA-GoldP, including zero-shot experiments. We
notice improvements using X-METRA-ADA over FT for some languages. However, we cannot
claim that there is a direct correlation between the degree to which the language is low-resource
and the gain in performance of X-METRA-ADA over fine-tuning. Other factors like similarities
of grammatical and morphological structure and shared vocabulary, in addition to consistency of
annotation, may play a role in the observed cross-lingual benefits. Studying such correlations is
beyond the scope of this thesis.
3.2.3.2 More Analysis
Meta-Adaptation Role The learning curves in Figure 3.6 compare X-METRA-ADA, X-METRA
(i.e., meta-training but no meta-adaptation), and fine-tuning, both with English and with target
language data only, for both Spanish and Thai intent detection in MTOD. In general, including
English data with in-language fine-tuning data lags behind language-specific training for all models,
languages, and sub-tasks. With the exception of slot filling on Spanish, there is a clear gap between
naive fine-tuning and meta-learning, with a gain in favor of X-METRA-ADA, especially for Thai.
Naive fine-tuning, X-METRA, and X-METRA-ADA all start from the same checkpoint fine-tuned
on English. All model variants are sampled from the same data. For Spanish, continuing to use
English in naive fine-tuning to Spanish reaches better performance than both variants of metalearning for Slot filling on Spanish. This could be due to the typological similarity between Spanish
and English, which makes optimization fairly easy for naive fine-tuning compared to Thai, which is
both typologically distant and low-resource.
82.4
90.7 90.2
92.8 92.2 91.3
75
80
85
90
95
MONO FT k=q=1 k=q=3 k=q=6 k=q=9
Accuracy Scores
X-METRA-ADA
(a) MTOD on Spanish for intent detection.
38.9
63.3 61.4
51
70.9 65.4
49.3
72.9
64.4
57.4
72.6 72.9
57.4
73.5 73.2
30
40
50
60
70
80
Bengali Finnish Telugu
F1 Scores
MONO Finetune k=6 q=3 k=q=6 k=9 q=6
X-METRA-ADA
(b) TyDiQA-GoldP on multiple languages.
Figure 3.7: K-Shot Analysis for different downstream tasks. For MTOD, the number of shots is the
same for both support and query sets (i.e. = ). For TyDiQA-GoldP, we use different shots for
both the support and query sets. The best results across models for each subtask and language are
highlighted in bold.
K-Shot Analysis We perform a k-shot analysis by treating the number of instances seen per
class (i.e., ‘shots’) as a hyperparameter to determine at which level few-shot meta-learning starts to
outperform the fine-tuning and monolingual baselines. As shown in Figure 3.7a, it seems that while
40
even one shot for X-METRA-ADA is better than fine-tuning on intent classification, = = 9
shot and = = 6 shot are at the same level of stability with very slightly better results for 6
shot showing that more shots beyond this level will not improve the performance. While 1 shot
performance is slightly below our monolingual baseline, it starts approaching the same level of
performance as 3 shots upon convergence. Figure 3.7b shows an analysis over both and shots for
TyDiQA-GoldP. In general, increasing helps more than increasing . The gap is bigger between
= 6 = 3 and = 6 = 6, especially for languages like Bengali and Telugu. We can also see that
= 6 = 3 is at the same level of performance as FT for those languages.
91.2 92.3
92.3 91.4 90.7 87.2 91.2
93.0 92.1 92.9
51.4
59.6
64.3 62.4
67.6
52.3
60.3 60.7 62.4 60.9
40
50
60
70
80
90
100
10% 30% 50% 70% 100%
Accuracy/F1 Scores
Percentage of Query Data
FT (Intents) X-METRA-ADA (Intents) FT (Slots) X-METRA-ADA (Slots)
Intents Slots
(a) Spanish.
72.3
77.2
72.7
75 74.2 78.9
80.2 81.3
87.4 86.3
47.6
60.3 63 65.2 66
51.1 58.7
66.3 69.1 69.6
40
50
60
70
80
90
Accuracy/F1 Scores
10% 30% 50% 70% 100%
Percentage of Query Data
FT (Intents) X-METRA-ADA (Intents) FT (Slots) X-METRA-ADA (Slots)
Intents Slots
(b) Thai.
Figure 3.8: Downsampling analysis on MTOD with different percentages of query data. The best
results across models for each subtask are highlighted in bold.
Downsampling Analysis We perform a downsampling analysis, where we gradually decrease
the proportion of the overall set from which the target language is sampled and used for few-shot
learning in X-METRA-ADA and FT. Figures in 3.8 show a comparison between intent accuracies
and slot F1 scores between the main models X-METRA-ADA and FT on Spanish and Thai. We
notice that as the percentage of query data increases, the gap between X-METRA-ADA and FT
increases slightly, whereas the gain effect on slots is steadier. This suggests that X-METRA-ADA
is at the same level of effectiveness even for lower percentages. Due to the typological similarity
between Spanish and English, even lower percentages starting from 50% of the query reach a
maximal performance for both intent classification and slot filling.
BERTology Analysis We analyze the degree of contribution of M-BERT layers by freezing
each pair of layers separately. Our analysis is not conclusive as the performance doesn’t change
significantly between layers. We then proceed to freeze all layers of M-BERT to discover that linear
layers are more important in refining the cross-lingual alignment to the target language, as shown by
the narrow gap between freezing vs non-freezing BERT layers in Figure 3.9. This can be explained
by the challenge of fine-tuning M-BERT alone with many layers and higher dimensionality for such
a low-resource setting.
41
0 1000 2000 3000 4000
0.2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Spanish: Fine-tune All
Spanish: Freeze BERT
Thai: Fine-tune All
Thai: Freeze BERT
# Training Mini-batches
Accuracy Scores
Figure 3.9: The effect of freezing BERT layers of X-METRA-ADA during few-shot on intent
classification.
3.3 Multilingual Meta-Transfer Learning
The web offers a wealth of information in multiple languages, presenting a challenge for reliable,
efficient, and accurate information retrieval. Users across the globe may express the need to retrieve
relevant content in a language different from the language of the query or in multiple languages
simultaneously. These observations bolster the strong demand for multilingual semantic search.
Compared to bilingual semantic search, often portrayed as cross-lingual information retrieval (Savoy
and Braschler, 2019; Grefenstette, 1998), multilingual semantic search, which involves retrieving
answers in multiple languages, is under-explored and more challenging. One of the main challenges
of multilingual semantic search is the need to circumvent “language bias”. Language bias is the
tendency of a model to prefer one language over another, making it prone to retrieve answers
from the preferred language more, regardless of how relevant they truly are. For example, a
weakly aligned model that clusters relevant content and queries more by language while poorly
encapsulating their meaning could pick answers that match the language of the query even if they
are incorrect. Circumventing language bias requires stronger alignment to factor out the language
influence so that the most semantically relevant pairs across languages stand out as closest in the
embedding space (Roy et al., 2020).
The majority of approaches used to improve the alignment between languages require parallel
resources across languages (Cao et al., 2020; Zhao et al., 2021), which are expensive to obtain
especially for multilingual tasks and biased towards high-resource language pairs/combinations.
Pre-trained unsupervised multilingual encoders such as M-BERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020) have been employed as off-the-shelf zero-shot tools for cross-lingual
and multilingual downstream applications. However, these models still fail to significantly outperform traditional static cross-lingual embedding (Glavaš et al., 2019) on multilingual semantic
search (Litschko et al., 2021). Simply fine-tuning M-BERT and XLM-R on English data is not
sufficient to produce an embedding space that exhibits strong alignment (Roy et al., 2020). In fact,
fine-tuning such models to largely available monolingual data makes them prone to "monolingual
overfitting" as they are shown to transfer reasonably well to other monolingual semantic search
settings but not necessarily to bilingual and multilingual settings (Litschko et al., 2022).
42
Knowledge distillation and contrastive-distillation learning approaches are used to produce
better-aligned multilingual sentence representations with reduced need for parallel corpora (Reimers
and Gurevych, 2020; Tan et al., 2023). However, they still rely on some supervision in the form of
monolingual corpora and back-translation. Cross-lingual meta-transfer learning (Nooralahzadeh
et al., 2020; M’hamdi et al., 2021) leveraging MAML (Finn et al., 2017) has been shown to reduce
overfitting to high-resource monolingual setups and improve the generalization to new languages
with little to no training. However, meta-learning can also be prone to overfitting when multiple
source domains or task variants are trained on as part of one single model, which undermines its
transferring capabilities (Zhong et al., 2022).
Teacher T-MAML Student S-MAML
Monolingual Bilingual Multilingual
Knowledge Distillation
MAML-Align Application
Few-shot
Zero-shot
Greek
Greek
Greek Greek
Arabic
Arabic
Greek Arabic
Arabic
Greek
Greek Hindi
Arabic
Greek
Greek Hindi
Russian
Thai
Russian Turkish
Figure 3.10: A high-level diagram of our meta-distillation MAML-Align framework for multilingual semantic search and some of its application scenarios. This differs from standard cross-lingual
transfer setups where the focus is on transferring between individual languages. Given the nature of
the downstream task where multiple language combinations could be used in the query and content
to be retrieved, we study the transfer here between different variants of the task. As illustrated above,
we focus on the three most to least resourced variants where the queries and contents are either
from the same language (monolingual), two different languages (bilingual), or multiple languages
(multilingual). We leverage knowledge distillation to align between the teacher T-MAML (Finn
et al., 2017), specialized in transferring from monolingual to bilingual, and the student S-MAML
specialized in transferring from bilingual to multilingual semantic search. We show the merit of
gradually transferring between those variants through few-shot and zero-shot applications involving
different language arrangements in the training and evaluation.
To obtain a stronger alignment while preventing monolingual overfitting and with decreased
reliance on parallel resources, we propose a low-resource adaptation of meta-distillation learning
to multilingual semantic search. We pursue a meta-learning direction based on MAML to allow
us to effectively leverage high-resource monolingual and bilingual variants of semantic search to
effectively transfer to multilingual semantic search. To improve the meta-transferring capabilities
of MAML, we explore the combination of meta-learning and knowledge distillation (Zhou et al.,
2022; Liu et al., 2022; Zhang et al., 2020) and propose a new algorithm for gradually adapting them
to the task of multilingual semantic search MAML-Align (Figure 3.10). We perform MAML-Align
in two stages 1) from monolingual to bilingual and 2) from bilingual to multilingual, to create
a more gradual feedback loop, which makes it easier to generalize to the multilingual case. We
conduct experiments on two different semantic search benchmarks: LAReQA (Roy et al., 2020),
43
a span-based question-answering task reformulated as a retrieval task, and STSBMulti (Cer et al.,
2017), a semantic similarity task. Our experiments show that our multilingual meta-distillation
approach beats vanilla MAML and achieves statistically significant gains of 0.6% and 10.6%
on LAReQA and 1.2% and 2.5% on STSBMulti over an off-the-shelf zero-shot baseline based on
sentence transformers (Reimers and Gurevych, 2019) and naive fine-tuning, respectively. We also
show consistent gains for both benchmarks on different languages, even those kept for zero-shot
evaluation. Our approach is model-agnostic and is extensible to other challenging multilingual and
cross-lingual downstream tasks requiring strong alignment.
Our main contributions20 can be summarized as follows:
(1) We are the first to propose a meta-learning approach for multilingual semantic search (Section 3.3.2) and to curate meta-tasks for that effect (Section 3.3.3.3).
(2) We are the first to propose a meta-distillation approach to transfer semantic search ability
between monolingual, bilingual, and multilingual data (Section 3.3.2.2).
(3) We systematically compare between several few-shot transfer learning methods and show the
gains of our multilingual meta-distillation approach (Section 3.3.4.1).
(4) We also conduct ablation studies involving different language arrangements and sampling
approaches (Section 3.3.4.2).
3.3.1 Multilingual Semantic Search
In this section, we define sentence-level semantic search and its different categories (Section 3.3.1.1),
language variants (Section 3.3.1.2), and supervision degrees (Section 3.3.1.3).
3.3.1.1 Task Formulation
Our base task is sentence-level semantic search. Given a sentence query from a pool of queries Q,
the goal is to find relevant content from a pool of candidate contents R. The queries are sentences,
and retrieved contents are either sentences or small passages of a few sentences.
In terms of the format of the queries and contents, there are two main categories of semantic
search:
• Symmetric Semantic Search. Each query and its corresponding relevant content have
similar length and format.
• Asymmetric Semantic Search. and are not of the same format. For example, and can
be a question and a passage answering that, respectively.
20This work is under review and available in ArXiv (M’hamdi et al., 2023).
44
3.3.1.2 Task Language Variants
In the context of languages, we distinguish between three variants of semantic search at evaluation
time (also shown in Figure 3.10):
• Monolingual Semantic Search (mono). The pools of queries and candidate contents Q and
R are from the same known and fixed language ℓQ = ℓR ∈ L .
• Bilingual Semantic Search (bi). The pools of queries and candidate contents are sampled
from two different languages {ℓQ, ℓR } ∈ L 2
, such that ℓQ ≠ ℓR. This is also termed as
cross-lingual semantic search (Savoy and Braschler, 2019).
• Multilingual Semantic Search (multi). This is the problem of retrieving relevant contents
from a pool of candidates from a subset of multiple languages LR ⊆ L to a query expressed
in a subset of multiple languages LQ ⊆ L . Unlike other variants (monolingual and bilingual),
multilingual semantic search doesn’t restrict which languages can be used in the queries or
the candidate contents.
3.3.1.3 Supervision Degrees
In the absence of enough training data for the task, we distinguish between three degrees of
supervision of semantic search:
• Zero-Shot Learning. This resembles ad-hoc semantic search in that it doesn’t involve any
fine-tuning specific to the task of semantic search. Rather, off-the-shelf pre-trained language
models are used directly to find relevant content to a specific query. This still uses some
supervision in the form of parallel sentences used to pre-train those off-the-shelf models. In
the context of multilingual semantic search, we include in the zero-shot learning case any
evaluation on languages not seen during fine-tuning.
• Few-Shot Learning. Few-shot learning is used in the form of a small fine-tuning dataset.
In the context of multilingual semantic search, few-shot learning on a particular language
implies that that language is seen during fine-tuning or meta-learning either to represent the
query or the contents to be retrieved.
3.3.2 Multilingual Meta-Distillation Learning
In this section, we present the original MAML algorithm (Section 3.3.2.1), then we present our
optimization-based meta-distillation learning algorithm MAML-Align (Section 3.3.2.2) and how it
differs from MAML.
45
3.3.2.1 Original MAML Algorithm
Our first variant is a direct adaptation of MAML to multilingual semantic search. We use the
procedure outlined in Algorithm 1. We start by sampling a batch of meta-tasks from a meta-dataset
distribution DX ′ , which simulates the transfer from to ′. and ′ are different task language
variants of semantic search (Section 3.3.1.2) from which the support and query sets are sampled,
respectively. We start by initializing our meta-learner parameters with the pre-trained base model
parameters . For each batch of meta tasks, we perform an inner loop (Algorithm 2): we go
over each meta-task = (
, ) in T where we update using
. After steps of this update,
we pre-compute the loss of on
′
and save it for later. At the end of all meta-tasks in the
batch, we perform one outer loop by summing over all pre-computed gradients and updating .
Following X-METR-ADA (Section 3.2.1.2), we perform this algorithm in two stages: meta-train
and meta-valid, where meta-valid is a replication of meta-train with the main difference being the
task language variant arrangements used to sample the meta-tasks.
3.3.2.2 MAML-Align Algorithm
The idea behind this extension is to use knowledge distillation to distill T-MAML to S-MAML
where T-MAML and S-MAML are replicates of MAML and T-MAML is more high-resource
than S-MAML. Following Section 3.2, which shows that multiple phases of bi-level optimization
encourage faster adaptation to low-resource languages, we adopt a gradual approach to meta-transfer
across different task language variants with the help of knowledge distillation. Given meta-tasks
from DXY and DYZ, the goal is to use that shared task language variant of transfer to align
different modes of transfer of semantic search. We start by executing the two inner loops of the
two MAMLs (with more inner steps for T-MAML than S-MAML), where the support sets are
sampled from and , respectively. Then, we compute, in the optimization process of the outer
loop, the weighted combination of L , the average over the task-specific losses on the query sets
sampled from and , and L , the mean-squared error on . Figure 3.11 illustrates a conceptual
comparison between MAML and MAML-Align.
3.3.3 Experimental Setup
In this section, we describe the downstream datasets and models used (Section 3.3.3.1), their
formulation as meta-tasks (Section 3.3.3.3), and the different baselines and model variants used in
the evaluation (Section 3.3.3.4).
3.3.3.1 Downstream Benchmarks
We evaluate our proposed approaches over the following multilingual and bilingual sentence-level
semantic search datasets:
• Asymmetric Semantic Search. We use LAReQA (Roy et al., 2020), focusing on XQuADR, which is a retrieval-based task reformulated from the span-based question answering
46
Algorithm 4 MAML-Align: Knowledge distillation to align two different MAMLs (X→Y→Z)
Require: Meta-task set distributions DXY and DYZ sharing the same , pre-trained downstream base
model with parameters , and meta-learners X Y with parameters (, , , ) and Y Z with
parameters (′, , , ′), where ′ < .
1: Initialize ← .
2: Initialize ′ ← .
3: while not done do
4: Sample batch of tasks T ∼ DXY.
5: Sample batch of tasks T ′ ∼ DYZ.
6: Í
L
T
,
Í
L
T
= INNER_LOOP(TXY, , , ).
7: Í
L
T′,
Í
L
T′ = INNER_LOOP (TYZ, ′, , ′).
8: L = (
Í
L
T
+ L
T′ )/2.
9: L = (
Í
L
T
−
Í
L
T′)
2
.
10: Update ← − ∇ (L + L ).
11: end while
Figure 3.11: A conceptual comparison between MAML-Align and the original meta-learning
baseline MAML. A single iteration of MAML involves one inner loop optimizing over a batch of
support sets from a source language variant of the task followed up by an outer loop optimizing
over the batch query sets curated from the target task variant. In MAML-Align, on the other hand,
we curate two support sets and one query set, where the second support set is used as both a query
and support set in T-MAML and S-MAML, respectively. We perform two inner loops. Then, in the
outer loop, we optimize jointly over the distillation and task-specific losses of the query sets.
XQuAD (Artetxe et al., 2020). This dataset covers 11 languages. In this work, we only use
seven languages. Arabic, German, Greek, and Hindi are used for few-shot learning. Russian,
Thai, and Turkish are kept for zero-shot evaluation. There are less than 1200 questions and
1300 candidates for each language.21
• Symmetric Semantic Search. As there is no multilingual parallel benchmark for symmetric
search, we focus, in our few-shot learning experiments, on a small-scale bilingual benchmark.
21We download the data from https://github.com/google-research-datasets/lareqa.
47
We use STSBMulti from SemEval-2017 Task 1 (Cer et al., 2017).22 This is a semantic similarity
benchmark, which consists of a collection of sentence pairs drawn mostly from news headlines.
It covers English-English, Arabic-Arabic, Spanish-Spanish, Arabic-English, Spanish-English,
and Turkish-English. There are only 250 sentence pairs for each language pair.
Tables 3.10 and 3.11 show a summary of the statistics of LAReQA and STSBMulti per language
and split, respectively. XQuAD-R in LAReQA has been distributed under the CC BY-SA 4.0
license, whereas STSBMulti has been released under the Creative Commons Attribution-ShareAlike
4.0 International License. The translated datasets from SQUADEN and STSBEN are shared under
the same license as the original datasets. SQUADEN is shared under XTREME benchmark Apache
License Version 2.0. STSBEN scores are under Creative Commons Attribution-ShareAlike 3.0
Unported (CC BY-SA 3.0) and sentence pairs are shared under Commons Attribution - Share Alike
4.0 International License).
Language ISO
Train Dev Test
#Q #C #Q #C #Q #C
Arabic AR 696 783 220 255 274 184
German DE 696 812 220 256 274 208
Greek EL 696 788 220 254 274 192
Hindi HI 696 808 220 252 274 184
Russian RU 696 774 220 262 274 183
Thai TH 696 528 220 178 274 146
Turkish TR 696 732 220 248 274 187
Table 3.10: Statistics of LAReQA in each 5-fold cross-validation split. #Q denotes the number of
question whereas #C denotes the number of candidates.
3.3.3.2 Base Models
For asymmetric semantic search, we use a Transformer-based triplet-encoder model. In the original
paper on the asymmetric benchmark we evaluate on (Roy et al., 2020), a dual-encoder model is
trained using contrastive loss in the form of an in-batch sampled softmax loss. This format reuses,
for each question, answers from other questions in the same batch (batched randomly) as negative
examples. Instead, we use triplet loss (Schroff et al., 2015), which was also shown to outperform
contrastive loss in general. Triplet loss is shown to surpass contrastive loss in general.23 Its strength
derives not just from the nature of its function but also from its sampling procedure. This sampling
procedure, which merely requires positive instances to be closer to negative instances, doesn’t
22Downloaded from https://alt.qcri.org/semeval2017/task1/index.php?id=data-and-t
ools.
23As posited in https://shorturl.at/ktvx9.
48
Language Pair ISO
# Sentence Pairs
Train Dev Test
English-English EN-EN 150 50 50
Spanish-Spanish ES-ES 150 50 50
Spanish-English ES-EN 150 50 50
Arabic-Arabic AR-AR 150 50 50
Arabic-English AR-EN 150 50 50
Turkish-English* TR-EN 150 50 50
Table 3.11: Statistics of the STSBMulti from SEM-Eval2007 in each 5-fold cross-validation split. *
means that for Turkish-English, there are only 250 ground truth similarity scores, while there are
500 sentence pairs. We assume that the ground truth scores are only for the first 250 sentence pairs.
In addition to that, we use 5749 train, 1500 dev, and 1379 test splits from the STSB original English
benchmark.
require gathering as many positive examples as contrastive loss requires. This makes triplet loss
more practical in our few-shot learning multilingual/cross-lingual scenario, as it provides more
freedom in terms of constructing negative candidates to tweak different sampling techniques from
different languages. We thus define a triplet encoder model (shown in Figure 3.12) with three
towers encoding the question, its answer combined with its context, and the negative candidates and
their contexts. While those towers are encoded separately, they still share the same Transformer
encoder model, which is initialized with pre-trained Sentence Transformers. On top of that, two
dot products (, ) and (, ) are computed. (, ) is the dot product between the question
and its answer , whereas (, ) is between and its non-answer candidate. Triplet loss is
computed as : L = max ( (, ) − (, ) + , 0) where is a tunable hyperparameter
to eventually make each triplet an easy one by pushing the distance (, ) closer to 0 and (, ) to
(, ) + .
Triplets (, , ) can be sampled with different levels of difficulty, as follows:
• Easy triplets: (, ) + < (, ).
• Hard triplets: (, ) < (, ).
• Semi-hard triplets: (, ) < (, ) < (, ) + .
For symmetric search, we use a Transformer-based dual-encoder model (shown in Figure 3.13),
which encodes sentence 1 and sentence 2 in each sentence pair separately using the same shared
encoder. Then, the cosine similarity score is computed for each sentence pair, and the mean squared
error (squared L2 norm) is computed between that and the golden score. This is not a retrieval-based
task but a semantic similarity task.
49
Question Answer + context Other candidate
+ context
Multilingual
Transformer
Encoder
(Shared)
Pooling Layer Pooling Layer Pooling Layer
Dot Product Dot Product
Triplet Loss
Multilingual
Transformer
Encoder
(Shared)
Multilingual
Transformer
Encoder
(Shared)
Figure 3.12: Architecture of Transformer-based triplet encoder for asymmetric semantic search:
We use three towers encoding the question, answer combined with its context, and the negative
candidate and its context. On top of that, two Euclidean distances are computed between the
question and the answer and between the question and the negative candidate, respectively. Then,
triplet loss is optimized to minimize the distance between the question and the answer encodings
and to maximize the distance between the question and the negative candidate encodings.
Sentence 1 Sentence 2
Pooling Layer Pooling Layer
Dot Product
Cosine Similarity Loss
Multilingual
Transformer
Encoder
(Shared)
Multilingual
Transformer
Encoder
(Shared)
Figure 3.13: Architecture of Transformer-based dual-encoder for symmetric semantic search: We
use a dual-encoder model that encodes each sentence pair using the same shared encoder. Then,
we minimize the mean squared error between that similarity score and the golden score for each
sentence pair.
3.3.3.3 Meta-Datasets
Following our formulation of downstream semantic search benchmarks, we independently construct
the support set in each meta-task by sampling a batch of question/answer/negative candidates
triplets and sentence pairs in LAReQA and STSBMulti, respectively. Then, we construct triplets or
sentence pairs in the query set by picking for each triplet or sentence pair in either a similar or
random triplet or sentence pair.24
24Details of transfer modes and their support and query set language arrangements are in Appendix B.1.1.
50
3.3.3.4 Baselines & Model Variants
Since we are the first, to the best of our knowledge, to explore meta-learning for bilingual or
multilingual information retrieval or semantic search, we only compare with respect to our internal
variants and design some external non-meta-learning baselines. We are also the first to explore
fine-tuning and meta-learning on extremely small-scale data using cross-validation splits on both
benchmarks. This makes it hard to compare with existing approaches; therefore, we rely more on
our own internal baselines.
Baselines. We design the following baselines:
• Zero-Shot: This is our initial zero-shot approach based on an off-the-shelf pre-trained language
model. Based on our preliminary performance evaluation of different existing and state-ofthe-art off-the-shelf language models in Table B.2, we use the best model on our 5-fold
cross-validation test splits, which is sentence-BERT (S-BERT) as our zero-shot model.25
• Fine-tune: On top of our off-the-shelf zero-shot baseline S-BERT, we fine-tune jointly and
directly on the support and query sets of each meta-task in both meta-train and meta-valid.
This few-shot baseline makes for a fair comparison with the meta-learning approaches.
Internal Variants. We design the following meta-learning variants:
• MAML: On top of S-BERT, we apply MAML (Algorithm 1). At each episode, we conduct a
meta-train followed by a meta-valid stage.
• MAML-Align: On top of S-BERT, we apply MAML-Align (following Algorithm 4).
External Evaluation. To assess the impact of using machine translation models with or without
meta-learning and the impact of machine translation from higher-resourced data, we explore
Translate-Train (T-Train), where we translate English data in SQUADEN
26 and STSBEN
27 to the
evaluation languages. We then either use translated data in all languages or in each language
separately as a data augmentation technique.
3.3.4 Results & Analysis
This section presents the results obtained using different meta-learning model variants compared to
the baselines. Given the extremely small-scaled dataset we are working with (Tables 3.10 and 3.11),
all experiments are evaluated using 5-fold cross-validation and the mean is reported. Following
XTREME-R (Ruder et al., 2021) and SemEval-2017 (Cer et al., 2017), scores are reported using
25paraphrase-multilingual-mpnet-base-v2 in https://huggingface.co/sentence-transformers.
26We use the translate.pseudo-test provided for XQuAD dataset by XTREME (Hu et al., 2020) https://consol
e.cloud.google.com/storage/browser/xtreme_translations.
27We use the translated dataset from the original English STSB https://github.com/PhilipMay/stsb-m
ulti-mt/.
51
Model LAReQA STSBMulti
Zero-Shot 57.0 81.4
Few-Shot Learning
Fine-tune 47.0 79.9
MAML(*) 57.2 81.3
MAML-Align(*) 57.6 82.4
Machine-Translation
T-Train+Fine-tune 46.1 73.7
T-Train+MAML(*) 57.0 80.9
Table 3.12: This is a comparison of different zero-shot baselines, few-shot learning, and machine
translation-enhanced models. Other zero-shot external models (Table B.2) show sub-optimal results
so we don’t include them. For LAReQA and STSBMulti, we report mAP@20 and Pearson’s r × 100,
respectively. All results are averaged over 5-fold cross-validation and multiple language choices.
Models in (*) are our main contribution. We report the average over many model variants translating
from English to one target language at a time for T-Train model variants. Best and second-best
results for each benchmark are in bold and underlined, respectively.
mean average precision at 20 (mAP@20) and Pearson correlation coefficient percentage (Pearson’s
r × 100) for LAReQA and STSBMulti, respectively.28
3.3.4.1 Multilingual Performance Evaluation
Table 3.12 summarizes the multilingual performances across different baselines and model variants
for both semantic search benchmarks. On average, we notice that MAML-Align achieves better
results than MAML or S-BERT zero-shot base model and significantly better than Fine-tune. It is
worth noting that we report the results for MAML using trans mode, which is trained over a combination of mono→bi and bi→multi in the meta-training and meta-validation stages, respectively.
This suggests that MAML-Align helps more in bridging the gap between those transfer modes.
We perform a paired two-sample for means t-Test and find that the gains using MAML-Align are
statistically significant with p-values of 0.002 13 and 0.002 48 compared to S-BERT and MAML
respectively on LAReQA, rejecting the null hypothesis with 95% confidence.29 We also observe
that fine-tuning baselines are consistently weak compared to different meta-learning model variants,
especially for LAReQA. We conjecture that fine-tuning is overfitting to the small amounts of training
data, unlike meta-learning approaches, which are more robust against that. However, for STSBMulti,
28 More fine-grained results for all languages and for both benchmarks can be found in Tables B.4 and B.5 in
Appendix B.2.
29We obtain p-values results 0.0134 and 7.04 × 10−10 when comparing MAML-Align to S-BERT and MAML using
paired t-test on top of bootstrap sampling on the results of each query before taking the mean. The gains using
MAML-Align are uniformly consistent for different cross-validation splits.
52
56.3 54.6
58.2 57.2 58.7 60.2
54.1
45.8 46.5 48.6
45
48.9 49.4
45
56 54.8
59.1
57 59.1 59.9
54.4 57 55.1
59.2 57.7 59.5 60.2
54.6
40
50
60
70
Arabic German Greek Hindi Russian (*) Thai (*) Turkish (*)
mAP@20
Zero-Shot Fine-tune MAML MAML-Align
(a) On LAReQA.
85.5
77.6
84.6 81.3 83.7
75.7
85
77.2
86.2
77.8 79.6
73.7
85.6
77.6
85.1
80.9
83.5
75.5
90.6
79
86.6
80.6 81.5
76.3
70
75
80
85
90
95
English-English Arabic-Arabic Spanish-Spanish Arabic-English Spanish-English Turkish-English
Pearson's r x 100
Zero-Shot Fine-tune MAML MAML-Align
(b) On STSBMulti.
Figure 3.14: mAP@20 and Pearson’s r × 100 5-fold cross-validated multilingual performance
evaluation evaluated on LAReQA and STSBMulti in the first and second subplots, respectively. There
are consistent gains in favor of MAML and MAML-Align compared to their fine-tuning and ZeroShot counterparts for all languages and language-pairs. Languages in (*) are used for zero-shot
evaluation, whereas other languages are included either during Meta-train and Meta-valid stages or
fine-tuned on. Best results for each language or language pair are highlighted in Bold.
the gap between fine-tuning and meta-learning while still existing and in favor of meta-learning is a
bit reduced. We hypothesize that even meta-learning models are suffering from meta-overfitting to
some degree in this case.
We notice that T-Train+MAML on top of machine-translated data doesn’t necessarily boost
the performance on LAReQA or STSBMulti on average. This suggests that not all languages used in
the machine-translated data provide an equal boost to the performance due to noisy translations
for certain languages. While introducing higher quality machine translations could be beneficial
in general, there is a compromise to be made in terms of translation API calls overheads and
human labor to evaluate the quality of the translations. The purpose of this work is to evaluate in
few-shot learning scenarios rather than using data augmentation for that effect. We conjecture that
based on our observation in this few-shot learning setup, meta-learning on top of higher-quality
machine-translated data could boost the performance even more.
Figure 3.14 highlights a fine-grained comparison between different model categories on all
languages and language pairs for each benchmark. We notice that the gain in favor of meta-learning
approaches is consistent across different languages and language pairs. This confirms our findings
that while MAML improves a bit over Zero-Shot, reducing the impact of overfitting that vanilla Finetune suffers from, MAML-Align boosts the gains of meta-learning on all languages and language
53
pairs except for Arabic-English and Spanish-English. The gain applies to zero-shot languages such
as Russian and Turkish.
3.3.4.2 Ablation Studies
Due to the lack of parallelism in STSBMulti making a multilingual evaluation on it not possible, we
focus hereafter on LAReQA in the remaining analysis and ablation studies. Figure 3.15 shows the
results across different modes of transfer for Fine-tune and MAML. Among all transfer modes,
trans, mono→bi, and mono→mono have the best gains, whereas bi→multi and mixt are the weakest
forms of transfer. Trans, which uses mono→bi during meta-train and bi→multi during meta-valid,
is the best transfer mode for MAML while being one of the weakest for Fine-tune. This not only
shows that curating different transfer modes for different meta-learning processes is beneficial, but
it also suggests that meta-learning is more effective at multi-stage adaptation than fine-tuning on
them jointly. Mixt is weaker than trans, and this implies that jointly optimizing different forms of
transfers of meta-tasks in one stage makes it harder for MAML to learn to generalize. MAML-Align
is shown to be better for combining different optimization objectives.
47
57
47
57
41.9
55.9
35.1
55.5
40.1
55.8
40.7
57.2
30
40
50
60
Fine-tune MAML
mAP@20
mono mono mono bi mono multi
bi multi mixt trans
Figure 3.15: mAP@20 multilingual performance averaged over 5-fold cross-validation splits on
LAReQA comparing between different meta-transfer modes for Fine-tune and MAML models. The
gap is large between Fine-tune and MAML across all meta-transfer modes and is even larger in favor
of MAML when trans mode (uses mono→bi and bi→multi in the meta-training and meta-validation,
respectively) is used.
Figure 3.16 shows a multilingual performance comparison between different sampling modes in
meta-tasks constructions. In each meta-task, we either sample the query set that is the most similar
to its corresponding support set (Similar) or randomly (Random). We hypothesize that the sampling
approach plays a role in stabilizing the convergence and generalization of meta-learning. While we
were expecting that sampling for each support set a query set that is the most similar to it would help
meta-learning converge faster and thus generalize better, it generalized worse on the multilingual
performance in this case. On the other hand, random sampling generalizes better to out-of-sample
test distributions, leading to lower biases between languages in the multilingual evaluation mode.
Figure 3.17 shows the results for different sampling modes of negative examples in the triplet
loss. For each support and query set in each meta-task, we either sample random, hard, or semi-hard
54
57.2
57.6
56.1
55.7
54
55
56
57
58
MAML MAML-Align
mAP@20
Random Similar
Figure 3.16: mAP@20 multilingual 5-fold cross-validated performance on LAReQA between
different query set sampling modes in meta-tasks for MAML and MAML-Align. We notice that
random query sampling has better generalization for both models.
triplets to test the added value of triplet sampling in few-shot learning. We follow the same approach
outlined in Schroff et al. (2015) to sample hard and semi-hard triplets. While we expect training
with more hard triplets to help converge the triplet loss in MAML, the multilingual performance
using this type of sampling falls short of random sampling. This is due to the fact that more
sophisticated ways of triplet loss sampling usually require a more careful hyper-parameter tuning
to pick the right amount of triplets. For few-shot learning applications, this usually results in a
significant reduction in the number of training examples, which could further hurt the generalization
performance. In future work, we plan to investigate hybrid sampling approaches to monitor at which
point in meta-learning the training should focus more on hard or easy triplets. This could be done
by proposing a regime for making the sampling of meta-tasks dynamic and flexible to also combat
meta-over-fitting.
57
55.1 59.2 57.7 59.5 60.2
54.6
55
53.5
57.1 56.2 57.8 58.5
53
54.5
53.7 57 55.6 57.4
58.6
53.1
45
50
55
60
65
Arabic German Greek Hindi Russian Thai Turkish
mAP@20
Random Semi-hard Hard
Figure 3.17: mAP@20 5-fold cross-validated mean multilingual performance over different triplet
negative sampling modes on LAReQA tested on different languages using MAML-Align. Random
sampling seems best on average for few-shot learning, whereas hard sampling is more stable across
cross-validation splits.
55
3.4 Conclusion
In this Chapter, we explore different cross-lingual transfer learning fine-tuning approaches targeting
low-resource applications. We draw some insights into the cross-lingual fine-tuning mechanism and
its challenges. Then, we propose meta-learning algorithms and evaluate their ability to learn a fast
adaptation and generalization to low-resource languages with less data. We show the effectiveness
of our meta-learning adaptation both cross-lingually and multilingually.
In Section 3.1, we propose a cross-lingual approach for event trigger extraction using a direct
transfer of annotation framework based on multilingual embedding. Compared to previous approaches, our approach doesn’t rely on hand-crafted linguistic features or machine translation. We
evaluate this approach using event trigger extraction architectures with type-based unsupervised
embedding (FastText and MUSE) and supervised embedding tuned to the context (BERT). Our
results for both English and Chinese show competitive performance with baselines on the ACE2005
benchmark, even in the zero-shot learning scheme. Although results using MUSE are lower for
English, they are on par with Chinese baselines and better for Arabic compared to BERT. We
observe a generous boost in performance when English is added to the target language and when all
languages are combined together to train one cross-lingual model, especially for Arabic. Our results
are promising compared to both feature-based approaches and cross-lingual approaches based on
machine translation.
In Section 3.2, we adapt a meta-learning approach for cross-lingual transfer learning in natural
language understanding tasks. Our experiments cover two challenging cross-lingual benchmarks:
task-oriented dialog and natural questions, including an extensive set of low-resource and typologically diverse languages. X-METRA-ADA outperforms and reaches better convergence stability
over both fine-tuning baselines on top of pre-trained multilingual encoder models, reaching a new
state of the art for most languages.
In Section 3.3, we adapt multilingual meta-transfer learning to combine MAML and knowledge
distillation to multilingual semantic search. Our experiments show that our multilingual metaknowledge distillation approach outperforms both vanilla MAML and fine-tuning approaches on top
of a strong sentence transformers model. We evaluate comprehensively on two types of multilingual
semantic search and show improvement over the baselines even on unseen languages.
56
Chapter 4
Cross-lingual Lifelong Learning
With more than 7,000 languages spoken around the globe, downstream applications still lack
proper linguistic resources across languages (Joshi et al., 2020), necessitating the use of transfer
learning techniques that take advantage of data that is mismatched to the application. To simplify
architecture complexity and energy consumption, it is desirable to unify multi-lingual performance
into a single, parameter- and memory-constrained model and to allow this model to evolve, learning
on multi-lingual training data as it becomes available without having to pre-train or fine-tune from
scratch. Such is the longstanding goal of language representation learning. Existing multi-lingual
representations such as M-BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) are strong
pillars in cross-lingual transfer learning, but if care is not taken when choosing how to fine-tune
them, they can neglect to maximize transfer (Ruder et al., 2019a) to new tasks or languages and are
subject to forgetting (McCloskey and Cohen, 1989), where performance decreases after exposure to
new task or language.
Most previous work that attempts to deal with the challenge of transfer exploitation and forgetting
mitigation focuses on the problem of sequentially learning over different NLP downstream tasks
or domains (Sun et al., 2020; Han et al., 2020; Madotto et al., 2021), rather than on language
shifts. Indeed, the current literature for learning over sequences of languages is rather scarce and
is mostly reduced to cross-lingual transfer learning between a pair of languages (Liu et al., 2021;
Garcia et al., 2021; Muller et al., 2021; Pfeiffer et al., 2021; Minixhofer et al., 2022). Liu et al.
pre-train a (parent) language model and then fine-tune it on a downstream task in one of several
different (child) languages. This conflates task and language transfer and confuses analysis – the
interference between the pre-trained language model ‘task’ and the fine-tuned task, along with the
parent and child languages, cannot be disentangled. Garcia et al. propose an adaptation scheme
for each new language pair independently while retaining the translation quality on the parent
language pairs. Similarly, Muller et al. (2021) and Pfeiffer et al. (2021) propose lexical and semantic
level techniques to adapt to target languages. However, all these mentioned works still focus on
the ‘one-hop’ case, consisting of two steps: (1) training on initial parent language(s) (pairs), then
(2) adapting to new children language(s) (pairs); the effect of multiple shifts in the datastream is
not trivially generalizable to more than one hop. More recently, Pfeiffer et al. (2022) propose an
approach for language-specific modules based on adapters and evaluate that on sequential streams
of languages. However, they only focus on adapters and two desiderata of continual learning:
57
Figure 4.1: We present here an overview of cross-lingual continual learning, an extension of the
general continual learning paradigm illustrated in Figure 2.7. We use an example of a non-stationary
datastream moving from high to low resource languages. Each bold and dashed box represents
either a training or test data instance being fine-tuned or evaluated on, respectively. To support this
problem setup, we evaluate the cross-lingual capabilities of continual approaches. Those capabilities
include knowledge preservation on old languages, accumulation to the current language, and
generalization to unseen languages at each point of the training. In addition to that, we evaluate
model utility at the end of continual learning.
interference mitigation and transfer maximization. We need a more robust and comprehensive finegrained evaluation that balances the dynamics between different cross-lingual continual learning
desiderata.
In this Chapter, we pave the way for a more comprehensive multi-hop continual learning
evaluation that simulates the sequential learning of a single task over a stream of input from
different languages. This evaluation paradigm requires experimentation over balanced streams
of data scenarios for > 2. Unlike previous work, this thesis concretely defines the following
comprehensive goals along with their evaluation metrics as guidelines for analyzing the crosslingual capabilities of multilingual sequential training: knowledge preservation, accumulation,
generalization, and model utility as shown in Figure 4.1. We apply our test bed to a six-language
task-oriented dialogue benchmark and comprehensively analyze a wide variety of successful
continual learning algorithms from previous literature investigated in continual learning contexts
different from the cross-lingual context, including (a) model-expansion (Pfeiffer et al., 2020b),
(b) regularization (Kirkpatrick et al., 2017), (c) memory replay (Chaudhry et al., 2019b), and (d)
distillation-based approaches (Hinton et al., 2015; Aguilar et al., 2020). Our findings confirm the
need for a multi-hop analysis and the effectiveness of continual learning algorithms in enhancing
knowledge preservation and accumulation of our multilingual language model. We additionally
demonstrate the robustness of different continual learning approaches to variations in individual
data setup choices that would be misleading if presented traditionally.
58
Our main contributions1
are:
(1) We are the first to explore and analyze cross-lingual continual fine-tuning2
across multiple
hops and show the importance of this multi-hop analysis in reaching clearer conclusions with
greater confidence compared to conventional cross-lingual transfer learning (Section 4.4.1).
(2) We demonstrate the aggregated effectiveness of a range of different continual learning approaches (Figure 4.1) at reducing forgetting and improving transfer (Section 4.4.3) compared
to multilingual sequential baselines (Section 4.4.2).
(3) We make concrete recommendations on model design to balance transfer and final model
performance with forgetting (Section 4.4.3).
(4) We show that the order of languages and data set size impacts the knowledge preservation
and accumulation of multi-lingual sequential fine-tuning and identify the continual learning
approaches that are most robust to this variation (Section 4.4.4).
(5) We analyze zero-shot generalization trends and their correlation with forgetting and show that
current continual learning approaches do not substantially improve the generalization (Section 4.4.5).
4.1 Cross-lingual Continual Learning
In this section, we formally define cross-lingual continual learning, describe its goals and challenges,
and introduce the downstream tasks, datastreams, and evaluation protocols used. Although we are
not the first to define or investigate continual learning for languages, we are, to the best of our
knowledge, the first to define and study cross-lingual continual learning where continual learning is
focused on languages only. Thus, we formally define cross-lingual continual learning as learning
over a set of languages seen sequentially in multiple hops, which is truer to the terms of cross-lingual
and continual learning, respectively. We distinguish that from ‘cross-lingual cross-task cross-stage
continual learning’, which continually learns over a set of pretraining and downstream tasks sampled
from different languages (Liu et al., 2021) and ‘cross-lingual one-hop transfer learning’ (Garcia
et al., 2021).
4.1.1 Problem Formulation
We define cross-lingual continual learning as the problem of sequentially fine-tuning a model
for a particular downstream task K over a cross-lingual datastream. In this case, a cross-lingual
data stream is made of labeled and distinct datasets D1··· , each one sampled from a distinct
language and consisting of separate train and test portions. Let ℎ be the stage in cross-lingual
1This work was published as a long paper in the Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics 2023 (M’hamdi et al., 2023).
2Our code is available at https://github.com/meryemmhamdi1/x-continuous-learning.
59
continual learning where
is optimized to +1 via exposure to D
. Let L = {ℓ1, ℓ2 · · · ℓ } be a set
of labeled languages, let (L ) be the set of all permutations of L , and without loss of generality
let ∈ (L ) be one such permutation and [] ∈ L be the th language in . The language of
D
is []. Therefore, by default, the number of languages used is equal to the number of datasets.
Let D< and D> refer to a sequence of datasets (train or test portions, depending on context) used
in hops from 1 to − 2 and to − 1, respectively; we generalize these terms to D≤ and D≥ by
including hop − 1 as well at the end or, respectively, beginning of the sequence.
4.1.2 Goals
We define the goals,3 necessarily dependent on each other, for our study of cross-lingual continual
learning as follows (also depicted in Figure 4.1):
• Cross-lingual preservation. This is the ability to retain previous knowledge on seen languages.
• Cross-lingual accumulation. This is the ability to accumulate knowledge learned from
previous languages to benefit learning on the current language.
• Cross-lingual generalization. This is the ability to generalize uniformly well to unseen
languages, which goes beyond accumulating knowledge up to the current languages.
• Model utility. This is the ability of the fully trained model to perform equally well on all
languages.
In this Chapter, we wish to understand the relationships between these goals. Our aim is to come
up with a recipe for more systematic cross-lingual continual learning. Thus, we need to understand
if the goals are aligned with each other or if maximizing some goals leads to minimizing other
goals.
4.1.3 Challenges
Learning sequentially from a non-stationary data distribution (i.e., task datasets coming from
different languages) can impose considerable challenges on the goals defined earlier:
• Catastrophic forgetting. This happens when fine-tuning a model on D≥
leads to a decrease
in the performance on D<
.
• Negative transfer. This happens when fine-tuning a model up to D≤
leads to a lower
performance on D
than training on it alone.
• Low zero-shot transfer. This happens when fine-tuning on D≤ gives a lower performance
than random on unseen D>
.
• Low final performance. This is when fine-tuning on all D≤ gives an uneven performance
between languages when tested on D≤ at the end of training.
3To the best of our knowledge, those goals were never synthesized for the context of cross-lingual continual learning.
60
4.1.4 Downstream Tasks and Datastreams
Here, we describe the downstream tasks and multi-lingual sequential datastreams used.
Downstream Tasks. We choose task-oriented dialogue parsing as a use case and consider the
multi-lingual task-oriented parsing (MTOP) benchmark (Li et al., 2021). Task-oriented dialogue
parsing provides a rich testbed for analysis, as it encompasses two subtasks: intent classification
and slot filling, thus allowing us to test different task capabilities in cross-lingual continual learning.
We use the same base model defined in Section 2.2.
Datastream Construction. For a set of languages L , our study considers a permutation subset
⊂ (L ) with the following properties:
• | | = |L | = , i.e. consists of permutations, each of which is a sequence of datasets
in each of the languages in L .
• ∀ℓ ∈ L , ∀ ∈ 1 . . . , there exists some ∈ such that [] = ℓ.
• H2L ∈ , the permutation from most high-resource to most low-resource fine-tuning data sets,
based on the training split dataset size.
• L2H ∈ , the reverse of H2L.
In our experiments, we use MTOP (Li et al., 2021), which is a multi-lingual task-oriented
dialogue dataset that covers six typologically diverse languages and spans over 11 domains and 117
intents. We chose MTOP4
since it is the largest scale dataset available for task-oriented dialogue,
and because it covers languages that have varying amounts of data resources available. We use only
the flat representation of slots (without nesting) to simplify our evaluation. We use the original data
for most experiments. Table 4.1 shows a summary of the number of sentences (dialogue utterances)
per language and split. The list of language permutations used is outlined in Table 4.2.
4.1.5 Evaluation Protocols
For each language permutation, we train on each dataset in sequence but continually evaluate
on all languages. Let be some success metric for evaluating a downstream task and ,≤
be the evaluation on the test set for language ℓ fine-tuning on D≤
. We define the following
meta-metrics (which are inspired, but slightly different from the metrics in Lopez-Paz and Ranzato
(2017) and Chaudhry et al. (2019a)):
4The MTOP dataset has been released by Facebook (Li et al., 2021) under Creative Commons Attribution-ShareAlike
4.0 International Public License.
61
Lang ISO Train Dev Test
English EN 15,667 2,235 4,386
German DE 13,424 1,815 3,549
French FR 11,814 1,577 3,193
Hindi HI 11,330 2,012 2,789
Spanish ES 10,934 1,527 2,998
Thai TH 10,759 1,671 2,765
Table 4.1: Number of sentences in MTOP per language and split.
Order 1 Order 2 Order 3 Order 4 Order 5 Order 6
English Thai Spanish French Hindi German
German Spanish Hindi Thai English French
French Hindi English German Spanish Thai
Hindi French German English Thai Spanish
Spanish German Thai Hindi French English
Thai English French Spanish German Hindi
Table 4.2: Simulated language permutations.
• Forgetting (F ↓). This is the average forgetting over all datasets (excluding the first dataset)
computed as:
=
1
− 1
∑︁
=2
≤
,
≤ =
1
− 1
∑︁
−1
=1
,≤
,
(4.1)
where ≤
is the average forgetting that occurred at the point of training D
. We compute
,≤ = max∈[1,−1] ,≤ − ,≤
. ,≤
is the degree to which performance on D has suffered
by continuing to train on D≤
instead of stopping before covering D
.
• Transfer (T ↑). This is the average forward transfer computed as:
=
1
− 1
∑︁
=2
,
= ,≤ −
,
(4.2)
62
where denotes evaluation of a model fine-tuned only on D
. Then,
is the incremental
impact of sequential training on datasets prior to seeing D
. To measure generalization to
new languages, we add a zero-shot transfer (T0
↑) metric measured as:
0 =
1
− 1
∑︁
=2
0
,
0
=
1
− 1
∑︁
−1
=1
,≤ −
0
,
(4.3)
where
0
is the average performance of a model on the forward transfer to a language ℓ
after training on D< compared to the random performance
0
before even fine-tuning on any
language (i.e., using fixed pre-trained M-BERT weights and randomly initialized weights for
the output layer).
• Final performance (FP ↑). This is the average performance after training on all datasets in
the studied stream, computed as:
=
1
∑︁
=1
,≤ . (4.4)
4.2 Methods
In this Section, we define the baselines and non-continual learning reference models ((Section 4.2.1)
and continual learning algorithms (Section 4.2.2).
4.2.1 Baseline & Reference Models
Before delving into continual learning approaches, we consider a simple lower-bound baseline. In
addition to that, we design reference models trained from scratch for each new language, either in a
joint manner or in a sequential multi-hop manner. Those are upper-bound non-continual learning
models that are used to assess the performance of different models trained with continual learning
methodologies. Those reference models can be, in general, superior to continual learning models
but can also be less efficient and not feasible. For a fair comparison, all models use the same base
model architecture and its loss with no further additions or special optimizations to the architecture.
Lower-bound Baseline. This consists of naive sequential fine-tuning (Naive Seq FT), which
sequentially fine-tunes with no continual learning.
Non-continual Learning Upper-bound Models. These are stronger upper-bound models used as
reference points of performance. However, they are either not efficient or prohibitive in the context
of cross-lingual continual learning. Some of them require training from scratch for each language,
63
which is not efficient. Others require having access to all languages either at the same time or
incrementally. Having such access can be restrictive due to privacy or storage efficiency concerns.
• Language-specific fine-tuning (Lang-Spec FT). This trains independent models on the data
set for each language ℓ using only D
.
• Multi-lingual learning (multi). This trains one single model jointly across all data sets D≤ .
• Incremental joint learning (Inc Joint). This incrementally trains adding the data set for each
language in the stream. This consists of the following hops: 1) D≤1, 2) D≤2, · · · , and N-1)
D≤−1. This is the only sequential reference model.
4.2.2 Continual Learning Approaches
To continually fine-tune on different languages, we establish a representative set of strong approaches
spanning the following categories inspired by previous evaluation paradigms such as Jin et al. (2022)
lifelong language model domain-incremental pertaining. To the best of our knowledge, we are the
first to exhaustively investigate such approaches for the context of cross-lingual continual learning,
whereas different approaches were investigated separately for different problem definitions.
Model Expansion. We consider the following approaches that add hop-specific parameters, as
shown in Figure 4.2. We expand on either the input (i.e., M-BERT representations) or the output side
(i.e., task-specific prediction heads). For the former (Lang-Spec Trans), the transformer layers are
replicated for each hop while sharing the prediction heads. To expand on the output side (Lang-Spec
Task), we use different prediction heads across hops but share the M-BERT layers. We additionally
consider Lang-Spec Enc[0-8], which trains M-BERT encoder layers ∈ 1 . . . 9 in a language-specific
manner while sharing the rest. We also separately add MAD-X adapters (Pfeiffer et al., 2020b). We
either fine-tune the adapter layers and freeze the rest of M-BERT (Lang-Spec Ada(F)) or tune them
both (Lang-Spec Ada(T)).
Model expansion methods, such as Lang-Spec Trans and Lang-Spec Enc[0-8], are fine-tuned
for each language with either an entirely or partially language-specific M-BERT (whole 12 layers in
addition to the embedding or just the top 8 layers in the case of Lang-Spec Trans and Lang-Spec
Enc[0-8] respectively). When fine-tuning them on a new language, the previously tuned parameters
on the old languages are retained unchanged while the rest of the parameters that are not languagespecific are fine-tuned. During the evaluation on a particular language, the tuned parameters for that
language are restored and used if the language has been seen in training. Otherwise, the parameters
initialized from M-BERT (before fine-tuning on any language) are used for zero-shot evaluation.
Adapters consist of downsampling layers followed by upsampling layers inserted between
layers of our Transformer encoder in addition to their invertible components. We do not add taskspecific adapters, as according to our ablation studies, they didn’t prove beneficial. We add adapter
components to every encoder layer following MAD-X configuration and using their pre-trained
weights.5 We either fine-tune the weights for the languages available in AdapterHub or train from
5obtained from AdapterHub (Pfeiffer et al., 2020a) https://adapterhub.ml/explore/text_lang/.
64
scratch for languages for which there are no pre-training adapter weights. At inference time, we use
adapter layers fine-tuned independently for each language in the datastream.
Figure 4.2: A comparison between different variants of model expansion for this problem setup:
either at the side of the input (Lang-Spec Trans), the output (Lang-Spec Task), or using adapters
(Lang-Spec Ada).
Regularization. We focus on elastic weight consolidation (EWC) (Kirkpatrick et al., 2017), which
mitigates catastrophic forgetting by reducing the changes in parameters that are deemed critical
to previously seen languages. We use the online version of EWC (EWC-Online) for efficiency.
To penalize changes in the parameters crucial to previous languages, we use EWC, which adds a
regularization term to the loss applied only after the first data set D
in the language stream is seen.
∀ ∈ 2 . . . , we compute the total loss as follows:
L
= L
+ L
, (4.5)
where L is the usual loss of the downstream task on the current data D and L is the
regularization term, and is a hyperparameter to control the regularization strength (which is fixed
to 20). For efficiency purposes, we use the online version of EWC (EWC-Online). Following that,
our regularization term is computed as, based on the formulation in van de Ven et al. (2022):
L
=
∑︁
=1
˜
(−1)
( −
)
2
, (4.6)
where are the parameters of the Transformer model in addition to the downstream prediction
heads, is the total number of parameters, and ˜
(−1)
is the Fisher information matrix on the
last language just before training on D
. This is computed as the running sum of the
ℎ diagonal
elements of the Fisher Information matrices of D
, for all ∈ 1 . . . ( − 1). ˜
()
= ˜
(−1)
+
and
˜1
=
1
. In practice,
is simply the gradients of all parameters flattened into one single matrix.
Memory Replay. We use experience replay (ER) (Chaudhry et al., 2019b), which alleviates
forgetting by maintaining a fixed-size memory equally balanced between the different languages
and regularly drawing examples from the memory to replay. After training for each D for all
65
∈ 1 . . . , we populate the memory with randomly sampled examples from D
. For each D for
all ∈ 2 . . . , after training for every = 100 mini-batches and optimizing for the current loss
separately, the model randomly samples an equal batch from the memory for each D such that
∈ 1 . . . ( − 1) and replays them using the current model checkpoint used for training on D
. We
retrieve an equal amount of memory items from each language and at each step and hop. The
loss from the current D and the loss on the memory on the D are interleaved as the replay on
the memory only happens every step. This prioritization of the current language helps make the
training more stable without over-fitting on the small memory from previous languages.
Distillation-based. On top of ER, we distill dark knowledge (Kariya, 2018) from previous model
checkpoints. We explore two variants: logit distillation (KD-Logit) (Hinton et al., 2015) and
representation distillation (KD-Rep) (Aguilar et al., 2020), which optimize the minimum squared
error loss on either the output logits or M-BERT representations between the current and previous
models. We use the same strategy explained in Section 4.2.2 to select the memory to be replayed
using a knowledge distillation loss. For each D for all ∈ 2 . . . , after training for every
= 100 mini-batches, we randomly sample an equal batch from the memory for each D such
that ∈ 1 . . . ( − 1). We also load the model checkpoints for each ℎ and use that model and
the memory for D
to compute either the intent and slot logits in the case of KD-Logit or the
multilingual representations of M-BERT in the case of KD-Rep. We do the same thing using the
current model checkpoint this time. Then, we use the minimum square error loss to minimize the
distance between the intent logits obtained using the previous and current model checkpoints and do
the same thing for slot logits for KD-Logit. Then, we take the same over intent and slot distillation
losses across different languages retrieved from the memory. The same is done for computing the
distillation loss over the multilingual representations in KD-Rep.
4.3 Experimental Setup
In this section, we provide implementation details for our evaluation of different baselines, reference
models, and continual learning models (Section 4.3.1). We also describe the bootstrap sampling
procedure used to compute the average and statistical significance results throughout our analysis
(Section 4.3.2).
4.3.1 Implementation Details
For all experiments, we use M-BERT(bert-base-multilingual-cased)6 with 12 layers as our pretrained Transformer model. We use the dev set to pick the hyperparameters of the optimizer to
be used. We perform a manual search for the most optimal learning rate over a range [1 − 4,
3 − 4, 1 − 5, 3 − 5] for Adam optimizer (Kingma and Ba, 2015) and finally based on the dev
6github.com/huggingface/transformers version 3.4.0 pre-trained on 104 languages, including all
languages evaluated on in this paper.
66
performance we fix the learning rate to 3 − 5 for all experiments for a fair comparison. We use
= 1 − 8, 1 = 0.9, 2 = 0.99, batch size of 16, = 0.1 for EWC Online, 6000 memory size for
ER and knowledge distillation. For all experiments, we run for a maximum of 10 epochs and pick
the best model based on dev data. We also fix a seed of 42 for the random initialization of Numpy,
random, and torch over all bootstrap experiments. For additional experiments using multiple seeds,
we fix three seeds. All experiments are run using the same computing infrastructure Pytorch version
1.7.1, using one Tesla P100 GPU of 16280 MiB of memory CUDA version 11.2.
The runtime and the number of parameters depend on the approach and the mode of training used,
as shown in Table 4.3. With the exception of model expansion and language-specific approaches, all
approaches have the same number of parameters coming from the sum of M-BERT and prediction
head parameters. Lang-Spec Trans has the highest number of parameters, which is six times more
than Naive Seq FT but only requires two times more runtime as only one 1
6
part of language-specific
M-BERT is updated at each hop for each whereas the rest is used in evaluation mode only. LangSpec Ada(F) has the smallest number of parameters which is around 24% and takes 2 times less
than the usual runtime of Naive Seq FT (while exhibiting lower forgetting and higher transfer than
Naive Seq FT, as shown in Table C.1). Memory replay and knowledge distillation approaches have
more runtime (slightly more than Lang-Spec Trans) as they store and handle memory and compute
the replay or distillation losses interleaved with the main loss, which makes them time-consuming.
What impacts the runtime of ER is much more than just iterating over a small sampled memory.
Its runtime depends not only on the size of the memory but also on the frequency of interleaving
happening at the fine-tuning schedule. After each k minibatch steps, we sample a minibatch from
the memory and fine-tune on it interleaved with the fine-tuning on the main minibatch. So, that
makes the runtime depend on k and not only the size of the memory. This makes its training more
time-consuming than if we had to sample only after each epoch with the same memory size.
4.3.2 Bootstrap Sampling & Statistical Significance
We run all experiments over one fixed seed of 42. We then use bootstrap sampling (Koehn, 2004) to
compute the mean and confidence intervals for each of the metrics described in Section 4.1.5 over a
single approach. For each language permutation, and for each ,≤
, representing some performance
metric on language ℓ after training on D≤
, we sample with replacement 600 sentences from
the testing data over 600 iterations. By using this number of iterations and sampling sentences,
we ensure and also double-check that all sentences in the test set are covered in the evaluation,
ensuring a uniform evaluation across approaches. Let be the list of results we get for each iteration
independently. Then, we compute the mean and standard deviation ¯ and () respectively and
the 95% confidence interval size using the following equation:
=
1.9639 × ()
√
600
,
() =
√︂Í
( − ¯)
2
600
.
(4.7)
67
Model Runtime # Param
Naive Seq FT 3h16min 178,081,402
Lang-Spec FT 5h02min 1,068,488,412
Inc Joint 1d22h51min 178,081,402
multi 16h45min 178,081,402
Lang-Spec Embed 7h46min 639,123,322
Lang-Spec Enc[0-2] 7h52min 284,399,482
Lang-Spec Enc[3-5] 7h12min 284,399,482
Lang-Spec Enc[6-8] 7h8min 284,399,482
Lang-Spec Enc[9-11] 7h20min 284,399,482
Lang-Spec Enc[0-8] 8h1min 497,035,642
Lang-Spec Trans 7h15min 1,067,348,602
Lang-Spec Enc[0-11] 7h53min 603,353,722
Lang-Spec Enc[0-5] 7h16min 390,717,562
Lang-Spec Enc[6-11] 7h10min 390,717,562
Lang-Spec Task 6h18min 179,221,212
Lang-Spec Ada(T) 4h34min 222,301,402
Lang-Spec Ada(F) 1h57min 44,447,962
EWC-Online 1d3h17min 178,081,402
ER 8h55min 178,081,402
KD-Logit 7h23min 178,081,402
KD-Rep 8h 178,081,402
Table 4.3: Runtime and parameters statistics.
This computes and for each language permutation separately. To aggregate this across
different language permutations, we simply take the average and the standard deviation.
To compute the statistical significance between different approaches, we use ANOVA and
perform a multiple pairwise comparisons analysis using Tukey’s honestly significant difference
(HSD) test7 over different language permutations for each metric.
4.4 Results & Analysis
In this section, we provide an extensive analysis of different ablation studies. We ask critical
analysis questions that revolve around the continual learning goals described in Section 4.1.2. For
Section 4.4.2, scores are reported using accuracy (Acc) and F1-score (F1) for intent classification and
slot filling, respectively. For the remaining sections, all results are reported for intent classification
only, slot filling results, for which the same trends are observed, can be found in Appendix C.1.
Bootstrap sampling (over test data shuffling) is used to compute the average and 95% confidence
intervals (averaged over all language permutations except for Section 4.4.4). We also separately
repeat key experiments over 3 different seeds and obtain similar findings, which can be found in
7We use bioinfokit library https://github.com/reneshbedre/bioinfokit.
68
Appendix C.2. We decided to report the results using bootstrap sampling since they have tighter
confidence intervals.
4.4.1 How is a Multi-Hop Analysis Different from its One-Hop Counterpart?
To motivate our cross-lingual continual learning evaluation paradigm, we start by investigating how
a multi-hop analysis is different from a conventional one-hop transfer learning analysis. Figure 4.3
shows a comparison between the two in terms of forgetting (Eq. 4.1) for different approaches
aggregated over different language permutations. More results for slot filling and other metrics can
be found in Figure C.4 in Appendix C.1.5.
Figure 4.3: Comparison between forgetting trends for intent classification using one-hop (crossed
boxplots on the left) and multi-hop analysis (dotted boxplots on the right), showing the variance
over different language permutations. One-hop analysis exhibits higher variance than its multi-hop
counterpart.
Lang-Spec Trans tends to have the least forgetting and Naive Seq FT the most, but importantly
the variance for the multi-hop analysis is much smaller than that for the one-hop analysis.
Having larger confidence intervals, the one-hop analysis also tends to be misleading in the sense
that certain models are depicted as having a good performance, while it is not truly the case. For
example, Naive Seq FT, according to the one-hop analysis, shows a range of forgetting from very
little (0.5) to a lot (2.0). So in some circumstances, it has little forgetting, thus a good performance
under the one-hop analysis. But according to the multi-hop analysis, it clearly has a lot of forgetting
with more confidence. Therefore, the multi-hop analysis leads to a more conclusive analysis. We
conjecture that averaging over more hops and balanced diversified datastreams is what leads to
narrower confidence intervals. This agrees with the well-known fact that larger sample sizes lead to
narrower confidence intervals (Hazra, 2017).
69
4.4.2 Can a Multi-lingual Language Model Learn to Preserve and Accumulate
Knowledge across Different Languages?
Given the conclusiveness of the multi-hop analysis in Section 4.4.1, we follow that type of analysis
thereafter. In this section, we investigate how well the baseline and different non-continual learning
reference models learn to preserve and accumulate knowledge across different languages by looking
at the average over language permutations. Since not all reference models are sequential, we start
by comparing them to the baseline using their final performances (Eq. 4.4). The final performance
is indicative of how well a single final model can encapsulate the knowledge across all languages at
the end of training. From Table 4.4, we notice that Naive Seq FT and multi have the worst and best
final performances, respectively. This suggests that a multilingual joint model is more beneficial
than sequential models. In practical scenarios, however, we may not have access to all languages
at the same time. Among non-continual learning reference models, Inc Joint is closest to multi
if all data may be preserved. However, this may also not be the case. In that case, Inc Joint is
nearly as good. Training incrementally and sequentially (Inc Joint) is also more beneficial than
fine-tuning on just the language of interest (Lang-Spec FT), as the former exploits cross-lingual
transfer capabilities.
Model Intent Class (Acc) Slot Filling (F1)
Naive Seq FT 91.06 ±1.08 69.37 ±1.06
Lang-Spec FT 93.40 ±0.08 73.90 ±0.83
Inc Joint 94.16 ±0.18 74.88 ±0.38
multi 94.25 ±0.07 76.34 ±0.82
Table 4.4: The average final performance across different language permutations for the baseline
compared to reference models. We highlight the best scores in bold and underline the second best
across models.
We focus, thereafter, on Inc Joint8
and compare its forgetting (Eq. 4.1) and transfer (Eq. 4.2)
trends to the baseline Naive Seq FT, as shown in Table 4.5. Inc Joint exhibits significantly less
forgetting, which also causes its final performance to be higher than Naive Seq FT. This suggests
that recalling previously used training data is helpful in knowledge preservation. However, Naive
Seq FT seems to slightly outperform Inc Joint in terms of transfer. This difference is not statistically
significant.9 We hypothesize that this could be due to exposing Inc Joint to all resources from
previously seen languages, so it is likely that the data distribution between all these languages may
distract the model from learning on the new one.
8We do not use multi since it is non-sequential. Metrics like forgetting are thus always zero, which makes this model
not comparable with other continual learning approaches and sequential reference models.
9We report the p-values from pairwise Tukey’s HSD analysis to gain a reliable unified view that individual t-tests
may fail to convey. More explanation can be found in Appendix 4.3.2.
70
Model
Intent Class (Acc) Slot Filling(F1)
F ↓ T ↑ F ↓ T ↑
Naive Seq FT 2.93 ±1.24 0.68 ±0.14 5.67 ±0.93 1.37 ±0.53
Inc Joint 0.11 ±0.10 0.52 ±0.19 0.91 ±0.34 0.83 ±0.77
Table 4.5: Forgetting (F) and transfer (T) performance averaged across different language permutations for sequential baseline and reference models. We highlight the best models in bold for each
subtask and metric.
4.4.3 Is Continual Learning Effective in Boosting Knowledge Preservation,
Accumulation, and Model Utility?
To study the effectiveness of continual learning approaches, we compare them to the baseline
using the average over language permutations. We show, in Figures 4.4(a) and 4.4(c), the final
performances (Eq. 4.4) and transfer (Eq. 4.2) of different approaches, respectively, versus their
negative forgetting (Eq. 4.1). In general, we observe that continual learning approaches mitigate
forgetting and improve final performance. They also improve transfer, to some degree, though gains
are mostly not significant compared to Naive Seq FT(Appendix 4.3.2).
From Figure 4.4(a), we notice that model expansion approaches10(Lang-Spec Trans and LangSpec Enc[0-8] described previously) are good at mitigating forgetting and improving the final
performance while Lang-Spec Task is not. M-BERT, when trained in a language-specific manner, is
responsible for encapsulating the cross-lingual representations necessary for enabling knowledge
preservation, whereas changes to the downstream task-specific layers do not make much of a
difference. This implies that in cross-lingual continual learning more attention should be paid to
how to train those representations in a language-specific manner efficiently. Lang-Spec Ada(T) is
one way to do this more efficiently, but its performance still lags behind other model expansion
approaches. ER achieves a performance close to Lang-Spec Trans and Lang-Spec Enc[0-8] and this
suggests that using a portion of the memory is beneficial.
11
In the baseline approach, which suffers from the highest forgetting, we also notice the lowest
final performance and transfer in Figures 4.4(a) and 4.4(c). As continual learning approaches reduce
forgetting, they also improve the final performance, and some of them also improve transfer, but not
to the same degree. This suggests that the lower the forgetting a model can achieve, the easier
it gets for it to learn a stronger final model. However, there is no direct correlation between
forgetting and transfer. For example, Lang-Spec Trans is the best model in reducing forgetting but
also the worst in terms of transfer. This could be due to the fact that Lang-Spec Trans exhibits
similar behavior to Lang-Spec FT; thus, the transfer of a model, which is the difference between the
performance of that model and that of Lang-Spec FT, is almost null. On the other hand, although
Lang-Spec Ada(F) has the highest transfer, it has the lowest final performance and is close to average
forgetting. Although the adapter will not be updated anymore after the model has been fine-tuned
10We include a full analysis of the expansion over several subsets of M-BERT components in Appendix C.1.2.
11An ablation study using different sizes of the memory is shown in Appendix C.1.6. It shows that even smaller sizes
up to 5% are still beneficial. We report here the highest memory size as it leads to the best results.
71
on, we think that the forgetting could be due to the shared task-specific layer leading to a forgetting
closer to Lang-Spec Trans more than Lang-Spec Ada(T), which also shares M-BERT and tunes it.
We show in Figure 4.4(b) that there is no direct correlation between final performance and transfer.
This posits that all three metrics need to be studied independently for a more insightful analysis.
Figure 4.4: Correlations between different pairs of metrics: (a) Final performance versus negative
forgetting for the task of intent classification. The lower the forgetting, the higher the final performance. (b) Final performance versus transfer for the task of intent classification. As hypothesized,
there is no direct correlation between final performance and transfer. (c) Transfer versus negative
forgetting for intent classification task. In general, there is no direct correlation between transfer and
forgetting. (d) Zero-shot generalization versus negative forgetting for intent classification. Model
expansion approaches are highlighted in shades of green. We zoom over the rest of the models in
the main graph and show an overview of all approaches in the lower right corner subplot. Mitigating
forgetting leads to higher generalization, with the exception of multi-headed models highlighted in
green.
4.4.4 Which Permutations Impose More Challenges on Knowledge Preservation,
Accumulation, and Model Utility?
So far, our analysis has focused on the average over language permutations, but are the same patterns
observed for different language permutations? To shed light on this, we analyze the performance
of different continual learning algorithms and the baseline in terms of their forgetting (Eq. 4.1),
transfer (Eq. 4.2), and final performance (Eq. 4.4) over H2L and L2H permutations, in Table 4.6.
12
12Full results for slot filling, more language permutations, and a balanced version of data can be found in Appendix C.1.3.
72
In general, we observe that it is more challenging to learn from low to high resource languages.
However, model expansion and memory replay approaches reduce forgetting and final performance
gaps between language permutations. We hypothesize that L2H being more challenging than H2L
could be due to the fine-tuning data size that is different between languages.
Model
F ↓ T ↑ FP ↑
H2L L2H H2L L2H H2L L2H
Naive Seq FT 1.52 5.52 0.93 0.57 92.06 88.80
Lang-Spec Trans 0.40 0.62 0.59 0.03 93.86 93.37
Lang-Spec Enc[0-8] 0.60 1.05 1.00 0.63 93.75 93.15
Lang-Spec Task 1.53 5.53 0.84 0.38 91.93 87.68
Lang-Spec Ada(T) 1.18 4.43 1.29 0.79 92.36 88.66
Lang-Spec Ada(F) 0.84 1.87 3.41 2.43 91.08 89.92
EWC 1.82 5.90 0.74 0.48 91.16 88.28
ER 0.71 2.35 0.95 0.78 93.51 92.58
KD-Logit 1.42 4.07 0.77 0.51 91.60 89.65
KD-Rep 1.49 4.00 0.96 0.53 91.64 90.17
Table 4.6: Comparison of intent classification for two language permutations. We highlight in
bold the best forgetting (F), highest transfer (T), and final performance (FP) of accuracy scores
among H2L and L2H, whereas the best and second best scores across approaches for H2L and L2H
separately are underlined and italicized, respectively. We report mean performances for each metric
and language order. All 95% confidence intervals range from ± 0.01 to ± 0.04.
To verify this hypothesis, we dig deeper to check if the differences among fine-tuning data sizes
between languages are the main factor by performing an ablation study on that. Therefore, we
use the same amount of fine-tuning and evaluation resources for each language (9,219 for train,
1,285 for dev, and 2,299 for test splits) and report the results on Naive Seq FT in Table 4.7. We
notice that there is still a gap between these two language permutations for forgetting and final
performance. This suggests that the difference in fine-tuning data size is not what accounts for the
differences between the two language permutations. There are perhaps biases in the pre-training or
other linguistic artifacts that need to be studied in future work.
Model
F ↓ T ↑ FP ↑
H2L L2H H2L L2H H2L L2H
Original Data 1.52 5.52 0.93 0.57 92.06 88.80
Balanced Data 1.25 5.81 0.89 0.75 89.33 85.81
Table 4.7: Performance on intent classification comparison between two versions of the data:
original data version and balanced data for Naive Seq FT across the same permutations as Table 4.6.
We bold the best among H2L and L2H for each metric.
73
4.4.5 How do Continual Learning Models Generalize to Unseen Languages?
To analyze the zero-shot transfer to languages unseen during fine-tuning, we plot the performance
of zero-shot transfer (Eq. 4.3) as a function of negative forgetting over the average of different
language permutations to investigate any relationships between generalization and preservation. In
Figure 4.4(d), we infer that most continual learning approaches do not substantially improve
generalization compared to Naive Seq FT. In particular, model expansion approaches (in red)
hurt generalization even if they significantly reduce forgetting. This zero-shot transfer versus
interference trade-off is referred to as the stability-plasticity dilemma (Carpenter and Grossberg,
1988; Hadsell et al., 2020; Wolczyk et al., 2021), where the weights responsible for improving
on new tasks are often responsible for forgetting previous tasks. Except for model expansion
approaches, we notice that approaches that reduce forgetting also improve generalization compared
to Naive Seq FT. Better approaches to balance the two can be investigated in future work.
4.5 Conclusion
In this Chapter, we formulate the cross-lingual continual learning problem setup. We show that naive
sequential fine-tuning is prone to catastrophic forgetting and has poor accumulation capabilities
sensitive to different language permutations. We provide the first benchmark to compare the
effectiveness of different continual learning algorithms for the cross-lingual case. We show that
continual learning models improve cross-lingual knowledge preservation, which also contributes to
improving final model performance and, to a lesser degree, accumulation and generalization. We
also discuss the challenges of sequential training for certain language permutations. We hope that
this study will encourage more analyses in the same spirit to cover more benchmarks and datastream
setups to gain more insights that go beyond conventional cross-lingual transfer learning.
74
Chapter 5
Cross-lingual Human Algorithms
In Chapter 4, we have established that memory-based approaches are more robust and efficient
than other approaches in taming the plasticity stability dilemma in cross-lingual continual learning.
This describes the phenomenon where it is inherently hard to preserve acquired knowledge from
previously seen languages (stability) while adapting to new languages (plasticity). Experience
replay (ER) (Chaudhry et al., 2019b) is a cognitively inspired memory-based approach that reinforces previously seen experiences similar to the process of memory consolidation in biological
systems (Isele and Cosgun, 2018). As more languages are incorporated into the datastream, fitting
examples from new languages into a fixed-size memory buffer becomes more challenging. This
invites a critical question: How can we dynamically come up with informative memory examples to
keep for each language?
1
5
1 3
1 1 1 1 2 2 3 1 2 2 3 3
Phase 1: English Phase 2: German Phase 3: Hindi Phase 4: Thai
Main Data
Items
1
5 5 5 5 5
3
2 2 3 3
4
5
2 1
4
1 1 2
4 4 4 5 5
3 3
5
1
3 3 Memory
Items
1
4
1 1 2
4 4 4 5 5
2 3
5
1
Memory
Population
Memory
Population
Memory
Population Figure 5.1: An overview of Leitner-guided memory replay for multi-phase cross-lingual continual
learning: an extension from the cross-lingual continual learning paradigm illustrated in Figure 4.1.
On top of a cross-lingual datastream, we build a skill rating system to continually guide the memory
population and update. Skill ratings are scores from 1 to 5 obtained from Leitner queues; a higher
score reflects greater learnability. At the end of each phase, the skill ratings on the main data items
from the phase language are used to choose what goes in the memory, and the skill ratings of data
items already in the memory are re-evaluated to determine if they can remain.
75
In this Chapter, we propose a human-inspired approach for learning what to replay at each phase
of cross-lingual continual learning. We hypothesize that in such a setup, in the beginning, most data
is difficult, but as training progresses, some data becomes well-learned and informative. We surmise
that reducing the forgetting of previously learned examples requires using a strategy of alternately
learning new difficult examples along with reinforcement of well-learned examples. To design
cross-lingual memory, we leverage Leitner queues, a cognitive technique that has been used for
strategically planning what to review in humans (Leitner, 1974; Reddy et al., 2016) and for determining informative and spurious data in self-training non-continual learning applications (Amiri et al.,
2018; Amiri, 2019). Our Leitner-guided memory sampling policy is a dynamic language-agnostic
skill rating system that selects candidates for inclusion into memory according to how well they are
learned (Figure 5.1). We analyze memory design attributes that contribute to reducing cross-lingual
continual learning forgetting and evaluate on typologically diverse benchmarks ranging in difficulty.
We summarize our main contributions1
as follows:
1. We are the first to formalize a human-inspired solution based on Leitner queues to guide
cross-lingual memory replay (Section 5.1.3).
2. We show that our Leitner-inspired approach for selecting memory replay items reduces
forgetting without sacrificing transfer learning gains (Section 5.3.1).
3. We provide a fine-grained analysis over different language orders and languages showing that
our approach is consistently beneficial (Section 5.3.2).
4. We provide a qualitative analysis that investigates the usefulness of data as a function of its
learnability (Section 5.3.3).
5.1 Methodology
In this section, we start by describing our ER approach adapted to cross-lingual continual learning
(Section 5.1.1). Then, we explain the mechanism for determining skill ratings based on Leitner
queues (Section 5.1.2). After that, we explain how we use this Leitner-based skill rating system to
guide memory storage and update in cross-lingual ER (Section 5.1.3).
5.1.1 Cross-lingual Experience Replay
We follow the same setup for cross-lingual continual learning presented in 4.1.1. We describe it
here again for a smooth transition to our adaptation of ER used in this Chapter, which is slightly
different from Section 4.2.2. The learning process consists of sequentially fine-tuning a model on a
cross-lingual datastream in multiple phases. A cross-lingual datastream D1··· is a set of N distinct
labeled datasets sampled from different languages one at a time. Each dataset D
is drawn from a
single distinct language ℓ ∈ L = {ℓ1, ℓ2 · · · ℓ }. Each phase P ∈ P1··· is a stage in cross-lingual
1This work is under review at ARR/ACL 2023.
76
continual learning where the model gets fine-tuned on a dataset D for several epochs. The ER
approach is implemented as follows: At the end of each phase (except the last one) P ∈ P1···−1,
we choose some data from D
to add to a memory buffer M of fixed size |M |. In later phases P
after P
, we replay from M, which contains memory data drawn from D<
interleaved with the
main loss on data drawn from P
.
5.1.2 Leitner-based Skill Rating System
We draw inspiration from Leitner queues2
(Leitner, 1974), a method of prioritization originally
conceived of as a strategy for human memorization and later used in machine learning applications (Amiri et al., 2018). The key prioritization insight we leverage is that of demonstrated mastery.
That is, items in a (training) data set may be rated by the degree to which they have been mastered
by the learner. We instantiate this by associating a rating to each training data item , and
changing () based on a model ’s ability to correctly classify during training. Let [, ] be the
acceptable rating range, let ′ () be the rating for according to some previous model ′
, and let
() ∈ {−1, 1} indicate that model classified {incorrectly, correctly}, respectively. Then, the
rating for a data item is computed as:
() = max(min(′ () + (), ), ) (5.1)
Thus, is raised when is correctly classified and lowered when it is misclassified, subject to the
acceptable range. In this work, we set [, ] = [1, 5], following established practice (Reddy et al.,
2016; Amiri et al., 2018).
5.1.3 Leitner-Guided Cross-lingual Experience Replay (LER)
We explore the use of () to determine whether or not to include in M. At the start of phase P,
by convention, for all ∈ D ∪ M, we set ∅ (), the initial rating, to . At the end of each epoch
within the phase, we update for each data item in D and M according to the model at that
point in training. At the end of P, we use values to form the new M, selecting |M |
items from
D and |M | − |M |
items from the current M according to one of two strategies:
• LER (Easy): Highest-rated items are prioritized.
• LER (Hard): Lowest-rated items are prioritized.
Our approach for selecting data from D
inversely proportional to enables the fixed and limited
M to contain an even distribution of samples from all D<
thus seen, militated by the relative
learning difficulty of different phase datasets.
2The original spaced repetition strategy is formulated in Algorithm 3.
77
5.2 Experimental Setup
We start by presenting the different baselines and model variants used to compare between different
experimental scenarios (Section 5.2.1). We then describe the benchmark datasets and their base
models (Section 5.2.2) along with the multilingual datastreams (Section 5.2.3) that we focus on in
this evaluation. More implementation details, such as the hyper-parameters, number of parameters
used, and runtime for different models, can be found in Appendix D.
5.2.1 Baselines & Model Variants
Baselines Before delving into different variants of Leitner-guided memory replay, we consider
the following baselines:
• No ER. This is our lower-bound naive sequential fine-tuning baseline. This sequentially
fine-tunes on datasets sampled from one language at a time D ∈ D1··· without using any
experience replay.
• Balanced. This is an experience replay approach adapted from Lopez-Paz and Ranzato (2017),
which allocates equally sized buffers balanced across language. At the end of each phase P
,
|M |/( − 1) examples are randomly picked from D and added to M.
• Random. This is a more realistic experience replay approach, adapted from Riemer et al.
(2019), which randomly samples and updates |M | from D< at the end of each phase P
.
Other techniques have been proposed to produce memory exemplars such as K-Means clustering (Chaudhry et al., 2019b), mean of features (Rebuffi et al., 2017), and prototypical networks (Ho
et al., 2023). However, we don’t explore those approaches since they don’t lead to clear improvements against Random (reservoir sampling) (Chaudhry et al., 2019b).
Model Variants We design the following model variants on top of LER. The research question
we analyze here is: Does dynamically prioritizing easy elements help mitigate forgetting more
than hard elements or vice versa? Our analysis evaluates the aggregated effectiveness of different
strategies used for memory construction. This consists of LER (Easy) and LER (Hard), which use
easy and hard examples to fill and update the memory, respectively.
5.2.2 Benchmarks & Base Models
We conduct experiments on two tasks commonly used in natural language understanding literature,
covering different typologically diverse languages and requiring different levels of reasoning:
multilingual task-oriented dialog (MTOD) and multilingual question answering (MQA). For MTOD
evaluation, we use two multilingual task-oriented dialog datasets: MTOP (Li et al., 2021) and
MultiATIS++ (Xu et al., 2020). While MultiATIS++ covers 18 intents and 84 slots on average per
language from one domain, MTOP covers 117 intents and 78 slots from 11 domains. We choose
78
MTOP and MultiATIS++ since they are among the large-scale datasets available for task-oriented
dialog covering typologically diverse languages. To ensure a challenging and trustworthy evaluation
for MQA, we choose TyDiQA (Clark et al., 2020), which is a translation-free, realistic informationseeking benchmark. For MTOD and MQA, we use the model architectures described in Section 2.2.
Table 5.1 shows the statistics per language and split for MTOP, MultiATIS++, and TyDiQA datasets.
Dataset Language Train Dev Test
MTOP
English 15,667 2235 4386
German 13,424 1815 3549
Hindi 11,330 2012 2789
Thai 10,759 1671 2765
MultiATIS++
English 4488 490 893
French 4488 490 893
Chinese 4488 490 893
Turkish 578 60 715
TyDiQA
Indonesian 5131 571 565
Russian 5841 649 812
Swahili 2479 276 499
Telugu 5006 557 669
Table 5.1: Statistics of MTOP, MultiATIS++, and TyDiQA per language and split.
5.2.3 Datastreams
We design a balanced set of distinct language permutations, following the cross-lingual continual
learning evaluation paradigm established in Section 4.1.4. Formally, for a given set of = 4
languages, we sample a subset of language permutations ⊂ (L ) where each language
appears exactly once in each permutation. Table 5.2 shows the language permutations we consider
for different downstream benchmarks.
5.3 Results & Analysis
In this section, we provide an extensive analysis to demonstrate the effectiveness of our Leitnerguided cross-lingual experience replay approach. Our primary analytical tool is forgetting, which
measures the degree to which a learned skill is lost when a model is trained on out-of-language data.
Lower forgetting is better, while negative forgetting indicates the model has improved as a result of
out-of-language training. We also show final performance, which is simply a metric’s value after all
phases of continual learning. For forgetting and final performance, we use the same formulation of
79
Dataset # Order
MTOP
1 English→German→Hindi→Thai
2 German →English→Thai→Hindi
3 Hindi→Thai→English→German
4 Thai→Hindi→German→English
MultiATIS++
1 English→French→Turkish→Chinese
2 French→English→Chinese→Turkish
3 Turkish→Chinese→English→French
4 Chinese→Turkish→French→English
TyDiQA
1 Russian→Indonesian→Telugu→Swahili
2 Indonesian→Russian→Swahili→Telugu
3 Telugu→Swahili→Russian→Indonesian
4 Swahili→Telugu→Indonesian→Russian
Table 5.2: Language permutations for MTOP, MultiATIS++, and TyDiQA.
evaluation protocols as defined in Section 4.1.5. We present both a summary of the test performance
based on the best epoch given Dev data split performance and over each epoch throughout different
training stages (Section 5.3.1). Then, we present a more fine-grained analysis, shedding light on
which language orders and languages our Leitner-based skill rating system is particularly helpful
(Section 5.3.2). Last but not least, we present a qualitative analysis of different categories of skill
ratings and what makes ruling out hard examples useful (Section 5.3.3).
5.3.1 Average Performance
In Table 5.3, we compare different Leitner-guided memory selection strategies and baselines for
MTOP, MultiATIS++, and TyDiQA benchmarks in terms of their forgetting. We start by showing
their forgetting on the test data averaged over different language orders based on the best-performing
model on the Dev data split. Compared to No ER baseline, all ER approaches: Balanced, Random,
and LER variants are beneficial in reducing forgetting, irrespective of the strategy followed in
memory storage and update. It is clear that the forgetting gap between No ER and ER approaches
is more pronounced for MTOP and MultiATIS++ tasks than it is for TyDiQA. We conjecture that
this is due to the formulation of TyDiQA as a span-based question-answering task. The latter
employs a simple token classification model, which is less challenging than joint optimization over
classification and sequence modeling objectives in MTOD. The gains are even more pronounced
for MTOP, whose ontology covers more domains, intents, and slots than that of single-domain
MultiATIS++. Among MTOP subtasks, slot filling has a higher overall forgetting than intent
detection. The implication of all of these findings is that forgetting is more pronounced, and our
technique is more crucial when tasks are more difficult.
By keeping a balanced memory across languages, Balanced could have the benefit of making
sure to revisit all languages, assuming knowledge of the total number of languages involved in
the continual learning. However, using a balanced memory across languages Balanced doesn’t
80
Approach
MTOP MultiATIS++ TyDiQA
Intent Accuracy ↓ Slot F1 ↓ Slot F1 ↓ F1 ↓
No ER 5.84 7.56 2.62 1.52
Balanced† 0.92 1.15 −0.63 0.92
Random‡ 0.68 0.76 −0.56 0.73
LER (Easy) 0.49 0.63 −0.73 0.83
LER (Hard) 0.82 1.09 1.10 1.14
Table 5.3: Average Test forgetting scores based on the Dev data split performance of different
models and baselines. We compare two Leitner-guided memory replay variants LER (Easy) and
LER (Hard) to the baselines. Since no previous work on experience replay in the cross-lingual setup
reports any forgetting results, we implement in addition to No ER our internal baselines: Balanced
and Random adapted from †Lopez-Paz and Ranzato (2017) and ‡Riemer et al. (2019), respectively.
Best (lowest ↓) forgetting scores are highlighted in bold for each task and subtask.
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
1 2 3 4 5 6 7 8 9 10
Forgetting of Slot Filling
F1 Score
LER (Easy)
Random
LER (Hard)
(a) Average forgetting ↓.
87.4
87.6
87.8
88
88.2
88.4
88.6
88.8
89
89.2
89.4
1 2 3 4 5 6 7 8 9 10
Final Performance of Slot
Filling F1 Score
LER (Easy)
Random
LER (Hard)
(b) Average final performance ↑.
Figure 5.2: Average forgetting and final performance of slot filling for different model variants
compared to the Random baseline averaged over different language orders. The lower the forgetting
and the higher the final performance the better.
lead to lower forgetting than picking a random memory across languages Random. This could be
because Balanced picks a balanced amount of examples per language, exposing the model to less
diversity compared to Random. This could also show the need to continuously update the diversity
of memory to make room for higher-quality examples in continual learning. LER (Easy) stands
out as one of the most successful strategies in reducing forgetting and beating both experience
replay baselines. LER (Easy) reaches the lowest forgetting with reductions of 1.76, 0.64, and 0.46 in
forgetting of F1 score for slot filling compared to LER (Hard), Balanced, and Random respectively.
Results on MultiATIS++ and to some degree TyDiQA confirm the consistent superiority of LER
(Easy) and the inferiority of LER (Hard) approach.
81
For the remaining analysis, we focus on MTOP, shedding more light on the added value of LER
approaches compared to the best-performing ER baseline Random. Figures 5.2a and 5.2b show
the learning curves of different models on slot filling in terms of forgetting and final performance,
respectively. Throughout the training, LER (Easy) is, in general, more effective than Random and
LER (Hard) in minimizing forgetting while improving final performance in most cases, thus taming
the stability plasticity dilemma. LER (Easy) can converge and stabilize at a low forgetting score
earlier in training. On the other hand, the LER (Hard) strategy exacerbates the forgetting problem
as training proceeds. This shows that replaying easy examples is consistently more effective than
revisiting hard ones that the model is struggling with. Our Leitner-based skill rating system provides
a dynamic measure that keeps selecting pertinent instances as language exemplars in constructing
the memory replay.
5.3.2 Fine-grained Language Analysis
Figures 5.3 and 5.4 show a fine-grained analysis of forgetting of intent classification and slot filing
between different models across all different language orders and languages, respectively. For
each language order and language, we report Test results for the best-performing model based on
Dev data split. Overall, we observe that LER (Easy) robustly and consistently outperforms LER
(Hard) across different language orders and languages. LER (Easy) outperforms Random for the
majority of the cases. Certain language orders such as Thai→Hindi→German→English (4) and
Hindi→Thai→English→German (3) have more forgetting than others. The languages that benefit
the most compared to Random are German and Thai, whereas the gains for English and Hindi are
more minimal. LER (Easy) manages to bridge that gap in forgetting, keeping it within a low range.
0.75
0.16
1.02
0.67 0.65
0.24
0.64
1.15
0.91 0.98 1.06
0.84
0.0
1.0
2.0
Forgetting of Intent
1 2 3 4
Class Accuracy
LER (Easy) Random LER (Hard)
(a) Intent Classification.
0.44
0.16
0.56
1.37
0.39 0.27
0.76
1.64
0.83 0.90
1.10
1.53
0.0
1.0
2.0
1 2 3 4
Forgetting of Slot
Filling F1 Score
LER (Easy) Random LER (Hard)
(b) Slot Filling.
Figure 5.3: Fine-grained analysis of forgetting over different language orders as defined in Table 5.2.
Best (lowest) results for each language order are highlighted in bold.
5.3.3 Discussion
In this part, we conduct a qualitative analysis to complement our conclusion from our quantitative
analysis that choosing training data for memory that is easy to learn is more beneficial than choosing
data that is not easily learned. To dig deeper into why ruling out harder examples from the memory
82
0.62
0.35
0.71
0.41 0.40
0.49 0.57 0.53
0.63 0.58
0.79 0.84
0.0
0.5
1.0
Forgetting of Intent
English German Hindi Thai
Class Accuracy
LER (Easy) Random LER (Hard)
(a) Intent Classification.
0.47 0.53
0.71
0.48
0.49 0.54
0.74
0.53
1.14
0.59
0.90
0.64
0.0
0.5
1.0
1.5
English German Hindi Thai
Forgetting of Slot
Filling F1 Score
LER (Easy) Random LER (Hard)
(b) Slot Filling.
Figure 5.4: Fine-grained analysis of forgetting over different languages. Best (lowest) results for
each language are highlighted in bold.
is beneficial, we look more closely at the characteristics of those hard cases among training data.
We define an intractable example as an example whose skill rating never gets promoted and stays
1 throughout training. At the other end of the spectrum is a confident example whose skill rating
converges to 5 and never gets demoted after that.
In Figure 5.5, we report percentages of intractable and confident training data in MTOP for each
language, averaged over all phases and language orders. We notice that for each language, 70%
or more examples are confident. Thus, the Random approach to memory selection is unlikely to
differ all that much from the LER (Easy) approach, at least in intent detection for MTOP. For other
tasks with lower rates of easy examples, it might not be straightforward to pick easy examples with
a random approach. We also observe a trend where the more high-resource the language is, the less
likely its examples are intractable and the more likely its examples are confident. Thai, which has
the highest percentage of intractable examples, is the most low-resource in MTOP. This explains
why LER (Easy) is much more beneficial than Random for Thai compared to English for intent
classification in Figure 5.4a.
As an exemplar, we focus now on English data, specifically concentrating on training data
analysis from the end of the first phase in language order English→German→Hindi→Thai. To
understand what makes an example particularly intractable, we design the following categories:
• Low-resource (LR): A training instance is considered low-resource if the number of training
instances per its intent label is below 10. For English, there are 137 training instances per
intent on average. This makes low-resource labels fall within the 25% percentile.
• Difficult to disambiguate (DD): This is the case if the true class is among the most similar to
the predicted class. We determine that by computing the [] token representations of all
training sentences. We then compute the centroid of the sentence representations per class
label. For each label, we determine its most similar predicted classes based on the 5 nearest
neighbors.
• Poorly-defined (PD): Unlike low-resource and difficult to disambiguate examples, which
are automatically determined by their labels, we inspect here case by case for poorly-defined
sentences. We define a poorly-defined example as any sentence that doesn’t make sense to be
83
0.90
86.57
2.56
79.18
2.93
77.03
5.14
70.41
0
20
40
60
80
100
Skill never promoted Converged to max skill
Percentage
English German Hindi Thai
Figure 5.5: Percentages of examples that never get promoted past skill rating 1 (Skill never
promoted) and those that converge to the maximum skill rating 5 (Converged to max skill) per
language averaged over different language orders.
attributed to a certain label. This could be due to a mismatch or lack of commonsense in the
way the ontology was defined for certain labels.
50%
20%
2% 22%
4%
2%
DD
LR
PD
LR & DD
PD & DD
Unclassified
Figure 5.6: Distribution of different categories of intractable examples in the English data.
We show in Figure 5.6 some statistics of different categories of intractable examples. Most
intractable examples are either DD, LR, or both. Inspecting confident examples reveals that no LR
or DD examples are encountered among them. This demonstrates that our Leitner-guided approach
LER (Easy) can detect such hard categories and rule them out. By imposing a more fine-grained
skill rating system, our Leitner-guided memory replay approach provides a more confident approach
to determine which labels the model is struggling with more than relying on prediction loss (Amiri
et al., 2018). The skill rating system adds information that prediction loss alone does not. In fact,
only 27% of the English examples that have skill ratings between 2 and 4 (neither intractable nor
84
Type Utterance True Class Prediction Classes
LR
Put this song on repeat. music:LOOP_MUSIC music:REPLAY_MUSIC
What is Tyler studying in school? people:GET_MAJOR people:GET_UNDERGRAD
Merge another call with this one. calling:MERGE_CALL calling:END_CALL
DD
Did Jack get sentenced today? news:GET_DETAILS_NEWS news:QUESTION_NEWS
How to make Arab tahini sauce? recipes:GET_INFO_RECIPES recipes:GET_RECIPES
What time does the sun come up tomorrow? weather:GET_SUNRISE weather:GET_SUNSET
PD
Where does Kade work? people:GET_LOCATION people:GET_EMPLOYER
Pause the current timer and delete. timer:PAUSE_TIMER timer:DELETE_TIMER
Increase my timer to 30 minutes. timer:CREATE_TIMER timer:RESTART_TIMER
Table 5.4: Examples of intractable examples and their golden truth and prediction intent labels from
each category.
-30
-20
-10
0
10
20
30
40
50
-80 -60 -40 -20 0 20 40 60 80
t
-SNE Dimesnion 2
t-SNE Dimension 1
alarm
calling
event
messaging
music
news
people
recipes
reminder
timer
weather
REPLAY_MUSIC
LOOP_MUSIC
GET_STORIES_NEWS
GET_DETAILS_NEWS
GET_
GET_ SUNRISE
SUNSET
GET_MAJOR
GET_UNDERGRAD
SET_AVAILABLE
SET_UNAVAILABLE
Figure 5.7: t-SNE visualization of centroids of different intent labels highlighting some ambiguous
labels indistinguishable in the embeddings space.
highly confident) are wrongly predicted at the end of the first phase. Those unstable examples are
part of the selection of LER (Hard), so not prioritizing such examples is beneficial.
In Table 5.4, we provide some examples of different categories of wrongly predicted labels. We
observe inconsistencies in those examples. Those that are specifically DD are so close to being
picked as representative examples of certain classes, which can only confuse the learner. For example, while "Pause the current timer and delete." is supposed to be classified as timer:PAUSE_TIMER,
this label is far from being comprehensively descriptive of the sentence intent. Its predicted label
timer:DELETE_TIMER is not wrong either, as it detects the intent to delete, which is the second
85
part of the example. We suspect that reinforcing the learning using difficult cases can only mislead
the learner.
In Figure 5.7, we show a t-SNE projection of the centroids of different intent label representations.
We highlight in that figure the most common DD labels whose representations are indistinguishable
in the vector space. Some of those labels like GET_STORIES_NEWS and GET_DETAILS_NEWS
are to the human eye also DD, which could be an artifact of how the intent ontology was defined.
Our Leitner-guided strategy LER (Easy) rules them out, favoring examples the learner is more
confident about with class labels that correspond to more clearly separable representations.
5.4 Conclusion
In this Chapter, we formulate a human-inspired experience replay approach specifically for crosslingual continual learning. We propose a Leitner-based skill rating system to dynamically populate
and update the memory with high-quality items. Our approach can tame the plasticity stability
dilemma better than vanilla and random selection, especially for complex tasks. The implications
of this analysis include a recipe for how to incorporate aspects of human learning in the design of
memory replay in cross-lingual continual learning.
86
Chapter 6
Related Work
This Chapter summarizes previous work related to this thesis and its context. We start by reviewing
work on cross-lingual transfer learning in general, focusing on specific application domains covered
in our evaluation (Section 6.1). We then conclude with related work on the limitations of this
traditional paradigm, setting the stage for the shift to newer paradigms. In Section 6.2, we review
applications of meta-learning algorithms to cross-lingual and multilingual downstream setups.
Section 6.3 reviews related work on lifelong learning for NLP in general before delving into work
that extends lifelong learning to cross-lingual applications. We conclude this Chapter by reviewing
previous work on human-inspired approaches with a focus on Leitner queues (Section 6.4).
6.1 Cross-lingual Transfer Learning
Cross-lingual embedding are extensions of monolingual word embedding (Mikolov et al., 2013)
encoding data from multiple languages in one joint space. Early work on cross-lingual embedding
involved type-based embedding models coming up with a single vector representation for each
word. Previous work on such embedding differs in the optimization approach, alignment data
used to train them, and how they are applied to the downstream task. They can either be pretrained from scratch (Gouws et al., 2015; Luong et al., 2015) or fine-tuned on top of monolingual
embedding (Smith et al., 2017; Mrksic et al., 2017). They could also differ in the alignment data used
either at the word (Klementiev et al., 2012), sentence (Hermann and Blunsom, 2014), or comparable
corpus levels (Vulic and Moens ´ , 2013; Søgaard et al., 2015). Later, unsupervised approaches such
as Lample et al. (2018) and Artetxe et al. (2018) are proposed to alleviate the need for parallel
data reaching competitive performance compared to supervised models. The earliest application
of such embedding involves direct transfer of learning where a model is trained on one language
and transferred to unseen languages leveraging cross-lingual embedding. Most work inducing
cross-lingual embedding evaluate on benchmarks like part-of-speech (POS) tagging (Cohen et al.,
2011), cross-lingual document classification (Klementiev et al., 2012; Schwenk and Li, 2018),
named entity recognition (Xie et al., 2018), etc.
Later on, type-based embedding were overpowered by contextualized encoders like ELMo (Peters et al., 2018) and Transformer-based encoders such as BERT (Devlin et al., 2019), which were
shown to achieve significant improvement in performance for many NLP tasks. Those models learn
87
representations that can generalize better to new tasks and domains. Their success is also due to
universal word piece tokenization, which handles out-of-vocabulary words better than type-based
embedding. Multilingual extensions of Transformer-based embedding such as M-BERT, XLM, and
XLM-R proposed by Devlin et al. (2019), Conneau and Lample (2019), and Conneau et al. (2020),
respectively. These multilingual Transformer-based encoders are shown to outperform type-based
embedding even if they are trained with no supervised alignment objective. This has encouraged
us to investigate the applicability and adaptability of different cross-lingual transfer techniques to
under-explored applications.
Below, we summarize previous work related to the specific application domains and evaluation
benchmarks covered in our thesis: event extraction, natural language understanding, and semantic
search. To understand related work for each application domain, we first describe its previous work
monolingually before describing techniques used to extend it cross-lingually and multilingually.
6.1.1 Event Extraction
In this Section, we highlight previous work on monolingual and cross-lingual event extraction,
focusing on work before our paper on contextualized cross-lingual event trigger extraction M’hamdi
et al. (2019).
Monolingual Event Extraction Previous work in event extraction on ACE2005 benchmark (Walker,
2006) mainly focuses on English. Early work such as cross-document (Ji and Grishman, 2008) and
cross-Event (Liao and Grishman, 2010) leverage hand-crafted linguistic features at the document
level to perform trigger identification followed by argument extraction in a pipelined fashion. Li et al.
propose a more structured framework for joint trigger labeling and argument extraction training.
Other approaches explore neural networks on top of linguistic features employing architectures
using dynamic multi-pooling CNNs (Chen et al., 2015) and bidirectional RNNs with manually
crafted features (Nguyen et al., 2016). A joint approach is proposed by Liu et al. (2018b) to extract
multiple events based on syntactic graph convolution networks. Zhang and Ji (2018) propose an
approach based on inverse reinforcement learning using Generative Adversarial Networks (GAN)
to alleviate mistakes related to ambiguous labels, making the model less vulnerable to biased,
supervised datasets like ACE2005. However, the majority of the described approaches involve,
to some degree, the use of linguistic features; this is labor intensive and requires rich external
resources, which are not necessarily available for low-resource languages.
Cross-lingual Event Extraction Prior efforts to our work (M’hamdi et al., 2019) on cross-lingual
event extraction involve manually crafted linguistic features or machine translation. Hsi et al. use
a mix of linguistic features and language-independent features such as universal dependencies
features and multilingual word embedding aligned using linear regression over a limited bilingual
dictionary. However, their approach doesn’t achieve a strong cross-lingual performance. Feng et al.
(2016) propose an approach based on a hybrid neural network combining bidirectional LSTM with
convolutional neural networks, reducing the need for manually crafted linguistic features. However,
88
this approach still requires equally abundant labeled data for different languages and implies the
need to independently train a new model for each new language. Liu et al. (2018b) propose gated
cross-lingual attention to exploit the inherent complementarity between different languages, which
helps with data scarcity. However, this approach relies on machine translation, which suffers from
error propagation. Our work on cross-lingual event trigger extraction (M’hamdi et al., 2019) is
the first to adapt a transfer learning approach based on multilingual embedding to event detection
without relying on machine translation or manually-crafted features. To the best of our knowledge,
there was no prior work adopting static or contextualized multilingual cross-lingual embedding
for event trigger detection. Our work inspired more studies using our formulation of event trigger
extraction as a sequence labeling problem, our released data splits for a fair comparison in different
languages, and our transfer learning evaluation schemes (Guzman Nateras et al., 2023; Nguyen
et al., 2023; Guzman Nateras et al., 2022). Our comparison of different multilingual type-based and
contextualized embedding under zero-shot, targeted, and joint learning helped with understanding
the dynamics of multilingual fine-tuning and drawing insights on the robustness and transferability
of current approaches (Lai, 2022).
6.1.2 Natural Language Understanding
In this Section, we first briefly summarize related work on different natural language understanding
tasks. We then focus on related work featuring different base model architectures for task-oriented
dialog system tasks. After that, we present work on extending natural language understanding to
multiple languages.
Monolingual Natural Language Understanding Natural language understanding refers to the
module intended to understand the user’s request in spoken dialog systems. A structured taskoriented dialog system involves two sub-tasks: intent detection and slot filling. Previous work
on such a system addresses each subtask independently, in a pipelined fashion, or trains on them
jointly. Intent detection is often treated as a text classification task (Xia et al., 2018; Casanueva
et al., 2020). Slot filling, on the other hand, is formulated as a sequence labeling task (Mesnil et al.,
2015). More work proves that training on both subtasks jointly (Liu and Lane, 2016; Castellucci
et al., 2019) reaches a better performance than training them in a pipelined way, especially for slot
filling. Later on, the term natural language understanding has evolved to encapsulate more tasks
such as question answering (Seo et al., 2017), sentiment analysis (Socher et al., 2013), and textual
entailment(Williams et al., 2018) with benchmarks like GLUE (Wang et al., 2018).
Cross-lingual Natural Language Understanding Upadhyay et al. (2018) and Schuster et al.
(2019a) propose the first real attempts at cross-lingual task-oriented dialog using transfer learning.
Although they show that cross-lingual joint training outperforms monolingual training, their zeroshot model lags behind machine translation for other languages. To circumvent imperfect alignments
in the cross-lingual representations, Liu et al. (2019) propose a latent variable model combined
with cross-lingual refinement with a small bilingual dictionary related to the dialog domain. Liu
89
et al. (2020) enhance Transformer-based embedding with mixed language training to learn interlingual semantics across languages. However, although these approaches show promising zero-shot
performance for Spanish, their learned refined alignments are not good enough to surpass machine
translation baselines on Thai. More recently, Hu et al. (2020) and Liang et al. (2020) introduce
XTREME and XGLUE benchmarks for the large-scale evaluation of cross-lingual capabilities
of pre-trained models across a diverse set of understanding and generation tasks. In addition to
M-BERT, they analyze models like XLM (Conneau and Lample, 2019) and Unicoder (Huang et al.,
2019). Although the latter two models slightly outperform M-BERT, they need a large amount of
parallel data to be pre-trained. The extent to which massive cross-lingual supervision helps bridge
the gap to linguistically distant languages is also unclear.
6.1.3 Semantic Search
In this Section, we highlight previous work on semantic search applications. Due to the scarcity of
cross-lingual semantic search and the popularity of cross-lingual information retrieval, we focus on
related work on cross-lingual information retrieval in the second part.
Monolingual Semantic Search Textual semantic search is the task of retrieving semantically
relevant content for a given query. Unlike traditional keyword-matching information retrieval,
semantic search seeks to improve search accuracy by understanding the searcher’s intent and
disambiguating the contextual meaning of the terms in the query (Muennighoff, 2022). Semantic
search has broad applications in search engines such as Google (Nayak, 2019), Bing (Zhu et al.,
2021), etc. They rely on Transformers (Vaswani et al., 2017) as their dominant architecture, going
beyond non-semantic models such as BM25 (Robertson and Zaragoza, 2009).
Cross-lingual Information Retrieval Progress in cross-lingual information retrieval (CLIR) has
seen multiple waves (Grefenstette, 1998). Traditionally, machine translation (MT) has been leveraged for CLIR. Most approaches that fall into this category translate either query into the language
of the documents (Lu et al., 2008; Nguyen et al., 2008; Jones et al., 2008) or documents to the
language of the queries or both to an intermediary language, then they perform a monolingual search.
Those MT pipeline approaches suffer from error propagation of the machine translation component
into the downstream semantic search, especially for low-resource languages. Moreover, the number
of language combinations in the query and content to be retrieved can get prohibitively large (Savoy
and Braschler, 2019). More prominent approaches include cross-lingual transfer learning, where
queries and documents or sentences are encoded into a shared space. The cross-lingual ability
of models like M-BERT and XLM has been analyzed for different retrieval-based downstream
applications including question-answer retrieval (Yang et al., 2020), bitext mining (Ziemski et al.,
2016; Zweigenbaum et al., 2017), and semantic textual similarity (Hoogeveen et al., 2015; Lei
et al., 2016). Litschko et al. (2022) benchmark the performance of Transformer-based multilingual
encoders in unsupervised ad-hoc (setup with no relevance judgments for IR-specific fine-tuning)
and supervised sentence and document-level CLIR. They show that those encoders need semantic
90
specialization through pre-fine-tuning on other auxiliary tasks such as XNLI (Conneau et al., 2018)
to outperform type-based embedding. In this thesis, we investigate ways to improve further the
transfer of these off-purpose sentences on top of semantic specialization in a data-efficient manner.
6.1.4 Limitations
Given the surprising capabilities of M-BERT in zero-shot cross-lingual applications, Wu and Dredze
(2019) and Pires et al. (2019) dig deeper to analyze the true cross-lingual capabilities of M-BERT.
They show that although M-BERT allows transfer even to languages with no lexical overlap, that
transfer exhibits systemic deficiencies between specific language pairs. Specifically, mBERT does
not learn systematic transformations of those structures to accommodate a typologically diverse
target language. Hu et al. (2020) and Ruder et al. (2021) perform a more comprehensive analysis
probing the capabilities of different multilingual Transformers-based encoders. They find that there
is still a sizable gap between those encoders and human performance, namely for syntactic and
sentence retrieval tasks. Better alignment approaches on top of M-BERT are explored (Schuster
et al., 2019b; Wang et al., 2019). However, they often require parallel corpora, which is expensive
to obtain, especially for low-resource languages. In addition, due to the contextualized nature of
those embedding, they usually involve a complex alignment objective so parallel. This encourages
us to explore more resource-lean approaches involving few-shot learning.
6.2 Few-shot Learning
In this Section, we give a brief overview of various applications of meta-learning techniques with
a particular focus on MAML (Finn et al., 2017). Our discussion of related work is divided into
cross-lingual and multilingual applications.
Cross-lingual Meta-Learning Previous work in meta-learning for cross-lingual NLP is relatively
scarce and focused on applying first-order MAML. Earlier work by Gu et al. (2018) extends
MAML to improve low-resource languages for neural machine translation. Dou et al. (2019)
apply MAML to NLU tasks in the GLUE benchmark. They show that meta-learning is a better
alternative to multi-task learning, but they only validate their approach on English. Wu et al.
(2020) also use MAML for cross-lingual NER with a slight enhancement to the loss function.
More recently, Nooralahzadeh et al. (2020) also directly leverage MAML on top of M-BERT and
XLM-R for zero-shot and few-shot XNLI and MLQA datasets. Although their attempt shows
that cross-lingual transfer using MAML outperforms other baselines, the degree of typological
commonalities among languages plays a significant role in that effect. In addition, their approach
oversimplifies the n-way k-shot setup, with a one-fit-all sampling of data points for support and
query and additional supervised fine-tuning. Other applications include speech recognition (Hsu
et al., 2020; Winata et al., 2020; Chen et al., 2020; Xiao et al., 2021), Natural Language Inference
(XNLI) (Conneau et al., 2018) and Multilingual Question Answering (MLQA) (Lewis et al., 2020)
using X-MAML (Nooralahzadeh et al., 2020), task-oriented dialog (Schuster et al., 2019a) and
91
TyDiQA (Clark et al., 2020), dependency parsing (Langedijk et al., 2022), etc. This encourages us
to investigate cross-lingual transfer learning from a different perspective, where we more explicitly
distinguish between meta-learning and naive fine-tuning for transfer learning. Our paper on XMETRA-ADA (M’hamdi et al., 2021) has been featured in several follow-up works, including Lee
et al. (2022) and Hupkes et al. (2022) which survey state-of-the-art meta-learning approaches for
natural language processing.
Multilingual Meta-Learning Most recent work adapting meta-learning to applications involving
different languages focuses on cross-lingual meta-learning. Multilingual meta-learning differs
from cross-lingual meta-transfer learning in its support for multiple languages jointly. M’hamdi
et al., for example, propose X-METRA-ADA, which performs few-shot learning on one single
target language at a time and also enables zero-shot learning on target languages not seen during
meta-training or meta-adaptation. Their approach shows gains compared to naive fine-tuning in the
few-shot more than the zero-shot learning scenario. Tarunesh et al. (2021) propose a meta-learning
framework for both multi-task and multilingual transfer leveraging heuristic sampling approaches.
They show that a joint approach to multi-task and multilingual learning using meta-learning enables
effective sharing of parameters across multiple tasks and multiple languages, thus benefits deeper
semantic analysis tasks such as QA, PAWS, NLI, etc. van der Heijden et al. (2021) propose a
meta-learning framework and show its effectiveness in both the cross-lingual and multilingual
training adaptation settings of document classification. However, their multilingual evaluation is
focused on the scenario where the same target languages during meta-testing can also be used as
auxiliary languages during meta-training. This motivates us to investigate, in this thesis, more
in the direction of multilingual meta-transfer learning, where we test the generalizability of our
meta-learning model when it is learned by taking into consideration multiple languages jointly for
semantic search.
6.3 Lifelong Learning
Given the scarcity of work on cross-lingual lifelong learning, we start by reviewing lifelong learning
for NLP applications in general. Then, we discuss how previous work formulated the problem of
lifelong learning for cross-lingual applications, the different families of approaches proposed, and
their limitations.
Lifelong Learning for NLP Continual learning approaches have found favor, especially among
the computer vision community, including regularization-based approaches (Kirkpatrick et al., 2017;
Zenke et al., 2017; Li and Hoiem, 2016; Ritter et al., 2018) and memory-based approaches (Shin
et al., 2017; Chaudhry et al., 2019b,a). Only recently has continual learning started gaining more
interest in the NLP community. Most efforts on continual learning for NLP have focused on
classification tasks and fall into the category of domain or class incremental continual learning (Han
et al., 2020). Current approaches often fail to effectively retain previous knowledge and adapt to
92
new information simultaneously (Biesialska et al., 2020; de Masson d’Autume et al., 2019). New
challenges are formulated to study the problem of continual learning from different perspectives. Jin
et al. (2022) frame the lifelong learning pre-training challenge, where pertaining language models
continually adapt to emerging data from new corpora.
Cross-lingual Lifelong Learning Continual learning for cross-lingual NLP, on the other hand, is
relatively fresh ground, either focusing on proposing cross-lingual approaches that indirectly support
continual learning, such as Artetxe et al. (2020), on the transfer-ability of monolingual models. Other
approaches derive a cross-lingual continual learning problem directly from cross-lingual transfer
learning, such as Garcia et al. (2021), who propose a lexical approach to adapt to new low-resource
languages for machine translation. Similarly, Pfeiffer et al. (2021) propose lexical-level adaptation
schemes that can be applied to models relying on subword-based tokenization to adapt them to
low-resource languages not covered or whose scripts are unseen during pre-training. Minixhofer
et al. (2022) also propose adaptations beyond the lexical level, which facilitates the creation of
monolingual language models that are transferable for new languages. Liu et al. (2021) explore
continual techniques to fine-tune on downstream applications for new languages while preserving
the original cross-lingual ability of the pre-trained model. However, unlike our work, they all focus
on a two-hop analysis from high to low-resource language pairs or pre-training to fine-tuning tasks,
which analyzes across multiple hops. Muller et al. (2021) analyze the adaptability and usability
of large language models to unseen and under-studied low-resource languages. Based on that and
depending on the degree of additional pre-training and fine-tuning required, they categorize the
low-resource languages into easy, intermediate, and hard. Although this work paves the way for
a better understanding of the mechanics of transferability to low-resource scenarios, they don’t
study the scenario where the transferability needs to be performed in multiple hops following a
sequential stream of data. More recently, Pfeiffer et al. (2022) propose a new methodology for
language-specific modules to add additional capacity and deal with the curse of multilinguality, and
show that their approach mitigates negative interference between languages while enabling positive
transfer. They use a continual learning multi-hop evaluation paradigm closer to our setup. Still,
they only evaluate using interference and transfer and only one approach based on adapters. They
don’t analyze other aspects of cross-lingual continual learning capabilities using a holistic approach
like our work (M’hamdi et al., 2023). Our work inspired follow-up studies including Praharaj
and Matveeva (2022), which use the findings of our work and evaluation protocols enforcing the
constraint of data stored from previous languages and considering dozens of hops instead of six
hops. Badola et al. (2023) also uses our evaluation protocols, especially the design of language
permutations. They extend our continual learning setup for model-based approaches by proposing a
parameter-efficient fine-tuning strategy enforcing the constraint of a fixed model capacity. Marchisio
et al. (2023) explore more parameter-efficient alternatives for continually adapting cross-lingual
models to new languages in a post-hoc manner.
93
6.4 Human-Like Learning
In this Section, we summarize some previous work on human learning techniques in general and
how they inspired human-like machine learning in the context of continual learning. Then, we focus
more on related work on Leitner queues (Leitner, 1974), which is explored in this thesis.
Human-Like Learning Techniques Humans are continual learning systems by nature. They
differ from machines in their ability to memorize experiences quickly, with less exposure, and in
their gradual rather than catastrophic forgetting. Continual learning work inspired by human-like
learning can be divided into the following categories: mechanisms of sleep (Ball et al., 2020;
Mallya and Lazebnik, 2018; Schwarz et al., 2018), reactivation of memories (Hayes et al., 2020;
van de Ven et al., 2020), spaced-repetition (Smolen et al., 2016; Amiri et al., 2017; Amiri, 2019;
Feng et al., 2019; Klasson et al., 2023), etc. Sleep mechanism is a human-like approach involving
dual processes. One famous model in this category is progress and compress (Schwarz et al.,
2018): 1) during the day, the information is received in short-term memories and consolidated to
be moved to long-term memories; 2) during sleep, short-term memories are freed up for future
learning. Another way to simulate human sleep mechanisms is by iterative pruning and retraining
the model, inspired by the multiple non-rapid eye movement (NREM)/ rapid eye movement (REM)
cycles that humans experience during sleep (Ball et al., 2020). Reactivation of memories is another
human-inspired continual learning direction. One of the most famous approaches is experience
replay (Chaudhry et al., 2019b), which interleaves replaying from memory of previously seen tasks
while visiting newer ones. Numerous approaches looked into different considerations of memory
reactivation, including memory size, format, replaying frequency, etc. To comply with memory
budget limitations, some work stores and replays compressed old information (Hayes et al., 2020)
or hidden representations (van de Ven et al., 2020) instead of raw information. Spaced repetition
is a human-like technique used to strategically plan review schedules where the time intervals
between active recalls get increasingly longer as the memory strengthens following the Ebbinghaus
memory model (Ebbinghaus, 1885). Leitner queues are among the most famous approaches to
spaced repetition, as we shall describe next.
Leitner queues Leitner queues were first introduced by Leitner (1974) as a heuristic for humans
to schedule when to repeat information. In 2016, Reddy et al. (2016) studied its theoretical bound
and provided the first mathematical model for spaced repetition systems, which is empirically
tested on flashcard reviewing. They show that it leads to better long-term retention for humans
than arbitrarily designed revision plans. Leitner queues have recently started garnering attention
for machine learning. However, most of the work is focused on scheduling when to review data in
non-continual learning setups. Amiri et al. (2017) show the sample efficiency of a human-inspired
memory model to determine when to review each item as a function of the difficulty of the item
and the strength of the network. Klasson et al. (2023) propose a Monte Carlo tree search approach
for memory replay. More work such as Amiri et al. (2018) and Amiri (2019) demonstrate the
effectiveness of Leitner queues at determining spurious data and confident labels for self-training
94
applications. In this thesis, we are the first to test the effectiveness of Leitner queues in mitigating
forgetting in cross-lingual continual learning. We leverage Leitner queues as a skill rating system to
determine informative items dynamically in the cross-lingual memory.
95
Chapter 7
Discussion
This Chapter summarizes our different contributions and findings for the three parts (Section 7.1).
Then, we outline some future work directions that are interesting to explore but beyond this thesis’s
scope (Section 7.2).
7.1 Summary
Despite their promising breakthroughs and surprising zero-shot performance, current cross-lingual
models are not equally generalizable to low-resource typologically diverse languages, continuously
evolving language shifts, and do not balance well between different directions and desiderata
of transfer learning. In this thesis, we tackled three aspects for improving cross-lingual transfer
learning towards more human-like generalization. In Chapter 3, we started by comprehensively
applying multilingual contextualized representations to event trigger detection under zero-shot,
language-specific, targeted, and multilingual transfer schemes. Our analysis shed light on the gains
and limitations of multilingual contextualized representations. Then, on top of those representations,
we proposed cross-lingual and multilingual data-efficient meta-learning approaches and showed
their capabilities to ease the optimization process both cross-lingually and multilingually, leading
to faster generalization to low-resource and typologically diverse languages. In Chapter 4, we
proposed a new formulation of cross-lingual continual learning focusing on analyzing language
shifts using multiple hops, which helps us gain more insights into the interactions between different
types and directions of transfer. We showed the effectiveness of different approaches to continual
learning in surmounting the challenges of sequential cross-lingual fine-tuning. In Chapter 5, we
proposed a cognitively inspired memory replay approach based on Leitner queues and showed
its effectiveness at reducing forgetting on previously seen languages while maintaining the final
performance thus taming the plasticity-stability dilemma.
7.2 Future Directions
We outline below the future work directions for each part. Those include work directions that are
either infeasible or that we foresee studying could benefit the NLP community:
96
Few-shot Learning In Chapter 3, we have focused more on conventional human-spoken languages. However, what about less explored yet even more low-resource paradigms like codeswitching and sign languages? One key future work direction is investigating and advancing the
capabilities of conventional cross-lingual transfer learning for such low-resource paradigms. We
focus on natural language understanding and semantic search applications in our evaluation of
different meta-learning approaches proposed. A larger scale evaluation of the effects of cross-lingual
and multilingual meta-transfer on other benchmark datasets (Hu et al., 2020; Ruder et al., 2021) is
out of our scope but is worth exploring. Investigating the feasibility of meta-learning without any
language-related pre-training would also be interesting. We focus on MAML, an optimization-based
approach, which is model-agnostic and thus easily adaptable to newer tasks. We leave the investigation of the merits of other families of meta-learning approaches, including metric-based methods,
and the curation of meta-tasks for that effect for future work. We conjecture that a hybrid system
that combines an optimization-based with a metric-based approach can boost the performance
in cross-lingual transfer learning. Last but not least, another direction in zero-shot cross-lingual
transfer learning could include studying unsupervised meta-transfer learning or the feasibility of
meta-learning without the manual curation of labeled meta-tasks.
Continual Learning In Chapter 4, we provide a systematic evaluation paradigm for cross-lingual
continual learning. For that purpose, we pick certain model expansion approach variations to analyze
the effect of model components that are not meant to be comprehensive. The NLP community is
encouraged to build on this work to extensively study the impact of the scale of data and model
size. It is interesting to analyze the robustness of different approaches with different supervision
levels, such as using different proportions of the data for each language and simulating few-shot
scenarios. We believe that for low-resource scenarios, we need to investigate specific approaches
to continual learning, like meta-learning. We plan to explore that in future work. In our work,
we focus on analyzing data streams where each step focuses on one language at a time without
repetition. Evaluating our paradigm using more realistic setups of continual learning is another
interesting future work direction. This could involve working on scenarios where the data from
different languages are integrated as soon as they are annotated. Another realistic scenario is
different versions of the data for one language that could be part of different hops. The hardest part
is developing a logic or pattern to model the annotation process for different languages. Another
way to complement our evaluation paradigm is to control the shifts in similarity between different
languages and to define metrics to measure that. It would be interesting to evaluate different
continual learning algorithms for the robustness or sensitivity to other language orders with varying
degrees of similarity. Code-switching is a natural domain of application of continual learning.
Nowadays, due to globalization, code-switching patterns are among the fastest-growing language
patterns. In future work, it would be interesting to analyze the capabilities of current language
models and propose approaches to adapt them when exposed to emerging code-switching shifts.
Human-like Cross-lingual Learning In Chapter 5, we have focused on Leitner queues as an
approach to guide the process of memory replay. Future work could explore other variants of Leitner
97
queues or other approaches based on human learning theories. For example, more fine-grained
methods based on cognitive theories of how languages get forgotten to model the retention curve as
a function of the task difficulty, review periods, and strength of the model could be investigated.
This could help the NLP community understand how the process of forgetting works and when to
schedule revisions accordingly to circumvent that. Our evaluation focuses on a representative set of
natural language understanding tasks. We prove that our approach benefits challenging tasks more
consistently. However, we don’t closely investigate if there is a correlation between the difficulty of
a task and the effectiveness of our Leitner-based skill ratings approach on it. For such an analysis
to be possible, we need a principled way to define what makes a task more difficult naturally or
to simulate that synthetically. We leave a systematic fine-grained analysis over more downstream
tasks ranging in difficulty for future work.
Human-like directions for advancing the field of cross-lingual transfer learning are boundless.
One direction is building compositional cross-lingual models. Humans are known for their systematic compositionality, that is the ability to compose new knowledge from existing components.
In cross-lingual transfer learning, the main question is if we know of an existing compositional
structure in a particular language, can we learn to generalize that across languages? It would be
interesting to investigate cases where systematic compositionality is beneficial and unsupervised
or semi-supervised approaches for building compositional cross-lingual models with unknown
components. Another direction is building models that mimic humans in their learning awareness
and capabilities to embrace the unknown. An interesting direction is preparing cross-lingual models
to face the unknown and fully embrace it rather than jumbling wrong answers that could mislead
humans.
98
Bibliography
Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. 2020. Knowledge distillation from internal representations. In The Thirty-Fourth AAAI Conference on Artificial
Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence
Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7350–7357. AAAI Press.
URL: https://aaai.org/ojs/index.php/AAAI/article/view/6229.
Hadi Amiri. 2019. Neural self-training through spaced repetition. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), pages 21–31, Minneapolis,
Minnesota. Association for Computational Linguistics. URL: https://aclanthology.o
rg/N19-1003.
Hadi Amiri, Timothy Miller, and Guergana Savova. 2017. Repeat before forgetting: Spaced repetition for efficient and effective training of neural networks. In Proceedings of the 2017 Conference
on Empirical Methods in Natural Language Processing, pages 2401–2410, Copenhagen, Denmark. Association for Computational Linguistics. URL: https://aclanthology.org/D
17-1255.
Hadi Amiri, Timothy Miller, and Guergana Savova. 2018. Spotting spurious data with neural
networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers),
pages 2006–2016, New Orleans, Louisiana. Association for Computational Linguistics. URL:
https://aclanthology.org/N18-1182.
Sébastien M. R. Arnold, Praateek Mahajan, Debajyoti Datta, Ian Bunner, and Konstantinos Saitas
Zarkias. 2020. learn2learn: A library for meta-learning research. arXiv:2008.12284.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning method for fully
unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
789–798, Melbourne, Australia. Association for Computational Linguistics. URL: https:
//aclanthology.org/P18-1073.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020. On the cross-lingual transferability of
monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics.
URL: https://aclanthology.org/2020.acl-main.421.
99
Kartikeya Badola, Shachi Dave, and Partha Talukdar. 2023. Parameter-efficient finetuning for robust
continual multilingual learning. In Findings of the Association for Computational Linguistics:
ACL 2023, pages 9763–9780, Toronto, Canada. Association for Computational Linguistics. URL:
https://aclanthology.org/2023.findings-acl.619.
Philip J. Ball, Yingzhen Li, Angus Lamb, and Cheng Zhang. 2020. A study on efficiency in
continual learning inspired by human learning. ArXiv preprint, abs/2010.15187. URL: https:
//arxiv.org/abs/2010.15187.
Magdalena Biesialska, Katarzyna Biesialska, and Marta R. Costa-jussà. 2020. Continual lifelong
learning in natural language processing: A survey. In Proceedings of the 28th International
Conference on Computational Linguistics, pages 6523–6541, Barcelona, Spain (Online). International Committee on Computational Linguistics. URL: https://aclanthology.org/2
020.coling-main.574.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word
vectors with subword information. Transactions of the Association for Computational Linguistics,
5:135–146. URL: https://aclanthology.org/Q17-1010.
Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Multilingual alignment of contextual word
representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. URL: https://openreview.net
/forum?id=r1xCMyBtPS.
Gail A. Carpenter and Stephen Grossberg. 1988. Art 2: Self-organization of stable category
recognition codes for analog input patterns. In Other Conferences. URL: https://api.se
manticscholar.org/CorpusID:60534041.
Iñigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuli ˇ c. 2020. ´
Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop
on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for
Computational Linguistics. URL: https://aclanthology.org/2020.nlp4convai
-1.5.
Giuseppe Castellucci, Valentina Bellomaria, Andrea Favalli, and Raniero Romagnoli. 2019. Multilingual intent detection and slot filling in a joint bert-based model. ArXiv preprint, abs/1907.02884.
URL: https://arxiv.org/abs/1907.02884.
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages
1–14, Vancouver, Canada. Association for Computational Linguistics. URL: https://acla
nthology.org/S17-2001.
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2019a.
Efficient lifelong learning with A-GEM. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. URL:
https://openreview.net/forum?id=Hkf2_sC5FX.
100
Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet Kumar
Dokania, Philip H. S. Torr, and Marc’Aurelio Ranzato. 2019b. Continual learning with tiny
episodic memories. ArXiv preprint, abs/1902.10486. URL: https://arxiv.org/abs/19
02.10486.
Chen Chen and Vincent Ng. 2012. Joint modeling for Chinese event extraction with rich linguistic
features. In Proceedings of COLING 2012, pages 529–544, Mumbai, India. The COLING 2012
Organizing Committee. URL: https://aclanthology.org/C12-1033.
Yi-Chen Chen, Jui-Yang Hsu, Cheng-Kuang Lee, and Hung-yi Lee. 2020. DARTS-ASR: differentiable architecture search for multilingual speech recognition and adaptation. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 1803–1807. ISCA. URL:
https://doi.org/10.21437/Interspeech.2020-1315.
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic
multi-pooling convolutional neural networks. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages 167–176, Beijing, China. Association for
Computational Linguistics. URL: https://aclanthology.org/P15-1017.
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev,
and Jennimaria Palomaki. 2020. TyDi QA: A benchmark for information-seeking question
answering in typologically diverse languages. Transactions of the Association for Computational
Linguistics, 8:454–470. URL: https://aclanthology.org/2020.tacl-1.30.
Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. Unsupervised structure prediction
with non-parallel multilingual guidance. In Proceedings of the 2011 Conference on Empirical
Methods in Natural Language Processing, pages 50–61, Edinburgh, Scotland, UK. Association
for Computational Linguistics. URL: https://aclanthology.org/D11-1005.
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek,
Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020.
Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association
for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.
747.
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information
Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages
7057–7067. URL: https://proceedings.neurips.cc/paper/2019/hash/c04
c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html.
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger
Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations.
In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
101
pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics. URL: https:
//aclanthology.org/D18-1269.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of
the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–
4186, Minneapolis, Minnesota. Association for Computational Linguistics. URL: https:
//aclanthology.org/N19-1423.
Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. 2019. Investigating meta-learning algorithms
for low-resource natural language understanding tasks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), pages 1192–1197, Hong Kong, China.
Association for Computational Linguistics. URL: https://aclanthology.org/D19-1
112.
Hermann Ebbinghaus. 1885. Über das gedächtnis: untersuchungen zur experimentellen psychologie.
Duncker & Humblot.
Kanyin Feng, Xiao Zhao, Jing Liu, Ying Cai, Zhifang Ye, Chuansheng Chen, and Gui Xue.
2019. Spaced learning enhances episodic memory by increasing neural pattern similarity across
repetitions. The Journal of Neuroscience, 39:2741–18.
Xiaocheng Feng, Lifu Huang, Duyu Tang, Heng Ji, Bing Qin, and Ting Liu. 2016. A languageindependent neural network for event detection. In Proceedings of the 54th Annual Meeting of
the Association for Computational Linguistics (Volume 2: Short Papers), pages 66–71, Berlin,
Germany. Association for Computational Linguistics. URL: https://aclanthology.o
rg/P16-2011.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast
adaptation of deep networks. In Proceedings of the 34th International Conference on Machine
Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of
Machine Learning Research, pages 1126–1135. PMLR. URL: http://proceedings.mlr.
press/v70/finn17a.html.
Robert M. French. 1993. Catastrophic interference in connectionist networks: Can it be predicted,
can it be prevented? In Advances in Neural Information Processing Systems 6, [7th NIPS
Conference, Denver, Colorado, USA, 1993], pages 1176–1177. Morgan Kaufmann. URL:
http://papers.nips.cc/paper/799-catastrophic-interference-in-c
onnectionist-networks-can-it-be-predicted-can-it-be-prevented.
Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2019. FewRel
2.0: Towards more challenging few-shot relation classification. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6250–6255, Hong
Kong, China. Association for Computational Linguistics. URL: https://aclanthology
.org/D19-1649.
102
Xavier Garcia, Noah Constant, Ankur Parikh, and Orhan Firat. 2021. Towards continual learning
for multilingual machine translation via vocabulary substitution. In Proceedings of the 2021
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 1184–1192, Online. Association for Computational
Linguistics. URL: https://aclanthology.org/2021.naacl-main.93.
Goran Glavaš, Robert Litschko, Sebastian Ruder, and Ivan Vulic. 2019. How to (properly) evaluate ´
cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 710–721, Florence, Italy. Association for Computational Linguistics. URL:
https://aclanthology.org/P19-1070.
Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015. Bilbowa: Fast bilingual distributed
representations without word alignments. In Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37, ICML’15, page 748–756. JMLR.org.
Gregory Grefenstette. 1998. Cross-language information retrieval. In Proceedings of the Third
Conference of the Association for Machine Translation in the Americas: Tutorial Descriptions,
Langhorne, PA, USA. Springer. URL: https://aclanthology.org/1998.amta-tut
orials.5.
Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for
low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 3622–3631, Brussels, Belgium. Association for
Computational Linguistics. URL: https://aclanthology.org/D18-1398.
Luis Guzman Nateras, Franck Dernoncourt, and Thien Nguyen. 2023. Hybrid knowledge transfer
for improved cross-lingual event detection via hierarchical sample selection. In Proceedings
of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 5414–5427, Toronto, Canada. Association for Computational Linguistics. URL:
https://aclanthology.org/2023.acl-long.296.
Luis Guzman Nateras, Viet Lai, Franck Dernoncourt, and Thien Nguyen. 2022. Few-shot crosslingual learning for event detection. In Proceedings of the The 2nd Workshop on Multi-lingual
Representation Learning (MRL), pages 16–27, Abu Dhabi, United Arab Emirates (Hybrid).
Association for Computational Linguistics. URL: https://aclanthology.org/2022.
mrl-1.2.
Raia Hadsell, Dushyant Rao, Andrei Rusu, and Razvan Pascanu. 2020. Embracing change: Continual learning in deep neural networks. Trends in Cognitive Sciences, 24:1028–1040.
Xu Han, Tianyu Gao, Yankai Lin, Hao Peng, Yaoliang Yang, Chaojun Xiao, Zhiyuan Liu, Peng Li,
Jie Zhou, and Maosong Sun. 2020. More data, more relations, more context and more openness:
A review and outlook for relation extraction. In Proceedings of the 1st Conference of the AsiaPacific Chapter of the Association for Computational Linguistics and the 10th International Joint
Conference on Natural Language Processing, pages 745–758, Suzhou, China. Association for
Computational Linguistics. URL: https://aclanthology.org/2020.aacl-main.
75.
103
Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, and Christopher Kanan. 2020.
REMIND your neural network to prevent catastrophic forgetting. In Computer Vision - ECCV
2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VIII,
volume 12353 of Lecture Notes in Computer Science, pages 466–483. Springer. URL: https:
//doi.org/10.1007/978-3-030-58598-3_28.
Avijit Hazra. 2017. Using the confidence interval confidently. Journal of Thoracic Disease,
9(10):4124–4129.
Niels van der Heijden, Helen Yannakoudakis, Pushkar Mishra, and Ekaterina Shutova. 2021.
Multilingual and cross-lingual document classification: A meta-learning approach. In Proceedings
of the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, pages 1966–1976, Online. Association for Computational Linguistics. URL:
https://aclanthology.org/2021.eacl-main.168.
K Hermann and P Blunsom. 2014. Multilingual distributed representations without word alignment.
International Conference on Learning Representations 2014.
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural
network. ArXiv preprint, abs/1503.02531. URL: https://arxiv.org/abs/1503.025
31.
Stella Ho, Ming Liu, Lan Du, Longxiang Gao, and Yong Xiang. 2023. Prototype-guided memory
replay for continual learning. IEEE Transactions on Neural Networks and Learning Systems,
pages 1–11.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput.,
9(8):1735–1780. URL: http://dx.doi.org/10.1162/neco.1997.9.8.1735.
Yu Hong, Jianfeng Zhang, Bin Ma, Jianmin Yao, Guodong Zhou, and Qiaoming Zhu. 2011.
Using cross-entity inference to improve event extraction. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies,
pages 1127–1136, Portland, Oregon, USA. Association for Computational Linguistics. URL:
https://aclanthology.org/P11-1113.
Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. 2015. Cqadupstack: A benchmark
data set for community question-answering research. In Proceedings of the 20th Australasian
Document Computing Symposium, ADCS 2015, Parramatta, NSW, Australia, December 8-9,
2015, pages 3:1–3:8. ACM. URL: https://doi.org/10.1145/2838931.2838934.
Timothy M. Hospedales, Antreas Antoniou, Paul Micaelli, and Amos J. Storkey. 2020. Metalearning in neural networks: A survey. ArXiv preprint, abs/2004.05439. URL: https://arxi
v.org/abs/2004.05439.
Andrew Hsi, Yiming Yang, Jaime Carbonell, and Ruochen Xu. 2016. Leveraging multilingual
training for limited resource event extraction. In Proceedings of COLING 2016, the 26th
International Conference on Computational Linguistics: Technical Papers, pages 1201–1210,
Osaka, Japan. The COLING 2016 Organizing Committee. URL: https://aclanthology
.org/C16-1114.
104
Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee. 2020. Meta learning for end-to-end low-resource
speech recognition. In 2020 IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020, pages 7844–7848. IEEE. URL:
https://doi.org/10.1109/ICASSP40776.2020.9053112.
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson.
2020. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual
generalisation. In Proceedings of the 37th International Conference on Machine Learning, ICML
2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research,
pages 4411–4421. PMLR. URL: http://proceedings.mlr.press/v119/hu20b.h
tml.
Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming
Zhou. 2019. Unicoder: A universal language encoder by pre-training with multiple cross-lingual
tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2485–2494, Hong Kong, China. Association for Computational Linguistics.
URL: https://aclanthology.org/D19-1252.
Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel,
Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer,
Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari,
Maria Ryskina, Rita Frieske, Ryan Cotterell, and Zhijing Jin. 2022. A taxonomy and review of
generalization research in nlp. Nat Mach Intell.
David Isele and Akansel Cosgun. 2018. Selective experience replay for lifelong learning. In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the
30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium
on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA,
February 2-7, 2018, pages 3302–3309. AAAI Press. URL: https://www.aaai.org/ocs
/index.php/AAAI/AAAI18/paper/view/16054.
Pavlov Ivan. 2010. Conditioned reflexes: An investigation of the physiological activity of the
cerebral cortex. Ann Neurosci.
Heng Ji and Ralph Grishman. 2008. Refining event extraction through cross-document inference. In
Proceedings of ACL-08: HLT, pages 254–262, Columbus, Ohio. Association for Computational
Linguistics. URL: https://aclanthology.org/P08-1030.
Xisen Jin, Dejiao Zhang, Henghui Zhu, Wei Xiao, Shang-Wen Li, Xiaokai Wei, Andrew Arnold,
and Xiang Ren. 2022. Lifelong pretraining: Continually adapting language models to emerging
corpora. In Proceedings of the 2022 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pages 4764–4780, Seattle, United
States. Association for Computational Linguistics. URL: https://aclanthology.org
/2022.naacl-main.351.
105
Gareth Jones, Fabio Fantino, Eamonn Newman, and Ying Zhang. 2008. Domain-specific query
translation for multilingual information access using machine translation augmented with dictionaries mined from Wikipedia. In Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies. URL:
https://aclanthology.org/I08-6005.
Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The state
and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association
for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.
560.
Mahendra Kariya. 2018. Dark knowledge in neural networks. Accessed on March 7th, 2023. URL:
https://medium.com/@mahendrakariya/dark-knowledge-in-neural-net
works-467e5d699181.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May
7-9, 2015, Conference Track Proceedings. URL: http://arxiv.org/abs/1412.6980.
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A.
Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis,
Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting
in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526. URL:
https://www.pnas.org/content/114/13/3521, arXiv:https://www.pnas
.org/content/114/13/3521.full.pdf.
Marcus Klasson, Hedvig Kjellström, and Cheng Zhang. 2023. Learn the time to learn: Replay
scheduling in continual learning. Transactions on Machine Learning Research. URL: https:
//research.aalto.fi/en/publications/learn-the-time-to-learn-rep
lay-scheduling-in-continual-learning.
Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed
representations of words. In Proceedings of COLING 2012, pages 1459–1474, Mumbai, India.
The COLING 2012 Organizing Committee. URL: https://aclanthology.org/C12-1
089.
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings
of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395,
Barcelona, Spain. Association for Computational Linguistics. URL: https://aclantholo
gy.org/W04-3250.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random
fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the
Eighteenth International Conference on Machine Learning (ICML 2001), Williams College,
Williamstown, MA, USA, June 28 - July 1, 2001, pages 282–289. Morgan Kaufmann. URL:
https://dl.acm.org/doi/10.5555/645530.655813.
106
Viet Dac Lai. 2022. Event extraction: A survey. CoRR, abs/2210.03419. URL: https://doi.
org/10.48550/arXiv.2210.03419, arXiv:2210.03419.
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer.
2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics.
URL: https://aclanthology.org/N16-1030.
Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou.
2018. Word translation without parallel data. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track
Proceedings. OpenReview.net. URL: https://openreview.net/forum?id=H196sa
inb.
Anna Langedijk, Verna Dankers, Phillip Lippe, Sander Bos, Bryan Cardenas Guevara, Helen Yannakoudakis, and Ekaterina Shutova. 2022. Meta-learning for fast cross-lingual adaptation in dependency parsing. In Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 8503–8520, Dublin, Ireland. Association for Computational Linguistics. URL: https://aclanthology.org/2022.acl-long.582.
Pat Langley. 2022. The computational gauntlet of human-like learning. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications
of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 12268–12273.
AAAI Press. URL: https://doi.org/10.1609/aaai.v36i11.21489.
Hung-yi Lee, Shang-Wen Li, and Thang Vu. 2022. Meta learning for natural language processing: A
survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pages 666–684, Seattle, United
States. Association for Computational Linguistics. URL: https://aclanthology.org
/2022.naacl-main.49.
Tao Lei, Hrishikesh Joshi, Regina Barzilay, Tommi Jaakkola, Kateryna Tymoshenko, Alessandro
Moschitti, and Lluís Màrquez. 2016. Semi-supervised question retrieval with gated convolutions.
In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 1279–1289, San Diego,
California. Association for Computational Linguistics. URL: https://aclanthology.o
rg/N16-1153.
S. Leitner. 1974. So lernt man lernen. Herder. URL: https://books.google.com/books
?id=opWFRAAACAAJ.
Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2020. MLQA:
Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 7315–7330, Online. Association
for Computational Linguistics. URL: https://aclanthology.org/2020.acl-main.
653.
107
Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. 2021.
MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Proceedings
of the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, pages 2950–2962, Online. Association for Computational Linguistics. URL:
https://aclanthology.org/2021.eacl-main.257.
Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global
features. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 73–82, Sofia, Bulgaria. Association for Computational
Linguistics. URL: https://aclanthology.org/P13-1008.
Zhizhong Li and Derek Hoiem. 2016. Learning without forgetting. In Computer Vision - ECCV 2016
- 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,
Part IV, volume 9908 of Lecture Notes in Computer Science, pages 614–629. Springer. URL:
https://doi.org/10.1007/978-3-319-46493-0_37.
Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, Ming Gong, Linjun Shou,
Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Zhang, Rahul Agrawal, Edward Cui, Sining
Wei, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel
Campos, Rangan Majumder, and Ming Zhou. 2020. XGLUE: A new benchmark dataset for
cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Online.
Association for Computational Linguistics. URL: https://aclanthology.org/2020.
emnlp-main.484.
Shasha Liao and Ralph Grishman. 2010. Using document level cross-event inference to improve
event extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, pages 789–797, Uppsala, Sweden. Association for Computational Linguistics. URL:
https://aclanthology.org/P10-1081.
Robert Litschko, Ivan Vulic, Simone Paolo Ponzetto, and Goran Glavas. 2021. Evaluating multilingual text encoders for unsupervised cross-lingual retrieval. In Advances in Information Retrieval
- 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1,
2021, Proceedings, Part I, volume 12656 of Lecture Notes in Computer Science, pages 342–358.
Springer. URL: https://doi.org/10.1007/978-3-030-72113-8_23.
Robert Litschko, Ivan Vulic, Simone Paolo Ponzetto, and Goran Glavas. 2022. On cross-lingual
retrieval with multilingual text encoders. Inf. Retr. J., 25(2):149–183. URL: https://doi.
org/10.1007/s10791-022-09406-x.
Bing Liu and Ian Lane. 2016. Joint online spoken language understanding and language modeling
with recurrent neural networks. In Proceedings of the 17th Annual Meeting of the Special Interest
Group on Discourse and Dialogue, pages 22–30, Los Angeles. Association for Computational
Linguistics. URL: https://aclanthology.org/W16-3603.
Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2018a. Event detection via gated multilingual
attention mechanism. In Proceedings of the Thirty-Second AAAI Conference on Artificial
108
Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18),
and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),
New Orleans, Louisiana, USA, February 2-7, 2018, pages 4865–4872. AAAI Press. URL:
https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16371.
Jihao Liu, Boxiao Liu, Hongsheng Li, and Yu Liu. 2022. Meta knowledge distillation. ArXiv
preprint, abs/2202.07940. URL: https://arxiv.org/abs/2202.07940.
Shulin Liu, Kang Liu, Shizhu He, and Jun Zhao. 2016. A probabilistic soft logic based approach to
exploiting latent and global information in event classification. In Proceedings of the Thirtieth
AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages
2993–2999. AAAI Press. URL: http://www.aaai.org/ocs/index.php/AAAI/AA
AI16/paper/view/11990.
Xiao Liu, Zhunchen Luo, and Heyan Huang. 2018b. Jointly multiple events extraction via attentionbased graph information aggregation. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 1247–1256, Brussels, Belgium. Association for
Computational Linguistics. URL: https://aclanthology.org/D18-1156.
Zihan Liu, Jamin Shin, Yan Xu, Genta Indra Winata, Peng Xu, Andrea Madotto, and Pascale
Fung. 2019. Zero-shot cross-lingual dialogue systems with transferable latent variables. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
pages 1297–1303, Hong Kong, China. Association for Computational Linguistics. URL: https:
//aclanthology.org/D19-1129.
Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, and Pascale Fung. 2020. Attentioninformed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. In
The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second
Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA,
February 7-12, 2020, pages 8433–8440. AAAI Press. URL: https://aaai.org/ojs/ind
ex.php/AAAI/article/view/6362.
Zihan Liu, Genta Indra Winata, Andrea Madotto, and Pascale Fung. 2021. Preserving crosslinguality of pre-trained models via continual learning. In Proceedings of the 6th Workshop
on Representation Learning for NLP (RepL4NLP-2021), pages 64–71, Online. Association for
Computational Linguistics. URL: https://aclanthology.org/2021.repl4nlp-1
.8.
David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual
learning. In Advances in Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages
6467–6476. URL: https://proceedings.neurips.cc/paper/2017/hash/f87
522788a2be2d171666752f97ddebb-Abstract.html.
109
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International
Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net. URL: https://openreview.net/forum?id=Bkg6RiCqY7.
Chengye Lu, Yue Xu, and Shlomo Geva. 2008. Web-based query translation for English-Chinese
CLIR. In International Journal of Computational Linguistics & Chinese Language Processing,
Volume 13, Number 1, March 2008: Special Issue on Cross-Lingual Information Retrieval and
Question Answering, pages 61–90. URL: https://aclanthology.org/O08-3004.
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Bilingual word representations with
monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for
Natural Language Processing, pages 151–159, Denver, Colorado. Association for Computational
Linguistics. URL: https://aclanthology.org/W15-1521.
Andrea Madotto, Zhaojiang Lin, Zhenpeng Zhou, Seungwhan Moon, Paul Crook, Bing Liu,
Zhou Yu, Eunjoon Cho, Pascale Fung, and Zhiguang Wang. 2021. Continual learning in taskoriented dialogue systems. In Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing, pages 7452–7467, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics. URL: https://aclanthology.org/2021.
emnlp-main.590.
Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by
iterative pruning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 7765–7773. IEEE Computer Society.
URL: http://openaccess.thecvf.com/content_cvpr_2018/html/Mallya_
PackNet_Adding_Multiple_CVPR_2018_paper.html.
Kelly Marchisio, Patrick Lewis, Yihong Chen, and Mikel Artetxe. 2023. Mini-model adaptation:
Efficiently extending pretrained models to new languages via aligned shallow training. In Findings
of the Association for Computational Linguistics: ACL 2023, pages 5474–5490, Toronto, Canada.
Association for Computational Linguistics. URL: https://aclanthology.org/2023.
findings-acl.338.
Cyprien de Masson d’Autume, Sebastian Ruder, Lingpeng Kong, and Dani Yogatama. 2019.
Episodic memory in lifelong language learning. In Advances in Neural Information Processing
Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS
2019, December 8-14, 2019, Vancouver, BC, Canada, pages 13122–13131. URL: https:
//proceedings.neurips.cc/paper/2019/hash/f8d2e80c1458ea2501f98
a2cafadb397-Abstract.html.
Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks:
The sequential learning problem. In Gordon H. Bower, editor, Psychology of Learning and
Motivation, volume 24, pages 109–165. Academic Press. URL: https://www.scienced
irect.com/science/article/pii/S0079742108605368.
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur,
Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, and Geoffrey Zweig. 2015. Using recurrent
110
neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 23(3):530–539.
Meryem M’hamdi, Marjorie Freedman, and Jonathan May. 2019. Contextualized cross-lingual event
trigger extraction with minimal resources. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 656–665, Hong Kong, China. Association for
Computational Linguistics. URL: https://aclanthology.org/K19-1061.
Meryem M’hamdi, Doo Soon Kim, Franck Dernoncourt, Trung Bui, Xiang Ren, and Jonathan
May. 2021. X-METRA-ADA: Cross-lingual meta-transfer learning adaptation to natural language understanding and question answering. In Proceedings of the 2021 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 3617–3632, Online. Association for Computational Linguistics. URL:
https://aclanthology.org/2021.naacl-main.283.
Meryem M’hamdi, Jonathan May, Franck Dernoncourt, Trung Bui, and Seunghyun Yoon. 2023.
Multilingual sentence-level semantic search using meta-distillation learning. arXiv:2309.0
8185.
Meryem M’hamdi, Xiang Ren, and Jonathan May. 2023. Cross-lingual continual learning. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 3908–3943, Toronto, Canada. Association for Computational Linguistics.
URL: https://aclanthology.org/2023.acl-long.217.
Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed
representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems
2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States,
pages 3111–3119. URL: https://proceedings.neurips.cc/paper/2013/hash
/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. 2022. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
In Proceedings of the 2022 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 3992–4006, Seattle, United
States. Association for Computational Linguistics. URL: https://aclanthology.org
/2022.naacl-main.293.
Nikola Mrksic, Ivan Vulic, Diarmuid Ó Séaghdha, Ira Leviant, Roi Reichart, Milica Gasic, Anna
Korhonen, and Steve J. Young. 2017. Semantic specialisation of distributional word vector
spaces using monolingual and cross-lingual constraints. CoRR, abs/1706.00374. URL: http:
//arxiv.org/abs/1706.00374, arXiv:1706.00374.
Niklas Muennighoff. 2022. SGPT: GPT sentence embeddings for semantic search. ArXiv preprint,
abs/2202.08904. URL: https://arxiv.org/abs/2202.08904.
Benjamin Muller, Antonios Anastasopoulos, Benoît Sagot, and Djamé Seddah. 2021. When
being unseen from mBERT is just the beginning: Handling new languages with multilingual
111
language models. In Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pages 448–462,
Online. Association for Computational Linguistics. URL: https://aclanthology.org
/2021.naacl-main.38.
Pandu Nayak. 2019. Understanding searches better than ever before. URL: https://blog.g
oogle/products/search/search-language-understanding-bert/.
Chien Nguyen, Linh Ngo, and Thien Nguyen. 2023. Retrieving relevant context to align representations for cross-lingual event detection. In Findings of the Association for Computational
Linguistics: ACL 2023, pages 2157–2170, Toronto, Canada. Association for Computational
Linguistics. URL: https://aclanthology.org/2023.findings-acl.135.
Dong Nguyen, Arnold Overwijk, Claudia Hauff, Dolf Trieschnigg, Djoerd Hiemstra, and Franciska
de Jong. 2008. Wikitranslate: Query translation for cross-lingual information retrieval using
only wikipedia. In Evaluating Systems for Multilingual and Multimodal Information Access, 9th
Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September
17-19, 2008, Revised Selected Papers, volume 5706 of Lecture Notes in Computer Science, pages
58–65. Springer. URL: https://doi.org/10.1007/978-3-642-04447-2_6.
Thien Huu Nguyen, Kyunghyun Cho, and Ralph Grishman. 2016. Joint event extraction via
recurrent neural networks. In Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pages
300–309, San Diego, California. Association for Computational Linguistics. URL: https:
//aclanthology.org/N16-1034.
Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms.
ArXiv preprint, abs/1803.02999. URL: https://arxiv.org/abs/1803.02999.
Farhad Nooralahzadeh, Giannis Bekoulis, Johannes Bjerva, and Isabelle Augenstein. 2020. Zeroshot cross-lingual transfer with meta learning. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pages 4547–4562, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2020.em
nlp-main.368.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New
Orleans, Louisiana. Association for Computational Linguistics. URL: https://aclantho
logy.org/N18-1202.
Jonas Pfeiffer, Naman Goyal, Xi Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe.
2022. Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings
of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495, Seattle, United States. Association
for Computational Linguistics. URL: https://aclanthology.org/2022.naacl-m
ain.255.
112
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulic, Sebastian Ruder, ´
Kyunghyun Cho, and Iryna Gurevych. 2020a. AdapterHub: A framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, pages 46–54, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2020.emnlp-demos.7.
Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebastian Ruder. 2020b. MAD-X: An Adapter- ´
Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online.
Association for Computational Linguistics. URL: https://aclanthology.org/2020.
emnlp-main.617.
Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebastian Ruder. 2021. UNKs everywhere: ´
Adapting multilingual language models to new scripts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10186–10203, Online
and Punta Cana, Dominican Republic. Association for Computational Linguistics. URL:
https://aclanthology.org/2021.emnlp-main.800.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT?
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 4996–5001, Florence, Italy. Association for Computational Linguistics. URL: https:
//aclanthology.org/P19-1493.
Karan Praharaj and Irina Matveeva. 2022. On robust incremental learning over many multilingual
steps. In IEEE International Conference on Data Mining Workshops, ICDM 2022 - Workshops,
Orlando, FL, USA, November 28 - Dec. 1, 2022, pages 852–859. IEEE. URL: https://doi.
org/10.1109/ICDMW58026.2022.00114.
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. 2017.
icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5533–
5542. IEEE Computer Society. URL: https://doi.org/10.1109/CVPR.2017.587.
Siddharth Reddy, Igor Labutov, Siddhartha Banerjee, and Thorsten Joachims. 2016. Unbounded
human learning: Optimal scheduling for spaced repetition. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco,
CA, USA, August 13-17, 2016, pages 1815–1824. ACM. URL: https://doi.org/10.114
5/2939672.2939850.
Ibraheem Rehman, Navid Mahabadi, Terrence Sanvictores, and Chaudhry I. Rehman. 2023. Classical conditioning. Treasure Island (FL): StatPearls Publishing.
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese
BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational
Linguistics. URL: https://aclanthology.org/D19-1410.
113
Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual
using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 4512–4525, Online. Association for Computational Linguistics. URL: https://aclanthology.org/2020.emnlp-main.365.
Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald
Tesauro. 2019. Learning to learn without forgetting by maximizing transfer and minimizing
interference. In 7th International Conference on Learning Representations, ICLR 2019, New
Orleans, LA, USA, May 6-9, 2019. OpenReview.net. URL: https://openreview.net/f
orum?id=B1gTShAct7.
Mark B. Ring. 1995. Continual learning in reinforcement environments. Ph.D. thesis, University of
Texas at Austin, TX, USA. URL: https://d-nb.info/945690320.
Hippolyt Ritter, Aleksandar Botev, and David Barber. 2018. Online structured laplace approximations for overcoming catastrophic forgetting. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018,
NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 3742–3752. URL: https:
//proceedings.neurips.cc/paper/2018/hash/f31b20466ae89669f9741
e047487eb37-Abstract.html.
Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25
and beyond. Found. Trends Inf. Retr., 3(4):333–389. URL: https://doi.org/10.1561/
1500000019.
Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Barua, Aaron Phillips, and Yinfei Yang. 2020.
LAReQA: Language-agnostic answer retrieval from a multilingual pool. In Proceedings of
the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages
5919–5930, Online. Association for Computational Linguistics. URL: https://aclantho
logy.org/2020.emnlp-main.477.
Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu,
Junjie Hu, Dan Garrette, Graham Neubig, and Melvin Johnson. 2021. XTREME-R: Towards
more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference
on Empirical Methods in Natural Language Processing, pages 10215–10245, Online and Punta
Cana, Dominican Republic. Association for Computational Linguistics. URL: https://acla
nthology.org/2021.emnlp-main.802.
Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. 2019a. Transfer
learning in natural language processing. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,
Minneapolis, Minnesota. Association for Computational Linguistics. URL: https://aclant
hology.org/N19-5004.
Sebastian Ruder, Anders Søgaard, and Ivan Vulic. 2019b. Unsupervised cross-lingual representation ´
learning. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics: Tutorial Abstracts, pages 31–38, Florence, Italy. Association for Computational
Linguistics. URL: https://aclanthology.org/P19-4007.
114
Jacques Savoy and Martin Braschler. 2019. Lessons Learnt from Experiments on the Ad Hoc
Multilingual Test Collections at CLEF, pages 177–200. Springer International Publishing, Cham.
URL: https://doi.org/10.1007/978-3-030-22948-1_7.
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for
face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 815–823. IEEE Computer Society. URL:
https://doi.org/10.1109/CVPR.2015.7298682.
Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2019a. Cross-lingual transfer
learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 3795–3805, Minneapolis, Minnesota.
Association for Computational Linguistics. URL: https://aclanthology.org/N19-1
380.
Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019b. Cross-lingual alignment of
contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1599–
1613, Minneapolis, Minnesota. Association for Computational Linguistics. URL: https:
//aclanthology.org/N19-1162.
Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye
Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & compress: A scalable framework for
continual learning. In Proceedings of the 35th International Conference on Machine Learning,
ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings
of Machine Learning Research, pages 4535–4544. PMLR. URL: http://proceedings.
mlr.press/v80/schwarz18a.html.
Holger Schwenk and Xian Li. 2018. A corpus for multilingual document classification in eight
languages. In Proceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
URL: https://aclanthology.org/L18-1560.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional
attention flow for machine comprehension. In International Conference on Learning Representations. URL: https://openreview.net/forum?id=HJ0UKP9ge.
Lei Sha, Feng Qian, Baobao Chang, and Zhifang Sui. 2018. Jointly extracting event triggers and
arguments by dependency-bridge RNN and tensor-based argument interaction. In Proceedings of
the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative
Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational
Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018,
pages 5916–5923. AAAI Press. URL: https://www.aaai.org/ocs/index.php/AAA
I/AAAI18/paper/view/16222.
115
Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. 2017. Continual learning with deep
generative replay. In Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA,
pages 2990–2999. URL: https://proceedings.neurips.cc/paper/2017/hash
/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html.
Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Ari, Jason Riesa, Ankur Bapna, Orhan Firat,
and Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual
neural machine translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence,
AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference,
IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence,
EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8854–8861. AAAI Press. URL:
https://aaai.org/ojs/index.php/AAAI/article/view/6414.
Samuel L. Smith, David H. P. Turban, Steven Hamblin, and Nils Y. Hammerla. 2017. Offline
bilingual word vectors, orthogonal transformations and the inverted softmax. In 5th International
Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. URL: https://openreview.net/forum?i
d=r1Aab85gg.
Paul Smolen, Yili Zhang, and John Byrne. 2016. The right time to learn: Mechanisms and
optimization of spaced learning. Nature Reviews Neuroscience, 17:77–88.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
URL: https://aclanthology.org/D13-1170.
Anders Søgaard, Željko Agic, Héctor Martínez Alonso, Barbara Plank, Bernd Bohnet, and Anders Jo- ´
hannsen. 2015. Inverted indexing for cross-lingual NLP. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pages 1713–1722, Beijing, China. Association for Computational Linguistics. URL: https://aclanthology.org/P15-1165.
Anders Søgaard, Ivan Vulic, Sebastian Ruder, and Manaal Faruqui. 2019. Cross-Lingual Word
Embeddings. Synthesis Lectures on Human Language Technologies. Morgan & Claypool
Publishers. URL: https://doi.org/10.2200/S00920ED2V01Y201904HLT042.
Fan-Keng Sun, Cheng-Hao Ho, and Hung-Yi Lee. 2020. LAMOL: language modeling for lifelong
language learning. In 8th International Conference on Learning Representations, ICLR 2020,
Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. URL: https://openreview
.net/forum?id=Skgxcn4YDS.
Weiting Tan, Kevin Heffernan, Holger Schwenk, and Philipp Koehn. 2023. Multilingual representation distillation with contrastive learning. In Proceedings of the 17th Conference of the European
Chapter of the Association for Computational Linguistics, pages 1477–1490, Dubrovnik, Croatia.
116
Association for Computational Linguistics. URL: https://aclanthology.org/2023.
eacl-main.108.
Ishan Tarunesh, Sushil Khyalia, Vishwajeet Kumar, Ganesh Ramakrishnan, and Preethi Jyothi.
2021. Meta-learning for effective multi-task and multilingual modelling. In Proceedings of
the 16th Conference of the European Chapter of the Association for Computational Linguistics:
Main Volume, pages 3600–3612, Online. Association for Computational Linguistics. URL:
https://aclanthology.org/2021.eacl-main.314.
Eleni Triantafillou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Utku Evci, Kelvin Xu, Ross
Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. 2020.
Meta-dataset: A dataset of datasets for learning to learn from few examples. In 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net. URL: https://openreview.net/forum?id=rkgAGAVKPr.
Shyam Upadhyay, Manaal Faruqui, Gökhan Tür, Dilek Hakkani-Tür, and Larry P. Heck. 2018.
(almost) zero-shot cross-lingual spoken language understanding. In 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada,
April 15-20, 2018, pages 6034–6038. IEEE. URL: https://doi.org/10.1109/ICASSP
.2018.8461905.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems 30: Annual Conference on Neural Information Processing
Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008. URL: https:
//proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd05
3c1c4a845aa-Abstract.html.
Gido van de Ven, Hava Siegelmann, and Andreas Tolias. 2020. Brain-inspired replay for continual
learning with artificial neural networks. Nature Communications, 11:4069.
Gido M. van de Ven, Tinne Tuytelaars, and Andreas S. Tolias. 2022. Three types of incremental
learning. Nature Machine Intelligence, 4(12):1185–1197. URL: https://doi.org/10.1
038/s42256-022-00568-3.
Ivan Vulic and Marie-Francine Moens. 2013. Cross-lingual semantic similarity of words as the ´
similarity of their semantic word responses. In Proceedings of the 2013 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 106–116, Atlanta, Georgia. Association for Computational Linguistics. URL:
https://aclanthology.org/N13-1011.
Christopher Walker. 2006. Ace 2005 multilingual training corpus ldc2006t06. In Linguistic Data
Consortium, Philadelphia, United States of America. URL: https://catalog.ldc.upen
n.edu/LDC2006T06.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
117
Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
URL: https://aclanthology.org/W18-5446.
Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, and Ting Liu. 2019. Cross-lingual BERT
transformation for zero-shot dependency parsing. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP), pages 5721–5727, Hong Kong, China.
Association for Computational Linguistics. URL: https://aclanthology.org/D19-1
575.
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus
for sentence understanding through inference. In Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association
for Computational Linguistics. URL: https://aclanthology.org/N18-1101.
Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, Peng Xu, and Pascale Fung.
2020. Meta-transfer learning for code-switched speech recognition. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pages 3770–3776, Online.
Association for Computational Linguistics. URL: https://aclanthology.org/2020.
acl-main.348.
Maciej Wolczyk, Michal Zajac, Razvan Pascanu, Lukasz Kucinski, and Piotr Milos. 2021. Continual
world: A robotic benchmark for continual reinforcement learning. In Advances in Neural
Information Processing Systems 34: Annual Conference on Neural Information Processing
Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 28496–28510. URL: https:
//proceedings.neurips.cc/paper/2021/hash/ef8446f35513a8d6aa230
8357a268a7e-Abstract.html.
Qianhui Wu, Zijia Lin, Guoxin Wang, Hui Chen, Börje F. Karlsson, Biqing Huang, and Chin-Yew
Lin. 2020. Enhanced meta-learning for cross-lingual named entity recognition with minimal
resources. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The
Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth
AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY,
USA, February 7-12, 2020, pages 9274–9281. AAAI Press. URL: https://aaai.org/ojs
/index.php/AAAI/article/view/6466.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness
of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 833–844, Hong Kong, China. Association for Computational Linguistics. URL:
https://aclanthology.org/D19-1077.
Yonghui Wu, Mike Schuster, and Zhifeng Chen et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv preprint, abs/1609.08144.
URL: https://arxiv.org/abs/1609.08144.
118
Congying Xia, Chenwei Zhang, Xiaohui Yan, Yi Chang, and Philip Yu. 2018. Zero-shot user intent
detection via capsule neural networks. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 3090–3099, Brussels, Belgium. Association for
Computational Linguistics. URL: https://aclanthology.org/D18-1348.
Yubei Xiao, Ke Gong, Pan Zhou, Guolin Zheng, Xiaodan Liang, and Liang Lin. 2021. Adversarial
meta sampling for multilingual low-resource speech recognition. In Thirty-Fifth AAAI Conference
on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of
Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial
Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 14112–14120. AAAI Press.
URL: https://ojs.aaai.org/index.php/AAAI/article/view/17661.
Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. Smith, and Jaime Carbonell. 2018. Neural
cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 369–379, Brussels,
Belgium. Association for Computational Linguistics. URL: https://aclanthology.org
/D18-1034.
Weijia Xu, Batool Haider, and Saab Mansour. 2020. End-to-end slot alignment and recognition for
cross-lingual NLU. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 5052–5063, Online. Association for Computational
Linguistics. URL: https://aclanthology.org/2020.emnlp-main.410.
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, Brian Strope, and Ray Kurzweil.
2020. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of
the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics. URL: https:
//aclanthology.org/2020.acl-demos.12.
Friedemann Zenke, Ben Poole, and Surya Ganguli. 2017. Continual learning through synaptic
intelligence. In Proceedings of the 34th International Conference on Machine Learning, ICML
2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning
Research, pages 3987–3995. PMLR. URL: http://proceedings.mlr.press/v70/ze
nke17a.html.
Min Zhang, Donglin Wang, and Sibo Gai. 2020. Knowledge distillation for model-agnostic metalearning. In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8
September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 -Including
10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), volume 325
of Frontiers in Artificial Intelligence and Applications, pages 1355–1362. IOS Press. URL:
https://doi.org/10.3233/FAIA200239.
Tongtao Zhang and Heng Ji. 2018. Event extraction with generative adversarial imitation learning.
ArXiv preprint, abs/1804.07881. URL: https://arxiv.org/abs/1804.07881.
Wei Zhao, Steffen Eger, Johannes Bjerva, and Isabelle Augenstein. 2021. Inducing languageagnostic multilingual representations. In Proceedings of *SEM 2021: The Tenth Joint Conference
119
on Lexical and Computational Semantics, pages 229–240, Online. Association for Computational
Linguistics. URL: https://aclanthology.org/2021.starsem-1.22.
Tao Zhong, Zhixiang Chi, Li Gu, Yang Wang, Yuanhao Yu, and Jin Tang. 2022. Meta-dmoe:
Adapting to domain shift by meta-distillation from mixture-of-experts. In NeurIPS. URL:
http://papers.nips.cc/paper_files/paper/2022/hash/8bd4f1dbc7a70
c6b80ce81b8b4fdc0b2-Abstract-Conference.html.
Wangchunshu Zhou, Canwen Xu, and Julian McAuley. 2022. BERT learns to teach: Knowledge
distillation with meta learning. In Proceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 7037–7049, Dublin, Ireland.
Association for Computational Linguistics. URL: https://aclanthology.org/2022.
acl-long.485.
Jeffrey Zhu, Mingqin Li, Jason Li, and Cassandra Odoula. 2021. Bing delivers more contextualized
search using quantized transformer inference on nvidia gpus in azure. URL: https://blogs.
bing.com/Engineering-Blog/october-2021/Bing-delivers-more-conte
xtualized-search-using-quantized-transformer-inference-on-NVIDI
A-GPUs-in-Azu.
Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations
parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC’16), pages 3530–3534, Portorož, Slovenia. European Language Resources
Association (ELRA). URL: https://aclanthology.org/L16-1561.
Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. Overview of the second BUCC
shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th
Workshop on Building and Using Comparable Corpora, pages 60–67, Vancouver, Canada.
Association for Computational Linguistics. URL: https://aclanthology.org/W17-2
512.
120
Appendix A
Cross-lingual Meta-Learning
In this Appendix, we present the full evaluation results on our in-house intent classification and
TyDiQA in sections A.1 and A.2, respectively.
A.1 Results on In-House Intent Classification Dataset
We perform an extensive evaluation including other languages for intent classification. We use
an in-house Adobe dataset covering 6 target languages in addition to English in the Jarvis tool.1
Statistics of train/dev/test splits are shown in Table A.1. Table A.2 shows a better performance in
favor of X-METRA with an average cross-lingual gain of 13.5% in accuracy over PRE. We notice
that few-shot learning on the language of interest leads to the best performance, as indicated by
higher numbers on the diagonal in the confusion matrix. Evaluation on more languages shows
some complicity trends between languages from the same family. In addition to that, we notice that
languages like Japanese and Korean help each other where few-shot on one helps zero-shot on the
other by a margin of 15.6 and 5.6 on Korean and Japanese respectively.
Lang Train Dev Test
English 5,438 1,814 1,814
German 1,570 526 526
French 1,082 362 362
Italian 1,082 362 362
Portuguese 1,150 386 386
Japanese 1,070 358 358
Korean 938 314 314
Table A.1: Statistics of In-House multilingual intent classification Dataset per language and split.
Model Train on
Test on
DE FR IT PT JA KR
PRE EN 19.1 30.0 30.1 26.1 14.6 5.1
X-METRA
DE 34.3 33.3 30.0 30.2 13.5 8.9
FR 19.2 34.1 29.9 29.1 5.8 9.0
IT 18.3 32.2 44.4 30.2 6.7 10.2
PT 19.1 27.7 30.1 31.4 5.8 9.0
JA 24.1 25.7 33.2 26.1 30.9 20.7
KR 24.4 25.6 34.4 25.0 20.2 30.7
Table A.2: X-METRA results on an In-House multilingual intent data.
Best results are highlighted in bold for each test language.
1Access to such dataset was granted as part of an internship at Adobe Research.
121
A.2 Full Results for QA
Tables A.3 and A.4 show the full results for F1 and Exact Match (EM) metrics for QA respectively.
Model
Test on
AR BN FI ID RU SW TE
MONO
AR 74.0 ±1.1 30.1 ±2.4 50.0 ±0.8 59.5 ±1.3 48.4 ±0.8 50.8 ±1.7 24.1 ±2.7
BN 32.2 ±2.6 38.9 ±0.8 33.9 ±1.4 36.3 ±1.5 31.8 ±1.4 37.2 ±1.8 34.7 ±4.2
FI 54.2 ±2.5 30.7 ±1.3 63.3 ±1.5 52.5 ±1.7 43.0 ±2.1 48.6 ±1.7 28.7 ±2.8
ID 58.0 ±1.8 31.8 ±0.5 48.2 ±2.0 67.1 ±1.9 45.1 ±1.8 50.3 ±1.8 29.4 ±2.7
RU 50.9 ±2.3 34.5 ±2.1 45.2 ±4.2 52.0 ±4.0 54.4 ±1.3 47.1 ±2.1 30.7 ±2.5
SW 35.8 ±1.5 27.6 ±1.5 33.6 ±2.1 37.4 ±1.9 25.7 ±1.7 60.3 ±1.2 13.2 ±2.3
TE 34.0 ±0.9 38.0 ±2.2 39.5 ±0.6 35.3 ±1.1 35.9 ±1.1 43.5 ±1.0 61.4 ±1.0
FT
AR 77.0 ±0.3 36.8 ±2.9 58.8 ±0.6 67.0 ±2.7 60.9 ±0.8 52.4 ±3.6 32.0 ±1.0
BN 60.7 ±0.4 51.0 ±2.7 59.2 ±0.6 67.1 ±1.6 59.2 ±0.3 56.2 ±0.8 43.7 ±0.9
FI 60.3 ±1.9 36.7 ±1.3 70.9 ±0.4 65.7 ±1.4 62.1 ±0.5 50.9 ±1.3 36.4 ±3.6
ID 65.7 ±1.4 37.0 ±1.1 60.8 ±0.2 77.0 ±0.4 61.1 ±0.5 56.8 ±1.0 36.7 ±0.4
RU 60.9 ±2.5 37.2 ±2.0 59.0 ±2.1 66.8 ±1.3 64.8 ±0.4 55.2 ±1.8 36.8 ±1.3
SW 57.4 ±0.5 35.2 ±1.5 56.2 ±1.0 65.4 ±1.8 58.8 ±0.8 70.2 ±1.7 33.1 ±2.8
TE 54.0 ±3.2 39.1 ±2.1 54.8 ±2.3 63.5 ±2.6 58.1 ±0.9 56.9 ±1.8 65.4 ±0.6
X-METRA
AR 78.4 ±0.6 33.0 ±0.8 58.2 ±0.2 66.4 ±1.4 59.9 ±0.1 53.2 ±3.8 31.4 ±3.0
BN 56.9 ±3.2 53.2 ±0.5 56.7 ±1.4 67.4 ±1.2 56.7 ±1.3 56.0 ±0.9 41.7 ±0.6
FI 58.9 ±0.6 33.6 ±1.1 72.8 ±0.3 61.9 ±2.0 60.7 ±0.9 46.5 ±1.2 36.6 ±1.7
ID 65.8 ±0.3 35.0 ±2.2 60.5 ±0.9 77.7 ±0.2 60.4 ±1.3 57.4 ±1.1 35.3 ±0.3
RU 60.3 ±1.6 37.2 ±0.7 59.1 ±0.3 66.8 ±0.8 66.2 ±0.1 53.7 ±0.8 33.2 ±3.1
SW 58.5 ±0.0 36.9 ±1.2 56.0 ±0.2 64.8 ±0.7 58.4 ±0.4 71.9 ±0.2 33.7 ±1.5
TE 56.0 ±3.0 38.8 ±0.1 53.6 ±1.7 61.1 ±1.9 58.6 ±0.6 55.8 ±0.2 66.4 ±0.5
X-METRA-ADA
AR 76.6 ±0.1 49.6 ±1.3 63.4 ±0.4 70.9 ±0.1 60.1 ±1.0 56.8 ±0.4 42.4 ±2.5
BN 59.4 ±0.3 57.8 ±0.6 59.2 ±0.2 63.1 ±0.2 56.5 ±0.2 56.1 ±0.3 44.1 ±0.4
FI 62.8 ±1.3 50.8 ±1.3 73.0 ±0.3 65.5 ±1.2 60.1 ±0.4 54.9 ±0.3 42.5 ±0.5
ID 66.7 ±0.3 49.9 ±0.5 62.6 ±0.7 77.3 ±0.1 58.3 ±0.9 58.1 ±0.6 42.6 ±0.4
RU 62.2 ±0.7 47.6 ±1.6 63.1 ±0.2 63.4 ±0.9 66.9 ±0.1 56.0 ±1.1 43.3 ±1.2
SW 59.1 ±0.7 49.1 ±1.1 58.1 ±0.2 62.1 ±1.0 54.6 ±0.6 70.3 ±0.2 43.2 ±0.7
TE 58.2 ±2.8 52.1 ±1.7 61.5 ±1.0 62.0 ±0.5 58.2 ±0.5 59.7 ±1.4 72.8 ±0.1
Table A.3: Full F1 results on TyDiQA-GoldP comparing X-METRA and X-METRA-ADA to
monolingual (MONO) and fine-tuning (FT) baselines. Best results for each language across all
models are in bold whereas the second best results are underlined.
122
Model
Test on
AR BN FI ID RU SW TE
MONO
AR 57.5 ±1.5 19.7 ±2.9 35.1 ±1.0 44.2 ±1.3 25.2 ±0.9 33.8 ±1.4 14.9 ±1.7
BN 17.1 ±1.4 24.5 ±2.9 17.5 ±0.4 20.8 ±2.0 14.4 ±0.5 20.5 ±1.4 19.9 ±5.0
FI 33.7 ±4.0 15.6 ±1.6 49.8 ±1.3 35.3 ±2.3 21.4 ±1.4 26.1 ±9.9 16.5 ±3.9
ID 39.7 ±1.4 18.6 ±1.3 32.7 ±1.9 54.9 ±0.1 23.8 ±0.6 34.4 ±1.2 16.9 ±4.9
RU 30.8 ±1.9 26.3 ±4.9 29.7 ±2.4 34.9 ±4.0 37.9 ±1.6 30.7 ±3.1 19.9 ±1.9
SW 16.0 ±1.3 16.5 ±1.5 15.6 ±1.0 21.1 ±1.3 10.5 ±1.3 48.6 ±1.2 5.3 ±1.7
TE 18.8 ±2.0 26.3 ±1.5 23.8 ±2.6 21.6 ±2.5 20.4 ±1.2 26.7 ±1.7 46.3 ±1.1
FT
AR 61.3 ±1.0 26.5 ±4.4 43.1 ±1.0 52.2 ±2.0 37.9 ±2.5 35.6 ±3.3 21.0 ±3.0
BN 42.2 ±0.9 38.0 ±4.4 44.8 ±1.2 51.5 ±2.2 36.8 ±1.6 37.2 ±1.7 27.3 ±0.2
FI 43.2 ±1.8 23.6 ±1.1 56.5 ±0.6 50.8 ±2.1 40.5 ±0.8 33.5 ±1.2 20.7 ±3.3
ID 49.4 ±1.6 23.3 ±2.4 46.4 ±0.4 63.8 ±0.5 40.5 ±0.1 38.1 ±2.1 24.1 ±0.5
RU 42.6 ±2.6 24.8 ±3.3 43.5 ±2.0 52.4 ±2.3 46.5 ±0.4 37.6 ±1.5 24.5 ±1.3
SW 38.9 ±0.6 23.0 ±1.4 40.1 ±1.4 50.0 ±1.7 38.0 ±0.8 59.0 ±3.1 23.5 ±1.4
TE 36.1 ±2.2 30.0 ±2.3 40.0 ±2.5 49.4 ±2.1 38.6 ±0.9 39.0 ±1.7 49.2 ±0.5
X-METRA
AR 63.3 ±0.8 21.2 ±1.9 42.6 ±1.0 51.8 ±1.2 34.9 ±1.1 36.0 ±3.5 20.9 ±1.7
BN 29.2 ±16.5 39.0 ±1.9 41.9 ±1.6 51.1 ±1.7 34.1 ±0.4 37.1 ±1.4 25.6 ±0.2
FI 42.0 ±1.0 20.4 ±0.7 59.1 ±1.1 46.0 ±2.7 36.8 ±1.3 30.9 ±0.6 22.5 ±0.9
ID 54.8 ±7.9 20.1 ±1.5 46.1 ±1.2 65.2 ±0.5 38.5 ±1.9 39.6 ±0.8 23.1 ±1.4
RU 42.9 ±1.3 26.5 ±1.2 43.0 ±0.6 53.0 ±0.1 48.9 ±0.4 35.3 ±1.0 21.6 ±2.4
SW 39.9 ±0.4 26.0 ±1.1 40.0 ±0.7 50.3 ±0.4 38.0 ±0.9 61.4 ±0.4 23.9 ±0.7
TE 38.0 ±3.9 28.3 ±0.0 37.0 ±2.3 47.6 ±3.4 36.3 ±0.5 36.9 ±1.2 49.7 ±0.5
X-METRA-ADA
AR 55.0 ±0.3 36.0 ±3.0 43.8 ±0.5 55.2 ±0.5 35.4 ±2.6 40.0 ±0.2 31.9 ±2.2
BN 38.4 ±0.3 41.0 ±0.8 43.5 ±0.4 46.7 ±0.1 32.4 ±0.4 37.9 ±0.4 33.8 ±0.7
FI 40.9 ±1.1 34.2 ±1.1 57.9 ±1.0 49.0 ±1.4 35.3 ±0.3 38.0 ±0.7 30.0 ±1.0
ID 45.4 ±0.4 33.9 ±1.1 47.6 ±0.4 63.4 ±0.4 36.3 ±0.9 43.4 ±0.8 31.9 ±0.2
RU 39.4 ±0.1 34.8 ±1.5 45.1 ±0.5 48.6 ±0.9 47.5 ±0.3 39.3 ±1.3 33.8 ±1.3
SW 36.7 ±0.6 36.3 ±1.4 42.5 ±0.5 45.8 ±1.4 32.4 ±0.7 59.6 ±0.5 33.8 ±1.0
TE 37.9 ±1.9 38.1 ±2.6 44.9 ±1.4 48.0 ±0.3 38.8 ±0.4 43.5 ±1.6 56.4 ±0.4
Table A.4: Full EM results on TyDiQA-GoldP comparing X-METRA and X-METRA-ADA to
monolingual (MONO) and fine-tuning (FT) baselines. Best results for each language across all
models are in bold whereas the second best results are underlined.
123
Appendix B
Multi-lingual Meta-Learning
In this Appendix, we include more experimental setup details (Section B.1) in addition to extensive
results on all language arrangements (Section B.2).
B.1 More Experimental Setup Details
In this Section, we include more experimental setup details including the language arrangements
and hyperparameters used for reproduction purposes.
B.1.1 Upstream Meta-Tasks
We detail in Table B.1 the arrangements of languages for the different meta-tasks used in the
meta-training Dmeta-train, meta-validation Dmeta-valid, and meta-testing Dmeta-test datasets. To make the
comparison fair and consistent across different transfer modes, we use the same combination of
languages and tweak them to fit the transfer mode. By picking a high number of meta-tasks during
meta-training, meta-validation, and meta-testing, we make sure that all transfer modes are exposed
to the same number of questions and candidates. We use Train and Dev splits are used to sample
Dmeta-train and Dmeta-valid, respectively
B.1.2 Hyperparameters
Based on our prior investigation of different sentence-transformer models in Table B.2, we notice
that paraphrase-multilingual-mpnet-base-v21
, which maps sentences and paragraphs to a 768-
dimensional dense vector space, performs the best for LAReQA, so we use it in our S-BERT
experiments on that dataset. The good initial performance of this pre-trained model is not surprising
since it was trained on parallel data and is recommended for use in tasks like clustering or semantic
search. For pre-processing LAReQA and SQUADEN, we truncate/pad all questions to length 96
and all answer or negative candidates concatenated with their contexts to 256. For pre-processing
STSBMulti and STSBEN, we pad or truncate each sentence to fit the maximum length of 100.
1https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet
-base-v2.
124
Transfer Mode Phase
Support→Query/Support1→Support2→Query
LAReQA STSBMulti
mono→mono All
EL_EL→AR_AR
HI_HI→DE_DE
(EN_EN,AR_AR,ES_ES)→(EN_EN,AR_AR,ES_ES)
mono→bi All
EL_EL→EL_AR
HI_HI→HI_DE
[EN_EN,AR_AR,ES_ES]→[AR_EN,ES_EN,TR_EN]
mono→multi All
EL_EL→EL_{AR,EL}
HI_HI→HI_{DE,HI}
Not Applicable
bi→multi All
EL_AR→EL_{AR,EL}
HI_DE→HI_{DE,HI}
Not Applicable
mixt All
mono→mono
mono→bi
mono→multi
bi→multi
Not Applicable
trans
Meta-train mono→bi
Not Applicable
Meta-valid bi→multi
mono→bi→multi All
EL_EL→EL_AR→EL_{AR,EL,HI}
HI_HI→HI_DE→HI_{AR,DE,HI}
EN_EN→AR_EN→EN_{AR,EN,ES}
AR_AR→AR_ES→AR_{AR,EN,ES}
ES_ES→ES_AR→ES_{AR,EN,ES}
Table B.1: Arrangements of languages for the different modes of transfer and meta-learning stages
for two standard benchmark datasets LAReQA and STSBMulti. X→Y denotes transfer from an X
model (for example a monolingual model) used to sample the support set to a Y model (for example
bilingual model) used to sample the query set. We denote a support or query set in LAReQA by
x_y where x and y are the ISO language codes of the question and the candidate answers and x_y
in STSBMulti where x and y are the ISO language codes of sentence 1 and 2 respectively. We use
parenthesis to mean that the same language pairs cannot be used in both support and query sets,
brackets to denote non-exclusivity (or in other words the language pairs used as a support can also
be used as a query), and curled braces to mean the query set may be sampled from more than one
language. We do not experiment with mono→multi, bi→multi, mixt, and trans for STSBMulti, since
it is not a multilingual parallel benchmark, but we still experiment with mono→bi→multi using
machine-translated data in that case.
For both benchmarks, for Fine-tune baselines, following XTREME-R, we use AdamW optimizer (Loshchilov and Hutter, 2019). We use a learning rate of = 5 − 5, = 1 − 8 and a weight
decay of 0, with no decay on the bias and LayerNorm weights. We use a batch size of 8 triplets or
sentence pairs. For LAReQA, we sample 3 negative examples per anchor and then project those to
3 triplets with one negative example and use a margin of 1. In STSBMulti, we use just sets of sentence
pairs composed of one source and one target sentence each, where we don’t have negative examples
so we don’t need to flatten the dimensions of the negative examples. We sample 7,000, 2,000, and
1,000 meta-tasks in the meta-training, meta-validation, and meta-testing phases respectively. We
125
Sentence Transformers Model mAP@20
LASER 13.5 ± 0.7
LaBSE 48.7 ± 2.6
M-BERT+SQUADEN 37.9 ± 3.4
distilbert-multilingual-nli-stsb-quora-ranking 44.1 ± 0.9
use-cmlm-multilingual 36.8 ± 2.6
distiluse-base-multilingual-cased-v2 46.9 ± 2.5
paraphrase-multilingual-MiniLM-L12-v2 49.6 ± 2.7
multi-qa-distilbert-dot-v1 6.4 ± 0.3
paraphrase-multilingual-mpnet-base-v2 57.0 ± 2.9
Table B.2: Comparison of mAP@20 multilingual 5-fold cross-validation evaluation of different
S-BERT models compared to M-BERT model. Best results are highlighted in bold.
use meta-batches of size 4. In each meta-task, we randomly sample = 8 and = 4 support and
query triplets respectively. We use the same meta-tasks and sampling regime in Fine-tune as well.
For MAML and MAML-Align in both benchmarks, we use learn2learn (Arnold et al., 2020)
implementation to handle gradient updates, especially in the inner loop. For the inner loop, we use
learn2learn pre-built optimizer with a learning rate = 1 − 3. The inner loop is repeated = 5
times for meta-training and meta-validation and meta-testing. For the outer loop, we use the same
optimizer with the same learning rate = 1 − 5 that we used in the Fine-tune model. At the end of
each epoch, we perform meta-validation similarly to meta-training with the same hyperparameters
described before. We use the same hyperparameters for MAML-Align for both T-MAML and
S-MAML except that we run the gradient updates in the inner loop in S-MAML just once, whereas
for T-MAML we perform = 5 inner loop gradient updates. We jointly optimize the outer loop
losses weighting the knowledge distillation by = 0.5. We don’t use meta-testing but keep it for
evaluation purposes. For a consistent comparison, we don’t use meta-testing for our main evaluation
as we use standard testing cross-validation splits, but we will include those meta-testing datasets to
encourage future work on few-shot learning. All experiments are run for one fixed initialization
seed using a 5-fold cross-validation. We observe a variance with respect to different seeds smaller
than the variance with respect to 5-fold cross-validation, so we report the latter to have a better
upper bound of the variance.
All experiments are conducted on the same computing infrastructure using one NVIDIA A40
GPU with 46068 MiB memory and one TESLA P100-PCIE with 16384 MiB memory of CUDA
version 11.6 each. We use Pytorch version 1.11.1, Python version 3.8.13, learn2learn version
0.1.7, Hugging Face transformers version 4.21.3 and Sentence-Transformers 2.2.2. For paraphrasemultilingual-mpnet-base-v2 used in the experiments in Section 3.2, there are 278,043,648 parameters. For asymmetric and symmetric semantic search benchmarks, there are three and two
126
encoding towers, respectively. Therefore, there are 834,130,944 and 556,087,296 parameters used
for asymmetric and symmetric semantic search benchmarks, respectively.
For all experiments and model variants, we train for up to 20 epochs maximum and we implement
early stopping, where we run the experiment for as long as there is an improvement on the Dev set
performance. After 50 mini meta-task batches of no improvement on the Dev set, the experiment
stops running. We use the multilingual performance on the Dev set averaged over all languages of
the query set as the early stopping evaluation criteria. Based on this early stopping policy, we report
in Table B.3 the typical runtime for each upstream model variant and baseline.
Model Runtime
Fine-tune 2 h 18 min
MAML 3 h 19 min
MAML-Align 19 h 29 min
Table B.3: Runtime per model variant excluding evaluation.
B.2 More Results
Tables B.4 and B.5 show full fine-grained results for all languages and language pairs for both
semantic search benchmarks.
127
Model
Train Language(s)
Configuration
Testing Languages
Few-Shot Languages Zero-Shot Languages
Arabic German Greek Hindi Russian Thai Turkish
Mean
AR DE EL HI RU TH TR
Zero-Shot Baselines
LASER - 13.2 15.1 14.6 9.4 14.9 13.0 14.1 13.5
LaBSE - 44.7 47.9 53.0 53.4 53.1 49.8 48.1 50.0
S-BERT - 56.3 54.6 58.2 57.2 58.7 60.2 54.1 57.0
+Few-Shot Learning
S-BERT+Fine-tune
mono→mono 45.9 46.3 47.9 45.4 48.9 49.7 45.1 47.0
mono→bi 45.8 46.5 48.6 45.0 48.9 49.4 45.0 47.0
mono→multi 40.4 42.5 43.1 37.8 44.1 44.3 41.1 41.9
bi→multi 33.8 35.6 35.2 32.4 37.1 37.2 34.4 35.1
mixt 38.3 39.8 40.7 39.3 41.9 41.7 38.7 40.1
trans 38.7 39.9 41.8 40.1 42.6 42.6 39.4 40.7
S-BERT+MAML
mono→mono 56.3 54.5 58.5 57.0 59.3 59.6 53.8 57.0
mono→bi 55.9 55.0 58.4 56.9 58.8 59.9 54.2 57.0
mono→multi 54.9 53.6 57.0 55.8 57.7 58.7 53.1 55.9
bi→multi 54.5 53.6 56.6 55.5 57.3 58.5 52.8 55.5
mixt 55.0 53.9 57.2 55.3 57.6 58.7 52.9 55.8
trans 56.0 54.8 59.1 57.0 59.1 59.9 54.4 57.2
S-BERT+MAML-Align mono→bi→multi 57.0 55.1 59.2 57.7 59.5 60.2 54.6 57.6
+Machine Translation
S-BERT+T-Train+Fine-tune
AR_AR→AR_AR 46.6 45.8 48.8 46.8 49.3 48.6 44.9 47.3
DE_DE→DE_DE 45.9 45.1 48.2 45.8 49.0 48.8 44.5 46.8
EL_EL→EL_EL 43.5 43.1 43.8 43.4 46.5 45.0 41.7 43.8
HI_HI→HI_HI 46.5 44.8 47.1 45.9 48.4 49.6 43.7 46.6
All test languages 44.8 43.5 46.9 44.0 47.0 46.4 42.1 45.0
S-BERT+T-Train+MAML
AR_AR→AR_AR 57.3 55.3 59.3 58.3 60.2 60.7 54.8 58.0
DE_DE→DE_DE 56.1 54.4 58.3 57.1 58.8 59.8 54.1 56.9
EL_EL→EL_EL 55.9 53.1 57.4 56.3 58.5 59.2 52.8 56.2
HI_HI→HI_HI 56.7 54.0 58.5 57.1 58.9 60.3 53.7 57.0
All test languages 55.9 53.8 58.0 56.6 58.1 59.2 53.4 56.4
Table B.4: mAP@20 multilingual 5-fold cross-validated performance tested for different languages.
Best and second-best results for each language are highlighted in bold and italicized respectively,
whereas best results across categories of models are underlined. Gains from meta-learning approaches are consistent across few-shot and zero-shot languages.
128
Model
Train Language(s)
Configuration
Testing Languages
AR-AR AR-EN ES-ES ES-EN EN-EN TR-EN Mean
Zero-Shot Learning
LASER - 22.5 ± 8.5 21.6 33.1 15.3 31.1 21.2 24.1
LaBSE - 71.6 73.2 83.2 68.7 76.3 74.9 74.6
S-BERT - 77.6 81.3 84.6 83.7 85.5 75.7 81.4
+Few-Shot learning
S-BERT+Fine-tune mono→bi 77.2 77.8 86.2 79.6 85.0 73.7 79.9
S-BERT+MAML mono→bi 77.6 80.9 85.1 83.5 85.6 75.5 81.3
S-BERT+MAML-Align mono→bi→multi 79.0 80.6 86.6 81.5 90.6 76.3 82.4
+Machine Translation
S-BERT+T-Train+Fine-tune
AR_AR→AR_AR 59.5 50.6 82.7 70.1 82.4 62.5 68.0
EN_EN→EN_EN 72.6 73.1 82.4 72.2 80.3 68.8 74.9
ES_ES→ES_ES 74.2 72.3 82.3 66.8 79.7 68.5 73.9
TR_TR→TR_TR 73.9 74.6 85.9 79.6 84.3 68.5 77.8
All test languages 65.8 63.0 82.5 75.8 83.0 67.8 73.0
S-BERT+T-Train+MAML
AR_AR→AR_AR 75.5 80.5 85.8 83.1 85.6 75.0 80.9
EN_EN→EN_EN 77.8 81.7 85.1 83.8 85.7 75.8 81.6
ES_ES→ES_ES 76.4 79.4 86.9 80.4 84.7 74.1 80.3
TR_TR→TR_TR 77.2 79.8 87.3 81.6 84.5 74.2 80.8
All test languages 77.6 81.8 84.7 83.6 85.6 75.9 81.5
Table B.5: Pearson correlation Pearson’s r × 100 5-fold cross-validated performance on STSBMulti
benchmark using different models few-shot learned on STSBMulti or its translation. Best and secondbest results for each language are highlighted in bold and italicized respectively, whereas best
results across categories of models are underlined.
129
Appendix C
Cross-lingual Continual Learning
In this Appendix, we exhaustively include more results using Bootstrap sampling (Section C.1) and
different initialization seeds (Section C.2). This serves us for computing statistical significance
tests.
C.1 More Results & Analysis using Boostrap Sampling
In this Section, we present more results on the main experiments using Bootstrap sampling. We
also include more ablation studies for M-BERT components, memory size in ER, and fine-grained
analysis over different languages and language permutations.
C.1.1 Full Average Results
Table C.1 shows the full results and confidence intervals for different continual learning approaches.
Compared to intent classification, we observe a higher forgetting and slightly higher transfer but
a lower zero-shot transfer and final performance in the case of slot filling. This could be due
to the nature of the task of slot filling which is more challenging to learn. In general, we can
observe the same forgetting, transfer, zero-shot transfer, and final performance trends between
intent classification and slot filling. In other words, if a model has higher forgetting of intent
classification than model then the same thing applies to slot filling. This could be due to the
transfer between intent classification and slot filling that is maximized when training them jointly.
The best model for transfer is Lang-Spec Ada(F), which we hypothesize is due to its lightweight
adaptation to the current language which makes it overfit on that at the cost of a lower average and
final performance overall.
C.1.2 Per M-BERT Components Analysis
Table C.2 shows ablation studies for the analysis of M-BERT components following four different
categories: groups of 12 layers with or without embedding, groups of 3 layers, 6 layers, and 9
layers at a time trained in a language-specific manner and the rest shared between languages. We
notice that training the full Lang-Spec Trans and Lang-Spec Enc[0-11] have the best in terms of
130
Model
F ↓ T ↑
0 ↑ FP ↑
Acc F1 Acc F1 Acc F1 Acc F1
Shared {Trans, Task} Baselines
Naive Seq FT 2.93 ±1.24 5.67 ±0.93 0.68 ±0.14 1.37 ±0.53 50.24 ±3.43 36.32 ±1.91 91.06 ±1.08 69.37 ±1.06
Lang-Spec FT 93.40 ±0.08 73.90 ±0.83
Lang-Spec FT + Ada(T) 93.04 ±0.09 72.90 ±0.80
Lang-Spec FT + Ada(F) 88.79 ±0.13 67.46 ±0.89
Inc Joint 0.11 ±0.10 0.91 ±0.34 0.52 ±0.19 0.83 ±0.77 50.07 ±2.48 36.39 ±2.60 94.16 ±0.18 74.88 ±0.38
multi 94.25 ±0.07 76.34 ±0.82
Model Expansion Baselines
Lang-Spec Trans 0.49 ±0.08 1.32 ±0.23 0.23 ±0.21 0.95 ±0.21 -0.43 ±0.16 0.42 ±0.06 93.51 ±0.18 74.74 ±0.20
Lang-Spec Enc[0-8] 0.78 ±0.15 1.95 ±0.51 0.80 ±0.19 1.44 ±0.71 24.23 ±1.73 12.32 ±1.24 93.50 ±0.21 74.19 ±0.92
Lang-Spec Task 2.91 ±1.26 5.26 ±1.01 0.66 ±0.18 1.15 ±1.15 0.10 ±0.25 0.07 ±0.02 90.86 ±1.46 69.41 ±1.57
Lang-Spec Ada(T) 2.19 ±1.12 4.23 ±1.26 0.98 ±0.18 2.04 ±0.92 49.35 ±3.64 33.60 ±2.98 91.75 ±1.39 71.13 ±1.68
Lang-Spec Ada(F) 1.20 ±0.35 3.35 ±0.85 2.82 ±0.33 3.93 ±0.68 6.52 ±2.16 2.80 ±0.59 90.36 ±0.37 68.55 ±1.10
Other Continuous Learning Algorithms
EWC 3.07 ±1.32 5.78 ±1.00 0.73 ±0.12 1.46 ±0.65 50.16 ±3.48 36.31 ±1.94 91.03 ±1.26 69.63 ±1.52
ER 1.29 ±0.51 3.06 ±0.59 0.75 ±0.17 1.47 ±0.85 50.71 ±3.55 36.91 ±2.14 93.09 ±0.29 73.00 ±0.52
KD-Logit 2.37 ±0.83 5.53 ±0.96 0.62 ±0.15 1.40 ±0.68 50.18 ±3.14 36.25 ±1.91 91.46 ±0.87 69.64 ±1.58
KD-Rep 2.29 ±0.80 5.35 ±0.69 0.69 ±0.20 1.43 ±0.59 50.41 ±2.92 36.26 ±1.96 91.69 ±0.71 70.03 ±1.09
Table C.1: A summary of results for different continual learning approaches over the average across
language order. For each metric and score, we highlight the best score in bold and underline the
second best score.
forgetting and final performance. Training only the first 8 encoder layers Lang-Spec Enc[0-8],
excluding embedding, in a language-specific manner comes next in terms of a low forgetting and a
comparable final performance, with a relatively better transfer and zero-shot transfer performance.
Other good model reaching a good compromise between transfer, zero-shot transfer, and forgetting
with less language-specific layers are Lang-Spec Enc[0-2] and Lang-Spec Enc[0-5]. Naive Seq FT
is comparable to those model-expansion approaches in terms of zero-shot performance, but has
a lower final performance and significantly higher forgetting. We also notice the same trend for
language-specific embedding Lang-Spec Embed which reaches the second best zero-shot transfer
performance, but with also a high forgetting. This suggests that language-specific knowledge is less
likely to be encoded in the embedding and more at the encoder layers. This shows that there is a
real plasticity-stability tradeoff between zero-shot transfer and knowledge preservation (which we
explain in more details in Section 4.4.5).
131
Model
F ↓ T ↑
0 ↑ FP ↑
Acc F1 Acc F1 Acc F1 Acc F1
Naive Seq FT 2.93 ±1.24 5.67 ±0.93 0.68 ±0.14 1.37 ±0.53 50.24 ±3.43 36.32 ±1.91 91.06 ±1.08 69.37 ±1.06
Lang-Spec FT 93.40 ±0.08 73.90 ±0.83
Lang-Spec Trans 0.49 ±0.08 1.32 ±0.23 0.23 ±0.21 0.95 ±0.21 -0.43 ±0.16 0.42 ±0.06 93.51 ±0.18 74.74 ±0.20
Lang-Spec Enc[0-11] 0.49 ±0.08 1.30 ±0.16 0.23 ±0.21 0.77 ±0.31 -0.31 ±0.18 0.57 ±0.09 93.52 ±0.12 74.51 ±0.25
Lang-Spec Embed 3.13 ±1.35 5.88 ±0.95 0.74 ±0.20 1.24 ±0.79 50.67 ±2.98 36.62 ±1.89 90.69 ±1.28 69.59 ±1.23
Lang-Spec Enc[0-2] 1.88 ±0.77 4.32 ±0.69 0.77 ±0.19 1.37 ±0.64 52.20 ±3.23 37.42 ±1.99 92.25 ±0.76 71.59 ±1.52
Lang-Spec Enc[3-5] 1.47 ±0.65 2.87 ±0.36 0.78 ±0.23 1.61 ±0.45 47.83 ±3.00 34.66 ±1.79 92.71 ±0.65 73.06 ±0.97
Lang-Spec Enc[6-8] 1.45 ±0.56 3.02 ±0.52 0.70 ±0.16 1.32 ±0.52 38.33 ±3.00 23.68 ±2.36 92.43 ±0.78 72.28 ±1.05
Lang-Spec Enc[9-11] 2.21 ±0.86 4.14 ±0.84 0.47 ±0.24 1.35 ±0.56 41.38 ±2.13 20.04 ±1.89 91.41 ±1.08 71.14 ±1.13
Lang-Spec Enc[0-5] 1.27 ±0.67 2.99 ±0.62 0.87 ±0.17 1.64 ±0.65 45.23 ±2.56 31.21 ±2.17 92.92 ±0.52 73.33 ±1.09
Lang-Spec Enc[6-11] 1.66 ±0.36 3.37 ±0.69 0.31 ±0.33 0.65 ±0.73 6.04 ±1.13 4.53 ±0.96 91.97 ±0.38 71.63 ±1.15
Lang-Spec Enc[0-8] 0.78 ±0.15 1.95 ±0.51 0.80 ±0.19 1.44 ±0.71 24.23 ±1.73 12.32 ±1.24 93.50 ±0.21 74.19 ±0.92
Lang-Spec Enc[9-11] 2.21 ±0.86 4.14 ±0.84 0.47 ±0.24 1.35 ±0.56 41.38 ±2.13 20.04 ±1.89 91.41 ±1.08 71.14 ±1.13
Table C.2: Per group layer analysis: ablation studies of different M-BERT’s components. Best,
second best, and third best scores for each metric are in bold, underlined, and italicized respectively.
C.1.3 Full Results on Language Permutations
Full results for all language permutations can be found in Tables C.3, C.4, and C.5. By looking at
additional language permutations, L2H (Thai → Spanish → Hindi → French → German → English)
is still the most challenging one in terms of knowledge preservation, accumulation, generalization,
and model utility. H2L (English → German → French → Hindi → Spanish → Thai) is still
the easiest to learn. Order 5(Hindi → English → Spanish → Thai → French → German) is the
second most challenging language permutation to train. In general, the same trends regarding
the more challenging nature of training for certain language permutations are observed for both
intent classification and slot filling uniformly. Table C.6 includes the results for more language
permutations for the balanced data.
C.1.4 Per Language Analysis
Tables C.7, C.8, and C.9 show the full results for forgetting, transfer, and zero-shot transfer
respectively, across different languages averaged over different language permutations. We notice
that languages like English, German, French, and Spanish have constantly lower forgetting and
higher zero-shot transfer than languages like Hindi and Thai for both intent classification and slot
filling for Naive Seq FT compared to the reference model Inc Joint for which the forgetting is
low and nearly equal between different languages. Approaches like Lang-Spec Trans, Lang-Spec
Enc[0-8], Lang-Spec Ada(F), and to a certain degree ER also reduce that gap. We also notice that
132
Model
H2L L2H
Test Intent Accuracy On
F ↓ T ↑
0 ↑ FP ↑ F ↓ T ↑
0 ↑ FP ↑
Shared {Trans, Task} Baselines
Naive Seq FT 1.52 ±0.02 0.93 ±0.02 50.68 ±0.03 92.06 ±0.02 5.52 ±0.04 0.57 ±0.01 44.66 ±0.02 88.80 ±0.02
Lang-Spec FT 93.40 ±0.08 93.40 ±0.08
Lang-Spec FT + Ada(T) 93.04 ±0.09 93.04 ±0.09
Lang-Spec FT + Ada(F) 88.79 ±0.13 88.79 ±0.13
Inc Joint -0.01 ±0.01 0.15 ±0.02 50.32 ±0.03 93.91 ±0.01 0.12 ±0.01 0.63 ±0.01 45.87 ±0.03 94.30 ±0.01
multi 94.25 ±0.07 94.25 ±0.07
Model Expansion Baselines
Lang-Spec Trans 0.40 ±0.01 0.59 ±0.02 -0.48 ±0.00 93.86 ±0.01 0.62 ±0.02 0.03 ±0.01 -0.54 ±0.00 93.37 ±0.01
Lang-Spec Enc[0-8] 0.60 ±0.01 1.00 ±0.01 22.02 ±0.02 93.75 ±0.02 1.05 ±0.02 0.63 ±0.01 22.50 ±0.01 93.15 ±0.01
Lang-Spec Task 1.53 ±0.02 0.84 ±0.01 0.17 ±0.00 91.93 ±0.01 5.53 ±0.04 0.38 ±0.02 -0.11 ±0.00 87.68 ±0.02
Lang-Spec Ada(T) 1.18 ±0.01 1.29 ±0.01 50.25 ±0.03 92.36 ±0.02 4.43 ±0.04 0.79 ±0.02 42.35 ±0.02 88.66 ±0.02
Lang-Spec Ada(F) 0.84 ±0.02 3.41 ±0.02 3.80 ±0.00 91.08 ±0.02 1.87 ±0.05 2.43 ±0.02 9.68 ±0.01 89.92 ±0.02
Other Continuous Learning Algorithms
EWC 1.82 ±0.02 0.74 ±0.01 51.13 ±0.03 91.16 ±0.02 5.9 ±0.04 0.48 ±0.02 44.73 ±0.03 88.28 ±0.02
ER 0.71 ±0.01 0.95 ±0.02 49.59 ±0.03 93.51 ±0.01 2.35 ±0.03 0.78 ±0.01 44.87 ±0.03 92.58 ±0.02
KD-Logit 1.42 ±0.01 0.77 ±0.02 50.79 ±0.03 91.60 ±0.02 4.07 ±0.04 0.51 ±0.01 44.38 ±0.03 89.65 ±0.02
KD-Rep 1.49 ±0.01 0.96 ±0.01 51.17 ±0.03 91.64 ±0.02 4.00 ±0.04 0.53 ±0.01 45.11 ±0.02 90.17 ±0.02
Test Slot Filling On
Shared {Trans, Task} Baselines
Naive Seq FT 4.15 ±0.18 0.77 ±0.20 37.03 ±0.05 67.80 ±0.13 7.06 ±0.23 0.77 ±0.17 33.29 ±0.03 68.37 ±0.13
Lang-Spec FT 73.90 ±0.83 73.90 ±0.83
Lang-Spec FT + Ada(T) 72.90 ±0.80 72.90 ±0.80
Lang-Spec FT + Ada(F) 67.46 ±0.89 67.46 ±0.89
Inc Joint 0.78 ±0.11 0.69 ±0.16 37.92 ±0.05 75.14 ±0.13 0.37 ±0.14 -0.47 ±0.19 32.75 ±0.03 75.14 ±0.14
multi 76.34 ±0.82 76.34 ±0.82
Model Expansion Baselines
Lang-Spec Trans 0.99 ±0.11 0.92 ±0.18 0.33 ±0.00 74.88 ±0.13 1.23 ±0.14 0.89 ±0.17 0.39 ±0.00 74.85 ±0.14
Lang-Spec Enc[0-8] 2.35 ±0.15 1.79 ±0.18 10.57 ±0.01 72.51 ±0.13 2.03 ±0.15 0.74 ±0.19 12.63 ±0.01 74.01 ±0.14
Lang-Spec Task 4.08 ±0.17 1.91 ±0.16 0.06 ±0.00 68.88 ±0.15 7.23 ±0.24 -0.67 ±0.19 0.06 ±0.00 66.28 ±0.13
Lang-Spec Ada(T) 2.46 ±0.14 2.75 ±0.16 35.05 ±0.05 71.79 ±0.15 6.42 ±0.23 0.40 ±0.17 29.89 ±0.03 67.70 ±0.12
Lang-Spec Ada(F) 2.57 ±0.20 4.77 ±0.17 3.34 ±0.00 70.33 ±0.15 5.01 ±0.24 2.70 ±0.20 2.59 ±0.00 67.07 ±0.12
Other Continuous Learning Algorithms
EWC 4.22 ±0.20 1.19 ±0.17 37.39 ±0.05 68.33 ±0.13 7.53 ±0.25 0.52 ±0.16 33.25 ±0.03 66.91 ±0.14
ER 2.32 ±0.15 1.83 ±0.16 37.50 ±0.05 73.31 ±0.14 3.48 ±0.20 0.44 ±0.19 32.97 ±0.04 72.00 ±0.15
KD-Logit 4.42 ±0.18 1.79 ±0.15 37.50 ±0.05 68.13 ±0.14 7.36 ±0.27 0.13 ±0.19 32.86 ±0.04 67.13 ±0.14
KD-Rep 4.56 ±0.18 1.61 ±0.15 37.42 ±0.05 68.28 ±0.13 6.65 ±0.28 1.03 ±0.17 32.57 ±0.03 69.03 ±0.13
Table C.3: Per language permutation view: a pairwise comparison between H2L (English →
German → French → Hindi → Spanish → Thai) and L2H (Thai → Spanish → Hindi → French →
German → English). We highlight the best forgetting (lowest), transfer (highest), zero-shot transfer
(highest), and final performance (highest) of accuracy and f1 scores among those two orders for
each approach in bold, whereas the best scores across approaches for the two orders separately are
underlined.
133
Model
Spanish → Hindi → English → German → Thai → French French → Thai → German → English → Hindi → Spanish
Test Intent Accuracy On
F ↓ T ↑
0 ↑ FP ↑ F ↓ T ↑
0 ↑ FP ↑
Shared {Trans, Task} Baselines
Naive Seq FT 2.62 ±0.03 0.59 ±0.01 52.07 ±0.03 91.49 ±0.02 2.63 ±0.03 0.52 ±0.02 55.0 ±0.02 90.74 ±0.02
Lang-Spec FT 93.40 ±0.08 93.40 ±0.08
Lang-Spec FT + Ada(T) 93.04 ±0.09 93.04 ±0.09
Lang-Spec FT + Ada(F) 88.79 ±0.13 88.79 ±0.13
Inc Joint 0.11 ±0.01 0.47 ±0.01 53.86 ±0.02 94.01 ±0.01 0.25 ±0.01 0.61 ±0.01 50.51 ±0.02 94.09 ±0.01
multi 94.25 ±0.07 94.25 ±0.07
Model Expansion Baselines
Lang-Spec Trans 0.45 ±0.01 0.05 ±0.02 -0.37 ±0.00 93.43 ±0.01 0.51 ±0.01 0.39 ±0.02 -0.5 ±0.0 93.63 ±0.01
Lang-Spec Enc[0-8] 0.64 ±0.02 0.54 ±0.01 26.32 ±0.02 93.68 ±0.01 0.81 ±0.02 0.82 ±0.02 25.26 ±0.02 93.59 ±0.01
Lang-Spec Task 2.23 ±0.03 0.46 ±0.02 0.47 ±0.00 91.73 ±0.02 3.02 ±0.03 0.85 ±0.02 -0.07 ±0.0 90.91 ±0.02
Lang-Spec Ada(T) 1.36 ±0.02 1.07 ±0.01 50.06 ±0.02 92.70 ±0.02 2.33 ±0.03 0.78 ±0.02 51.96 ±0.02 92.15 ±0.02
Lang-Spec Ada(F) 0.82 ±0.02 2.61 ±0.02 5.68 ±0.01 90.34 ±0.02 1.21 ±0.03 2.75 ±0.02 8.84 ±0.01 90.17 ±0.02
Other Continuous Learning Algorithms
EWC 2.55 ±0.02 0.87 ±0.01 52.29 ±0.03 92.04 ±0.02 2.57 ±0.03 0.71 ±0.02 54.84 ±0.02 91.67 ±0.02
ER 1.27 ±0.02 0.70 ±0.02 54.29 ±0.02 93.08 ±0.01 1.33 ±0.02 0.44 ±0.02 55.05 ±0.03 93.05 ±0.02
KD-Logit 2.16 ±0.02 0.54 ±0.02 52.32 ±0.02 92.23 ±0.02 2.18 ±0.03 0.45 ±0.02 53.73 ±0.03 91.84 ±0.02
KD-Rep 2.04 ±0.03 0.36 ±0.02 52.06 ±0.03 92.25 ±0.02 2.13 ±0.03 0.65 ±0.01 53.55 ±0.03 92.06 ±0.02
Test Slot Filling On
Shared {Trans, Task} Baselines
Naive Seq FT 5.40 ±0.25 1.95 ±0.17 36.2 ±0.04 70.61 ±0.14 5.5 ±0.19 1.81 ±0.16 38.41 ±0.04 70.30 ±0.15
Lang-Spec FT 73.90 ±0.83 73.90 ±0.83
Lang-Spec FT + Ada(T) 72.90 ±0.80 72.90 ±0.80
Lang-Spec FT + Ada(F) 67.46 ±0.89 67.46 ±0.89
Inc Joint 0.81 ±0.14 1.57 ±0.16 37.46 ±0.05 74.9 ±0.16 1.03 ±0.15 1.72 ±0.17 37.54 ±0.04 75.34 ±0.15
multi 76.34 ±0.82 76.34 ±0.82
Model Expansion Baselines
Lang-Spec Trans 1.57 ±0.18 1.29 ±0.15 0.49 ±0.00 74.56 ±0.13 1.29 ±0.13 0.60 ±0.17 0.47 ±0.0 74.57 ±0.15
Lang-Spec Enc[0-8] 1.80 ±0.19 2.05 ±0.17 13.24 ±0.01 75.2 ±0.16 1.25 ±0.17 0.23 ±0.17 13.57 ±0.01 74.67 ±0.14
Lang-Spec Task 4.94 ±0.24 2.20 ±0.16 0.11 ±0.00 71.06 ±0.14 4.77 ±0.22 1.14 ±0.18 0.05 ±0.0 70.63 ±0.14
Lang-Spec Ada(T) 3.25 ±0.18 3.26 ±0.16 34.88 ±0.04 72.38 ±0.16 4.31 ±0.21 1.75 ±0.14 35.48 ±0.03 70.39 ±0.13
Lang-Spec Ada(F) 2.52 ±0.2 4.03 ±0.18 3.10 ±0.0 68.22 ±0.14 3.06 ±0.2 4.03 ±0.19 3.57 ±0.0 68.67 ±0.14
Other Continuous Learning Algorithms
EWC 5.54 ±0.24 1.99 ±0.16 36.34 ±0.04 70.69 ±0.13 5.46 ±0.23 1.07 ±0.18 38.14 ±0.04 70.05 ±0.15
ER 3.01 ±0.18 1.98 ±0.16 37.54 ±0.04 72.92 ±0.13 2.77 ±0.18 0.81 ±0.17 38.66 ±0.04 72.82 ±0.14
KD-Logit 5.00 ±0.25 2.00 ±0.17 35.67 ±0.04 71.82 ±0.14 5.46 ±0.23 1.52 ±0.17 37.78 ±0.04 69.76 ±0.14
KD-Rep 4.84 ±0.22 1.46 ±0.17 35.96 ±0.04 70.71 ±0.16 5.01 ±0.22 0.9 ±0.16 37.25 ±0.04 70.37 ±0.14
Table C.4: Per language permutation view: a pairwise comparison between Order 3 (Spanish →
Hindi → English → German → Thai → French) and Order 4 (French → Thai → German →
English → Hindi → Spanish). We highlight the best forgetting (lowest), transfer (highest), zero-shot
transfer (highest), and final performance (highest) of accuracy and f1 scores among those two orders
for each approach in bold, whereas the best scores across approaches for the two orders separately
are underlined.
134
Model
Hindi → English → Spanish → Thai → French → German German → French → Thai → Spanish → English → Hindi
Test Intent Accuracy On
F ↓ T ↑
0 ↑ FP ↑ F ↓ T ↑
0 ↑ FP ↑
Shared {Trans, Task} Baselines
Naive Seq FT 2.97 ±0.03 0.75 ±0.01 47.04 ±0.03 91.63 ±0.02 2.32 ±0.02 0.71 ±0.02 51.97 ±0.03 91.63 ±0.02
Lang-Spec FT 93.4 ±0.08 93.4 ±0.08
Lang-Spec FT + Ada(T) 93.04 ±0.09 93.04 ±0.09
Lang-Spec FT + Ada(F) 88.79 ±0.13 88.79 ±0.13
Inc Joint 0.21 ±0.01 0.74 ±0.01 48.41 ±0.03 94.44 ±0.01 -0.02 ±0.01 0.54 ±0.02 51.49 ±0.03 94.23 ±0.01
multi 94.25 ±0.07 94.25 ±0.07
Model Expansion Baselines
Lang-Spec Trans 0.41 ±0.02 0.03 ±0.02 -0.57 ±0.0 93.39 ±0.01 0.52 ±0.02 0.29 ±0.02 -0.11 ±0.0 93.38 ±0.01
Lang-Spec Enc[0-8] 0.80 ±0.02 0.74 ±0.01 23.18 ±0.02 93.35 ±0.01 0.76 ±0.01 1.05 ±0.01 26.12 ±0.02 93.46 ±0.01
Lang-Spec Task 2.84 ±0.03 0.67 ±0.01 -0.2 ±0.0 91.17 ±0.02 2.32 ±0.02 0.76 ±0.01 0.36 ±0.0 91.7 ±0.02
Lang-Spec Ada(T) 2.49 ±0.03 1.05 ±0.01 47.67 ±0.03 92.34 ±0.01 1.35 ±0.02 0.89 ±0.01 53.77 ±0.02 92.30 ±0.02
Lang-Spec Ada(F) 1.13 ±0.03 3.09 ±0.02 4.40 ±0.01 90.50 ±0.02 1.32 ±0.02 2.64 ±0.02 6.73 ±0.01 90.15 ±0.02
Other Continuous Learning Algorithms
EWC 3.07 ±0.03 0.79 ±0.01 46.44 ±0.03 91.45 ±0.02 2.51 ±0.02 0.81 ±0.01 51.54 ±0.02 91.54 ±0.02
ER 1.11 ±0.02 0.72 ±0.01 48.23 ±0.03 93.00 ±0.02 0.98 ±0.02 0.92 ±0.01 52.23 ±0.03 93.32 ±0.01
KD-Logit 2.50 ±0.03 0.86 ±0.01 47.96 ±0.03 91.27 ±0.02 1.89 ±0.02 0.59 ±0.02 51.88 ±0.03 92.16 ±0.02
KD-Rep 2.24 ±0.03 0.81 ±0.01 48.08 ±0.03 91.89 ±0.02 1.86 ±0.02 0.83 ±0.02 52.51 ±0.03 92.16 ±0.02
Test Slot Filling On
Shared {Trans, Task} Baselines
Naive Seq FT 6.51 ±0.22 1.90 ±0.15 34.53 ±0.04 68.93 ±0.13 5.38 ±0.25 1.00 ±0.18 38.47 ±0.05 70.22 ±0.14
Lang-Spec FT 73.9 ±0.83 73.9 ±0.83
Lang-Spec FT + Ada(T) 72.9 ±0.8 72.9 ±0.8
Lang-Spec FT + Ada(F) 67.46 ±0.89 67.46 ±0.89
Inc Joint 0.99 ±0.15 1.21 ±0.15 32.99 ±0.03 74.45 ±0.16 1.52 ±0.15 0.27 ±0.18 39.69 ±0.05 74.31 ±0.14
multi 76.34 ±0.82 76.34 ±0.82
Model Expansion Baselines
Lang-Spec Trans 1.65 ±0.17 1.04 ±0.17 0.37 ±0.00 74.51 ±0.14 1.17 ±0.14 0.97 ±0.16 0.47 ±0.00 75.04 ±0.14
Lang-Spec Enc[0-8] 1.48 ±0.13 2.18 ±0.17 10.66 ±0.01 75.03 ±0.14 2.77 ±0.18 1.67 ±0.18 13.25 ±0.01 73.73 ±0.14
Lang-Spec Task 5.72 ±0.21 2.4 ±0.17 0.06 ±0.00 70.08 ±0.13 4.80 ±0.24 -0.04 ±0.18 0.06 ±0.00 69.54 ±0.13
Lang-Spec Ada(T) 4.96 ±0.25 2.39 ±0.15 29.17 ±0.03 72.28 ±0.13 3.98 ±0.21 1.69 ±0.16 37.14 ±0.05 72.27 ±0.13
Lang-Spec Ada(F) 3.15 ±0.21 4.51 ±0.18 1.90 ±0.00 69.47 ±0.14 3.77 ±0.22 3.54 ±0.16 2.31 ±0.00 67.57 ±0.14
Other Continuous Learning Algorithms
EWC 6.38 ±0.23 2.54 ±0.17 34.29 ±0.04 71.25 ±0.14 5.56 ±0.27 1.46 ±0.17 38.44 ±0.05 70.57 ±0.16
ER 4.12 ±0.22 2.90 ±0.16 35.45 ±0.04 73.39 ±0.14 2.65 ±0.18 0.83 ±0.17 39.34 ±0.05 73.56 ±0.15
KD-Logit 6.03 ±0.27 2.02 ±0.16 35.2 ±0.04 70.70 ±0.14 4.91 ±0.21 0.92 ±0.17 38.49 ±0.05 70.31 ±0.14
KD-Rep 5.72 ±0.27 2.6 ±0.15 35.54 ±0.04 71.61 ±0.15 5.35 ±0.21 0.97 ±0.15 38.8 ±0.05 70.15 ±0.13
Table C.5: Per language permutation view: a pairwise comparison between Order 5(Hindi →
English → Spanish → Thai → French → German) and Order 6 (German → French → Thai →
Spanish → English → Hindi). We highlight the best forgetting (lowest), transfer (highest), zero-shot
transfer (highest), and final performance (highest) of accuracy and f1 scores among those two orders
for each approach in bold, whereas the best scores across approaches for the two orders separately
are underlined.
135
Model
F ↓ T ↑ FP ↑
Acc F1 Acc F1 Acc F1
Order 1 1.25 ±0.02 3.60 ±0.18 0.89 ±0.02 1.76 ±0.17 89.33 ±0.02 65.59 ±0.13
Order 2 5.81 ±0.05 7.89 ±0.28 0.75 ±0.02 0.11 ±0.17 85.81 ±0.02 64.18 ±0.14
Order 3 1.68 ±0.02 4.43 ±0.21 0.77 ±0.02 2.20 ±0.17 89.57 ±0.02 68.88 ±0.14
Order 4 2.70 ±0.04 4.62 ±0.23 0.71 ±0.02 1.22 ±0.17 88.59 ±0.02 68.07 ±0.14
Order 5 1.83 ±0.01 5.74 ±0.24 6.64 ±0.01 4.89 ±0.15 96.00 ±0.01 71.75 ±0.13
Order 6 1.08 ±0.01 4.44 ±0.20 7.09 ±0.01 4.86 ±0.15 96.40 ±0.01 71.81 ±0.13
Table C.6: Impact of language order across the balanced dataset for Naive Seq FT. Best and
second best scores for each language for intent classification and slot filling independently across
approaches are highlighted in bold and underlined, respectively.
approaches that lower forgetting for a particular language do so uniformly for all languages. The
performance in terms of zero-shot transfer is significantly lower in the case of Thai.
C.1.5 More Analysis
Figure C.1 plots final performance versus negative forgetting, final performance versus transfer,
transfer versus negative forgetting, and zero-shot transfer versus negative forgetting for the subtask
of slot filling. The same trends observed for intent classification can also be observed for slot filling.
Figures C.2a and C.2b show how Naive Seq FT intent classification accuracy score and slot filling
F1 score, respectively, change for each language separately after different hops of training. We
can see that although the performance increases as more hops are seen for high-resource Latinscript languages like English, Spanish and to some degree French, the same cannot be said for
low-resource languages Thai and Hindi, which also suffer from being script isolates.
To analyze the zero-shot generalization to unseen languages, we analyze the performance of
each model across different hops. In other words, we consider the average performance after
seeing from 1 to 5 languages, enabled by the balanced datastreams we carefully curated 4.1.4.
We can check the performance after training on each language(s) from exactly one datastream.
Figures C.3a and C.3b show a comparison between different approaches across different hops of
training using zero-shot transfer metric for intent classification and slot filling, respectively. In
general, we can observe that the average performance of the zero-shot transfer decreases after
seeing languages, where ∈ [1 . . . 5]. In this case, after seeing one language, the performance
is equivalent to conventional transfer learning involving two hops, whereas the performance after
seeing >= 2 is for multi-hop continual learning. We notice that as we increase the number of hops,
the transfer capabilities decrease nearly uniformly across most approaches, making the problem
more challenging and different from conventional transfer learning. Figures C.3c and C.3d show
the generalization trends for different continual learning approaches compared to the baselines for
136
Figure C.1: Correlations between different pairs of metrics: (a) Final performance versus negative
forgetting for the task of slot filling. The lower the forgetting the higher the final performance. (b)
Final performance versus transfer for the task of slot filling. (c) Transfer versus negative forgetting
for slot filling task. (d) Zero-shot generalization versus negative forgetting for slot filling. Model
expansion approaches are highlighted in shades of green. We zoom over the rest of the models in
the main graph and show an overview of all approaches in the lower right corner subplot. The same
trends observed for intent classification in Figure 4.4 can be observed here.
intent classification and slot filling, respectively. We can see that most continual learning approaches
improve in terms of both intent accuracy and slot filling F1 scores over Naive Seq FT and the gap
increases mainly as more languages are seen (except at ℎ4). After 5 hops, there is a clear gap
between Naive Seq FT and continual learning approaches on top of them Lang-Spec Ada(T) and
KD-Logit. Figure C.4 shows more results for multi-hop versus one-hop analysis for more metrics
and tasks. In general, we can observe the same trend, whereby multi-hop dotted boxplots analysis
has smaller confidence intervals than one-hop crossed boxplots.
C.1.6 Experience Replay Ablation Studies
Table C.10 shows a comparison between the performance of experience replay variants with different
memory sizes ranging from 750 to 6000 instances which accounts for 5% to 60% of the training
data for each language. Although we notice that forgetting is the lowest and the final performance is
the highest when a memory of 6000 instances is used, the gap is not that significant as the memory
is scaled down. Moreover, differences in transfer are not correlated with the size of the memory.
137
(a) Accuracy for intent classification. (b) F1 score for slot filling.
Figure C.2: Comparing cross-lingual generalization of Naive Seq FT across many hops and different
languages for intent classification and slot filling.
We notice that ER achieves a performance that surpasses Naive Seq FT even when using the lowest
memory size. This suggests that even tiny bits of memory are helpful.
C.2 More Results using Multiple Seeds
In this section, we show the results using different seeds for key experiments in Chapter 4. We
show in Table C.11 and C.12 the average final performance, forgetting, and transfer averaged across
different language permutations for the baseline model compared to reference models. We also
show in Table C.13 the performance on intent classification comparison between the baseline and
different continual learning algorithm across H2L and L2H. Overall, we notice the same trends and
findings observed earlier in Tables 4.4, 4.5, and 4.6.
C.3 Statistical Significance
We show in Figures C.5 and C.6 the results for different approaches with a p-value lower than 0.05
for confidence intervals of 95%, thus rejecting the null hypothesis that they are drawn from the
same distribution. Figures C.5a, C.6a, C.5c, C.5b, C.6a, C.5d, C.5e, and C.5f show confusion plots
of statistical significance p-values for different metrics (forgetting, transfer, and final performance)
for intent classification and slot filling, respectively. For example, for forgetting, we notice that
improvements or losses from approaches are statistically significant with 95% confidence more than
49% and 61% of the time for intent classification and slot filling. For zero-shot transfer, we notice 60%
and 56% of pairwise comparisons are statistically significant for intent classification and slot filling.
For final performance, we notice 47% and 49% of pairwise comparisons are statistically significant
for intent classification and slot filling. For transfer, we notice that improvements or degradation
138
(a) Zero-shot transfer of accuracy for intent classification. (b) Zero-shot transfer of f1 score for slot filling.
(c) Accuracy for intent classification. (d) F1 score for slot filling.
Figure C.3: Measuring cross-lingual generalization to new languages across many hops for intent
classification and slot filling. This is both in terms of zero-shot transfer metric and plain accuracy
and f1 scores.
over transfer of intent classification are not statistically significant with the exceptions of Lang-Spec
Trans which the lowest in terms of transfer Lang-Spec Ada(F) which exhibit high transfer. The
same can be said for Lang-Spec Ada(F) in slot filling. Overall, model expansion approaches exhibit
the highest statistical significance, whereas EWC-Online and knowledge distillation are among the
lowest. Figures C.7 and C.8 show the corresponding statistical significance p-value confusion plots
using multiple seeds. With a few exceptions like Lang-Spec FT + Ada(T) and Lang-Spec FT +
Ada(F), most pairwise p-values which indicate statistical significance between two models using
bootstrap sampling analysis are compliant with statistical significance computed using multiple
seeds.
139
(a) Forgetting for slot filling. (b) Transfer for intent classification.
(c) Transfer for slot filling. (d) Final performance for intent classification.
(e) Final performance for slot filling.
Figure C.4: Comparison between different metrics using one-hop (crossed boxplots) and multi-hop
analysis (dotted boxplots), on the left and right respectively for each approach.
140
Model
Test Intent Accuracy On
German English French Spanish Hindi Thai
Shared {Trans, Task} Baselines
Naive Seq FT 1.52 ±0.12 1.06 ±0.08 1.30 ±0.14 1.49 ±0.13 2.90 ±0.38 5.51 ±1.35
Inc Joint 0.31 ±0.05 0.12 ±0.04 0.19 ±0.05 0.15 ±0.04 0.04 ±0.07 0.28 ±0.08
Model Expansion Baselines
Lang-Spec Trans 0.36 ±0.06 0.33 ±0.04 0.44 ±0.07 0.34 ±0.06 0.42 ±0.08 0.46 ±0.08
Lang-Spec Enc[0-8] 0.54 ±0.07 0.45 ±0.05 0.51 ±0.08 0.59 ±0.06 0.66 ±0.10 0.90 ±0.15
Lang-Spec Task 1.22 ±0.12 0.95 ±0.09 1.49 ±0.14 1.37 ±0.12 3.20 ±0.40 5.44 ±1.67
Lang-Spec Ada(T) 0.88 ±0.08 0.81 ±0.08 1.16 ±0.12 1.00 ±0.09 1.85 ±0.24 4.23 ±1.15
Lang-Spec Ada(F) 0.58 ±0.08 0.61 ±0.08 0.81 ±0.11 0.54 ±0.10 0.86 ±0.11 1.88 ±0.33
Other Continuous Learning Algorithms
EWC 1.40 ±0.15 1.00 ±0.08 1.74 ±0.15 1.56 ±0.13 3.26 ±0.37 5.62 ±1.75
ER 0.76 ±0.07 0.53 ±0.05 0.87 ±0.08 0.71 ±0.08 1.13 ±0.12 2.19 ±0.22
KD-Logit 1.23 ±0.12 0.97 ±0.08 1.47 ±0.12 1.27 ±0.12 2.19 ±0.27 4.41 ±0.75
KD-Rep 1.20 ±0.11 0.80 ±0.07 1.45 ±0.11 1.42 ±0.12 2.29 ±0.27 4.02 ±0.63
Test Slot Filling On
German English French Spanish Hindi Thai
Shared {Trans, Task} Baselines
Naive Seq FT 3.64 ±1.31 3.91 ±1.14 2.80 ±0.94 2.94 ±0.94 6.48 ±1.85 8.85 ±3.19
Inc Joint 1.21 ±0.85 1.12 ±0.70 0.64 ±0.71 0.96 ±0.62 1.13 ±0.70 0.77 ±0.57
Model Expansion Baselines
Lang-Spec Trans 0.90 ±0.71 1.02 ±0.62 1.03 ±0.65 1.21 ±0.74 1.28 ±0.75 1.06 ±0.64
Lang-Spec Enc[0-8] 2.03 ±0.93 1.83 ±0.81 1.03 ±0.77 1.31 ±0.69 1.76 ±0.81 2.00 ±0.76
Lang-Spec Task 3.32 ±1.29 2.96 ±0.97 2.74 ±0.93 2.76 ±0.89 6.89 ±2.01 8.17 ±3.05
Lang-Spec Ada(T) 2.96 ±1.12 3.05 ±0.88 1.49 ±0.76 1.52 ±0.82 4.34 ±1.17 6.84 ±2.26
Lang-Spec Ada(F) 1.82 ±0.97 1.85 ±0.88 1.33 ±0.83 1.89 ±0.96 2.72 ±0.99 5.81 ±1.98
Other Continuous Learning Algorithms
EWC 3.41 ±1.25 3.90 ±1.24 3.08 ±0.95 3.32 ±0.96 6.29 ±1.86 8.74 ±3.22
ER 1.94 ±0.82 2.01 ±0.96 1.60 ±0.76 1.82 ±0.80 3.65 ±1.04 4.73 ±1.18
KD-Logit 3.69 ±1.31 3.70 ±1.03 3.10 ±1.01 3.55 ±1.11 5.66 ±1.68 8.05 ±2.68
KD-Rep 3.49 ±1.18 3.85 ±1.09 3.13 ±0.95 2.99 ±0.92 5.81 ±1.66 7.93 ±2.18
Table C.7: CCL per language analysis of forgetting. Best and second best scores for each language
are highlighted in bold and underlined respectively.
141
Model
Test Intent Accuracy On
German English French Hindi Spanish Thai
Shared {Trans, Task} Baselines
Naive Seq FT 0.37 ±0.07 0.30 ±0.06 0.77 ±0.08 1.14 ±0.07 0.64 ±0.09 0.85 ±0.11
Inc Joint 0.25 ±0.07 0.04 ±0.06 0.74 ±0.09 1.25 ±0.06 0.27 ±0.12 0.57 ±0.11
Model Expansion Baselines
Lang-Spec Trans -0.36 ±0.08 -0.07 ±0.06 0.29 ±0.10 0.93 ±0.08 0.12 ±0.10 0.47 ±0.11
Lang-Spec Enc[0-8] 0.39 ±0.07 0.28 ±0.05 0.96 ±0.08 1.09 ±0.07 0.80 ±0.11 1.25 ±0.10
Lang-Spec Task 0.22 ±0.07 0.12 ±0.06 0.99 ±0.08 1.11 ±0.07 0.69 ±0.10 0.84 ±0.09
Lang-Spec Ada(T) 1.38 ±0.07 0.41 ±0.06 1.30 ±0.09 0.93 ±0.11 1.20 ±0.09 0.65 ±0.10
Lang-Spec Ada(F) 2.47 ±0.10 1.43 ±0.08 3.03 ±0.11 3.17 ±0.11 2.00 ±0.15 4.84 ±0.33
Other Continuous Learning Algorithms
EWC 0.26 ±0.08 0.12 ±0.05 1.13 ±0.07 1.10 ±0.07 0.85 ±0.09 0.92 ±0.11
ER 0.27 ±0.08 0.07 ±0.06 1.01 ±0.08 1.16 ±0.07 0.96 ±0.10 1.04 ±0.11
KD-Logit 0.16 ±0.08 0.13 ±0.06 0.96 ±0.09 0.96 ±0.07 0.68 ±0.10 0.82 ±0.11
KD-Rep 0.12 ±0.08 0.09 ±0.06 0.82 ±0.09 1.30 ±0.07 0.74 ±0.10 1.06 ±0.10
Test Slot Filling On
German English French Spanish Hindi Thai
Shared {Trans, Task} Baselines
Naive Seq FT 1.71 ±0.98 1.24 ±0.71 2.01 ±0.90 0.54 ±0.97 0.20 ±0.91 2.50 ±0.75
Inc Joint 1.59 ±0.90 -0.17 ±0.89 1.22 ±0.84 1.08 ±0.94 -1.10 ±1.04 2.36 ±0.79
Model Expansion Baselines
Lang-Spec Trans 1.75 ±0.95 1.37 ±0.80 1.85 ±0.83 -0.25 ±0.91 -0.67 ±0.93 1.67 ±0.74
Lang-Spec Enc[0-8] 1.80 ±0.92 0.45 ±1.05 2.11 ±0.86 0.67 ±0.98 0.51 ±0.88 3.12 ±0.88
Lang-Spec Task 2.28 ±1.07 -0.27 ±0.86 1.55 ±1.07 0.56 ±1.26 0.44 ±0.94 2.36 ±0.86
Lang-Spec Ada(T) 3.24 ±0.94 -0.54 ±0.72 1.04 ±0.95 1.59 ±0.94 3.37 ±0.98 3.53 ±0.82
Lang-Spec Ada(F) 3.48 ±1.00 3.38 ±0.87 1.46 ±1.00 4.68 ±1.04 2.11 ±1.06 8.48 ±1.27
Other Continuous Learning Algorithms
EWC 1.58 ±1.02 0.39 ±0.82 2.11 ±0.87 1.58 ±1.05 -0.09 ±0.93 3.19 ±0.73
ER 1.97 ±0.93 0.29 ±0.89 2.05 ±0.94 1.38 ±1.04 0.23 ±0.87 2.87 ±0.93
KD-Logit 2.20 ±0.98 0.50 ±0.83 2.00 ±0.84 1.35 ±1.00 -0.64 ±0.94 2.97 ±0.76
KD-Rep 1.90 ±0.88 0.90 ±0.75 2.54 ±0.88 1.01 ±0.91 -0.23 ±0.96 2.45 ±0.75
Table C.8: CCL per language analysis of transfer. Best and second best scores for each language
are highlighted in bold and underlined respectively.
142
Model
Test Intent Accuracy On
German English French Hindi Spanish Thai
Shared {Trans, Task} Baselines
Naive Seq FT 58.53 ±1.49 69.09 ±12.56 60.83 ±3.24 59.42 ±24.92 33.38 ±1.35 20.17 ±1.10
Inc Joint 58.48 ±2.13 70.13 ±12.56 61.17 ±2.62 61.18 ±19.86 32.28 ±2.56 17.20 ±0.19
Model Expansion Baselines
Lang-Spec Trans -1.42 ±0.00 0.44 ±0.01 -0.01 ±0.01 -0.95 ±0.01 -0.15 ±0.00 -0.47 ±0.00
Lang-Spec Enc[0-8] 26.17 ±7.44 33.16 ±10.88 25.56 ±7.00 27.21 ±18.32 21.79 ±2.33 11.51 ±0.77
Lang-Spec Task -0.25 ±0.12 0.38 ±0.01 0.63 ±0.06 -0.66 ±0.02 0.60 ±0.03 -0.09 ±0.01
Lang-Spec Ada(T) 55.95 ±0.91 67.93 ±14.89 60.21 ±4.16 58.14 ±33.89 36.44 ±4.20 17.40 ±1.10
Lang-Spec Ada(F) 5.08 ±0.51 14.37 ±1.06 7.61 ±0.49 6.87 ±1.00 5.50 ±0.90 -0.30 ±0.04
Other Continuous Learning Algorithms
EWC 58.57 ±1.77 69.39 ±12.59 60.71 ±3.48 58.99 ±24.22 33.59 ±1.40 19.71 ±1.30
ER 59.70 ±1.68 70.20 ±13.83 61.32 ±4.05 60.09 ±24.40 33.38 ±1.24 19.57 ±1.48
KD-Logit 58.12 ±1.32 68.87 ±12.38 60.85 ±3.45 59.69 ±24.27 33.55 ±1.46 19.99 ±1.19
KD-Rep 58.47 ±1.20 68.64 ±12.23 60.96 ±3.56 59.69 ±24.54 34.22 ±1.07 20.49 ±1.00
Test Slot Filling On
German English French Spanish Hindi Thai
Shared {Trans, Task} Baselines
Naive Seq FT 44.25 ±1.16 48.42 ±8.10 47.58 ±1.63 46.60 ±15.31 18.97 ±0.44 12.09 ±0.33
Inc Joint 44.73 ±1.68 48.74 ±10.90 47.67 ±2.19 46.98 ±18.10 18.05 ±0.31 12.20 ±0.22
Model Expansion Baselines
Lang-Spec Trans 0.45 ±0.00 0.76 ±0.01 0.33 ±0.00 0.83 ±0.01 0.00 ±0.00 0.15 ±0.00
Lang-Spec Enc[0-8] 14.81 ±3.81 15.50 ±6.12 16.09 ±4.03 16.11 ±8.84 6.62 ±1.29 4.80 ±0.35
Lang-Spec Task 0.07 ±0.00 0.15 ±0.00 0.08 ±0.00 0.04 ±0.00 -0.02 ±0.00 0.09 ±0.00
Lang-Spec Ada(T) 41.08 ±1.24 44.36 ±18.19 45.26 ±2.44 42.56 ±21.09 17.62 ±1.27 10.72 ±0.13
Lang-Spec Ada(F) 4.42 ±0.10 1.12 ±0.04 4.51 ±0.32 4.86 ±0.93 1.80 ±0.03 0.09 ±0.00
Other Continuous Learning Algorithms
EWC 44.17 ±1.16 48.52 ±8.21 47.51 ±1.62 46.38 ±15.32 18.94 ±0.42 12.32 ±0.30
ER 44.73 ±1.45 49.60 ±9.35 48.17 ±2.22 47.26 ±15.85 19.06 ±0.44 12.62 ±0.24
KD-Logit 43.79 ±1.04 48.30 ±8.21 47.31 ±2.05 46.77 ±15.51 18.85 ±0.37 12.49 ±0.22
KD-Rep 43.81 ±1.35 48.10 ±7.99 47.38 ±1.85 46.60 ±15.21 18.83 ±0.45 12.82 ±0.26
Table C.9: CCL per language zero-shot forward transfer. Best and second best scores for each
language for intent classification and slot filling independently across approaches are highlighted in
bold and underlined respectively.
143
Model
F ↓ T ↑
0 ↑ FP ↑
Acc F1 Acc F1 Acc F1 Acc F1
Naive Seq FT 2.93 ±1.24 5.67 ±0.93 0.68 ±0.14 1.37 ±0.53 50.24 ±3.43 36.32 ±1.91 91.06 ±1.08 69.37 ±1.06
ER-750 1.97 ±0.73 4.28 ±0.63 0.65 ±0.19 1.46 ±0.59 50.41 ±3.19 36.53 ±1.91 92.10 ±0.68 71.65 ±1.02
ER-1500 1.55 ±0.44 3.88 ±0.42 0.68 ±0.26 1.55 ±0.69 50.83 ±3.38 36.59 ±1.93 92.65 ±0.35 71.68 ±0.71
ER-3000 1.40 ±0.44 3.36 ±0.47 0.70 ±0.25 1.48 ±0.71 51.03 ±3.60 36.77 ±2.06 92.93 ±0.37 72.71 ±0.56
ER-4500 1.43 ±0.58 3.39 ±0.75 0.59 ±0.11 1.44 ±0.38 50.46 ±3.68 36.91 ±2.19 92.73 ±0.72 72.46 ±1.05
ER-6000 1.29 ±0.51 3.06 ±0.59 0.75 ±0.17 1.47 ±0.85 50.71 ±3.55 36.91 ±2.14 93.09 ±0.29 73.00 ±0.52
Table C.10: Ablation Studies of Experience Replay where we experiment with different memory
sizes per language. For each metric and score, we highlight the best score in bold and underline the
second best score.
Model Acc F1
Naive Seq FT 90.40 ±1.53 65.01 ±1.25
Lang-Spec FT 93.28 ±0.31 68.93 ±1.17
Inc Joint 94.14 ±0.08 71.70 ±0.43
multi 94.20 ±0.21 72.23 ±0.99
Table C.11: The average final performance across different language permutations for the baseline
compared to reference models using multiple seeds. We highlight the best scores in bold and
underline the second best across models. We notice the same findings as when using bootstrap
sampling but with tighter confidence intervals as shown in Table 4.4.
Model
F ↓ T ↑
Acc F1 Acc F1
Naive Seq FT 3.2 ±1.66 5.47 ±0.87 0.73 ±0.16 2.75 ±0.63
Inc Joint -0.1 ±0.01 -0.38 ±0.45 0.57 ±0.14 1.73 ±1.05
Table C.12: Forgetting (F) and transfer (T) performance averaged across different language
permutations for sequential baseline and reference models using different seeds. We highlight the
best models in bold. We notice exactly the same trends as when using bootstrap sampling for our
analysis in Table 4.5.
144
Model
F ↓ T ↑ FP ↑
H2L L2H H2L L2H H2L L2H
Naive Seq FT 1.37 ±0.14 5.38 ±0.34 0.95 ±0.03 0.56 ±0.07 91.83 ±0.55 88.28 ±0.55
Lang-Spec Trans 0.01 ±0.01 0.17 ±0.08 0.57 ±0.06 0.09 ±0.01 93.81 ±0.06 93.27 ±0.10
Lang-Spec Task 1.29 ±0.08 5.52 ±0.87 0.88 ±0.12 0.43 ±0.19 92.12 ±0.18 87.20 ±1.76
Lang-Spec Ada(T) 0.81 ±0.08 4.17 ±0.30 1.16 ±0.09 0.65 ±0.06 92.53 ±0.22 88.61 ±0.44
Lang-Spec Ada(F) 0.38 ±0.09 1.04 ±0.61 3.54 ±0.15 2.34 ±0.11 91.15 ±0.04 90.0 ±0.39
EWC 1.35 ±0.24 5.42 ±0.60 0.87 ±0.11 0.71 ±0.12 91.86 ±0.52 88.09 ±0.20
ER-6000 0.69 ±0.14 1.93 ±0.28 0.93 ±0.07 0.72 ±0.14 93.43 ±0.08 92.50 ±0.25
KD-Logit 1.33 ±0.11 3.82 ±0.23 0.81 ±0.11 0.54 ±0.07 91.86 ±0.31 89.85 ±0.4
KD-Rep 1.37 ±0.1 3.7 ±0.25 0.85 ±0.23 0.52 ±0.13 91.64 ±0.49 89.73 ±0.8
Table C.13: Performance on intent classification comparison between the baseline and continual
learning algorithms across two language permutations using multiple seeds. We highlight in bold the
lowest forgetting (F), highest transfer (T), and final performance (FP) of accuracy scores among H2L
and L2H, whereas the best scores across approaches for H2L and L2H separately are underlined.
We notice the same trends and findings as Table 4.6 where only bootstrap sampling is used to
compute the confidence intervals.
145
(a) Forgetting of intent accuracy. (b) Forgetting of slot filling.
(c) Final performance of intent accuracy. (d) Final performance of slot filling.
(e) Zero-shot transfer of intent accuracy. (f) Zero-shot transfer of slot filling.
Figure C.5: P-values for different pairwise comparison of different continual learning approaches
using Tukey’s honestly significant difference (HSD) test using bootstrap sampling.
146
(a) Transfer of intent accuracy. (b) Transfer of slot filling.
Figure C.6: P-values for different pairwise comparison of different continual learning approaches
using Tukey’s honestly significant difference (HSD) test using bootstrap sampling (Cont.).
147
(a) Forgetting of intent accuracy. (b) Forgetting of slot filling.
(c) Final performance of intent accuracy. (d) Final performance of slot filling.
(e) Zero-shot transfer of intent accuracy. (f) Zero-shot transfer of slot filling.
Figure C.7: P-values for different pairwise comparison of different continual learning approaches
using Tukey’s honestly significant difference (HSD) test using different seeds.
148
(a) Transfer of intent accuracy. (b) Transfer of slot filling.
Figure C.8: P-values for different pairwise comparison of different continual learning approaches
using Tukey’s honestly significant difference (HSD) test using different seeds (Cont.).
149
Appendix D
Human-like Cross-lingual Continual Learning
In this Appendix, we specify more implementation details related to the hyperparameters and
datasets used in addition to the runtime and number of parameters of different models.
D.1 Hyperparameters
For all experiments, we use M-BERT (bert-base-multilingual-cased)1 with 12 layers as our pretrained multilingual Transformer-based encoder model. Consistent with Section 4.3.1 and Hu et al.
(2020) for MTOP and TyDiQA, respectively, we use the Adam optimizer (Kingma and Ba, 2015),
fixing the learning rate to 3 − 5 for all experiments for a fair comparison. M’hamdi et al. perform a
manual hyperparameter search over the range [1 × 10−4
, 3 × 10−4
, 1 × 10−5
, 3 × 10−5
] to choose the
most optimal learning rate based on Dev data split performance. For TyDiQA, those hyperparameters
are chosen based on Hu et al. (2020). For MultiATIS++, we perform a manual search over the same
learning rates range and find that 3 × 10−5 performs comparably to other learning rates. So, we fix a
learning rate of 3 × 10−5
, = 1 × 10−8
, 1 = 0.9, 2 = 0.99 in the optimizer for a fair comparison for
all experiments. For TyDiQA experiments, we find it helpful when a scheduler with linear decaying
learning rates is used. We use batch sizes of 4, 16, and 4 for MTOP, MultiATIS++, and TyDiQA,
respectively. In all baseline models Balanced and Random and Leitner-guided ER (LER) model
variants, we choose a fixed memory proportion to 20% of the training data from each benchmark.
Based on that, we fix |M | memory size to 10,105, 500, and 500 for all MTOP, MultiATIS++, and
TyDiQA experiments, respectively. We also fix the sampling frequency from the memory to every
10 minibatches. For all experiments, we run for 10 epochs maximum and pick the best model based
on Dev data split. We use the same seed across all experiments to report the mean results. We
also fix a seed of 42 for the random initialization of Numpy, Random, and Torch libraries over all
experiments. All experiments are run on the same computing infrastructure using 1 NVIDIA A40
GPU of 46,068 MiB of memory CUDA version 11.6 and Pytorch version 1.13.1.
1github.com/huggingface/transformers version 3.4.0 pre-trained on 104 languages, including all
languages covered in our evaluation.
150
D.2 Dataset License
MTOP dataset has been released by Facebook under Creative Commons Attribution-ShareAlike
4.0 International Public License. MultiATIS++ and TyDiQA datasets have been released under the
Apache License which allows the use, modification, and distribution of the dataset.
D.3 Runtime
We show in Table D.1 the runtime of different approaches and baselines for one single language
order on MTOP. This runtime includes both the costs of training and evaluation. Our LER only
incurs 3 hours more than No ER approach, with most of it spent calculating the skill rating at the
end of each epoch. Table D.2 compares between the number of parameters of models used for
different downstream benchmarks. Task-oriented dialog benchmarks (MTOP and MultiATIS++)
require more parameters and thus are more challenging compared to span-based question answering
(TyDiQA).
Model Total Runtime
No ER 6 hrs 23 min 20 sec
Balanced 6 hrs 45 min 22 sec
Random 6 hrs 25 min 25 sec
LER(easy/hard) 9 hrs 23 min 13 sec
Table D.1: Fine-grained runtime analysis per model for one single language order on MTOP.
Model # Parameters
MTOP 178,081,402
MultiATIS++ 178,036,139
TyDiQA 177,264,386
Table D.2: Fine-grained parameter analysis per benchmark.
151
Abstract (if available)
Abstract
Cross-lingual transfer learning comprises a set of techniques used to adapt a model trained on (a) source language(s), enabling it to generalize to new target languages. With the emergence of Transformer-based contextualized encoders, there has been a surge in multilingual representations that adapt these encoders to various cross-lingual downstream applications. The surprising zero-shot capabilities of these encoders make them promising substitutes for other fully-supervised techniques, bypassing the need for large-scale annotation. However, these representations are still far from solving the long-tail of NLP phenomena, where models are biased more towards high-resource and typologically similar languages to the source language. This bias can be attributed to the over-reliance of current transfer learning pipelines on what we define as the 'Data-Intensive Identically-Distributed Minimally-Evaluated' paradigm. In other words, current cross-lingual models often need a lot of training data to perform well, lack robustness to different language distribution shifts, and are minimally evaluated, overlooking critical human-like generalization capabilities.
In this thesis, we analyze and propose techniques to advance the capabilities of multilingual language models beyond this traditional paradigm and more toward human-like cross-lingual transfer learning. We achieve that through 1) human-inspired input requirements by using data-efficient few-shot techniques, 2) human-inspired outcomes by defining a cross-lingual learning evaluation paradigm for learning over a continuously evolving data stream of languages, and 3) human-inspired approaches through devising cognitive strategies to consolidate retention of knowledge learned across languages and balance between different cross-lingual capabilities.
Our contributions to advancing the current transfer learning paradigm towards human-like learning are four-fold: 1) We explore cross-lingual fine-tuning on low-resource multilingual applications such as event trigger extraction and semantic search, shedding light on the strengths and limitations of existing cross-lingual transfer learning techniques. 2) We propose language-agnostic meta-learning approaches to bridge the gap between source and target typologically diverse languages. We show the merits of our approaches in reaching quicker and smoother generalization compared to naive fine-tuning, especially under low-resource scenarios. 3) We are the first to define a lifelong learning paradigm that analyzes language shifts. We show the merits and challenges of a multi-phase analysis where the system continually learns over several languages one at a time in multiple phases. 4) We are the first to adapt a cognitively inspired technique based on Leitner-queues to choose what to repeat in a cross-lingual continual learning setup and investigate its impact on reducing the forgetting of previously learned languages while maintaining transfer to new languages.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
Aggregating symbols for language models
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Green image generation and label transfer techniques
PDF
Modeling, learning, and leveraging similarity
PDF
Towards learning generalization
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Transfer learning for intelligent systems in the wild
PDF
Grounding language in images and videos
PDF
Learning logical abstractions from sequential data
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Advanced techniques for green image coding via hierarchical vector quantization
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
The inevitable problem of rare phenomena learning in machine translation
PDF
Neural creative language generation
PDF
Recording, reconstructing, and relighting virtual humans
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Towards generalized event understanding in text via generative models
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
Asset Metadata
Creator
M'hamdi, Meryem
(author)
Core Title
Towards more human-like cross-lingual transfer learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
01/26/2024
Defense Date
01/25/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
continual learning,cross-lingual transfer learning,human-like learning,meta-learning,multilingual language models,natural language processing
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
May, Jonathan (
committee chair
), Iskarous, Khalil (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
meryemmhamdi1@gmail.com,mmhamdi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113814335
Unique identifier
UC113814335
Identifier
etd-MhamdiMery-12633.pdf (filename)
Legacy Identifier
etd-MhamdiMery-12633
Document Type
Dissertation
Format
theses (aat)
Rights
M'hamdi, Meryem
Internet Media Type
application/pdf
Type
texts
Source
20240130-usctheses-batch-1123
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
continual learning
cross-lingual transfer learning
human-like learning
meta-learning
multilingual language models
natural language processing