Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards understanding language in perception and embodiment
(USC Thesis Other)
Towards understanding language in perception and embodiment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS UNDERSTANDING LANGUAGE
IN PERCEPTION AND EMBODIMENT
by
Hexiang Hu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2021
Copyright 2021 Hexiang Hu
Acknowledgements
This thesis concludes the five-year adventure and tremendous learning experiences of my Ph.D.
study. Towards the end of this process, I would like to express my gratitude to the fantastic advisor,
mentors, colleagues, friends, and family for their support over time.
First, I would like to express my sincere thanks to Prof. Fei Sha, for his patience, guidance, and
continuous support in advising me through the Ph.D. training. Fei is intelligent, knowledgeable,
and inspiring. Among all pieces of his wisdom and advice, what I treasured most are the constant
strive for improvement and growth, planning, hard work, clear communication, and the courage to
be honest with ourselves and our work. I started at UCLA with the critical decision between staying
and transferring with him. His enthusiasm persuaded me to follow his move, which was proven to
be the best decision and made me the very person I am today. During my study afterward, Fei has
graciously helped me expand my knowledge and develop my skills in machine learning from the
ground up. I also appreciate his immense effort to create an excellent environment for open-ended
fundamental research, where research in a broad field is encouraged, including machine learning,
computer vision, and natural language processing. Within such an environment, I am privileged to
focus on the research I passionated about instead of worries in any other dimensions because Fei
always has my back.
I would also like to thank Prof. C.-C. Jay Kuo, Prof. Jesse Thomason, Prof. Robin Jia, and
Prof. Joseph Lim, for spending their time kindly serving as my dissertation proposal/defense
committee members and providing valuable suggestions in improving this dissertation.
I thank Prof. Greg Mori, Prof. Zicheng Liao, Dr. Guang-Tong Zhou, and Dr. Zhiwei Deng for
their inspirational mentorship. They prepared a speculating undergraduate student like me with
the knowledge, skills, and mindset necessary to an artificial intelligence researcher. Without their
help for my first CVPR project, I would not have started pursuing graduate study.
During my Ph.D. study, I am incredibly fortunate to have many internship opportunities
working with the top AI researchers: Dr. R. Manmatha and Dr. Deva Ramanan at Amazon AI
Research, Palo Alto; Dr. Laurens van der Maaten and Dr. Ishan Misra at Facebook AI Research,
New York City; Prof. Sergey Levine at Berkeley; Dr. Valdlen Koltun and Dr. Ozan Sener at Intel
Labs, Santa Clara; Dr. Eugene Ie, Vihan Jain, Dr. Joonseok Lee, Dr. Ming Zhao, and Peter Shaw
ii
at Google. These precious academic and industrial internship experiences combined together
helped me grow into a mature researcher.
I thank both past and present ShaLab members for their generous help and support in research
and life. Mainly, I would like to thank Harry Chao, Soravit (Beer) Changpinyo, and Zhiyun Lu
for the vital help to survive my first year and make a smooth transition from UCLA to USC. My
collaboration with Harry and Beer taught me how to efficiently define research problems, study
methods, manage experiments and communicate effectively via oral and written presentations.
During these experiences, I learned the spirit of Trojan, to never give up and keep fighting on. I
thank Zhiyun for her candid advice about work and life, and later about the job search. Moreover,
I would like to give special thanks to all the junior collaborators, i.e., Bowen Zhang, Liyu Chen,
Han-Jia Ye, Jiacheng Chen, Wang Zhu, Linlu Qiu, for their hard-working during our collaboration.
Additionally, I want to thank Bowen Zhang for numerous help in lab computation resource
management. Besides, I would also like to thank Aaron Chan, Boqing Gong, Chin-Cheng Hsu,
Chao-Kai Chiang, Ivy Xiao, Ke Zhang, Melissa Ailem, Michiel de Jong, Robby Costales, Séb
Arnold, Shariq Iqbal, Yiming Yan, Yury Zemlyanskiy, and Prof. Marius Kloft for the discussion,
inspiration, encouragement, and cares. I feel astonishingly grateful to have such gifted and kind
people around me.
Last but most importantly, this thesis is dedicated to my family for their unconditioned love
and support in no return. My parents are lifelong role models who always stay humble, work hard,
support me, and care about me. There would never be enough for me to give back. Finally, I would
thank my beloved significant other, Tianqi, for many things that words can not faithfully convey. I
hope this thesis has made them proud.
iii
Table of Contents
Acknowledgements ii
List of Tables viii
List of Figures xiii
Abstract xviii
Chapter 1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Learning Grounded Concept in Perception . . . . . . . . . . . . . . . . . 3
1.2.2 Learning Intent of Language with Embodied Experiences . . . . . . . . . 4
1.2.3 Learning with Limited and Growing Data . . . . . . . . . . . . . . . . . 5
1.3 Published Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I LEARNING LANGUAGE IN PERCEPTION 7
Chapter 2 Visual Grounded Concept Learning with Denotation Graph 8
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Denotation Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Learning with Denotation Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Matching Texts with Images . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Learning to Be More Specific . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.3 Learning to Predict Structures . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.4 The Final Learning Objective . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.3 Zero/Few-Shot and Transfer Learning . . . . . . . . . . . . . . . . . . . 17
2.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.5 Image Retrieval from Abstract Concepts . . . . . . . . . . . . . . . . . . 20
iv
Chapter 3 Hierarchical Modeling of Video and Text 21
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Settings and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Flat sequence modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Hierarchical sequence modeling . . . . . . . . . . . . . . . . . . . . . . 25
3.3.4 Final learning objective and its extensions . . . . . . . . . . . . . . . . . 27
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Experiment Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.2 Results on Video-Paragraph Retrieval . . . . . . . . . . . . . . . . . . . 30
3.4.3 Results on Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.4 Results on Action Recognition . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
II LEARNING LANGUAGE IN EMBODIED EXPERIENCES 38
Chapter 4 Synthesized Policy for Compositional Generalization 39
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Synthesized Policies with Better Generalization . . . . . . . . . . . . . . . . . . 41
4.3.1 Problem Statement and Main Idea . . . . . . . . . . . . . . . . . . . . . 41
4.3.2 Policy Factorization and Composition . . . . . . . . . . . . . . . . . . . 43
4.3.3 Disentanglement of the Embeddings for Environments and Tasks . . . . 44
4.3.4 Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.5 Transfer to Unseen Environments and Tasks . . . . . . . . . . . . . . . . 45
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 Experimental Results on GRIDWORLD . . . . . . . . . . . . . . . . . . . 47
4.4.3 Experimental Results on THOR . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 5 Babywalk Agent for Generalization across Task Horizons 52
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Learning Policy that Takes Babystep . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4.1 The BABYWALK Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4.2 Learning of the BABYWALK Agent . . . . . . . . . . . . . . . . . . . . 58
5.4.3 New Datasets for Evaluation & Learning . . . . . . . . . . . . . . . . . 59
5.4.4 Key Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.4 Revisiting ROOM2ROOM . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
III LEARNING WITH LIMITED AND GROWING DATA 67
Chapter 6 Few-shot and Generalized Few-shot Learning 68
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 7 Few-shot Learning by Embedding Adaptation 72
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.3 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.4 Embedding Adaptation for Task-specific FSL . . . . . . . . . . . . . . . . . . . 76
7.4.1 Adapting to Task-Specific Embeddings . . . . . . . . . . . . . . . . . . 76
7.4.2 Embedding Adaptation using Neural Networks . . . . . . . . . . . . . . 78
7.4.3 Contrastive Learning of Intra-Class and Inter-Class Relation . . . . . . . 79
7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.5.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.5.2 Standard Few-Shot Image Classification . . . . . . . . . . . . . . . . . . 82
7.5.3 Extended Few-Shot Learning Tasks . . . . . . . . . . . . . . . . . . . . 87
Chapter 8 Generalized Few-shot Learning with Classifier Synthesis 89
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.3 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4 Learning Adaptive Classifier Synthesis . . . . . . . . . . . . . . . . . . . . . . . 94
8.4.1 Classifier Composition with a Neural Dictionary . . . . . . . . . . . . . 94
8.4.2 Unified Learning of Few-Shot and Many-Shot Classifiers . . . . . . . . . 96
8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.5.2 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.5.3 Pivot Study on Multi-Domain GFSL . . . . . . . . . . . . . . . . . . . . 101
8.5.4 Experiments on GFSL . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
IV CONCLUSION 112
Chapter 9 Conclusion 113
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.2.1 Leverage Structure of Language for Modeling Visual Concepts. . . . . . 114
9.2.2 Learning Language Representation with Visual Supervision. . . . . . . . 115
Bibliography 116
vi
V APPENDICES 134
Appendix A
Details and Additional Experiments for Chapter 2 . . . . . . . . . . . . . . . . . . . . 135
A.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Appendix B
Details and Additional Experiments for Chapter 4 . . . . . . . . . . . . . . . . . . . . 141
B.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.1.1 Details on simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
B.1.2 Imitation Learning Algorithm and Optimization Details . . . . . . . . . . 143
B.1.3 Reinforcement Learning Algorithm and Optimization Details . . . . . . 143
B.1.4 Detailed Configuration of Methods . . . . . . . . . . . . . . . . . . . . 144
B.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Appendix C
Details and Additional Experiments for Chapter 5 . . . . . . . . . . . . . . . . . . . . 154
C.1 Details on BABY-STEP Identification and Trajectory Alignments . . . . . . . . . 154
C.1.1 Identify BABY-STEPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
C.1.2 Align Expert Trajectories with identified BABY-STEPs . . . . . . . . . . 155
C.2 Additional Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 157
C.2.1 Navigation Agent Configurations . . . . . . . . . . . . . . . . . . . . . 157
C.2.2 Details of Reward Shaping for RL . . . . . . . . . . . . . . . . . . . . . 158
C.2.3 Optimization Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . 158
C.3 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Appendix D
Details and Additional Experiments for Chapter 7 . . . . . . . . . . . . . . . . . . . . 166
D.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 166
D.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Appendix E
Details and Additional Experiments for Chapter 8 . . . . . . . . . . . . . . . . . . . . 176
E.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 176
E.2 Additional Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 178
vii
List of Tables
2.1 Key statistics of the two DGs: averaged over the all nodes in the graph, internal
nodes and leaf nodes (formated as all/internal/leaf) . . . . . . . . . . . . . . . . 12
2.2 Image-Text Retrieval Results on Different Datasets . . . . . . . . . . . . . . . . 17
2.3 Transfer Learning Result for Text-based Image Retrieval . . . . . . . . . . . . . 18
2.4 Zero/Few-shot Learning for Referring Expression . . . . . . . . . . . . . . . . . 18
2.5 Image Recognition on UNSEEN Attribute-Object Pairs on the Mit-State Dataset 18
2.6 Ablation Studies of Learning from DG . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Video paragraph retrieval on ActivityNet (val1). Standard deviation from 3 random
seeded experiments are also reported. . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Video paragraph retrieval on DiDeMo dataset. S2VT method is re-implemented
for retrieval task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Ablation studies on the weak alignment learning objective. . . . . . . . . . . . . 31
3.4 Performance of using proposal instead of ground truth on ActivityNet dataset . . 32
3.5 Ablation study on the learning objectives. . . . . . . . . . . . . . . . . . . . . . 32
3.6 Ablation study of on ActivityNet (val2). . . . . . . . . . . . . . . . . . . . . . 33
3.7 ActivityNet video captioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 ActivityNet action recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Comparison of methods on GRIDWORLD with SEEN/UNSEEN ratio of 144/256. . 48
4.2 Performance of transfer learning in the settings 2 and 3 on GRIDWORLD . . . . . 49
4.3 Comparison of methods on THOR with SEEN/UNSEEN ratio of 144/199 . . . . . 50
5.1 Statistics of datasets used for VLN learning and evaluation . . . . . . . . . . . . 60
5.2 VLN agents trained on the R4R dataset and evaluated on the unseen portion of
the R4R (in-domain) and the other 3 out-of-the-domain datasets: R2R, R6R and
R8R with different distributions in instruction length. (
+
: pre-trained with data
augmentation.
?
: reimplemented or adapted from the original authors’ public
codes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 The memory buffer is beneficial to generalizing to different tasks from on which
the agent is trained. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 BABYWALK’s performances with curriculum-based reinforcement learning (CRL),
which improves imitation learning without or with reinforcement learning (IL+RL). 65
viii
5.5 (Top) BABYWALK trained on R2R is nearly as effective as the agent trained on
R4R when generalizing to longer tasks. (Bottom) BABYWALK trained on R2R
adapts to R4R better than the agent trained in the reverse direction. . . . . . . . . 65
7.1 Few-shot classification accuracy 95% confidence interval on MiniImageNet with
ConvNet and ResNet backbones. Our implementation methods are measured over
10,000 test trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2 Few-shot classification accuracy and 95% confidence interval on TieredImageNet
with the ResNet backbone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Few-shot classification performance with ConvNet backbone on CUB dataset
(mean accuracy95% confidence interval). Our implementation methods are
measured over 10,000 test trials. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 Number of parameters introduced by each set-to-set function in additional to the
backbone’s parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.5 We evaluate our model on three additional few-shot learning tasks: (a) Multi-
Domain Few-shot, (b) Transductive few-shot learning, and (c) Generalized few-
shot learning. We observe that FEAT consistently outperform all previous methods
or baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.1 Generalized 1-shot classification performance (mean accuracy and harmonic mean
accuracy) on (a) the Heterogeneous dataset with 100 Head and 5 Tail categories
and (b) the Office-Home dataset with 25 Head and 5 Tail categories.S!S[U
andU ! S[U denote the joint classification accuracy for SEEN class and
UNSEEN class instances respectively. CASTLE
is the variant of CASTLE without
using the neural dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.2 Generalized Few-shot classification performance (mean accuracy, -value, and
harmonic mean accuracy) on MiniImageNet when there are 64 Head and 5 Tail
categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3 Generalized Few-shot classification accuracies on MiniImageNet with 64 head
categories and 20 tail categories. . . . . . . . . . . . . . . . . . . . . . . . . . 107
8.4 Generalized Few-shot classification accuracy on TieredImageNet with 351 head
categories and 160 tail categories. . . . . . . . . . . . . . . . . . . . . . . . . 107
8.5 Few-shot classification accuracy on MiniImageNet with different types of back-
bones. Our methods are evaluated with 10,000 few-shot tasks. . . . . . . . . . . 110
8.6 Few-shot classification accuracy on TieredImageNet with different types of back-
bones. Our methods are evaluated with 10,000 few-shot tasks. . . . . . . . . . . 110
A.1 Text-based Image Retrieval Performance of ViLBERT trained with different num-
ber of DG levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.2 Results on Cross-Modal Retrieval on COCO dataset 1K test split (Higher is better) 138
A.3 Results on Cross-Modal Retrieval on COCO dataset 5K test split (Higher is better) 139
A.4 Results on Text-based Image Retrieval on Flickr test split (Higher is better) . . . 139
A.5 Transferrability of the learned representations . . . . . . . . . . . . . . . . . . . 140
B.1 interactable objects, receptacles and environment indexes in THOR . . . . . . . . 142
ix
B.2 Structure of State Feature Function
s
in GRIDWORLD . . . . . . . . . . . . . . 144
B.3 Structure of State Feature Function
s
in THOR . . . . . . . . . . . . . . . . . . 145
B.4 Performance of the best model for each method on GRIDWORLD (Seen/Un-
seen=144/256). All algorithms are trained using three random seeds and reported
with mean and std. on each (", ) pair, we sample the locations of agent and
treasures for 100 times to evaluate the performances. . . . . . . . . . . . . . . . 145
B.5 Performance of SynPo, MTL and MLP on GRIDWORLD (SEEN/UNSEEN=144/256)
with window size = 0. All algorithms trained are trained using three random seeds
and reported with mean and std. . . . . . . . . . . . . . . . . . . . . . . . . . . 153
C.1 Sanity check of model trained on R2R and evaluated on its validation unseen split
(
+
: pre-trained with data augmentation;?:reimplemented or readapted from the
original authors’ released code). . . . . . . . . . . . . . . . . . . . . . . . . . . 160
C.2 Ablation on BABYWALK after each learning stage (trained on R4R). . . . . . . . 161
C.3 BABYWALK Agent performances between different segmentation rules (trained
on R4R). Refer to text for more details. . . . . . . . . . . . . . . . . . . . . . . 162
C.4 Indomain results. Each model is trained on the training set of R2R, R4R, R6R
and R8R datasets, and evaluated on the corresponding unseen validation set (
+
:
pre-trained with data augmentation). . . . . . . . . . . . . . . . . . . . . . . . . 163
C.5 Transfer results of R2R, R4R trained model evaluated on their complementary un-
seen validation datasets (
+
: pre-trained with data augmentation;
?
: reimplemented
or readapted from the original authors’ released code). . . . . . . . . . . . . . . 164
C.6 Transfer results of R6R, R8R trained model evaluated on their complementary un-
seen validation datasets (
+
: pre-trained with data augmentation;
?
: reimplemented
or readapted from the original authors’ released code). . . . . . . . . . . . . . . 165
D.1 Few-shot classification accuracy 95% confidence interval on MiniImageNet with
ConvNet and ResNet backbones. Our implementation methods are measured over
10,000 test trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
D.2 Few-shot classification performance with Wide ResNet (WRN)-28-10 backbone
on MiniImageNet dataset (mean accuracy95% confidence interval). Our imple-
mentation methods are measured over 10,000 test trials. . . . . . . . . . . . . . . 168
D.3 Few-shot classification performance with Wide ResNet (WRN)-28-10 backbone
on TieredImageNet dataset (mean accuracy95% confidence interval). Our imple-
mentation methods are measured over 10,000 test trials. . . . . . . . . . . . . . . 169
D.4 Few-shot classification performance with ConvNet backbone on CUB dataset
(mean accuracy95% confidence interval). Our implementation methods are
measured over 10,000 test trials. . . . . . . . . . . . . . . . . . . . . . . . . . . 169
D.5 Ablation studies on whether the embedding adaptation improves the discerning
quality of the embeddings. After embedding adaptation, FEAT improves w.r.t. the
before-adaptation embeddings a lot for Few-shot classification. . . . . . . . . . . 170
x
D.6 Ablation studies on the position to average the same-class embeddings when there
are multiple shots per class in FEAT (tested on the 5-Way tasks with different
numbers of shots). “Pre-Avg” and “Post-Avg” means we get the embedding center
for each class before or after the set-to-set transformation, respectively. . . . . . . 170
D.7 Ablation studies on the number of heads in the Transformer of FEAT (with number
of layers fixes to one). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D.8 Ablation studies on the number of layers in the Transformer of FEAT (with number
of heads fixes to one). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
D.9 Ablation studies on effects of the contrastive learning of the set-to-set function on
FEAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
D.10 Ablation studies on the prediction strategy (with cosine similarity or euclidean
distance) of FEAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
D.11 Cross-Domain 1-shot 5-way classification results of the FEAT approach. . . . . . 173
D.12 Results of models for transductive FSL with ConvNet backbone on MiniImageNet.
We cite the results of Semi-ProtoNet and TPN from [181] and [173] respectively.
For TEAM [173], the authors do not report the confidence intervals, so we set
them to 0.00 in the table. FEAT
y
and FEAT
z
adapt embeddings with the joint set of
labeled training and unlabeled test instances, while make prediction via ProtoNet
and Semi-ProtoNet respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 174
D.13 The top-5 low-shot learning accuracy over all classes on the large scale Ima-
geNet [184] dataset (w/ ResNet-50). . . . . . . . . . . . . . . . . . . . . . . . . 175
E.1 The difference between training with a pre-trained backbone or from scratch
with 1-Shot 5-Way Tasks on MiniImageNet. “MA” and “HM” denote the Mean
Accuracy and Harmonic Mean Accuracy, respectively. . . . . . . . . . . . . . . 179
E.2 Comparison between CASTLE variants and the incremental learning methods on
MiniImageNet. The harmonic mean accuracy in different evaluation scenarios are
recorded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
E.3 The light-weight model adaptation by fine-tuning the scale and bias weights based
on the classifier initialization from CASTLE variants. The harmonic mean accuracy
in different evaluation scenarios on MiniImageNet are recorded. The superscripty
denotes the method with another light-weight update step. . . . . . . . . . . . . 180
E.4 The performance with different choices of classifier synthesize strategies when
tested with 5-Shot 5-Way UNSEN Tasks on MiniImageNet. We denote the option
compute embedding prototype and average synthesized classifiers as “Pre-A VG”
and “Post-A VG” respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
E.5 The GFSL performance (harmonic mean accuracy) change with different num-
ber of classifiers (# of CLS) when tested with 1-Shot 5-Way UNSEN Tasks on
MiniImageNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xi
E.6 The performance gap between CASTLE variants and a kind of “many-shot” upper
bound (denoted as “UB”) on MiniImageNet. The ability of FSL classification is
measured by the mean accuracy, while the harmonic mean accuracy is used as a
criterion for GFSL. 5-Shot classification performance of CASTLE and ACASTLE
are listed for a comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
xii
List of Figures
1.1 Illustrative photo of black sheep population in real-world (photo credit [69]). . . . 2
1.2 Different types of Generalization in Instruction-conditioned Embodied Agents. . 4
2.1 (Left) A schematic example of denotation graph showing the hierarchical organi-
zation of linguistic expression (Right) A random-subgraph from the denotation
graph extracted from the Flickr30K dataset, with images attached to concepts at
different levels of hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Image Retrieval using Mid-level Linguistic Expression on Flickr Denotation Graph. The
results are reported in Mean Average Precision (Mean AP). . . . . . . . . . . . . . . 20
3.1 Conceptual diagram of our approach for cross-modal modeling of video and texts. The
main idea is to embed both low-level (clips and sentences) and high-level (video and
paragraph) in their own semantic spaces coherently. As shown in the figure, the 3
sentences (and the corresponding 3 clips) are mapped into a local embedding space
where the corresponding pairs of clips and sentences are placed close to each other. As a
whole, the videos and the paragraphs are mapped into a global semantic space where their
embeddings are close. See Fig. 3.3 and texts for details. . . . . . . . . . . . . . . . . 22
3.2 Flat sequence modeling of videos and texts, ignoring the hierarchical structures in either
and regarding the video (paragraph) as a sequence of frames (words). . . . . . . . . . 24
3.3 Hierarchical cross-modal modeling of videos and texts. We differ from previous works [128,
165] in two aspects (components in red color): layer-wise reconstruction through decoders,
and matching at both global and local levels. See texts for details. . . . . . . . . . . . 25
3.4 Recall vs Rank curves of Video to Paragraph and Paragraph to Video retrieval of
both HSE[=0] and HSE. All results are collected from models based on InceptionV3
feature on ActivityNet validation set 1. . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Retrieval performance improves given more observed clips/sentences. . . . . . . 34
3.6 T-SNE visualization of off-the-shelf video embedding of HSE on ActivityNet v1.3
training and validation set. Points are marked with its action classes. . . . . . . . 37
4.1 We consider a transfer learning scenario in reinforcement learning that considers
transfer in both task and environment. Three different settings are presented here
(see text for details). The red dots denote SEEN combinations, gray dots denote
UNSEEN combinations, and arrows! denote transfer directions. . . . . . . . . . 40
xiii
4.2 Overview of our proposed model. Given a task and an environment, the corre-
sponding embeddingse
"
ande
are retrieved to compose the policy coefficients
and reward coefficients. Such coefficients then linearly combine the shared basis
and synthesize a policy (and a reward prediction) for the agent. . . . . . . . . . . 43
4.3 From left to right: (a) Some sample mazes of our GRIDWORLD dataset. They
are similar in appearance but different in topology. Demonstrations of an agent’s
egocentric views of (b) GRIDWORLD and (c) THOR. . . . . . . . . . . . . . . . . 46
4.4 On GRIDWORLD. Averaged success rate (AvgSR) on SEEN pairs and UNSEEN
pairs, respectively. Results are reported withjEj = 20 andjTj = 20. We report
mean and std based on 3 training random seeds. . . . . . . . . . . . . . . . . . . 48
4.5 (a) Transfer learning performance (in AvgSR.) with respect to the ratio: # SEEN
pairs / # TOTAL pairs, withjEj = 10 andjTj = 10. (b) Reinforcement learning
performance on unseen pairs of different approaches (with PPO [192]). MLP
overfits, MTL improves slightly, and SYNPO achieves 96.16% AvgSR. . . . . . . 48
4.6 Transfer results of settings 2 and 3. AvgSRs are marked in the grid (see Suppl.
Materials for more visually discernible plots). The tasks and environments in the
purple cells are from the unseenQ set and the red cells correspond to the rest.
Darker color means better performance. It shows that cross-task transfer is easier
than cross-environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Performance of various VLN agents on generalizing from shorter navigation tasks
to longer ones. The vertical axis is the newly proposed path-following metric
SDTW [148], the higher the better. BABYWALK generalizes better than other
approaches across different lengths of navigation tasks. Meanwhile, it get very
close to the performances of the in-domain agents (the dashed line). Please refer
to the texts for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 The BABYWALK agent has a memory buffer storing its past experiences of
instructionsx
m
, and its trajectory ^ y
m
. When a new BABY-STEPx
m
is presented,
the agent retrieves from the memory a summary of its experiences as the history
context. It takes actions conditioning on the context (as well as its states
t
and the
previous action ^ a
t
). Upon finishing following the instruction. the trajectory ^ y
m
is
then sent to the memory to be remembered. . . . . . . . . . . . . . . . . . . . . 56
5.3 Two-phase learning by BABYWALK. (Left) An example instruction-trajectory
pair from the R4R dataset is shown. The long instruction is segmented into
four BABY-STEP instructions. We use those BABY-STEPs for imitation learning
(§5.4.2.1) (Right) Curriculum-based RL. The BABYWALK agent warm-starts from
the imitation learning policy, and incrementally learns to handle longer tasks by
executing consecutive BABY-STEPs and getting feedback from external rewards
(c.f . §5.4.2.2). We illustrate two initial RL lectures using the left example. . . . . 58
5.4 The distribution of lengths of the instructions and trajectories. . . . . . . . . . . 60
5.5 Performance by various agents on navigation tasks in different lengths. See texts
for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Trajectories by human experts and VLN agents on two navigation tasks. . . . . . 64
xiv
7.1 Qualitative visualization of model-based embedding adaptation procedure (im-
plemented using FEAT) on test tasks (refer to §7.5.2.2 for more details). Each
figure shows the locations of PCA projected support embeddings (class prototypes)
before and after the adaptation of FEAT. Values below are the 1-shot 5-way classi-
fication accuracy before and after the the adaptation. Interestingly, the embedding
adaptation step of FEAT pushes the support embeddings apart from the clutter and
toward their own clusters, such that they can better fits the test data of its categories. 73
7.2 Illustration of the proposed Few-Shot Embedding Adaptation Transformer (FEAT).
Existing methods usually use the same embedding function E for all tasks. We
propose to adapt the embeddings to each target few-shot learning task with a
set-to-set function such as Transformer, BiLSTM, DeepSets, and GCN. . . . . . 74
7.3 Interpolation and Extrapolation of few-shot tasks from the “way” perspective.
First, We train various embedding adaptation models on 1-shot 20-way (a) or
5-way (b) classification tasks and evaluate models on unseen tasks with different
number of classes (N={5, 10, 15, 20}). It shows that FEAT is superior in terms of
way interpolation and extrapolation ability. . . . . . . . . . . . . . . . . . . . . 85
7.4 Qualitative results of few-shot domain-generalization for FEAT. Correctly clas-
sified examples are shown in red boxes and incorrectly ones are shown in blue
boxes. We visualize one task that FEAT succeeds (top) and one that fails (bottom). 87
8.1 A conceptual diagram comparing the Few-Shot Learning (FSL) and the Gen-
eralized Few-Shot Learning (GFSL). GFSL requires to extract inductive bias
from SEEN categories to facilitate efficiently learning on few-shot UNSEEN tail
categories, while maintaining discernability on head classes. . . . . . . . . . . . 90
8.2 Illustration CASTLE and ACASTLE. In CASTLE (left), the synthesized few-shot
classifiers are directly unioned with the many-shot classifiers to make the join
prediction. Different from that, ACASTLE (right) synthesizes the joint classifiers,
using both the many-shot and few-shot classifiers as query to the neural dictionary.
This ensures backward knowledge transfer, where many-shot head classifiers
co-adapt to the few-shot tail classifiers. . . . . . . . . . . . . . . . . . . . . . . 93
8.3 The split of data in the generalized few-shot classification scenario. In addition to
the standard dataset like MiniImagetnet (blue part), we collect non-overlapping
augmented head class instances from the corresponding categories in the ImageNet
(red part), to measure the classification ability on the SEEN classes. Then in the
generalized few-shot classification task, few-shot instances are sampled from each
of the UNSEEN classes, while the model should have the ability to predict instances
from both the head and tail classes. . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.4 An illustration of the harmonic mean based criterion for GFSL evaluation.S and
U denotes the SEEN and UNSEEN instances (x) and labels (y) respectively.S[U
is the joint set ofS andU. The notationX!Y;X;Y 2fS;U;S[Ug means
computing prediction results with instances fromX to labels ofY . By computing
a performance measure (like accuracy) on the joint label space prediction of SEEN
and UNSEEN instances separately, a harmonic mean is computed to obtain the final
measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
xv
8.5 An illustration of the Heterogeneous and Office-Home dataset. Both datasets
contain multiple domains. In the Heterogeneous dataset, each class belongs to
only one domain, while in Office-Home, a class has images from all three domains.101
8.6 Calibration’s effect to the 1-shot harmonic mean accuracy on MiniImageNet.
Baseline models improve a lot with the help of the calibration factor. . . . . . . . 109
8.7 The 1-shot AUSUC performance with two configurations of UNSEEN classes on
MiniImageNet. The larger the area under the curve, the better the GFSL ability. . 109
8.8 Results of 1-shot GFSL harmonic mean accuracy with incremental number of
UNSEEN classes on MiniImageNet. Note MC+kNN and MC+ProtoNet bias
towards SEEN classes and get nearly zero harmonic mean accuracy. . . . . . . . . 109
8.9 Post-calibrated results of 1-shot GFSL harmonic mean accuracy with incremental
number of UNSEEN classes on MiniImageNet. All methods select the their best
calibration factors from the meta-val data split. . . . . . . . . . . . . . . . . . . 109
9.1 Illustration of the graph we envisioned to guide the modeling of visual concepts. . 114
9.2 Illustration of using visual data as supervision for language learning [211]. . . . . 115
A.1 Architecture of (a) ViLBERT, (b) UNITER. The
N
means element-wise product.
The [CLS] represents the embedding of [CLS] token in the last UNITER layer. . 137
B.1 Demonstrations of agent’s view in two simulators. In the left, we present the
agent’s input state of GRIDWORLD. An agent only have the vision to its surround-
ing context and the locations of all treasures (see (a)). Similarly, in the THOR, an
agent has access to an egocentric image that represents the first-person viewpoint
(see (b)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
B.2 Results on GRIDWORLD. (a)-(b): Comparison between average success rate
(ASR.) of algorithms on seen split and unseen split. (c)-(d): Comparison between
average accumulated reward (AvgReward.) of algorithms in each episode on seen
split and unseen split. Results are reported on the setting withjEj = 20 and
jTj = 20. For each intermediate performance, we sample 100 (",) combinations
and test one configuration to evaluate the performances. We evaluate models
trained with 3 random seeds and report results in terms of the mean AvgSR and its
standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
B.3 An ablation study about our learning objectives. We report the results of
the ablated versions without the disentanglement loss (Disentg) on environment
(EnvDisentg) and on task (TaskDisentg). (a)-(b): Comparison between average
success rate (ASR.) of algorithms on SEEN split and UNSEEN split. (c)-(d):
Comparison between average accumulated reward (AvgReward.) of algorithms in
each episode on SEEN split and UNSEEN split. Results are reported on the setting
withjEj = 20 andjTj = 20. Similarly, for each intermediate performance, we
sample 100 (",) combinations to evaluate the performances. . . . . . . . . . . . 147
B.4 Average test success rate on each environment-task combination. Blue grids
represent seen combinations and red grids represent unseen combinations . . . . 148
B.5 Average test success rate on each environment-task combination. Blue grids
represent seen combinations and red grids represent unseen combinations . . . . 149
xvi
B.6 Case study for a situation when the ratio of # of combinations seen and the total is
0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
B.7 Results of “A blind agent scenario” on GRIDWORLD with window size of 0.
(a)-(b): Comparison between average success rate (ASR.) of algorithms on seen
split and unseen split. (c)-(d): Comparison between average accumulated reward
(AvgReward.) of algorithms in each episode on seen split and unseen split. Results
are reported on the setting withjEj = 20 andjTj = 20. For each intermediate
performance, we sample 100 (",) combinations and test one configuration to
evaluate the performances. We evaluate models trained with 3 random seeds and
report results in terms of the mean AvgSR and its standard deviation. . . . . . . 151
B.8 Visualizing the effectiveness transferring. Average success rates are marked in the grid
(more visually discernible plots are in the Suppl. Materials). The purple cells are fromQ
set and red cells represents the rest. The darker the color is, the better the corresponding
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
C.1 Our network architecture at them-th BABY-STEP sub-task. Red line represents the
procedure of encoding context variablez
m
via summarizing the BABY-STEP tra-
jectoryf
SUMMARY
(v(^ y
1
);:::;v(^ y
m1
)) and the corresponding (micro)instruction
f
SUMMARY
(u(x
1
);:::;u(x
m1
)) in the memory buffer. Blue line represents the
procedure of encoding the (micro)instructionu(x
m
) of the current BABY-STEP.
Purple line represents the detailed decision making process of our BABYWALK
policy (A
st
is denoted as the set of navigable directions ats
t
as defined by Fried
et al. [55]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
E.1 The 1-shot 5-way accuracy on UNSEEN of MiniImageNet with different size of
dictionaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
E.2 The 64-way multi-class accuracy on SEEN of MiniImageNet with 1-shot trained
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
xvii
Abstract
Human ability to understand language is general, flexible, and more importantly, grounded to the
physical world. We digest the natural language not by looking at the co-occurring statistics of
words in sentences, but by associating its meaning to the corresponding situation and interacting
accordingly within the physical environment. Language learning requires going beyond text. In
particular, building intelligent agents that understand the meaning of language asks for access to
the multi-modal and physical world.
Towards this goal, this thesis describes techniques to understand visually grounded concepts
and follow instructions in the embodied environment. Specifically, we present three primary
research directions. The first part of this thesis proposes learning the concepts described by the
language with perception data, via developing models that associate the words, phrases, sentences,
and paragraphs with both the static and temporally expanded visual world. Based on that, the
second part focuses on learning the underlying intent of language instructions and propose models
that execute the instructions faithfully in the dynamic visual environment, to achieve substantial
generalization performance. Finally, the third part studies more realistic and challenging learning
situations, developing methods to handle the learning of data coming from the long-tailed and
growing data distribution. In all three parts, we conduct extensive empirical studies on multiple
large-scale datasets and demonstrate the superior performance of the proposed models and learning
algorithms.
xviii
Chapter 1
Introduction
1.1 Background
Language is the hallmark of human intelligence, distinguishing us from animals [171]. Developing
intelligent agents that can understand human language is considered one of the most challenging
tasks in pursuit of Artificial General Intelligence [185]. It motivates the research community to
innovate theories and algorithms towards natural language learning. Although still far from being
perfect, to date, tremendous progress has been made in natural language processing (NLP).
The last decade has witnessed the success of deep learning [71, 113], which drives a large num-
ber of NLP research to focus on the learning of distributional representation of words and sentences.
These efforts first lead to models of the structured word embedding [153, 168] that gracefully cap-
tures the co-occurring statistics of words. Continuing this trajectory, more research [107, 169] was
done to learn contextualized word representation with bidirectional language modeling, using deep
sequence models [75]. Such models show promising transfer learning results to the sentence-level
NLP tasks of multiple text domains, such as question answering [116, 179] and natural language
inference [19, 241]. More recently, the advent of large-scale datasets [39, 271] and new hardware
platforms (e.g., high-capacity GPUs & TPUs) further opened up opportunities for scaling up
language learning to an unprecedented level. As a result, massive model [220] trained on massive
data occur [44, 132, 177, 178], and provide superior text representation to empower super-human
performances over a diverse set of downstream natural language processing tasks [231, 232]. With
various models exceeding the human performances on almost every NLP tasks, a key question
comes to the researcher’s plate:
Where is natural language learning going next?
Towards answering this question, we first introduce a phenomenon of language modeling,
which we refer to it as the “black sheep problem” [42] that is initially described as “reporting bias”
by Ben van Durme in his dissertation [219]. Concretely, this problem describes the following
1
situation: in a language such as English, the "black sheep" outnumbers "white sheep" about 25:1,
1
which biases the typical language learners [19, 107, 153, 168, 169, 241] who completely relies on
modeling word co-occurrence to believe the word “black” is better describing the concept “sheep”
than the other word “white”. However, common sense knowledge reminds us that there are way
more white sheep than black sheep in reality! (see Figure 1.1)
Figure 1.1: Illustrative photo of black sheep population in real-world (photo credit [69]).
This “black sheep problem” tells an important limitation of those state-of-the-art language
learning methods — the co-occurrence frequencies of words do not reflect co-occurrence frequen-
cies of things in the real world! It provides an excellent example to support the critical argument
of this dissertation, which argues that machine learning can not learn the meaning of language
only from the text [15, 18]. Instead, we need to refer to a broader set of context, i.e., the visual,
dynamic, and physical world, to produce machine learning models that genuinely understand the
meaning of human language.
Along this direction, this thesis demonstrates our research that builds methods and algorithms
to learn the meaning of language. Specifically, we propose to learn language with the perception
data and the embodied experiences in the long-tailed physical world. Our work illustrates that the
tight coupling of vision, language, and action can lead to intelligent models that better understand
the intent of human language, and reflect the underlying truth of the world. This dissertation
examines three complementary questions, in an increasing level of difficulty:
• How to learn the meaning of language in visual dynamic world?
1
In other languages such as French and German, this ratio is 3:1 and 12:1, respectively.
2
• How to learn policies that understand the intent of language and generalize faithfully?
• How to learn concepts under the long-tailed and even growing data in the real world?
To address these questions, this thesis work draws on theories and methods from machine
learning, computer vision, and natural language processing to develop a road map towards the
ultimate goal: building agents that understand human language in the real world.
1.2 Thesis Outline
This dissertation is structured into three main parts, in response to the three challenges mentioned
above. Each part consists of two to three chapters. PART I primarily focus on learning the
meaning of concepts described by language with the perception data, including static images
and dynamic videos. PART II investigates how to understand language instructions and convert
them into faithful execution plans for an embodied agent, such that they generalize well under
different circumstances. PART III first gives an overview of two challenging learning situations,
i.e., few-shot learning (FSL) and generalized few-shot learning (GFSL), which learns under a
growing number of concepts with limited training data. Then it describes two methods that tackle
FSL and GFSL, respectively. We expand our discussion of each in greater detail as follows.
1.2.1 Learning Grounded Concept in Perception
Visually grounded language expressions denote the images they describe. These expressions of
visual concepts are naturally organized hierarchically in shorter expressions. Such organization
reveals structural relations that do not manifest when the sub-expressions are studied in isolation.
For example, a coherent paragraph that describes celebration events can contain sentences, phrases,
and words that correspond to the characters, the scene, and the spatial relationship relates the
objects, and temporal relationship connects consequent human behaviors. Therefore, understanding
such a paragraph requires fine-grained understanding at multiple granularities, from the detailed
grounding between nouns and visual objects to high-level associations of sentences and scenes.
In this part, we highlight two lines of works that build the grounding model to perform
cross-modal matching at different granularities. Our work investigates how to establish multi-
level associations between vision and language. Specifically, we learn the structural matching of
words, phrases, sentences, and paragraphs to the image and videos they describe. In particular,
Chapter 2 describes our work that builds fine-grained matching between images and sentences.
The algorithm, built upon the previous work that extracts linguistic prior from sentences [254],
learns to preserve the hierarchical structure in the visual-linguistic representation using a tailored
graph database that clusters images to a central concept (in words, phrases, or sentences). As a
3
consequence, the resulting model presents a strong representation that benefits various downstream
tasks.
Inspired by this result, Chapter 3 extends the idea to the video and paragraph, to learn the
hierarchical matching at the levels of sentences to short clips, and paragraph to a long video. Specif-
ically, it introduces a dual encoder framework that hierarchically encodes video and paragraphs,
from frames to clip then to the video, and from words to sentences then to the paragraph. Without
requiring explicit alignment between clips and sentences, our model demonstrates promising
results in paragraph-to-video retrieval and video captioning.
1.2.2 Learning Intent of Language with Embodied Experiences
Beyond visually grounded language, another set of knowledge a language learner can not acquire
purely from the text corpus are the intent behinds imperative sentences, such as instructions or
commands. For example, when an agent is asked to “navigate straightforward along the corridor”,
the ability to perceive the environment and correspond a “corridor” to the visual appearance is
insufficient to resolve the instruction. Instead, it needs to comprehend the intent of this instruction
and take action to accomplish the encoded task, which is to step forward on the corridor. This
motivates our research that learns agents who understand the instruction and execute accordingly
in the dynamic environment.
Unseen Seen
(b) Cross-Horizon Generalization (a) Compositional Generalization
Env A
(Task 1, Env A) (Task 2, Env A)
(Task 1, Env B) (Task 2, Env B)
Env B
Task 2
Task 3
Task 1
Go to the kitchen
at the corner.
Wash the dishes
by grasping the
steel pot on the
stove and place it
to the water sunk.
Turn around.
Turn right and
pick up the apple.
Figure 1.2: Different types of Generalization in Instruction-conditioned Embodied Agents.
A significant challenge stems from the large space of all possible tasks described by instructions,
making it intractable to learn every task for all the situations (i.e., environment setups) the agent
interacts with. This motivates the study of models that handles the compositional generalization.
Specifically, Chapter 4 presents the study of such generalization scenarios, where agents are
required to generalize across the combinations of (task, environment) pairs. Concretely, this asks
an agent to transfer the skill on resolving a specific task 1 learned in environment A to environment
B, where the setting of B has been explored in the learning of another task 2 (so exploration in
environment B is not an obstacle for transfer). Figure 1.2 (a) shows an illustrative example of
4
such transfer learning situation. In Chapter 4, we managed to develop an agent that decouples
the task and environment into low-dimensional embeddings, to enable successful compositional
generalization [80].
Moreover, Chapter 5 studies another challenging transfer situation, where a significant discrep-
ancy occurs between the instruction for training and the instructions for evaluation. For instance,
Figure 1.2 (b) presents the situation that an agent learned on medium horizon instructions is
required to generalize to either longer or shorter horizon instructions. Consequently, agents that
memorize the word sequence instead of comprehending the meaning of elements and compounds
of the instruction, can fail severely. To overcome such a situation, we propose to learn an agent
that explicitly decomposes instructions into sub-instructions, using curriculum learning [16, 266].
As a result, such decomposition brings successful generalization.
1.2.3 Learning with Limited and Growing Data
Finally, we investigate relatively independent but realistic learning setups, which correspond to
learning visual concepts in a long-tailed distribution, without knowing what the tail concepts look
like a priori. An example is to learn a reliable object recognition system to recognize the newly
defined product (e.g., new smartphone models), which only possesses a few data instances. A
naive approach such as directly training a recognition model over all the present concepts is prone
to over-fit or result in a biased model [20, 40, 91, 249, 265]. Therefore, approaches that inductively
transfer the knowledge of learning from the many-shot SEEN categories to the few-shot UNSEEN
classes become the key towards success.
Part III of this dissertation primarily focuses on this important problem. Specifically, Chapter 6
presents an in-depth overview of the problem and discusses the concrete formulation to prepare
the reader with the background. Chapter 7 then studies the standard few-shot learning (FSL) setup,
using a model-based embedding adaptation approach [251] to acquired task-specific few-shot
classifiers that masters the recognition of UNSEEN tail concepts. Chapter 8 studies the generalized
few-shot learning (GFSL) setup, and introduces an adaptive classifier synthesis framework [252],
which is used to recognize not only the newly encountered few-shot tail concepts but also the
many-shot head concepts simultaneously.
1.3 Published Works
This dissertation provides a comprehensive set of techniques for modeling and learning the
meaning of language in perception and embodied experiences, which were published in the top
computer science venues. The remaining of this section enumerates all the published works, and
briefly describes the content of each chapter in this dissertation.
5
PART I: Learning Language in Perception
• Chapter 2 corresponds to the EMNLP 2020 paper [261]. It introduces an algorithm that enables
the matching between words, phrases, and sentences to the image denotation they describe.
• Chapter 3 corresponds to the ECCV 2018 paper [260]. It introduce a cross-modal and hierar-
chical model that associates the paragraph to the video.
PART II: Learning Language in Embodied Experiences
• Chapter 4 corresponds to the NeurIPS 2018 paper [80]. It introduces three transfer scenarios
in grounded instruction following, which discuss the transfer and adaptation across tasks and
environments, and designed a multi-linear policy to address these challenges.
• Chapter 5 corresponds to the ACL 2020 paper [266]. It introduces a navigation agent that
decomposes long instruction into short one and learns to faithfully execute them with the
curriculum learning.
PART III: Learning with Limited and Growing Data
• Chapter 6 discuss the setting and challenges of few-shot learning and generalized few-shot
learning, describe the problem formulation, and provides an outline that briefly overviews the
algorithms and models to discuss in this part.
• Chapter 7 corresponds to the CVPR 2020 paper [251]. It introduces a few-shot learning
approach that performs embedding adaptation to acquire task-specific classifiers, which is shown
empirically more successful over previous state-of-the-art methods.
• Chapter 8 corresponds to the IJCV 2021 paper [252]. It introduces a model that generates
few-shot classifiers and many-shot classifiers for joint prediction in the task of generalized
few-shot learning, together with a learning algorithm that simultaneously learns them.
Finally, we conclude this thesis and remark promising future directions in Chapter 9.
6
Part I
Learning Language in Perception
7
Chapter 2
Visual Grounded Concept Learning with Denotation Graph
Learning to fuse vision and language information and representing them is an important research
problem with many applications. Recent progresses have leveraged the ideas of pre-training (from
language modeling) and attention layers in Transformers to learn representation from datasets
containing images aligned with linguistic expressions that describe the images. We propose
learning representations from a set of implied, visually grounded expressions between image and
text, automatically mined from those datasets. In particular, we use denotation graphs to represent
how specific concepts (such as sentences describing images) can be linked to abstract and generic
concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific
relations can be discovered using linguistic analysis tools. We propose methods to incorporate
such relations into learning representation. We show that state-of-the-art multimodal learning
models can be further improved by leveraging automatically harvested structural relations. The
representations lead to stronger empirical results on downstream tasks of cross-modal image
retrieval, referring expression, and compositional attribute-object recognition.
2.1 Motivation
There has been an abundant amount of aligned visual and language data such as text passages
describing images, narrated videos, subtitles in movies, etc. Thus, learning how to represent
visual and language information when they are semantically related has been a very actively
studied topic. There are many V+L applications: image retrieval with descriptive sentences or
captions [12, 13, 76, 254], image captioning [35, 248], visual question answering [9], visual
navigation with language instructions [3], visual objects localization via short text phrases [172],
and others. A recurring theme is to learn the representation of these two streams of information so
that they correspond to each other, highlighting the notion that many language expressions are
visually grounded.
8
black dog
dog
two dogs
two black
dogs
two dogs
running
two dogs running
on grass
dog running
on grass
child
child
running
on grass
grass running
dog
running
running
on grass
child
running
A street performer
playing a percussion
instrument in front
of a crowd of
people.
play percussion instrument
man play percussion
instrument in front
of crowd
instrument in front of
person
instrument
person
Figure 2.1: (Left) A schematic example of denotation graph showing the hierarchical organization
of linguistic expression (Right) A random-subgraph from the denotation graph extracted from the
Flickr30K dataset, with images attached to concepts at different levels of hierarchy.
A standard approach is to embed the visual and the language information as points in a (joint)
visual-semantic embedding space [49, 56, 105]. One can then infer whether the visual information
is aligned with the text information by checking how these points are distributed.
How do we embed visual and text information? Earlier approaches focus on embedding each
stream of information independently, using models that are tailored to each modality. For example,
for image, the embedding could be the features at the last fully-connected layer from a deep neural
network trained for classifying the dominant objects in the image. For text, the embedding could
be the last hidden outputs from a recurrent neural network.
Recent approaches, however, have introduced several innovations [36, 126, 143]. The first
is to contextualize the embeddings of one modality using information from the other one. This
is achieved by using co-attention or cross-attention (in addition to self-attention) in Transformer
layers. The second is to leverage the power of pre-training [45, 178]: given a large number of
parallel corpora of images and their descriptions, it is beneficial to identify pre-trained embeddings
on these data such that they are useful for downstream V+L tasks.
Despite such progress, there is a missed opportunity of learning stronger representations from
those parallel corpora. As a motivating example, suppose we have two paired examples: one is
an imagex
1
corresponding to the texty
1
of TWO DOGS SAT IN FRONT OF PORCH and the other
is an imagex
2
corresponding to the texty
2
of TWO DOGS RUNNING ON THE GRASS. Existing
approaches treat the two pairs independently and compute the embeddings for each pair without
acknowledging that both texts share the common phrasey
1
\y
2
= TWO DOGS and the images
have the same visual categories of two dogs.
We hypothesize that learning the correspondence between the common phrasey
1
\y
2
and the
set of imagesfx
1
;x
2
g, though not explicitly annotated in the training data, is beneficial. Enforcing
the alignment due to this additionally constructed pair introduces a form of structural constraint:
9
the embeddings ofx
1
andx
2
have to convey similar visual information that is congruent to the
similar text information in the embeddings ofy
1
andy
2
.
We validate this hypothesis and show that extracting additional and implied correspondences
between the texts and the visual information, then using them for learning leads to better represen-
tation, which results in a stronger performance in downstream tasks. The additional alignment
information forms a graph where the edges indicate how visually grounded concepts can be
instantiated at both abstract levels (such as TWO DOGS) and specific levels (such as TWO DOGS
SAT IN FRONT OF THE PORCH). These edges and the nodes that represent the concepts at differ-
ent abstraction levels form a graph, known as denotation graph, previously studied in the NLP
community [117, 172, 254] for grounding language expressions visually.
Our contributions are to propose creating visually-grounded denotation graphs to facilitate
representation learning. Concretely, we apply the technique originally developed for the Flickr
dataset [254] also to COCO dataset [137] to obtain denotation graphs that are grounded in
each domain respectively (§2.3). We then show how the denotation graphs can be used to
augment training samples for aligning text and image (§2.4). Finally, we show empirically that the
representation learned with denotation graphs leads to stronger performance in downstream tasks
(§2.5).
2.2 Related Work
Learning representation for image and text. Single-stream methods learn each modality
separately and align them together with a simple fusion model, often an inner product between
the two representations. Frome et al. [56] learns the joint embedding space for images and labels
and use the learned embeddings for zero-shot learning. Kiros et al. [105] uses bi-directional
LSTMs to encode sentences and then maps images and sentences into a joint embedding space
for cross-modal retrieval and multi-modal language models. Li et al. [129] designs a high-level
visual reasoning module to contextualize image entity features and obtain a more powerful image
representation. Vendrov et al. [223] improves image retrieval performance by exploiting the
hypernym relations among words. There is a large body of work that has been focusing on
improving the visual or text embedding functions [47, 63, 84, 159, 199].
Another line of work, referred to as cross-stream methods infer fine-grained alignments
between local patterns of visual (i.e., local regions) and linguistic inputs (i.e., words) between
a pair of image and text, then use them to derive the similarity between the image and the text.
SCAN [122] uses cross-modal attention mechanism [248] to discover such latent alignments.
Inspired by the success of BERT [45], recent efforts have conducted visual-linguistic pre-training
on large-scale datasets [194], using a powerful sequence model such as deep Transformers [36, 126,
10
130, 143, 202]. The pre-training strategies of these methods typically involve many self-supervised
learning tasks, including the image-text matching [143], masked language modeling [45, 143] and
masked region modeling [36].
In contrast to those work, we focus on exploiting additional correspondences between image
and text that are not explicitly given in the many image and text datasets. By analyzing the
linguistic structures of the texts in those datasets, we are able to discover more correspondences
that can be used for learning representation. We show the learned representation is more powerful
in downstream tasks.
Vision + Language Tasks. There has been a large collection of tasks combining vision and
language, including image captioning [34, 50, 76, 92, 114], visual QA [9], text-based image
verification [81, 203, 204], visual commonsense reasoning [258], and so on. In the context of this
paper, we focus on studying cross-modality retrieval [12, 13, 62, 76, 254, 260], as well as transfer
learning on downstream tasks, including compositional attribute-object recognition [86, 156] and
referring expressions [41, 96, 110, 157]. Please refer to §2.5 for explanation of these tasks.
2.3 Denotation Graph
Visually grounded text expressions denote the images (or videos) they describe. When examined
together, these expressions reveal structural relations that do not exhibit when each expression is
studied in isolation. In particular, through linguistic analysis, these expressions can be grouped and
partially ordered and thus form a relation graph, representing how (visually grounded) concepts
are shared among different expressions and how different concepts are related. This insight
was explored by Young et al. [254] and the resulting graph is referred to as a denotation graph,
schematically shown in the top part of Fig. 2.1. In this work, we focus on constructing denotation
graphs from the Flickr and the COCO datasets, where the text expressions are sentences describing
images.
Formally, a denotation graphG is a polytree where a nodev
i
in the graph corresponds to a pair
of a linguistic expressiony
i
and a set of imagesX
i
=fx
1
;x
2
; ;x
n
i
g. A directed edgee
ij
from a nodev
i
to its childv
j
represents a subsumption relation betweeny
i
andy
j
. Semantically,
y
i
is more abstract (generic) thany
j
, and the tokens iny
i
can be a subset ofy
j
’s. For example,
TWO DOGS describes all the images which TWO DOGS ARE RUNNING describes, though less
specifically. Note that the subsumption relation is defined on the semantics of these expressions.
Thus, the tokens do not have to be exactly matched on their surface forms. For instance, IN FRONT
OF PERSON or IN FRONT OF CROWD are also generic concepts that subsume IN FRONT OF A
CROWD OF PEOPLE, see the right-hand side of Fig. 2.1 for another example.
11
Table 2.1: Key statistics of the two DGs: averaged over the all nodes in the graph, internal nodes
and leaf nodes (formated as all/internal/leaf)
Dataset Flickr-DG COCO-DG
# of edges 1.94M 4.57M
# of nodes 597K/452K/145K 1.41M/841K/566K
# of tokens/node 6.78/4.45/14.04 5.88/4.07/8.58
# of images/node 4.46/5.57/1.00 5.06/7.79/1.00
More formally, the set of images that correspond tov
i
is the union of all the images corre-
sponding tov
i
’s children ch(v
i
):X
i
=
S
v
j
2ch(v
i
)
X
j
. We also use pa(v
j
) to denote the set of
v
j
’s parents.
Denotation graphs (DG) can be seen as a hierarchical organization of semantic knowledge
among concepts and their visual groundings. In this sense, they generalize the tree-structured
object hierarchies that have been often used in computer vision. The nodes in the DG are composite
phrases that are semantically richer than object names and the relationship among them is also
richer.
Constructing DG. We used the publicly available tool
1
, following Young et al. [254]. Once the
graph is constructed, we attach the images to the proper nodes by set-union images of each node’s
children, starting from the sentence-level node.
Flickr-DG and COCO-DG. We regenerate a DG on the Flickr dataset [254] and construct a
new DG on the COCO [137] dataset. The two datasets come from different visual and text domains
where the former contains more iconic social media photos and the latter focuses on photos with
complex scenes and has more objects. Figure 2.1 shows a random sub-graph of Flickr-DG.
Table 2.1 lists the key statistics of the two DGs. We note that in both graphs, a large number of
internal nodes (more abstract concepts or phrases) are introduced. For such concepts, the linguistic
expressions are much shorter and the number of images they correspond to is also larger.
2.4 Learning with Denotation Graphs
The denotation graphs, as described in the previous section, provide rich structures for learning
representations of text and image. In what follows, we describe three learning objectives, starting
from the most obvious one that matches images and their descriptions (§2.4.1), followed by
learning to discriminate between general and specialized concepts (§2.4.2) and learning to predict
concept relatedness (§2.4.3). We perform ablation studies of those objectives in §2.5.4.
1
The toolkit is publicly available online athttps://github.com/aylai/DenotationGraph
12
2.4.1 Matching Texts with Images
We suppose the imagex and the texty are represented by (a set of) vectors(x) and (y)
respectively. A common choice for() is the last layer of a convolutional neural network [67, 247]
and for () the contextualized word embeddings from a Transformer network [221]. The
embedding of the multimodal pair is a vector-valued function over(x) and (y):
v(x;y) =f((x); (y)) (2.1)
There are many choices off(;). The simplest one is to concatenate the two arguments. We
can also use the element-wise product between the two if they have the same embedding dimen-
sion [105], or complex mappings parameterized by layers of attention networks and convolu-
tions [36, 143] – we experimented some of them in our empirical studies.
2.4.1.1 Matching Model
We use the following probabilistic model to characterize the joint distribution
p(x;y)/ exp(
T
v(x;y)) (2.2)
where the exponents(x;y) =
T
v is referred as the matching score. To estimate, we use the
maximum likelihood estimation
= arg max
X
v
i
X
k
logp(x
ik
;y
i
) (2.3)
wherex
ik
is thekth element in the setX
i
. However, this probability is intractable to compute as
it requires us to get all possible pairs of (x;y). To approximate, we use negative sampling.
2.4.1.2 Negative Sampling
For each (randomly selected) positive sample (x
ij
;y
i
), we explore 4 types of negative examples
and assemble them as a negative sample setD
ik
:
• Visually mismatched pair. We randomly sample an imagex
= 2 X
i
to pair withy
i
, i.e.,
(x
;y
i
). Note that we automatically exclude the images fromv
i
’s children.
• Semantically mismatched pair We randomly sample a texty
j
6=y
i
to form the pair (x
ik
;y
j
).
Note that we constrainy
j
not to include concepts that could be more abstract thany
i
as the
more abstract can certainly be used to describe the specific imagesx
ik
.
13
• Semantically hard pair We randomly sample a texty
j
that corresponds to an imagex
j
that is
visually similar tox
ik
to form (x
ik
;y
j
). See [143] for details.
• DG Hard Negatives We randomly sample a sibling (but not cousin) nodev
j
tov
i
such that
x
ik
= 2X
j
to form (x
ik
;y
j
)
Note that the last 3 pairs have increasing degrees of semantic confusability. In particular, the
4th type of negative sampling is only possible with the help of a denotation graph. In that type
of negative samples,y
j
is semantically very close toy
i
(from the construction) yet they denote
different images. The “semantically hard pair”, on the other end, is not as hard as the last type as
y
i
andy
j
could be very different despite high visual similarity.
With the negative samples, we estimate as the minimizer of the following negative log-
likelihood
`
MATCH
=
X
v
i
X
k
log
e
s(x
ik
;y
i
)
P
(^ x;^ y)D
i
e
s(^ x;^ y)
(2.4)
whereD
i
=D
ik
[f(x
ik
;y
i
)g contains both the positive and negative examples.
2.4.2 Learning to Be More Specific
The hierarchy in the denotation graph introduces an opportunity for learning image and text
representations that are sensitive to fine-grained distinctions. Concretely, consider a parent node
v
i
with an edge to the child nodev
j
. While the descriptiony
j
matches any images in its children
nodes, the parent node’s descriptiony
i
on a higher level is more abstract. For example, the
concepts INSTRUMENT and PLAY PERCUSSION INSTRUMENT in Fig 2.1 is a pair of examples
showing the latter more accurately describes the image(s) at the lower-level.
To incorporate this modeling notion, we introduce
`
SPEC
=
X
e
ij
X
k
[s(x
jk
;y
i
)s(x
jk
;y
j
)]
+
(2.5)
as a specificity loss, where [h]
+
= max(0;h) denotes the hinge loss. The loss is to be minimized
such that the matching score for the less specific descriptiony
i
is smaller than that for the more
specific descriptiony
j
.
2.4.3 Learning to Predict Structures
Given the graph structure of the denotation graph, we can also improve the accuracy of image and
text representation by modeling high-order relationships. Specifically, for a pair of nodesv
i
and
v
j
, we want to predict whether there is an edge fromv
i
tov
j
, based on each node’s corresponding
14
embedding of a pair of image and text. Concretely, this is achieved by minimizing the following
negated likelihood:
`
EDGE
=
X
e
ij
X
k;k
0
logp
e
ij
= 1jv(x
ik
;y
i
);v(x
jk
0;y
j
)
(2.6)
We use a multi-layer perceptron (MLP) with a binary output to parameterize the log-probability.
2.4.4 The Final Learning Objective
We combine the above loss functions as the final learning objective for learning on the DG
`
DG
=`
MATCH
+
1
`
SPEC
+
2
`
EDGE
(2.7)
where
1
,
2
are the hyper-parameters that trade-off different losses. Setting them to 1.0 seems to
work well. We study how each component could affect the learning of representation in §2.5.4.
2.5 Experiments
We examine the effectiveness of using denotation graphs to learn image and text representations.
We first describe the experimental setup and key implementation details (§2.5.1). We then describe
key image-text matching results in §2.5.2, followed by studies about the transfer capability of our
learned representation (§2.5.3). Next, we present ablation studies over different components of
our model (§2.5.4). Finally, we validate how well abstract concepts can be used to retrieve images,
using our model (§2.5.5).
2.5.1 Experimental Setup
Embeddings and Matching Models. Our aim is to show denotation graphs improve state-of-
the-art methods. To this end, we experiment with two recently proposed state-of-the-art approaches
and their variants for learning from multi-modal data: ViLBERT [143] and UNITER [36].
Both the approaches start with an image encoder, which obtains a set of embeddings of image
patches, and a text encoder which obtains a sequence of word (or word-piece) embeddings. For
ViLBERT, text tokens are processed with Transformer layers and fused with the image information
with 6 layers of co-attention Transformers. The output of each stream is then element-wise
multiplied to give the fused embedding of both streams. For UNITER, both streams are fed into
12 Transformer layers with cross-modal attention. A special token CLS is used, and its embedding
is regarded as the fused embedding of both streams.
15
For ablation studies, we use a smaller ViLBERT for rapid experimentation: ViLBERT (Re-
duced) where there are 3 Transformer layers and 2 co-attention Transformers for the text stream,
and 1 Transformer layer for the image stream.
Constructing Denotation Graphs. As described in §2.3, we construct denotation graphs Flickr-
DG and COCO-DG from the Flickr [254] and the COCO [137] datasets. Flickr was originally
developed for the tasks of image-based and text-based retrieval. It contains 29,000 images for
training, 1,000 images for validation, and 1,000 images for testing. COCO is a significantly larger
dataset, developed for the image captioning task. It contains 565,515 sentences with 113,103
images. We evaluate on both the 1,000 images testing split and the 5,000 images testing split,
following the setup in [92]. Key characteristics for the two DGs are reported in Table 2.1.
Evaluation Tasks. We evaluate the learned representations on three common V+L tasks. In
text-based image retrieval, we evaluate two settings: the text is either a sentence or a phrase from
the test corpus. In the former setting, the sentence is a leaf node on the denotation graph, and in
the latter case, the phrase is an inner node on the denotation graph, representing more general
concepts. We evaluate the Flickr and the COCO datasets, respectively. The main evaluation
metrics we use are precisions at recall R@M where M = 1; 5 or 10 and RSUM which is the sum of
the 3 precisions [244]. Conversely, we also evaluate using the task of image-based text retrieval to
retrieve the right descriptive text for an image.
In addition to the above cross-modal retrieval, we also consider two downstream evaluation
tasks, i.e., Referring Expression and Compositional Attribute-Object Recognition. (1) Refer-
ring Expression is a task where the goal is to localize the corresponding object in the image given
an expression [96]. We evaluate on the dataset RefCOCO+, which contains 141,564 expressions
with 19,992 images. We follow the previously established protocol to evaluate on the validation
split, the TestA split, and the TestB split. We are primarily interested in zero-shot/few-shot learning
performance. (2) Compositional Attribute-Object Recognition is a task that requires a model to
learn from images of SEEN (attribute, object) label pairs, such that it can generalize to recog-
nize images of UNSEEN (attribute, object) label pairs. We evaluate this task on the Mit-State
dataset [86], following the protocol by Misra et al. [156]. The training split contains 34,562 images
from 1,262 SEEN labels, and the test split contains 19,191 images from 700 UNSEEN labels. We
report the Top-1, 2, 3 accuracies on the UNSEEN test set as evaluation metrics.
Training Details Both ViLBERT and UNITER models are pre-trained on the Conceptual Cap-
tion dataset [194] and the pre-trained models are released publicly
2
. On the Flickr-DG, ViLBERT
2
The UNITER[36] model performs an additional online hard-negative mining (which we did not) during the training
of image-text matching to improve their results. This is computationally very costly.
16
Table 2.2: Image-Text Retrieval Results on Different Datasets
Text2Image Image2Text
Method R@1 R@5 R@10 RSUM R@1 R@5 R@10 RSUM
Flickr Dataset
ViLBERT 59.1 85.7 92.0 236.7 76.8 93.7 97.6 268.1
ViLBERT + DG 63.8 87.3 92.2 243.3 77.0 93.0 95.0 265.0
UNITER 62.9 87.2 92.7 242.8 78.3 93.3 96.5 268.1
UNITER + DG 66.4 88.2 92.2 246.8 78.2 93.0 95.9 267.1
COCO 1K Test Split
ViLBERT 62.3 89.5 95.0 246.8 77.0 94.1 97.2 268.3
ViLBERT + DG 65.9 91.4 95.5 252.7 79.0 96.2 98.6 273.8
UNITER 60.7 88.0 93.8 242.5 74.4 93.9 97.1 265.4
UNITER + DG 62.7 88.8 94.4 245.9 77.7 95.0 97.5 270.2
COCO 5K Test Split
ViLBERT 38.6 68.2 79.0 185.7 53.5 79.7 87.9 221.1
ViLBERT + DG 41.8 71.5 81.5 194.8 57.5 84.0 90.1 232.2
UNITER 37.8 67.3 78.0 183.1 52.8 79.7 87.8 220.3
UNITER + DG 39.1 68.0 78.3 185.4 51.4 78.7 87.0 217.1
and UNITER are trained with a minibatch size of 64 and ViLBERT is trained for 17 epochs and
UNITER for 15 epochs, with a learning rate of 0:00004. On the COCO-DG, ViLBERT is trained
for 17 epochs and UNITER for 15 epochs with a minibatch size of 64 and a learning rate of
0:00004. The hyperparameters in Eq. (2.7) are set to 1.0, unless other specified.
2.5.2 Main Results
Table 2.2 report the performances on cross-modal retrieval. On both datasets, models trained
with denotation graphs considerably outperform the corresponding ones which are not. For the
image-based text retrieval task, ViLBERT and UNITER on Flickr suffers a small drop in R@10
when DG is used. On the same task, UNITER on COCO 5K Test Split decreases more when DG
is used. However, note that on both splits of COCO, ViLBERT is a noticeably stronger model, and
using DG improves its performance.
2.5.3 Zero/Few-Shot and Transfer Learning
Transfer across Datasets. Table 2.3 illustrates that the learned representations assisted by the
DG have better transferability when applied to another dataset (TARGET DOMAIN) that is different
from the SOURCE DOMAIN dataset which the DG is based on. Note that the representations are not
fine-tuned on the TARGET DOMAIN. The improvement on the direction COCO!Flickr is stronger
than the reverse one, presumably because the COCO dataset is bigger than Flickr.
17
Table 2.3: Transfer Learning Result for Text-based Image Retrieval
SOURCE FLICKR!COCO COCO!FLICKR
!TARGET R@1 RSUM R@1 RSUM
ViLBERT 43.5 199.5 49.0 209.0
+ SOURCE DG 44.9 200.5 52.8 218.2
Table 2.4: Zero/Few-shot Learning for Referring Expression
Setting! 0% (Zero-shot) 25% 50% 100%
Method Val TestA TestB Val TestA TestB Val TestA TestB Val TestA TestB
ViLBERT 35.7 41.8 29.5 67.2 74.0 57.1 68.8 75.6 59.4 71.0 76.8 61.1
+ COCO-DG 36.1 43.3 29.6 67.4 74.5 57.3 69.3 76.6 59.3 71.0 77.0 60.8
Table 2.5: Image Recognition on UNSEEN Attribute-Object Pairs on the Mit-State Dataset
Method Top-1 Top-2 Top-3
VisProd [156] 13.6 16.1 20.6
RedWine [156] 12.1 21.2 27.6
SymNet [134] 19.9 28.2 33.8
ViLBERT pre-trained on
N/A 16.2 26.3 33.3
COCO 17.9 28.8 36.2
COCO-DG 19.4 30.4 37.6
Zero/Few-shot Learning for Referring Expression. We evaluate our model on the task of
referring expression, a supervised learning task, in the setting of zero/few-shot transfer learning.
In zero-shot learning, we didn’t fine-tune the model on the referring expression dataset (i.e.
RefCOCO+). Instead, we performed a “counterfactual” inference, where we measure the drop
in the compatibility score (between a text describing the referring object and the image of all
candidate regions) as we removed individual candidates results. The region that causes the biggest
drop of compatibility score is selected. As a result, the selected region is most likely to correspond
to the description. In the setting of few-shot learning, we fine-tune our COCO-pre-trained model
on the task of referring expression in an end-to-end fashion on the referring expression dataset (i.e.
RefCOCO+).
The results in Table 2.4 suggest that when the amount of labeled data is limited, training with
DG performs better than training without. When the amount of data is sufficient for end-to-end
training, the advantage of training with DG diminishes.
18
Table 2.6: Ablation Studies of Learning from DG
ViLBERT variants! Reduced Full
w/o DG 215.4 236.7
w/ DG
+`
MATCH
221.5 236.5
DG HARD NEGATIVES
+`
MATCH
228.4 241.7
+`
MATCH
+`
SPEC
228.8 242.6
+`
MATCH
+`
SPEC
+`
EDGE
231.2 243.3
Compositional Attribute-Object Recognition. We evaluate our model for supervised compo-
sitional attribute-object recognition [156], and report results on recognizing UNSEEN attribute-
object labels on the Mit-State test data [86]. Specifically, we treat the text of image labels (i.e.,
attribute-object pairs as compound phrases) as the sentences to fine-tune the ViLBERT models,
using the`
MATCH
objective. Table 2.5 reports the results (in top-K accuracies) of both prior methods
and variants of ViLBERT, which are trained from scratch (N/A), pre-trained on COCO and COCO-
DG, respectively. ViLBERT models pre-trained with parallel pairs of images and texts (i.e., COCO
and COCO-DG) improve significantly over the baseline that is trained on the Mit-State from
scratch. The model pre-trained with COCO-DG achives the best results among ViLBERT variants.
It performs on par with the previous state-of-the-art method in top-1 accuracy and outperforms
them in top-2 and top-3 accuracies.
2.5.4 Ablation Studies
The rich structures encoded in the DGs give rise to several components that can be incorporated
into learning representations. We study whether they are beneficial to the performances on the
downstream task of text-based image retrieval. In the notions of §2.4, those components are: (1)
remove “DG HARD NEGATIVES” from the`
MATCH
loss and only use the other 3 types of negative
samples (§2.4.1); (2) align images with more specific text descriptions (§2.4.2); (3) predict the
existences of edges between pairs of nodes (§2.4.3).
Table 2.6 shows the results from the ablation studies. We report results on two versions of
ViLBERT: In ViLBERT (reduced), the number of parameters in the model is significantly reduced
by making the model less deep, and thus faster for development. Instead of being pre-trained, they
are trained on the Flickr dataset directly for 15 epochs with a minibatch size of 96 and a learning
rate of 4e
5
. In ViLBERT (Full), we use the aforementioned settings. We report RSUM on the
Flickr dataset for the task of text-based image retrieval.
19
LEAF LEAF-1 LEAF-2
Level in the DG Hierarchy
Mean AP
70.8
42.6
32.0
74.1
52.4
44.2
ViLBERT
ViLBERT w/ DG
Figure 2.2: Image Retrieval using Mid-level Linguistic Expression on Flickr Denotation Graph. The results
are reported in Mean Average Precision (Mean AP).
All models with DG perform better than the models without DG. Secondly, the components of
DG HARD NEGATIVES,`
SPEC
, and`
EDGE
contribute positively and their gains are cumulative.
2.5.5 Image Retrieval from Abstract Concepts
The leaf nodes in a DG correspond to complete sentences describing images. The inner nodes are
shorter phrases that describe more abstract concepts and correspond to a broader set of images,
refer to Table 2.2 for some key statistics in this aspect.
Fig. 2.2 contrasts how well abstract concepts can be used to retrieve images. The concepts
are the language expressions corresponding to the leaf nodes, the nodes that are one level above
(LEAF-1), or two levels above (LEAF-2) the leaf nodes from the Flickr-DG. Since abstract concepts
tend to correspond to multiple images, we use mean averaged precision (mAP) to measure the
retrieval results. ViLBERT+DG outperforms ViLBERT significantly. The improvement is also
stronger when the concepts are more abstract.
It is interesting to note that while the`
MATCH
used in ViLBERT+DG incorporates learning
representations to align images at both specific and abstract levels, such learning benefits all levels.
The improvement of retrieving at abstract levels does not sacrifice the retrieval at specific levels.
20
Chapter 3
Hierarchical Modeling of Video and Text
Visual data and text data are composed of information at multiple granularities. A video can
describe a complex scene that is composed of multiple clips or shots, where each depicts a
semantically coherent event or action. Similarly, a paragraph may contain sentences with different
topics, which collectively conveys a coherent message or story. In this paper, we investigate
the modeling techniques for such hierarchical sequential data where there are correspondences
across multiple modalities. Specifically, we introduce hierarchical sequence embedding (HSE), a
generic model for embedding sequential data of different modalities into hierarchically semantic
spaces, with either explicit or implicit correspondence information. We perform empirical studies
on large-scale video and paragraph retrieval datasets and demonstrate superior performance by
the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings
when applied to downstream tasks. We show its utility in zero-shot action recognition and video
captioning.
3.1 Motivation
Recently, there has been an intensive interest in multi-modal learning of vision and language.
A few challenging tasks have been proposed: visual semantic embedding (VSE) [38, 100, 106],
image captioning [92, 137, 227, 248], and visual question answering (VQA) [9, 28, 267]. To
jointly understand these two modalities of data and make inference over them, the main intuition
is that different types of data can share a common semantic representation space. Examples are
embedding images and the visual categories [56], embedding images and texts for VSE [106],
and embedding images, questions, and answers for VQA [79]. Once embedded into this common
(vector) space, similarity and distances among originally heterogeneous data can be captured by
learning algorithms.
While there has been a rich study on how to discover this shared semantic representation on
structures such as images, noun phrases (visual object or action categories) and sentences (such as
21
A Paragraph with Multiple Sentences A Video with Multiple Clips
2. The water skier
jumps over ramps
in the water of a lake.
1. A man on the dock
hands a rope to a
water skier.
3. He falls but
recovers then lets go
of the rope and drifts
over to the shoreline.
Local
Embedding Space
Global
Embedding Space
Figure 3.1: Conceptual diagram of our approach for cross-modal modeling of video and texts. The main
idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own
semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are
mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close
to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where
their embeddings are close. See Fig. 3.3 and texts for details.
captions, questions, answers), less is known about how to achieve so on more complex structures
such as videos and paragraphs of texts
1
. There are conceptual challenges: while complex
structured data can be mapped to vector spaces (for instance, using deep architectures [67, 113]), it
is not clear whether the intrinsic structures in those data’s original format, after being transformed
to the vectorial representations, still maintain their correspondence and relevance across modalities.
Take the dense video description task as an example [112]. The task is to describe a video
which is made of short, coherent and meaningful clips. (Note that those clips could overlap
temporally.) Due to its narrowly focused semantic content, each clip is then describable with a
sentence. The description for the whole video is then a paragraph of texts with sentences linearly
arranged in order. Arguably, a corresponding pair of video and its descriptive paragraph can be
embedded into a semantic space where their embeddings are close to each other, using a vanilla
learning model by ignoring the boundaries of clips and sentences and treating as a sequence of
continually flowing visual frames and words. However, for such a modeling strategy, it is opaque
that if and how the correspondences at the “lower level” (i.e. clips versus sentences) are useful
in either deriving the embeddings or using the embeddings to perform downstream tasks such as
video or text retrieval.
Addressing these deficiencies, we propose a novel cross-modal learning approach to model
both videos and texts jointly. The main idea is schematically illustrated in Fig. 3.1. Our approach
1
We use paragraphs and documents interchangeably throughout this work.
22
is mindful of the intrinsic hierarchical structures of both videos and texts, and models them with
hierarchical sequence learning models such as GRUs [37]. However, as opposed to methods which
disregard low-level correspondences, we exploit them by deriving loss functions to ensure the
embeddings for the clips and sentences are also in accordance in their own (shared) semantic
space. Those low-level embeddings in turn strengthen the desiderata that videos and paragraphs
are embedded coherently. We demonstrate the advantages of the proposed model in a range of
tasks including video and text retrieval, zero-shot action recognition and video description.
3.2 Related Work
Hierarchical Sequence Embedding Models. Embedding images, videos, and textual data has
been very popular with the rise of deep learning. The most related works to ours are [128] and
[165]. The former models the paragraph using a hierarchical auto-encoder for text modeling [128],
and the later uses a hierarchical RNN for videos and a one-layer RNN for caption generation. In
contrast, our work models both modalities hierarchically and learn the parameters by leveraging
the correspondences across modalities. Works motivated by other application scenarios usually
explore hierarchical modeling in one modality [162, 255, 263].
Cross-modal Embedding Learning. There has been a rich history to learn embeddings for
images and smaller linguistic units (such as words and noun phrases). DeViSE [56] learns to align
the latent embeddings of visual data and names of the visual object categories. ReViSE [218]
uses auto-encoders to derive embeddings for images and words which allow them to leverage
unlabeled data. In contrast to previous methods, our approach models both videos and texts hierar-
chically, bridging the embeddings at different granularities using discriminative loss computed on
corresponded pairs (i.e., videos vs. paragraphs).
Action Recognition in Video. Deep learning has brought significant improvement to video
understanding [51, 196, 216, 233, 243, 259] on large-scale action recognition datasets [68, 95, 201]
in the past decade. Most of them [51, 196, 233] employed deep convolutional neural network
to learn appearance feature and motion information respectively. Based on the spatial-temporal
feature from these video modeling methods, we learn video semantic embedding to match the
holistic video representation to text representation. To evaluate the generalization of our learned
video semantic representation, we evaluate the model directly on the challenging action recognition
benchmark. (Details in Section 3.4.4)
23
Word!
"
Word!
#
$%&
'
Paragraph Encoder
$%&
(
Video Encoder
A Paragraph with Multiple Sentences A Video with Multiple Clips
Global Alignments ()
*+,-.
./0.
)
Frame1
2
Frame1
"
Frame1
3
'
Paragraph Embedding
(
Video Embedding
Word!
2
Figure 3.2: Flat sequence modeling of videos and texts, ignoring the hierarchical structures in either and
regarding the video (paragraph) as a sequence of frames (words).
3.3 Method
We begin by describing the problem settings and introducing necessary notations. We then describe
the standard sequential modeling technique, ignoring the hierarchical structures in the data. Finally,
we describe our approach.
3.3.1 Settings and notations
We are interested in modeling videos and texts that are paired in correspondence In the later
section, we describe how to generalize this where there is no one to one correspondence.
A videov has n clips (or subshots), where each clipc
i
contains n
i
frames. Each frame is
represented by a visual feature vectorx
ij
. This feature vector can be derived in many ways, for
instance, by feeding the frame (and its contextual frames) to a convolution neural net and using the
outputs from the penultimate layer. Likewise, we assume there is a paragraph of texts describing
the video. The paragraphp containsn sentences, one for each video clip. Lets
i
denote theith
sentence andw
ij
the feature for thejth word out ofn
0
i
words. We denote byD =f(v
k
;p
k
)g a set
of corresponding videos and text descriptions.
We compute a clip vector embedding c
i
from the frame featuresfx
ij
g, and a sentence
embedding vs:
i
from the word featuresfw
ij
g. From those, we derivev andp, the embedding for
the video and the paragraph, respectively.
3.3.2 Flat sequence modeling
Many sequence-to-sequence (SEQ2SEQ) methods leverage the encoder-decoder structure [144,
207] to model the process of transforming from the input sequence to the output sequence. In partic-
ular, the encoder, which is composed of a layer of long short-term memory units (LSTMs) [75] or
Gated Recurrent Units (GRUs) [37], transforms the input sequence into a vector as the embedding
h. The similarly constructed decoder takesh as input and outputs another sequence.
24
Word!"#"
Word!$"
Word!$# $
Word!%"
Word !%#%
&'(
)
(")
Sentence Encoder
,
"
Embedding
,
$
Embedding
,
%
Embedding
&'(
)
($)
Paragraph Encoder
)
Paragraph Embedding
Frame-""
Frame-"#"
Frame-$"
Frame-$#$
Frame-%"
Frame-%#%
&'(
.
(")
Clip Encoder
/
"
Embedding
/
$
Embedding
/
%
Embedding
.
Video Embedding
A Paragraph with Multiple Sentences A Video with Multiple Clips
Global Alignments (0
1231
)
Local Alignments (0
456
)
Reconstruction Reconstruction
Reconstruction
7&(
)
($)
Paragraph Decoder
&'(
.
($)
Video Encoder
7&(
)
(")
Sentence Decoder
7&(
.
($)
Video Decoder
7&(
.
(")
Clip Decoder
Word!""
Reconstruction
Figure 3.3: Hierarchical cross-modal modeling of videos and texts. We differ from previous works [128,
165] in two aspects (components in red color): layer-wise reconstruction through decoders, and matching at
both global and local levels. See texts for details.
The original SEQ2SEQ methods do not consider the hierarchical structures in videos or texts.
We refer the embeddings as flat sequence embedding (FSE):
v = ENC
v
(fx
ij
g); p = ENC
p
(fw
ij
g); (3.1)
Fig. 3.2 schematically illustrates this idea. We measure how well the videos and the texts are
aligned by the following cosine similarity
MATCH(v;p) =v
>
p=kvkkpk (3.2)
3.3.3 Hierarchical sequence modeling
One drawback of flat sequential modeling is that the LSTM/GRU layer needs to have a sufficient
number of units to model well the potential long-range dependency among video frames (or words).
This often complicates learning as the optimization becomes difficult [167].
We leverage the hierarchical structures in those data to overcome this deficiency: a video is
made of clips which are made of frames. In parallel, a paragraph of texts is made of sentences
which in turn are made of words. Similar ideas have been explored in [128, 165] and other previous
works. The basic idea is illustrated in Fig. 3.3, where we also add components in red color to
highlight our extensions.
25
Hierarchical sequence embedding. Given the hierarchical structures in Fig. 3.3, we can com-
pute the embeddings using the forward paths
c
i
= ENC
(1)
v
(fx
ij
;j = 1; 2;n
i
g); v = ENC
(2)
v
(fc
i
g)
vs:
i
= ENC
(1)
p
(fw
ij
;j = 1; 2;n
0
i
g); p = ENC
(2)
p
(fvs:
i
g)
(3.3)
Learning with discriminative loss. For videos and texts have strong correspondences where
clips and sentences are paired, we optimize the encoders such that videos and texts are matched.
To this end, we define two loss functions, corresponding to the matching at the low-level and the
high-level respectively:
`
HIGH
MATCH
=
X
k
X
k
0
6=k
[ + MATCH(v
k
;p
k
) MATCH(v
k
0;p
k
)]
+
+ [ + MATCH(v
k
;p
k
) MATCH(v
k
;p
k
0)]
+
(3.4)
`
LOW
MATCH
=
X
k
X
i
X
(k
0
;i
0
)6=(k;i)
[ + MATCH(c
ki
; vs:
ki
) MATCH(c
k
0
i
0; vs:
ki
)]
+
+[ + MATCH(c
ki
; vs:
ki
) MATCH(c
ki
; vs:
k
0
i
0)]
+
(3.5)
These losses are margin-based losses [190] where and are positive numbers as the margins to
separate matched pairs from unmatched ones. The function []
+
is the standard hinge loss function.
Learning with contrastive loss. Assuming videos and texts are well clustered, we use the
following loss to model their clustering in their own space.
`
HIGH
CLUSTER
=
X
k
X
k
0
6=k
[
+ 1 MATCH(v
k
0;v
k
)]
+
+ [
+ 1 MATCH(p
k
0;p
k
)]
+
(3.6)
`
LOW
CLUSTER
=
X
k
X
i
X
(k
0
;i
0
)6=(k;i)
[ + 1 MATCH(c
k
0
i
0;c
ki
)]
+
+[ + 1MATCH(vs:
k
0
i
0; vs:
ki
)]
+
(3.7)
Note that the self-matching values MATCH(v
k
;v
k
) and MATCH(p
k
;p
k
) are 1 by definition. This
loss can be computed on videos and texts alone and does not require them being matched.
Learning with unsupervised layer-wise reconstruction loss. Thus far, the matching loss fo-
cuses on matching across modality. The clustering loss focuses on separating between video/text
data so that they do not overlap. None of them, however, focuses on the quality of the modeling
26
data itself. In what follows, we propose a layer-wise reconstruction loss – when minimized, this
loss ensures the learned video/text embedding faithfully preserves information in the data.
We first introduce a set of layer-wise decoders for both videos and texts. The key idea is to
pair the encoders with decoders so that each pair of functions is an auto-encoder. Specifically, the
decoder is also a layer of LSTM/GRU units, generating sequences of data. Thus, at the level of
video (or paragraph), we will have a decoder to generate clips (or sentences). And at the level
of clips (or sentences), we will have a decoder to generate frames (or words). Concretely, we
would like to minimize the difference between what are generated by the decoders and what are
computed by encoders on the data. Let
f^ c
i
g = DEC
(2)
v
(v);f ^ vs:
i
g = DEC
(2)
p
(p) (3.8)
be the two (high-level) decoders for videos and texts respectively. And similarly, for the decoder
at the low-level
f^ x
ij
g = DEC
(1)
v
(^ c
i
);f ^ w
ij
g = DEC
(1)
p
( ^ vs:
i
) (3.9)
where the low-level decoders take each generated clip and sentence embeddings as inputs and
output sequences of generated frame and word embeddings.
`
RECONSTRUCT
(v;p) =
X
i
fk^ c
i
c
i
k
2
2
+
1
n
i
X
j
k^ x
ij
x
ij
k
2
2
g
+
X
i
fk ^ vs:
i
vs:
i
k
2
2
+
1
n
0
i
X
j
k ^ w
ij
w
ij
k
2
2
g (3.10)
Using those generated embeddings, we can construct a loss function characterizing how well
the encoders encode the data pair (v;p) (see Eq 3.10).
3.3.4 Final learning objective and its extensions
The final learning objective is to balance all those loss quantities
` =`
HIGH
+`
LOW
+
X
k
`
RECONSTRUCT
(v
k
;p
k
) (3.11)
where the high-level and low-level losses are defined as
`
HIGH
=`
HIGH
MATCH
+`
HIGH
CLUSTER
; `
LOW
=`
LOW
MATCH
+`
LOW
CLUSTER
(3.12)
27
In our experiments, we will study the contribution by each term.
Learning under weak correspondences. Our idea can be also extended to the common setting
where only high-level alignments are available. In fact, high-level coarse alignments of data are
easier and more economical to obtain, compared to fine-grained alignments between each sub-level
sentence and video clip.
Since we do not have enough information to define the low-level matching loss`
LOW
MATCH
exactly,
we resort to approximation. We first define an averaged matching over all pairs of clips and
sentences for a pair of video and paragraph
MATCH(v;p) =
1
nm
X
c
i
X
s
j
MATCH(c
i
; vs:
j
) (3.13)
where we relax the assumption that there is precisely the same number of sentences and clips.
We use this averaged quantity to approximate the low-level matching loss
~
`
LOW
MATCH
=
X
k
X
k
0
6=k
[
0
+ MATCH(v
k
;p
k
) MATCH(v
k
0;p
k
)]
+
+ [
0
+ MATCH(v
k
;p
k
) MATCH(v
k
;p
k
0)]
+
(3.14)
This objective will push a clip embedding closer to the embeddings of the sentences belonging
to the corresponding video (and vice versa for sentences to the corresponding video). A more
refined approximation involving a soft assignment of matching can also be derived, which will be
left for future work.
3.4 Experiments
We evaluate and demonstrate the advantage of learning hierarchical cross-modal embedding with
our proposed approach on several tasks: (i) large-scale video-paragraph retrieval (Section 3.4.2),
(ii) down-stream tasks such as video captioning (Section 3.4.3), and (iii) action recognition
(Section 3.4.4).
3.4.1 Experiment Setups
Datasets. We evaluate both baseline approaches and our method on three large-scale video
datasets:
28
(1) ActivityNet Dense Caption [112]. This variant of ActivityNet contains densely labeled
temporal segments for 10,009 training and 4,917 validation videos (i.e., val1 split). Each video
contains multiple clips and a corresponding paragraph with sentences aligned to the clips. In
all our retrieval experiments, we follow the setting in [112] and report retrieval metrics such as
recall@k (k=1,5,50) and median rank (MR). Following [112] we use ground-truth clip proposals
as input for our main results. In addition, we also study our algorithm with a heuristic proposal
method (see Section 3.4.2). In the main text, we report all results on validation set 1 (val1). For
video caption experiment, we follow [112] and evaluate on the validation set (val1 and val2).
Instead of using action proposal method, ground-truth video segmentation is used for training and
evaluation. Performances are reported in Bleu@K, METEOR and CIDEr.
(2) DiDeMo [8]. The original goal of DiDeMo dataset is to locate the temporal segments that
correspond to unambiguous natural language descriptions in a video. We re-purpose it for the
task of video and paragraph retrieval. It contains 10,464 videos, 26,892 video clips and 40,543
sentences. The training, validation and testing split contain 8,395, 1,065 and 1,004 videos
and corresponding paragraphs, respectively. Each video clip may correspond to one or many
sentences. For the video and paragraph retrieval task, paragraphs are constructed by concatenating
all sentences that corresponding to one video. Similar to the setting in ActivityNet, we use the
ground-truth clip proposals as input.
(3) ActivityNet Action Recognition [68]. We use ActivityNet V1.3 for aforementioned off-the-
shelf action recognition. The dataset contains 14,950 untrimmed videos with 200 action classes,
which is split into training and validation set. Training and validation set have 10,024 and 4,926
videos, respectively. Among all 200 action classes, 189 of the action classes have been covered by
the vocabulary extracted from the paragraph corpus and 11 of the classes are unseen.
Baselines and our methods. We use the FSE method (as described in Section 3.3.1) as a baseline
model. It ignores the clip and sentence structures in the videos and paragraphs. We train a one-
layer GRU directly on the extracted frame/word features and take their outputs as the embedding
representing each modality. Results with C3D features are also included (see Table 3.1).
Our method has two variants: when = 0, the method (HSE[=0]) simplifies to a stacked/hier-
archical sequence models as used in [128, 165] except that they do not consider cross-modal
learning with cross-modal matching loss while we do. We consider this as a very strong baseline.
When6= 0, the HSE takes full advantage of layer-wise reconstruction with multiple decoders, at
different levels of the hierarchy. In our experiments, this method gives the best results.
29
Table 3.1: Video paragraph retrieval on ActivityNet (val1). Standard deviation from 3 random
seeded experiments are also reported.
Paragraph ) Video Video ) Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
C3D Feature with Dimensionality Reduction [216]
LSTM-YT [225] 0.0 4.0 24.0 102.0 0.0 7.0 38.0 98.0
NO CONTEXT [226] 5.0 14.0 32.0 78.0 7.0 18.0 45.0 56.0
DENSE online[112] 10.0 32.0 60.0 36.0 17.0 34.0 70.0 33.0
DENSE full[112] 14.0 32.0 65.0 34.0 18.0 36.0 74.0 32.0
FSE 12.60.4 33.20.3 77.60.3 12.0 11.50.5 31.80.3 77.70.3 13.0
HSE[=0] 32.80.3 62.30.4 90.50.1 3.0 32.00.6 62.50.5 90.50.3 3.0
HSE[=5e-4] 32.70.7 63.20.4 90.80.2 3.0 32.80.4 63.20.2 91.20.3 3.0
Inception-V3 pre-trained on Kinetics [235]
FSE 18.20.2 44.80.4 89.10.3 7.0 16.70.8 43.11.1 88.40.3 7.3
HSE[=0] 43.90.6 75.80.2 96.90.3 2.0 43.30.6 75.30.6 96.60.2 2.0
HSE[=5e-4] 44.40.5 76.70.3 97.10.1 2.0 44.20.6 76.70.3 97.00.3 2.0
Implementation Details. Following the settings of [112], we extract the C3D features [216]
pretrained on Sports-1M dataset [93] for raw videos in ActivityNet. PCA is then used to reduce
the dimensionality of the feature to 500. To verify the generalization of our model across different
sets of visual feature, as well as leveraging the state-of-the-art video models, we also employed
recently proposed TSN-Inception V3 network [233] pre-trained on Kinetics [95] dataset to extract
visual features. Similarly, we extract TSN-Inception V3 feature for videos in Didemo dataset. We
do not fine-tuning the convolutional neural network on the video along the training to reduce the
computational cost. For word embedding, we use 300 dimension GloVe [168] features pre-trained
on 840B common web-crawls. In all our experiments, we use GRU as sequence encoders. For
HSE, we choose = 0:0005 from tuning this hyper-parameter on the val2 set of ActivityNet
retrieval dataset. The same value is used for experiments on DiDeMo, without further tuning.
(More details in the Supp. Material)
3.4.2 Results on Video-Paragraph Retrieval
In this section, we first compare our proposed approach to the state-of-the-art algorithms, and then
perform ablation studies on variants of our method, to evaluate the proposed learning objectives.
Main Results. We reported our results on ActivityNet Dense Caption val1 set and DiDeMo
test set as Table 3.1 and Table 3.2, respectively. For both C3D and Inception V3 feature, we
observed performances on our hierarchical models improved the previous state-of-the-art result by
30
Table 3.2: Video paragraph retrieval on DiDeMo dataset. S2VT method is re-implemented for
retrieval task.
Paragraph) Video Video) Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
S2VT [226] 11.9 33.6 76.5 13.0 13.2 33.6 76.5 15.0
FSE 13.90.7 36.00.8 78.91.6 11.0 13.10.5 33.90.4 78.00.8 12.0
HSE[=0] 30.20.8 60.51.1 91.80.7 3.3 29.40.4 58.90.7 91.90.6 3.7
HSE[=5e-4] 29.70.2 60.30.9 92.40.3 3.3 30.11.2 59.20.9 92.10.5 3.0
Table 3.3: Ablation studies on the weak alignment learning objective.
Paragraph) Video Video ) Paragraph
Dataset `
LOW
R@1 R@5 R@50 R@1 R@5 R@50
ActivityNet
HSE[=0]
7 41.80.4 74.10.6 96.60.1 40.50.4 73.90.6 96.30.1
WEAK 42.60.4 74.80.3 96.70.1 41.30.2 74.70.4 96.50.1
STRONG 43.90.6 75.80.2 96.90.3 43.30.6 75.30.6 96.60.2
HSE[=5e-4]
7 42.50.3 74.80.1 96.90.0 41.60.2 74.70.6 96.60.1
WEAK 43.00.6 75.20.4 96.90.1 41.50.1 75.20.6 96.80.2
STRONG 44.40.5 76.70.3 97.10.1 44.20.6 76.70.3 97.00.3
DiDeMo
HSE[=0]
7 27.11.9 59.10.4 92.20.3 27.31.0 57.60.5 91.31.2
WEAK 28.00.8 58.90.5 91.40.6 28.30.3 58.50.6 91.20.3
STRONG 30.20.8 60.51.1 91.80.7 29.40.4 58.90.7 91.90.6
HSE[=5e-4]
7 28.10.8 59.51.1 91.70.7 28.20.8 58.10.5 90.90.5
WEAK 28.72.1 59.10.2 91.60.7 28.30.8 59.20.6 91.10.1
STRONG 29.70.2 60.30.9 92.40.3 30.11.2 59.20.9 92.10.5
a large margin (on Recall@1, over 15% improvement with C3D and 30% improvement with
InceptionV3). DENSE full [112], which models the flat sequences of clips, outperforms our FSE
baseline as they augment each segment embedding with a weighted aggregated context embedding.
However, it fails to model more complex temporal structures of video and paragraph, which leads
to inferior performance to our HSE models.
Comparing to our flat baseline model, both HSE[=0] and HSE[=5e-4] improve performances
over all metrics in retrieval. It implies that hierarchical modeling can effectively capture the
structure information and relationships over clips and sentences among videos and paragraphs.
Moreover, we observe that HSE[=5e-4] consistently improves over HSE[=0] across most retrieval
metrics on both datasets. This attributes the importance of our layer-wise reconstruction objectives,
which suggests that better generalization performances.
Low-level loss is beneficial. Table 3.1 and Table 3.2 have shown results with optimizing both
low-level and high-level objectives. In Table 3.5, we further performed ablation studies on the
31
Table 3.4: Performance of using proposal instead of ground truth on ActivityNet dataset
P ) V V ) P
Proposal Method # Segments R@1 R@5 R@1 R@5 Precision Recall
HSE + SSN - 10.4 31.9 10.8 31.7 1.5 17.1
HSE + UNIFORM
1 18.0 45.5 16.5 44.9 63.2 31.1
2 20.0 48.9 18.4 47.6 61.8 46.0
3 20.0 48.6 18.2 47.9 55.3 50.6
4 20.5 49.3 18.7 48.1 43.2 45.5
HSE + GROUND TRUTH - 44.4 76.7 44.2 76.7 100.0 100.0
FSE - 18.2 44.8 16.7 43.1 - -
learning objectives. Note that rows with7 represent learning without low-level loss`
LOW
. In all
scenarios, joint learning with both low-level and high-level correspondences improves the retrieval
performance.
Learning with weak correspondences at low-level. As mentioned in Section 3.3, our method
can be extended to learn the low-level embedding with weak correspondence. We evaluate its
effectiveness on both ActivityNet and DiDeMo datasets. Performance are listed in Table 3.3. Note
that for the rows of “weak”, no auxiliary alignments between sentences and clips are available
during training.
Clearly, including low-level loss with weak correspondence (ie, correspondence only at the
high-level) obtained superior performances when compared to models that do not include low-level
loss at all. On several occasions, it even attains the same competitive result as including low-level
loss with strong correspondences at the clip/sentence levels.
Ablation study with different learning objectives We report ablation studies of different losses
on ActivityNet video and paragraph retrieval task in Table 3.5. We use the Inception-V3 features
and follow the same setting for training HSE. Each time we remove one loss and report the
performance. Note that the reconstruction loss and low-match loss are the most useful.
Table 3.5: Ablation study on the learning objectives.
Paragraph ) Video Video ) Paragraph
Method R@1 R@5 R@1 R@5
HSE w/o high-cluster 44.6 76.4 44.2 76.1
HSE w/o low-match 40.9 73.6 39.8 73.6
HSE w/o low-cluster 44.6 76.6 43.9 76.4
HSE w/o reconstruction 43.9 75.8 43.3 75.3
HSE w all losses 44.4 76.7 44.2 76.7
32
Low-level loss is beneficial As mentioned in the main text (see Table 1 and Table 2 in the main
text), learning with low-level objectives is beneficial for our full model. To better understands
this, we also plot the recall (in %) with regard to the rank of the video/paragraph to a query as
supportive evidence. The results are shown in Fig. 3.4.
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Paragraph Video
FSE
HSE
=0
w/o L
low
HSE
=0
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Video Paragraph
FSE
HSE
=0
w/o L
low
HSE
=0
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Paragraph Video
FSE
HSE w/o L
low
HSE
0 5 10 15
Rank
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Video Paragraph
FSE
HSE w/o L
low
HSE
(a) HSE[=0] (b) HSE
Figure 3.4: Recall vs Rank curves of Video to Paragraph and Paragraph to Video retrieval of
both HSE[=0] and HSE. All results are collected from models based on InceptionV3 feature on
ActivityNet validation set 1.
Ablation Study on the reconstruction loss weights. Here we study the influence of loss balance
term, by experimenting multiple choices of under a controlled environment. We choose to study
this on the validation set 2 (val2) of ActivityNet with Inception V3 visual feature as input. Detailed
results are shown in Table 3.6. We summarized that retrieval performance, R@1 and R@5,
approach to its peak when=0.0005. Therefore, as stated in the main text, we set to be 0.0005
in all our experiments.
Table 3.6: Ablation study of on ActivityNet (val2).
Paragraph) Video Video ) Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
Inception-V3 pre-trained on Kinetics [235]
HSE[=0.05] 25.0 54.9 92.6 5.0 25.1 55.4 92.4 4.0
HSE[=0.005] 32.4 62.2 93.8 3.0 32.1 63.0 93.7 3.0
HSE[=0.0005] 33.2 62.9 93.6 3.0 32.6 62.8 93.5 3.0
HSE[=0.00005] 33.2 62.9 93.8 3.0 32.2 62.5 93.6 3.0
HSE[=0] 32.2 61.5 93.6 3.0 31.5 62.0 93.3 3.0
33
1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Segments/Sentences Observed
10
20
30
40
50
60
70
80
Retrieval Performance
Video to Paragarph Retrieval
Recall@1 - HSE
Recall@5 - HSE
Recall@1 - FSE
Recall@5 - FSE
1 1.5 2 2.5 3 3.5 4 4.5 5
Number of Segments/Sentences Observed
10
20
30
40
50
60
70
80
Retrieval Performance
Paragarph to Video Retrieval
Recall@1 - HSE
Recall@5 - HSE
Recall@1 - FSE
Recall@5 - FSE
Figure 3.5: Retrieval performance improves given more observed clips/sentences.
Learning with video proposal methods. As using ground-truth temporal segments of videos is
not a natural assumption, we perform experiments to validate the effectiveness of our method with
proposal methods. Specifically, we experiment with two different proposal approaches: SSN [264]
pre-trained on ActivityNet action proposal and a heuristic uniform proposal. For uniform proposal
of K segments, we meant naturally segmenting a video into K non-overlapping and equal-length
temporal segments.
The results are summarized in Table 3.4 (with columns of precision and recall being the
performance metrics of the proposal methods). There are two main conclusions from these
results: (1) The segments of Dense Caption dataset deviate significantly from the action proposals,
therefore a pre-trained action proposal algorithm performs poorly. (2) Even with heuristic proposal
methods, the performance of HSE is mostly better than (or comparable with) FSE. We leave to
future work on identifying stronger methods for proposals.
Retrieval with incomplete video and paragraph. In this section, we investigate the correlation
between the number of observed clips and sentences and models’ performance of video and
paragraph retrieval. In this experiment, we gradually increase the number of clips and sentences
observed by our model during the testing and obtained the Figure 3.5, on ActivityNet. When
the video/paragraph contains fewer clips/sentences than the number of observations we required,
we take all those available clips/sentences for computing the video/paragraph embedding. (On
average 3.65 clips/sentences per video/paragraph)
From Figure 3.5, we note that increasing the number of the observed clips and sentences leads
to improved performance results in retrievals. We can see that when observing only one clip
and sentence, our model already outperforms the previous state-of-the-art method as well as our
baseline FSE that observes the entire sequence. With observing less than the average length of
clips and sentences, our learned model can achieve 70% of the final performance.
34
Table 3.7: ActivityNet video captioning.
B@1 B@2 B@3 B@4 M C
LSTM-YT [225] 18.2 7.4 3.2 1.2 6.6 14.9
S2VT [226] 20.4 9.0 4.6 2.6 7.9 21.0
HRNN [255] 19.5 8.8 4.3 2.5 8.0 20.2
DENSE [112] 26.5 13.5 7.1 4.0 9.5 24.6
DVC [133] 19.6 9.9 4.6 1.6 10.3 25.2
FSE 17.9 8.2 3.6 1.7 8.7 32.1
HSE[=0] 19.6 9.4 4.2 2.0 9.2 39.5
HSE[=5e-4] 19.8 9.4 4.3 2.1 9.2 39.8
Table 3.8: ActivityNet action recognition.
Zero-Shot Train
Transfer Classifier
Top-1 Top-5 Top-1 Top-5
FV-VAE [175] - - 78.6 -
TSN [234] - - 88.1 -
FSE 48.3 79.4 74.4 94.1
HSE[=0] 50.2 84.4 74.7 94.3
HSE[=5e-4] 51.4 83.8 75.3 94.3
RANDOM 0.5 2.5 0.5 2.5
3.4.3 Results on Video Captioning
Setup. In addition to the video paragraph retrieval, we evaluate our learned embeddings for
video captioning. Specifically, we follow [112] and train a caption model [227] on top of the pre-
trained video embeddings. Similar to [112], we concatenate the clip-level feature with contextual
video-level feature, and build a two-layer LSTM as a caption generator. We randomly initialized
the word embedding as well as LSTM and trained the model for 25 epochs with learning rate of
0.001. We use the ground-truth proposal throughout training and evaluation following the setting
of [112, 133]. During testing, beam search is used with beam=5. Results are reported in Table 3.7.
Results. We observe that our proposed model outperforms baseline over most metrics. Mean-
while, HSE also improves over previous approaches such as LSTM-YT, S2VT, and HRNN on B@2,
METEOR, and CIDEr by a margin. HSE achieves comparable results with DVC in all criterions.
However, both HSE and HSE[=0] failed to obtain close performance to DENSE [112]. This may due
to the fact that DENSE [112] carefully learns to aggregate the context information of a video clip
for producing high-quality caption, while optimized for video-paragraph retrieval our embedding
model does not equip with such capability. However, it is worth noting that our model obtains
higher CIDEr score compared to all existing methods. We empirically observe that fine-tuning the
pre-trained video embedding does not lead to further performance improvement.
3.4.4 Results on Action Recognition
To evaluate the effectiveness of our model, we take the off-the-shelf clip-level embeddings trained
on video-paragraph retrieval for action recognition (on ActivityNet with non-overlapping training
and validation data). We use two action recognition settings to evaluate, namely zero-shot transfer
and classification.
35
Setup. In the zero-shot setting, we directly evaluate our low-level embedding model learned in
the video and text retrieval, via treating the phrases of actions as sentences and use the sentence-
level encoder to encode the action embedding. We take the raw video and apply clip-level video
encoder to extract the feature for retrieving actions. No re-training is performed and all models
have no access to the actions’ data distribution. Note though action are not directly used as
sentences during the training, some are available as verbs in the vocabulary. Meanwhile, as
we are using pre-trained word vector (GloVe), it allows the transfer to unseen actions. In the
classification setting, we discriminatively train a simple classifier to measure the classification
accuracy. Concretely, a one-hidden-layer Multi-Layer Perceptron (MLP) is trained on the clip-level
embeddings. We do not fine-tune the pre-trained clip-level video embedding here.
Results. We report results of above two settings on the ActivityNet validation set (see Table 3.8).
We observe that our learned low-level embeddings allow superior zero-shot transfer to action
recognition, without accessing any training data. This indicates that semantics of actions are
indeed well reserved in the learned embedding models. More interestingly, we can see that both
HSE[=0] and HSE improve the performance over FSE. It shows that our hierarchical modeling of
video benefits not only high-level embedding but also low-level embedding. A similar trend is
also observed in the classification setting. Our method achieves comparable performance to the
state-of-the-art video modeling approach such as FV-VAE [175]. Note TSN [234] is fully supervised
thus not directly comparable.
3.4.5 Qualitative Results
We use t-SNE [147] to visualize our results in the video to paragraph and paragraph to video
retrieval task. Fig 3.6 shows that the proposed method can cluster the embedding of videos with
regard to its action classes.
36
-40 -30 -20 -10 0 10 20 30 40 50
-40
-30
-20
-10
0
10
20
30
40
Arm wrestling
Capoeira
Futsal
Longboarding
Playing congas
Playing drums
Rafting
Snow tubing
Surfing
Using the balance beam
-20 -15 -10 -5 0 5 10 15 20 25
-25
-20
-15
-10
-5
0
5
10
15
20
25
Arm wrestling
Capoeira
Futsal
Longboarding
Playing congas
Playing drums
Rafting
Snow tubing
Surfing
Using the balance beam
ActivityNet Training Data ActivityNet Validation Data
Figure 3.6: T-SNE visualization of off-the-shelf video embedding of HSE on ActivityNet v1.3
training and validation set. Points are marked with its action classes.
37
Part II
Learning Language in Embodied Experiences
38
Chapter 4
Synthesized Policy for Compositional Generalization
The ability to transfer in reinforcement learning is key towards building an agent of general
artificial intelligence. In this chapter, we consider the problem of learning to simultaneously
transfer across both environments (") and tasks (), probably more importantly, by learning from
only sparse (",) pairs out of all the possible combinations of commands. We propose a novel
compositional neural network architecture which depicts a meta rule for composing policies from
environment and task embeddings. Notably, one of the main challenges is to learn the embeddings
jointly with the meta rule. We further propose new training methods to disentangle the embeddings,
making them both distinctive signatures of the environments and tasks and effective building
blocks for composing the policies. Experiments on GRIDWORLD and THOR, of which the agent
takes as input an egocentric view, show that our approach gives rise to high success rates on all the
(",) pairs after learning from only 40% of them.
4.1 Motivation
Remarkable progress has been made in reinforcement learning in the last few years [158, 195, 229].
Among these, an agent learns to discover its best policy of actions to accomplish a instruction
(or command), by interacting with the environment. However, these concrete commands the
agent learns to tackle are often tied for a specific pair of the environment (") and the task ().
Consequently, when the environment changes even slightly, the agent’s performance deteriorates
drastically [83, 262]. Thus, being able to swiftly adapt to new environments and transfer skills to
new tasks is crucial for the agents to act in real-world settings.
How can we achieve swift adaptation and transfer? In this chapter, we consider several
progressively difficult settings. In the first setting, the agent needs to adapt and transfer to a new
pair of environment and task, when the agent has been exposed to the environment and the task
before (but not simultaneously). Our goal is to use as few as possible seen pairs (i.e., a subset out
of all possible (",) combinations, as sparse as possible) to train the agent.
39
Unseen Seen
M
envs
N tasks
(c) Transfer Setting 3
N tasks
M
envs
(b) Transfer Setting 2
M
envs
(a) Transfer Setting 1
N tasks
Figure 4.1: We consider a transfer learning scenario in reinforcement learning that considers
transfer in both task and environment. Three different settings are presented here (see text for
details). The red dots denote SEEN combinations, gray dots denote UNSEEN combinations, and
arrows! denote transfer directions.
In the second setting, the agent needs to adapt and transfer across either environments or tasks,
to those previously unseen by the agent. For instance, a home service robot needs to adapt from
one home to another one but essentially accomplish the same sets of tasks, or the robot learns new
tasks in the same home. In the third setting, the agent has encountered neither the environment nor
the task before. Intuitively, the second and the third settings are much more challenging than the
first one and appear to be intractable. Thus, the agent is allowed to have a very limited amount of
learning data in the target environment and/or task, for instance, from one demonstration, in order
to transfer knowledge from its prior learning.
Figure 4.1 schematically illustrates the three settings. Several existing approaches have been
proposed to address some of those settings [4, 5, 14, 115, 163, 213, 214]. A common strategy
behind these works is to jointly learn through multi-task (reinforcement) learning [72, 166, 214].
Despite many progresses, however, adaptation and transfer remain a challenging problem in
reinforcement learning where a powerful learning agent easily overfits to the environment or the
task it has encountered, leading to poor generalization to new ones [83, 262].
We propose a new approach to tackle this challenge. Our main idea is to learn a meta rule to
synthesize policies whenever the agent encounters new environments or tasks. Concretely, the
meta rule uses the embeddings of the environment and the task to compose a policy, which is
parameterized as the linear combination of the policy basis. On the training data from seen pairs
of environments and tasks, our algorithm learns the embeddings as well as the policy basis. For
new environments or tasks, the agent learns the corresponding embeddings only while it holds the
policy basis fixed. Since the embeddings are low-dimensional, a limited amount of training data in
the new environment or task is often adequate to learn well so as to compose the desired policy.
While deep reinforcement learning algorithms are capable of memorizing and thus entangling
representations of tasks and environments [262], we propose a disentanglement objective such
40
that the embeddings for the tasks and the environments can be extracted to maximize the efficacy
of the synthesized policy. Empirical studies demonstrate the importance of disentangling the
representations.
We evaluated our approach on GRIDWORLD which we have created and the photo-realistic
robotic environment THOR [109]. We compare to several leading methods for transfer learning in
a significant number of settings. The proposed approach outperforms most of them noticeably in
improving the effectiveness of transfer and adaptation.
4.2 Related Work
Multi-task [242] and transfer learning [213] for reinforcement learning (RL) have been long
and extensively studied. Teh et al. [214] presented a distillation based method that transfers the
knowledge from task specific agents to a multi-task learning agent. Andreas et al. [5] combined the
option framework [209] and modular network [4], and presented an efficient multi-task learning
approach which shares sub-policies across policy sketches of different tasks. Schaul et al. [188]
encoded the goal state into value functions and showed its generalization to new goals. More
recently, Oh et al. [163] proposed to learn a meta controller along with a set of parameterized
policies to compose a policy that generalizes to unseen instructions. In contrast, we jointly
consider the tasks and environments which can be both atomic, as we learn their embeddings
without resorting to any external knowledge (e.g., text, attributes, etc.).
Several recent works [14, 43, 115, 268] factorize Q value functions with an environment-
agnostic state-action feature encoding function and task-specific embeddings. Our model is related
to this line of work in spirit. However, as opposed to learning the value functions, we directly
learn a factorized policy network with strengthened disentanglement between environments and
tasks. This allows us to easily generalize better to new environments or tasks, as shown in the
empirical studies.
4.3 Synthesized Policies with Better Generalization
We begin by introducing notations and stating the research problem formally. We then describe
the main idea behind our approach, followed by the details of each component of the approach.
4.3.1 Problem Statement and Main Idea
Problem statement. We follow the standard framework for reinforcement learning [208]. An
agent interacts with an environment by sequentially choosing actions over time and aims to
maximize its cumulative rewards. This learning process is abstractly described by a Markov
41
decision process with the following components: a space of the agent’s states2S, a space of
possible actionsa2A, an initial distribution of statesp
0
(s), a stationary distribution characterizing
how the state at timet transitions to the next state at (t + 1):p(s
t+1
js
t
;a
t
), and a reward function
r :=r(s;a).
The agent’s actions follow a policy(ajs) : SA ! [0; 1], defined as a conditional
distributionp(ajs). The goal of the learning is to identify the optimal policy that maximizes
the discounted cumulative reward: R = E[
P
1
t=0
t
r(s
t
;a
t
)], where
2 (0; 1] is a discount
factor and the expectation is taken with respect to the randomness in state transitions and taking
actions. We denote byp(sjs
0
;t;) the probability at states after transitioningt time steps, starting
from states
0
and following the policy. With it, we define the discounted state distribution as
(s) =
P
s
0
P
1
t=1
t1
p
0
(s
0
)p(sjs
0
;t;).
In this chapter, we study how an agent learns to accomplish a variety of tasks in different
environments. LetE andT denote the sets of the environments and the tasks, respectively. We
assume the cases of finite sets but it is possible to extend our approach to infinite ones. While the
most basic approach is to learn an optimal policy under each pair (";) of environment and task,
we are interested in generalizing to all combinations in (E;T ), with interactive learning from a
limited subset of (";) pairs. Clearly, the smaller the subset is, the more desirable the agent’s
generalization capability is.
Main idea. In the rest of the paper, we refers to the limited subset of pairs as seen pairs or
training pairs and the rest ones as unseen pairs or testing pairs. We assume that the agent does
not have access to the unseen pairs to obtain any interaction data to learn the optimal policies
directly. In computer vision, such problems have been intensively studied in the frameworks of
unsupervised domain adaptation and zero-shot learning, for example, [22, 26, 61, 155]. There are
totallyjEjjTj pairs – our goal is to learn fromO(jEj +jTj) training pairs and generalize to all.
Our main idea is to synthesize policies for the unseen pairs of environments and tasks. In
particular, our agent learns two sets of embeddings: one for the environments and the other for the
tasks. Moreover, the agent also learns how to compose policies using such embeddings. Note that
learning both the embeddings and how to compose happens on the training pairs. For the unseen
pairs, the policies are constructed and used right away — if there is interaction data, the policies
can be further fine-tuned. However, even without such interaction data, the synthesized policies
still perform well.
To this end, we desire our approach to jointly supply two aspects: a compositional structure of
Synthesized Policies (SYNPO) from environment and task embeddings and a disentanglement
learning objective to learn the embeddings. We refer this entire framework as SYNPO and describe
its details in what follows.
42
Task Descriptor
Task
Embedding
Environment Descriptor
Environment
Embedding
StateFeature Extraction
Action
Embedding
RewardPrediction
PolicyPrediction
L2 Normalize
L2 Normalize
e
"
AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=
AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=
AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=
AAAB9HicbVDLSgNBEOz1GeMr6tHLYBA8hV0R9Bjw4jGKeUCyhNlJbzJkdmadmQ2EkO/w4kERr36MN//GSbIHTSxoKKq66e6KUsGN9f1vb219Y3Nru7BT3N3bPzgsHR03jMo0wzpTQulWRA0KLrFuuRXYSjXSJBLYjIa3M785Qm24ko92nGKY0L7kMWfUOinEbmdENaaGCyW7pbJf8ecgqyTISRly1Lqlr05PsSxBaZmgxrQDP7XhhGrLmcBpsZMZTCkb0j62HZU0QRNO5kdPyblTeiRW2pW0ZK7+npjQxJhxErnOhNqBWfZm4n9eO7PxTTjhMs0sSrZYFGeCWEVmCZAe18isGDtCmebuVsIGVFNmXU5FF0Kw/PIqaVxWAr8S3F+Vqw95HAU4hTO4gACuoQp3UIM6MHiCZ3iFN2/kvXjv3seidc3LZ07gD7zPHzqQkm8=
e
⌧
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0M/NbT9xYodUDjlMeJnSgRCwYRSc1ea+LNOuVK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLiu1+zyOIpzAKZxDAFdQg1uoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBnGePMA==
↵ (e
"
,e
⌧ )
AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=
AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=
AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=
AAACBnicbVDLSgNBEJyNrxhfUY8iDAYhgoRdEfQY8OIxinlAdgm9k04yZHZ2mZkNhJCTF3/FiwdFvPoN3vwbJ4+DJhY01FR1M90VJoJr47rfTmZldW19I7uZ29re2d3L7x/UdJwqhlUWi1g1QtAouMSq4UZgI1EIUSiwHvZvJn59gErzWD6YYYJBBF3JO5yBsVIrf+yDSHpQxJY/AIWJ5iKW59Q+DaRnrXzBLblT0GXizUmBzFFp5b/8dszSCKVhArRuem5ighEow5nAcc5PNSbA+tDFpqUSItTBaHrGmJ5apU07sbIlDZ2qvydGEGk9jELbGYHp6UVvIv7nNVPTuQ5GXCapQclmH3VSQU1MJ5nQNlfIjBhaAkxxuytlPVDAjE0uZ0PwFk9eJrWLkueWvLvLQvl+HkeWHJETUiQeuSJlcksqpEoYeSTP5JW8OU/Oi/PufMxaM8585pD8gfP5A7OBmKk=
(e
"
,e
⌧ )
AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==
AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==
AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==
AAACBnicbVDLSgNBEJyN7/iKehRhMAgRJOyKoEfBi0cV84AkhN5JJxkyO7vM9AZCyMmLv+LFgyJe/QZv/o2TmIMmFjTUVHUz3RUmSlry/S8vs7C4tLyyupZd39jc2s7t7JZtnBqBJRGr2FRDsKikxhJJUlhNDEIUKqyEvauxX+mjsTLW9zRIsBFBR8u2FEBOauYO6iES8AI2630wmFipYn3C3ZMgPW7m8n7Rn4DPk2BK8myKm2bus96KRRqhJqHA2lrgJ9QYgiEpFI6y9dRiAqIHHaw5qiFC2xhOzhjxI6e0eDs2rjTxifp7YgiRtYModJ0RUNfOemPxP6+WUvuiMZQ6SQm1+PmonSpOMR9nwlvSoCA1cASEkW5XLrpgQJBLLutCCGZPnifl02LgF4Pbs/zl3TSOVbbPDlmBBeycXbJrdsNKTLAH9sRe2Kv36D17b977T2vGm87ssT/wPr4BPKOYXw==
a
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJfROJsmY2ZllZlYIS/7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgQ31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVqyhpUCaXbERomuGQNy61g7UQzjCPBWtH4Zua3npg2XMkHO0lYGONQ8gGnaJ3U7CYj3sNeueJX/TnIKglyUoEc9V75q9tXNI2ZtFSgMZ3AT2yYobacCjYtdVPDEqRjHLKOoxJjZsJsfu2UnDmlTwZKu5KWzNXfExnGxkziyHXGaEdm2ZuJ/3md1A6uw4zLJLVM0sWiQSqIVWT2OulzzagVE0eQau5uJXSEGql1AZVcCMHyy6ukeVEN/Gpwd1mp3edxFOEETuEcAriCGtxCHRpA4RGe4RXePOW9eO/ex6K14OUzx/AH3ucPiNaPIw==
s
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==
AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMeAF49RzAOSJcxOepMxszPLzKwQQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSgU31ve/vcLa+sbmVnG7tLO7t39QPjxqGpVphg2mhNLtiBoUXGLDciuwnWqkSSSwFY1uZn7rCbXhSj7YcYphQgeSx5xR66RmNzW8Z3rlil/15yCrJMhJBXLUe+Wvbl+xLEFpmaDGdAI/teGEasuZwGmpmxlMKRvRAXYclTRBE07m107JmVP6JFbalbRkrv6emNDEmHESuc6E2qFZ9mbif14ns/F1OOEyzSxKtlgUZ4JYRWavkz7XyKwYO0KZ5u5WwoZUU2ZdQCUXQrD88ippXlQDvxrcXVZq93kcRTiBUziHAK6gBrdQhwYweIRneIU3T3kv3rv3sWgtePnMMfyB9/kDtOuPQA==
˜ r
z
(s,a)
AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==
AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==
AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==
AAAB+nicbVBNS8NAEN3Ur1q/Uj16WSxCBSmJCHosePFYxX5AG8JmM22XbjZhd6PU2J/ixYMiXv0l3vw3btsctPXBwOO9GWbmBQlnSjvOt1VYWV1b3yhulra2d3b37PJ+S8WppNCkMY9lJyAKOBPQ1Exz6CQSSBRwaAejq6nfvgepWCzu9DgBLyIDwfqMEm0k3y73NOMhZHLiP1bVKSYnvl1xas4MeJm4OamgHA3f/uqFMU0jEJpyolTXdRLtZURqRjlMSr1UQULoiAyga6ggESgvm50+wcdGCXE/lqaExjP190RGIqXGUWA6I6KHatGbiv953VT3L72MiSTVIOh8UT/lWMd4mgMOmQSq+dgQQiUzt2I6JJJQbdIqmRDcxZeXSeus5jo19+a8Ur/N4yiiQ3SEqshFF6iOrlEDNRFFD+gZvaI368l6sd6tj3lrwcpnDtAfWJ8/fzuThw==
⇡ z
(a|s)
AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH
AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH
AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH
AAAB8nicbVBNS8NAEJ34WetX1aOXxSLUS0lE0GPBi8cq9gPSUDbbTbt0swm7E6HW/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6Lrfzsrq2vrGZmGruL2zu7dfOjhsmiTTjDdYIhPdDqnhUijeQIGSt1PNaRxK3gqH11O/9cC1EYm6x1HKg5j2lYgEo2glv5OK7iOp0Cdz1i2V3ao7A1kmXk7KkKPeLX11egnLYq6QSWqM77kpBmOqUTDJJ8VOZnhK2ZD2uW+pojE3wXh28oScWqVHokTbUkhm6u+JMY2NGcWh7YwpDsyiNxX/8/wMo6tgLFSaIVdsvijKJMGETP8nPaE5QzmyhDIt7K2EDaimDG1KRRuCt/jyMmmeVz236t1elGt3eRwFOIYTqIAHl1CDG6hDAxgk8Ayv8Oag8+K8Ox/z1hUnnzmCP3A+fwCAo5DH
Figure 4.2: Overview of our proposed model. Given a task and an environment, the corresponding
embeddingse
"
ande
are retrieved to compose the policy coefficients and reward coefficients.
Such coefficients then linearly combine the shared basis and synthesize a policy (and a reward
prediction) for the agent.
4.3.2 Policy Factorization and Composition
Given a pairz = (";) of an environment" and a task, we denote bye
"
ande
their embeddings,
respectively. The policy is synthesized with a bilinear mapping
z
(ajs)/ exp(
>
s
U(e
"
;e
)
a
+b
) (4.1)
whereb
is a scalar bias, and
vs:
and
a
are featurized states and actions (for instances, image
pixels or the feature representations of an image). The bilinear mapping given by the matrixU is
parameterized as the linear combination ofK basis matrices
k
,
U(e
"
;e
) =
K
X
k=1
k
(e
"
;e
)
k
: (4.2)
Note that the combination coefficients depend on the specific pair of environment and task while
the basis is shared across all pairs. They enable knowledge transfer from the seen pairs to unseen
ones.
Analogously, during learning (to be explained in detail in the later section), we predict the
rewards by modeling them with the same set of basis but different combination coefficients:
~ r
z
(s;a) =
>
vs:
V (e
"
;e
)
a
+b
r
=
>
vs:
X
k
k
(e
"
;e
)
k
!
a
+b
r
(4.3)
43
whereb
r
is a scalar bias. Note that similar strategies for learning to predict rewards along with
learning the policies have also been studied in recent works [14, 87, 268]. We find this strategy
helpful too (cf. details in our empirical studies in Section 4).
Figure 4.2 illustrates the model architecture described above. Specifically, we consider agents
that take egocentric views of the environment, so a convolutional neural network is used to extract
the state features
vs:
(cf. the bottom left panel of Figure 4.2). The action features
a
are learned
as a look-up table. Other model parameters include the basis , the embeddingse
"
ande
in
the look-up tables respectively for the environments and the tasks, and the coefficient functions
k
(;) and
k
(;) for respectively synthesizing the policy and reward predictor. The coefficient
functions
k
(;) and
k
(;) are parameterized with one-hidden-layer MLPs with the inputs being
the concatenation ofe
"
ande
, respectfully.
4.3.3 Disentanglement of the Embeddings for Environments and Tasks
In SYNPO, both the embeddings and the bilinear mapping are to be learnt. In an alternative but
equivalent form, the policies are formulated as
z
(ajs)/ exp
X
k
k
(e
"
;e
)
>
vs:
k
a
+b
!
: (4.4)
As the defining coefficients
k
are parameterized by a neural network whose inputs and parameters
are both optimized, we need to impose additional structures such that the learned embeddings
facilitate the transfer across environments or tasks. Otherwise, the learning could overfit to the
seen pairs and consider each pair in unity, thus leading to poor generalization to unseen pairs.
To this end, we introduce discriminative losses to distinguish different environments or tasks
through the agent’s trajectories. Letx =f
>
vs:
k
a
g2R
K
be the state-action representation.
For the agent interacting with an environment-task pairz = (";), we denote its trajectory as
fx
1
;x
2
; ;x
t
;:::g. We argue that a good embedding (eithere
"
ore
) ought to be able to
tell from which environment or task the trajectory is from. In particular, we formulate this as a
multi-way classification where we desirex
t
(on average) is telltale of its environment" or task:
`
"
:=
X
t
logP ("jx
t
) withP ("jx
t
)/ exp
g(x
t
)
>
e
"
(4.5)
`
:=
X
t
logP (jx
t
) withP (jx
t
)/ exp
h(x
t
)
>
e
(4.6)
where we use two nonlinear mapping functions (g() andh(), parameterized by one-hidden-layer
MLPs) to transform the state-action representationx
t
, such that it retrievese
"
ande
. These two
functions are also learnt using the interaction data from the seen pairs.
44
4.3.4 Policy Learning
Our approach (SYNPO) relies on the modeling assumption that the policies (and the reward
predicting functions) are factorized in the axes of the environment and the task. This is a generic
assumption and can be integrated with many reinforcement learning algorithms. Here we study its
effectiveness on imitation learning (mostly) and also reinforcement learning.
In imitation learning, we denote by
e
z
the expert policy of combinationz and apply the simple
strategy of “behavior cloning” with random perturbations to learn our model from the expert
demonstration [73]. We employ a cross-entropy loss for the policy as follows:
`
z
:=E
s
e
z;a
e
z
[log
z
(ajs)] (4.7)
A`
2
loss is used for learning the reward prediction function, `
rz
:= E
s
e
z;a
e
z
k~ r
z
(s;a)
r
z
(s;a)k
2
. Together with the disentanglement losses, they form the overall loss function
L :=E
z
[`
z
+
1
`
rz
+
2
`
"
+
3
`
] (4.8)
which is then optimized through experience replay, as shown in Algorithm 1 in the Appendix B.
We choose the value of those hyper-parameters
i
so that the contributions of the objectives are
balanced. More details are presented in the Suppl. Materials.
4.3.5 Transfer to Unseen Environments and Tasks
Eq. 4.1 is used to synthesize a policy for any (", ) pair, as long as the environment and the
task — not necessarily the pair of them — have appeared at least once in the training pairs. If,
however, a new environment and/or a new task appears (corresponding to the transfer setting 2 or
3 in Section 4.1), fine-tuning is required to extract their embeddings. To do so, we keep all the
components of our model fixed except the look-up tables (i.e., embeddings) for the environment
and/or the task. This effectively re-uses the policy composition rule and enables fast learning of
the environment and/or the task embeddings, after seeing a few number of demonstrations. In the
experiments, we find it works well even with only one shot of the demonstration.
4.4 Experiments
We validate our approach (SYNPO) with extensive experimental studies, comparing with several
baselines and state-of-the-art transfer learning methods.
45
Figure 4.3: From left to right: (a) Some sample mazes of our GRIDWORLD dataset. They are
similar in appearance but different in topology. Demonstrations of an agent’s egocentric views of
(b) GRIDWORLD and (c) THOR.
4.4.1 Experimental Setup
We experiment with two simulated environments
1
: GRIDWORLD and THOR [109], in both of which
the agent takes as input an egocentric view (cf. Figure 4.3). Please refer to the Suppl. Materials for
more details about the state feature function
s
used in these simulators.
GRIDWORLD and tasks. We design twenty 16 16 grid-aligned mazes, some of which are
visualized in Figure 4.3 (a). The mazes are similar in appearance but differ from each other in
topology. There are five colored blocks as “treasures” and the agent’s goal is to collect the treasures
in pre-specified orders, e.g., “Pick up Red and then pick up Blue”. At a time step, the “egocentric”
view observed by the agent consists of the agent’s surrounding within a 3 3 window and the
treasures’ locations. At each run, the locations of the agent and treasures are randomized. We
consider twenty tasks in each environment, resultingjEjjTj = 400 pairs of (",) in total. In
the transfer setting 1 (cf. Figure 4.1(a)), we randomly choose 144 pairs as the training set under
the constraint that each of the environments appears at least once, so does any task. The remaining
256 pairs are used for testing. For the transfer settings 2 and 3 (cf. Figure 4.1(b) and (c)), we
postpone the detailed setups to Section 4.4.2.2.
THOR [109] and tasks. We also test our method on THOR, a challenging 3D simulator where
the agent is placed in indoor visual scenes. The tasks are to search and act on objects, e.g., “Put
the cabbage to the fridge”. Different from GRIDWORLD, the objects’ locations are unknown so
the agent has to search for the objects of interest by its understanding of the visual scene (cf.
Figure 4.3(c)). There are 7 actions in total (look up, look down, turn left, turn right, move forward,
open/close, pick up/put down). We run experiments with 19 scenes 21 tasks in this simulator.
1
The implementation of the two simulated environments are available on https://www.github.com/sha-lab/
gridworld andhttps://www.github.com/sha-lab/thor, respectfully.
46
4.4.1.1 Evaluation
We evaluate the agent’s performance by the averaged success rate (AvgSR.) for accomplishing the
tasks, limiting the maximum trajectory length to 300 steps. For the results reported in numbers
(e.g., Tables 4.1), we run 100 rounds of experiments for each (",) pair by randomizing the agent’s
starting point and the treasures’ locations. To plot the convergence curves (e.g., Figure 4.4), we
sample 100 (",) combinations and run one round of experiment for each to save computation
time. We train our algorithms under 3 random seeds and report the mean and standard deviation
(std).
Competing methods. We compare our approach (SYNPO) with the following baselines and
competing methods. Note that our problem setup is new, so we have to adapt the competing
methods, which were proposed for other scenarios, to fit ours.
• MLP. The policy network is a multilayer perceptron whose input concatenates state features
and the environment and task embeddings. We train this baseline using the proposed losses for
our approach, including the disentanglement losses`
;`
; it performs worse without`
;`
.
• Successor Feature (SF). We learn the successor feature model [14] by Q-imitation learning for
fair comparison. We strictly follow [115] to set up the learning objectives. The key difference of
SF from our approach is its lack of capability in capturing the environmental priors.
• Module Network (ModuleNet). We also implement a module network following [44]. Here
we train an environment specific module for each environment and a task specific module for
each task. The policy for a certain (",) pair is assembled by combining the corresponding
environment module and task module.
• Multi-Task Reinforcement Learning (MTL). This is a degenerated version of our method,
where we ignore the distinctions of environments. We simply replace the environment embed-
dings by zeros for the coefficient functions. The disentanglement loss on task embeddings is
still used since it leads to better performances than otherwise.
Please refer to the Appendix for more experimental details, including all the twenty GRID-
WORLD mazes, how we configure the rewards, optimization techniques, feature extraction for the
states, and our implementation of the baseline methods.
4.4.2 Experimental Results on GRIDWORLD
We first report results on the adaptation and transfer learning setting 1, as described in Section 1
and Figure 4.1(a). There, the agent acts upon a new pair of environment and task, both of which it
47
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
ModuleNet
SF
SynPo
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
ModuleNet
SF
SynPo
(a) AvgSR. over Time on SEEN (b) AvgSR. over Time on UNSEEN
Figure 4.4: On GRIDWORLD. Averaged success rate (AvgSR) on SEEN pairs and UNSEEN pairs,
respectively. Results are reported withjEj = 20 andjTj = 20. We report mean and std based on
3 training random seeds.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
# of seen / # of total
0.2
0.4
0.6
0.8
1.0
average success rate
Seen
Unseen
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
steps 1e7
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
success rate on test set
MLP
MTL
SynPo
(a) Transfer learning performance curve (b) AvgSR. over Time on UNSEEN
Figure 4.5: (a) Transfer learning performance (in AvgSR.) with respect to the ratio: # SEEN
pairs / # TOTAL pairs, withjEj = 10 andjTj = 10. (b) Reinforcement learning performance
on unseen pairs of different approaches (with PPO [192]). MLP overfits, MTL improves slightly,
and SYNPO achieves 96.16% AvgSR.
Table 4.1: Comparison of methods on GRIDWORLD with SEEN/UNSEEN ratio of 144/256.
Method SF ModuleNet MLP MTL SYNPO
AvgSR. (SEEN) 0.0 0.0% 50.9 33.8% 69.0 2.0% 64.1 1.2% 83.3 0.5 %
AvgSR. (UNSEEN) 0.0 0.0% 30.4 20.1% 66.1 2.6% 41.5 1.4% 82.1 1.5%
has encountered during training but not in the same (",) pair. The goal is to use as sparse (",)
pairs among all the combinations as possible to learn and yet still able to transfer successfully.
4.4.2.1 Transfer to Previously Encountered Environments and Tasks
Main results. Table 4.1 and Figure 4.4 show the success rates and convergence curves, respec-
tively, of our approach and the competing methods averaged over the seen and unseen (",) pairs.
48
Table 4.2: Performance of transfer learning in the settings 2 and 3 on GRIDWORLD
Setting Method
Cross Pair
(Q’s",P ’s)
Cross Pair
(P ’s",Q’s)
Q Pairs
Setting 2
MLP 13.8% 20.7% 6.3%
SYNPO 50.5% 21.5% 13.5%
Setting 3
MLP 14.6% 18.3% 7.2%
SYNPO 42.7% 19.4% 12.9%
SYNPO consistently outperforms the others in terms of both the convergence and final perfor-
mance, by a significant margin. On the seen split, MTL and MLP have similar performances, while
MTL performs worse comparing to MLP on the unseen split (i.e. in terms of the generalization
performance), possibly because it treats all the environments the same.
We design an extreme scenario to further challenge the environment-agnostic methods (e.g.,
MTL). We reduce the window size of the agent’s view to one, so the agent sees the cell it
resides and the treasures’ locations and nothing else. As a result, MTL suffers severely, MLP
performs moderately well, and SYNPO outperforms both significantly (unseen AvgSR: MTL=6.1%,
MLP=66.1%, SYNPO = 76.8%). We conjecture that the environment information embodied in
the states is crucial for the agent to beware of and generalize across distinct environments. More
discussions are deferred to the Appendix.
How many seen (", ) pairs do we need to transfer well? Figure 4.5(a) shows that, not
surprisingly, the transfer learning performance increases as the number of seen pairs increases.
The acceleration slows down after the seen/total ratio reaches 0.4. In other words, when there is a
limited budget, our approach enables the agent to learn from 40% of all possible (",) pairs and
yet generalize well across the tasks and environments.
Does reinforcement learning help transfer? Beyond imitation learning, we further study our
SYNPO for reinforcement learning (RL) under the same transfer learning setting. Specifically,
we use PPO [192] to fine-tune the three top performing algorithms on GRIDWORLD. The results
averaged over 3 random seeds are shown in Figure 4.5(b). We find that RL fine-tuning improves
the transfer performance for all the three algorithms. In general, MLP suffers from over-fitting,
MTL is improved moderately yet with a significant gap to the best result, and SYNPO achieves the
best AvgSR, 96.16%.
Ablation studies.
49
Table 4.3: Comparison of methods on THOR with SEEN/UNSEEN ratio of 144/199
Method ModuleNet MLP MTL SYNPO
AvgSR. (SEEN) 51.5 % 47.5% 52.2% 55.6%
AvgSR. (UNSEEN) 14.4 % 25.8% 33.3% 35.4%
4.4.2.2 Transfer to Previously Unseen Environments or Tasks
Now we investigate how effectively one can schedule transfer from seen environments and tasks
to unseen ones, i.e., the settings 2 and 3 described in Section 4.1 and Figure 4.1(b) and (c). The
seen pairs (denoted byP ) are constructed from ten environments and ten tasks; the remaining
ten environments and ten tasks are unseen (denoted byQ). Then we have two settings of transfer
learning.
One is to transfer to pairs which cross the seen setP and unseen setQ – this corresponds to
the setting 2 as the embeddings for either the unseen tasks or the unseen environments need to
be learnt, but not both. Once these embeddings are learnt, we use them to synthesize policies for
the test (",) pairs. This mimics the style “incremental learning of small pieces and integrating
knowledge later”.
The other is the transfer setting 3. The agent learns policies via learning embeddings for the
tasks and environments of the unseen setQ and then composing, as described in section 3.5. Using
the embeddings fromP andQ, we can synthesize policies for any (",) pair. This mimics the
style of “learning in giant jumps and connecting dots”.
Main results. Table 4.2 contrasts the results of the two transfer learning settings. Clearly, setting
2 attains stronger performance as it “incrementally learns” the embeddings of either the tasks
or the environments but not both, while setting 3 requires learning both simultaneously. It is
interesting to see this result aligns with how effective human learns.
Figure 4.6 visualizes the results whose rows are indexed by tasks and columns by environments.
The seen pairs inP are in the upper-left quadrant and the unseen setQ is on the bottom-right. We
refer readers to the Suppl. Materials for more details and discussions of the results.
4.4.3 Experimental Results on THOR
Main results. The results on the THOR simulator are shown in Table 4.3, where we report
our approach as well as the top performing ones on GRIDWORLD. Our SYNPO significantly
outperforms three competing ones for both seen pairs and unseen pairs. Moreover, our approach
also has the best performance of success rate on seen to unseen, indicating that it is less prone to
overfiting than the other methods. More details are included in the Suppl. Materials.
50
Env_0
Env_1
Env_2
Env_3
Env_4
Env_5
Env_6
Env_7
Env_8
Env_9
Env_10
Env_11
Env_12
Env_13
Env_14
Env_15
Env_16
Env_17
Env_18
Env_19
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
('R', 'O')
('B', 'P')
('G', 'R')
('O', 'B')
('P', 'G')
('R', 'P')
('B', 'R')
('G', 'B')
('O', 'G')
('P', 'O')
0.90
1.00
0.90
0.90
1.00
0.90
1.00
0.90
1.00
1.00
0.50
0.30
0.40
0.60
0.50
0.30
0.50
0.30
0.60
0.70
1.00
0.90
1.00
0.90
0.90
1.00
0.90
1.00
0.90
0.90
0.30
0.50
0.80
0.50
0.50
0.30
0.50
0.30
0.70
0.60
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.90
1.00
0.90
0.50
0.40
0.80
0.50
0.40
0.50
0.20
0.30
0.50
0.70
0.90
1.00
0.80
0.90
1.00
1.00
0.90
0.90
0.80
0.90
0.50
0.10
0.80
0.60
0.50
0.70
0.20
0.40
0.40
0.70
0.90
1.00
1.00
0.90
1.00
0.80
0.90
0.80
0.90
1.00
0.40
0.50
0.60
0.50
0.60
0.50
0.30
0.10
0.50
0.40
1.00
0.90
1.00
0.80
0.90
1.00
1.00
1.00
0.90
1.00
0.40
0.20
0.50
0.30
0.60
0.80
0.50
0.50
0.70
0.80
1.00
1.00
1.00
0.80
1.00
1.00
0.90
0.80
1.00
1.00
0.70
0.30
0.80
0.90
0.40
0.30
0.40
0.20
0.40
0.40
0.80
0.80
0.90
0.90
1.00
1.00
1.00
0.90
1.00
0.90
0.30
0.50
0.70
0.60
0.50
0.50
0.60
0.20
0.80
0.40
1.00
1.00
1.00
0.80
1.00
1.00
0.90
0.80
1.00
1.00
0.60
0.60
0.60
0.50
0.60
0.50
0.50
0.20
0.60
0.60
0.90
1.00
0.90
1.00
0.70
1.00
0.80
0.80
1.00
1.00
0.60
0.70
0.80
0.60
0.80
0.60
0.70
0.40
0.40
0.60
0.30
0.00
0.20
0.40
0.10
0.40
0.20
0.10
0.50
0.30
0.20
0.10
0.20
0.20
0.10
0.10
0.00
0.00
0.10
0.00
0.20
0.30
0.40
0.30
0.10
0.40
0.20
0.30
0.10
0.20
0.10
0.00
0.20
0.10
0.20
0.20
0.10
0.00
0.20
0.10
0.10
0.50
0.30
0.10
0.10
0.10
0.30
0.40
0.10
0.40
0.20
0.20
0.10
0.20
0.40
0.20
0.20
0.10
0.20
0.30
0.50
0.20
0.30
0.20
0.10
0.20
0.20
0.50
0.20
0.20
0.20
0.10
0.10
0.10
0.70
0.10
0.20
0.00
0.20
0.10
0.00
0.10
0.20
0.20
0.20
0.10
0.40
0.10
0.20
0.50
0.10
0.00
0.20
0.10
0.10
0.10
0.00
0.10
0.30
0.10
0.40
0.30
0.50
0.40
0.30
0.50
0.40
0.50
0.50
0.30
0.20
0.10
0.20
0.20
0.10
0.10
0.20
0.00
0.30
0.30
0.20
0.40
0.20
0.10
0.40
0.30
0.50
0.00
0.40
0.20
0.20
0.20
0.50
0.40
0.00
0.30
0.30
0.10
0.40
0.30
0.10
0.20
0.20
0.00
0.00
0.00
0.10
0.20
0.20
0.20
0.00
0.00
0.20
0.00
0.10
0.00
0.00
0.00
0.20
0.00
0.50
0.00
0.10
0.00
0.10
0.00
0.00
0.00
0.10
0.30
0.00
0.10
0.00
0.10
0.10
0.10
0.00
0.10
0.10
0.10
0.00
0.00
0.00
0.10
0.00
0.00
0.10
0.00
0.20
0.00
0.10
0.10
0.00
0.10
0.10
0.00
0.20
0.10
0.00
0.00
Env_0
Env_1
Env_2
Env_3
Env_4
Env_5
Env_6
Env_7
Env_8
Env_9
Env_10
Env_11
Env_12
Env_13
Env_14
Env_15
Env_16
Env_17
Env_18
Env_19
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
('R', 'O')
('B', 'P')
('G', 'R')
('O', 'B')
('P', 'G')
('R', 'P')
('B', 'R')
('G', 'B')
('O', 'G')
('P', 'O')
1.00
1.00
1.00
1.00
0.90
0.90
0.90
0.90
0.90
1.00
0.40
0.50
0.80
0.20
0.50
0.40
0.50
0.30
0.80
0.00
1.00
1.00
0.90
0.90
0.90
1.00
0.90
1.00
0.90
1.00
0.40
0.30
0.20
0.40
0.60
0.20
0.30
0.10
0.90
0.10
1.00
0.80
0.90
1.00
0.70
0.90
0.90
1.00
1.00
1.00
0.30
0.30
0.70
0.50
0.50
0.40
0.50
0.30
0.70
0.30
1.00
0.80
0.90
1.00
0.80
0.90
0.80
0.90
1.00
0.90
0.50
0.30
0.50
0.40
0.40
0.40
0.40
0.30
0.80
0.30
0.90
0.90
0.80
1.00
1.00
0.80
0.90
1.00
0.90
1.00
0.40
0.80
0.60
0.20
0.20
0.20
0.20
0.40
0.60
0.50
1.00
1.00
0.90
0.70
0.90
0.90
0.80
0.80
0.90
1.00
0.50
0.30
0.60
0.60
0.90
0.20
0.40
0.50
0.60
0.40
1.00
1.00
1.00
0.70
0.90
0.80
0.90
0.90
0.90
0.90
0.40
0.50
0.50
0.30
0.80
0.40
0.30
0.10
0.40
0.40
0.90
0.80
0.90
0.70
0.80
0.80
0.80
0.90
1.00
0.90
0.30
0.40
0.70
0.40
0.20
0.40
0.30
0.30
0.70
0.10
0.80
0.80
0.80
1.00
1.00
1.00
0.80
0.80
1.00
1.00
0.00
0.40
0.70
0.40
0.80
0.40
0.50
0.30
0.70
0.40
0.80
1.00
1.00
0.90
0.90
0.80
0.90
1.00
0.90
1.00
0.30
0.30
0.70
0.40
0.70
0.40
0.30
0.20
0.90
0.20
0.40
0.20
0.30
0.20
0.20
0.50
0.50
0.30
0.20
0.10
0.20
0.10
0.00
0.20
0.20
0.20
0.10
0.10
0.00
0.10
0.20
0.30
0.20
0.30
0.10
0.20
0.10
0.10
0.40
0.20
0.30
0.00
0.10
0.10
0.00
0.10
0.00
0.10
0.40
0.00
0.30
0.20
0.10
0.40
0.10
0.20
0.30
0.10
0.30
0.30
0.10
0.20
0.20
0.00
0.10
0.20
0.10
0.00
0.10
0.10
0.20
0.30
0.20
0.30
0.20
0.00
0.20
0.10
0.10
0.30
0.10
0.20
0.00
0.20
0.10
0.10
0.00
0.00
0.40
0.20
0.10
0.10
0.20
0.20
0.00
0.10
0.10
0.00
0.20
0.20
0.00
0.20
0.00
0.10
0.10
0.10
0.30
0.00
0.00
0.10
0.40
0.20
0.30
0.40
0.50
0.10
0.30
0.10
0.30
0.50
0.40
0.40
0.20
0.10
0.10
0.20
0.10
0.10
0.70
0.40
0.10
0.50
0.20
0.20
0.10
0.20
0.20
0.30
0.30
0.20
0.30
0.30
0.40
0.00
0.30
0.20
0.10
0.00
0.30
0.00
0.40
0.00
0.20
0.00
0.00
0.30
0.10
0.10
0.20
0.00
0.20
0.00
0.20
0.00
0.10
0.20
0.00
0.00
0.00
0.00
0.10
0.00
0.10
0.00
0.10
0.30
0.10
0.00
0.10
0.30
0.20
0.00
0.00
0.10
0.10
0.10
0.00
0.10
0.00
0.10
0.00
0.10
0.00
0.20
0.00
0.20
0.10
0.10
0.30
0.20
0.10
0.00
0.00
0.00
0.40
0.20
0.10
0.00
0.10
0.40
(a) Transfer Setting 2 (b) Transfer Setting 3
Figure 4.6: Transfer results of settings 2 and 3. AvgSRs are marked in the grid (see Suppl.
Materials for more visually discernible plots). The tasks and environments in the purple cells
are from the unseenQ set and the red cells correspond to the rest. Darker color means better
performance. It shows that cross-task transfer is easier than cross-environment.
51
Chapter 5
Babywalk Agent for Generalization across Task Horizons
We study how an agent can navigate long paths when learning from a corpus that consists of
shorter ones. We show that existing state-of-the-art agents do not generalize well. To this end, we
propose BabyWalk, a new vision-and-language navigation (VLN) agent learned to navigate by
decomposing long instructions into shorter ones (BabySteps) and completing them sequentially.
A special design memory buffer is used by the agent to turn its past experiences into contexts
for future steps. The learning process is composed of two phases. In the first phase, the agent
uses imitation learning from demonstration to accomplish BabySteps. In the second phase, the
agent uses curriculum-based reinforcement learning to maximize rewards on navigation tasks
with increasingly longer instructions. We create two new benchmark datasets (of long navigation
tasks) and use them in conjunction with existing ones to examine BabyWalk’s generalization
ability. Empirical results show that BabyWalk achieves state-of-the-art results on several metrics,
in particular, is able to follow long instructions better. The codes and the datasets are released on
our project pagehttps://github.com/Sha-Lab/babywalk.
5.1 Motivation
Autonomous agents such as household robots need to interact with the physical world in multiple
modalities. As an example, in vision-and-language navigation (VLN) [3], the agent moves around
in a photo-realistic simulated environment [21] by following a sequence of natural language
instructions. To infer its whereabouts so as to decide its moves, the agent infuses its visual
perception, its trajectory and the instructions [3, 55, 145, 146, 236].
Arguably, the ability to understand and follow the instructions is one of the most crucial skills
to acquire by VLN agents. Jain et al. [88] shows that the VLN agents trained on the originally
proposed dataset ROOM2ROOM (i.e. R2R thereafter) do not follow the instructions, despite having
achieved high success rates of reaching the navigation goals. They proposed two remedies: a new
dataset ROOM4ROOM (or R4R) that doubles the path lengths in the R2R, and a new evaluation
52
R2R R6R R8R
Datasets
0
10
20
30
40
SDTW(%)
In-domain RCM
RCM(GOAL)
RCM(FIDELITY)
BabyWalk
Figure 5.1: Performance of various VLN agents on generalizing from shorter navigation tasks
to longer ones. The vertical axis is the newly proposed path-following metric SDTW [148], the
higher the better. BABYWALK generalizes better than other approaches across different lengths of
navigation tasks. Meanwhile, it get very close to the performances of the in-domain agents (the
dashed line). Please refer to the texts for details.
metric Coverage weighted by Length Score (CLS) that measures more closely whether the ground-
truth paths are followed. They showed optimizing the fidelity of following instructions leads to
agents with desirable behavior. Moreover, the long lengths in R4R are informative in identifying
agents who score higher in such fidelity measure.
In this chapter, we investigate another crucial aspect of following the instructions: can a
VLN agent generalize to following longer instructions by learning from shorter ones? This
aspect has important implication to real-world applications as collecting annotated long sequences
of instructions and training on them can be costly. Thus, it is highly desirable to have this
generalization ability. After all, it seems that humans can achieve this effortlessly
1
.
To this end, we have created several datasets of longer navigation tasks, inspired by R4R [88].
We trained VLN agents on R4R and use the agents to navigate in ROOM6ROOM (i.e., R6R) and
ROOM8ROOM (i.e., R8R). We contrast to the performance of the agents which are trained on those
datasets directly (“in-domain”). The results are shown in Fig. 5.1.
Our findings are that the agents trained on R4R (denoted by the purple and the pink solid
lines) perform significantly worse than the in-domain agents (denoted the light blue dashed line).
Also interestingly, when such out-of-domain agents are applied to the dataset R2R with shorter
navigation tasks, they also perform significantly worse than the corresponding in-domain agent
despite R4R containing many navigation paths from R2R. Note that the agent trained to optimize
the aforementioned fidelity measure (RCM(fidelity)) performs better than the agent trained to reach
1
Anecdotally, we do not have to learn from long navigation experiences. Instead, we extrapolate from our experiences
of learning to navigate in shorter distances or smaller spaces (perhaps a skill we learn when we were babies or kids).
53
the goal only (RCM(goal)), supporting the claim by Jain et al. [88] that following instructions is
a more meaningful objective than merely goal-reaching. Yet, the fidelity measure itself is not
enough to enable the agent to transfer well to longer navigation tasks.
To address these deficiencies, we propose a new approach for VLN. The agent follows a long
navigation instruction by decomposing the instruction into shorter ones (“micro-instructions”, i.e.,
BABY-STEPs), each of which corresponds to an intermediate goal/task to be executed sequentially.
To this end, the agent has three components: (a) a memory buffer that summarizes the agent’s
experiences so that the agent can use them to provide the context for executing the next BABY-
STEP. (b) the agent first learns from human experts in “bite-size”. Instead of trying to imitate to
achieve the ground-truth paths as a whole, the agent is given the pairs of a BABY-STEP and the
corresponding human expert path so that it can learn policies of actions from shorter instructions.
(c) In the second stage of learning, the agent refines the policies by curriculum-based reinforcement
learning, where the agent is given increasingly longer navigation tasks to achieve. In particular,
this curriculum design reflects our desiderata that the agent optimized on shorter tasks should
generalize well to slightly longer tasks and then much longer ones.
While we do not claim that our approach faithfully simulates human learning of navigation, the
design is loosely inspired by it. We name our approach BABYWALK and refer to the intermediate
navigation goals in (b) as BABY-STEPs. Fig. 5.1 shows that BABYWALK (the red solid line)
significantly outperforms other approaches and despite being out-of-domain, it even reach the
performance of in-domain agents on R6R and R8R.
The effectiveness of BABYWALK also leads to an interesting twist. As mentioned before, one
of the most important observations by Jain et al. [88] is that the original VLN dataset R2R fails
to reveal the difference between optimizing goal-reaching (thus ignoring the instructions) and
optimizing the fidelity (thus adhering to the instructions). Yet, leaving details to section 5.5, we
have also shown that applying BABYWALK to R2R can lead to equally strong performance on
generalizing from shorter instructions (i.e., R2R) to longer ones.
In summary, we have demonstrated empirically that the current VLN agents are ineffective in
generalizing from learning on shorter navigation tasks to longer ones. We propose a new approach
in addressing this important problem. We validate the approach with extensive benchmarks,
including ablation studies to identify the effectiveness of various components in our approach.
5.2 Related Work
Vision-and-Language Navigation (VLN) Recent works [3, 31, 88, 160, 215] extend the early
works of instruction based navigation [30, 101, 151] to photo-realistic simulated environments.
For instance, Anderson et al. [3] proposed to learn a multi-modal Sequence-to-Sequence agent
54
(Seq2Seq) by imitating expert demonstration. Fried et al. [55] developed a method that augments
the paired instruction and demonstration data using a learned speaker model, to teach the navigation
agent to better understand instructions. Wang et al. [236] further applies reinforcement learning
(RL) and self-imitation learning to improve navigation agents. Ma et al. [145, 146] designed
models that track the execution progress for a sequence of instructions using soft-attention.
Different from them, we focus on transferring an agent’s performances on shorter tasks to
longer ones. This leads to designs and learning schemes that improve generalization across datasets.
We use a memory buffer to prevent mistakes in the distant past from exerting strong influence on
the present. In imitation learning stage, we solve fine-grained subtasks (BABY-STEPs) instead
of asking the agent to learn the navigation trajectory as a whole. We then use curriculum-based
reinforcement learning by asking the agent to follow increasingly longer instructions.
Transfer and Cross-domain Adaptation There have been a large body of works in transfer
learning and generalization across tasks and environments in both computer vision and reinforce-
ment learning [6, 80, 163, 200, 268, 270]. Of particular relevance is the recent work on adapting
VLN agents to changes in visual environments [82, 212]. To our best knowledge, this work is the
first to focus on adapting to a simple aspect of language variability — the length of the instructions.
Curriculum Learning Since proposed in [16], curriculum learning was successfully used in
a range of tasks: training robots for goal reaching [54], visual question answering [150], image
generation [94]. To our best knowledge, this work is the first to apply the idea to learning in VLN.
5.3 Preliminary
In the VLN task, the agent receives a natural language instructionX composed of a sequence of
sentences. We model the agent with an Markov Decision Process (MDP) which is defined as a
tuple of a state spaceS, an action spaceA, an initial states
1
, a stationary transition dynamics
:SA!S, a reward functionr :SA!R, and the discount factor
for weighting future
rewards. The agent acts according to a policy :SA! 0[R
+
. The state and action spaces
are defined the same as in [55] (cf. §5.4.4 for details).
For eachX, the sequence of the pairs (s;a) is called a trajectoryY =
s
1
;a
1
;:::;s
jYj
;a
jYj
wherejj denotes the length of the sequence or the size of a set. We use ^ a to denote an action
taken by the agent according to its policy. Hence,
^
Y denotes the agent’s trajectory, whileY (ora)
denotes the human expert’s trajectory (or action). The agent is given training examples of (X;Y )
to optimize its policy to maximize its expected rewards.
55
Env
⋮
" #
$
%
$
̂ '
(
) *
(
+
(
Memory Buffer
Χ
Instruction
segmentation
(-th BABY-STEP)
+ ) *
+ ) *
+ ) *
(
(
(
,
)
,
)
,
)
1
23((456
7 8
BABYWALK
Policy 9
- 1 − - 1 −
1 1
2
2
Figure 5.2: The BABYWALK agent has a memory buffer storing its past experiences of instruc-
tionsx
m
, and its trajectory ^ y
m
. When a new BABY-STEPx
m
is presented, the agent retrieves
from the memory a summary of its experiences as the history context. It takes actions conditioning
on the context (as well as its states
t
and the previous action ^ a
t
). Upon finishing following the
instruction. the trajectory ^ y
m
is then sent to the memory to be remembered.
In our work, we introduce additional notations in the following. We will segment a (long)
instructionX into multiple shorter sequences of sentencesfx
m
jm = 1; 2; ;Mg, to which
we refer as BABY-STEPs. Eachx
m
is interpreted as a micro-instruction that corresponds to a
trajectory by the agent ^ y
m
and is aligned with a part of the human expert’s trajectory, denoted as
y
m
. While the alignment is not available in existing datasets for VLN, we describe how to obtain
them in a later section. Throughout the paper, we also freely interexchange the term “following
themth micro-instruction”, “executing the BABY-STEPx
m
”, or “complete themth subtask”.
We uset2 [1;jYj] to denote the (discrete) time steps the agent takes actions. Additionally,
when the agent followsx
m
, for convenience, we sometimes uset
m
2 [1;j^ y
m
j] to index the time
steps, instead of the “global time”t =t
m
+
P
m1
i=1
j^ y
i
j.
5.4 Learning Policy that Takes Babystep
We describe in detail the 3 key elements in the design of our navigation agent: (i) a memory buffer
for storing and recalling past experiences to provide contexts for the current navigation instruction
(§5.4.1); (ii) an imitation-learning stage of navigating with short instructions to accomplish a
single BABY-STEP (§5.4.2.1); (iii) a curriculum-based reinforcement learning phase where the
agent learns with increasingly longer instructions (i.e. multiple BABY-STEPs) (§5.4.2.2). We
describe new benchmarks created for learning and evaluation and key implementation details in
§5.4.3 and §5.4.4.
56
5.4.1 The BABYWALK Agent
The basic operating model of our navigation agent BABYWALK is to follow a “micro instruction”
x
m
(i.e., a short sequence of instructions, to which we also refer as BABY-STEP), conditioning
on the context ^ z
m
and to output a trajectory ^ y
m
. A schematic diagram is shown in Fig. 5.2. Of
particularly different from previous approaches is the introduction of a novel memory module. We
assume the BABY-STEPs are given in the training and inference time – §5.4.3 explains how to
obtain them if not given a prior (Readers can directly move to that section and return to this part
afterwards). The left of the Fig. 5.3 gives an example of those micro-instructions.
Context. The context is a summary of the past experiences of the agent, namely the previous
(m 1) mini-instructions and trajectories:
^ z
m
=g
f
SUMMARY
(x
1
; ;x
m1
);
f
SUMMARY
(^ y
1
; ; ^ y
m1
)
(5.1)
where the function g is implemented with a multi-layer perceptron. The summary function
f
SUMMARY
is explained in below.
Summary. To map variable-length sequences (such as the trajectory and the instructions) to a
single vector, we can use various mechanisms such as LSTM. We reported an ablation study on
this in §5.5.3. In the following, we describe the “forgetting” one that weighs more heavily towards
the most recent experiences and performs the best empirically.
f
SUMMARY
(x
1
; ;x
m1
) =
m1
X
i=1
i
u(x
i
) (5.2)
f
SUMMARY
(^ y
1
; ; ^ y
m1
) =
m1
X
i=1
i
v(^ y
i
) (5.3)
where the weights are normalized to 1 and inverse proportional to how fari is fromm,
i
/ exp
!(m 1i)
(5.4)
is a hyper-parameter (we set to 1=2) and!() is a monotonically nondecreasing function and we
simply choose the identity function.
Note that, we summarize over representations of “micro-instructions” (x
m
) and experiences
of executing those micro-instructions ^ y
m
. The two encodersu() andv() are described in §5.4.4.
They are essentially the summaries of “low-level” details, i.e., representations of a sequence of
57
Instruction
of sub-tasks
Warmup: IL
Clone expert’s
behavior to complete
single sub-tasks
RL curriculum
Lecture #1
No expert demo, learn
from external rewards
1 sub-task given
history context
⋮
Lecture #t
t consecutive sub-tasks
given history context
⋮
Lecture #T
The whole task
exit the room then go
straight and turn left.
go straight until you
pass an eye chart
picture frame on the
left wall then wait
there.
go straight. pass the
bar with the stools.
walk straight until
you get to a table
with chairs then
stop.
#
$
%
&
&
%
$
#
Lecture #1
Reward
#
, &
&
% %
$
$
)
#
̂ #
,
,
Baby
Walk
Reward
)
$
) $
)
#
Lecture #2
&
&
% %
$
,
,
,
#
̂ #
$
&
&
%
%
,
,
̂
$
Baby
Walk
Baby
Walk
Decomposition of a navigation task RL curriculum design Pipeline
( #)
(
#
)
( $)
Figure 5.3: Two-phase learning by BABYWALK. (Left) An example instruction-trajectory
pair from the R4R dataset is shown. The long instruction is segmented into four BABY-STEP
instructions. We use those BABY-STEPs for imitation learning (§5.4.2.1) (Right) Curriculum-based
RL. The BABYWALK agent warm-starts from the imitation learning policy, and incrementally
learns to handle longer tasks by executing consecutive BABY-STEPs and getting feedback from
external rewards (c.f . §5.4.2.2). We illustrate two initial RL lectures using the left example.
words, or a sequence of states and actions. While existing work often directly summarizes all the
low-level details, we have found that the current form of “hierarchical” summarizing (i.e., first
summarizing each BABY-STEP, then summarizing all previous BABY-STEPs) performs better.
Policy. The agent takes actions, conditioning on the context ^ z
m
, and the current instructionx
m
:
^ a
t
js
t
; ^ a
t1
;u(x
m
); ^ z
m
(5.5)
where the policy is implemented with a LSTM with the same cross-modal attention between visual
states and languages as in [55].
5.4.2 Learning of the BABYWALK Agent
The agent learns in two phases. In the first one, imitation learning is used where the agent learns
to execute BABY-STEPs accurately. In the second one, the agent learns to execute successively
longer tasks from a designed curriculum.
5.4.2.1 Imitation Learning
BABY-STEPs are shorter navigation tasks. With themth instructionx
m
, the agent is asked to
follow the instruction so that its trajectory matches the human expert’sy
m
. To assist the learning,
58
the context is computed from the human expert trajectory up to themth BABY-STEP (i.e., in
eq. (5.1), ^ ys are replaced withys). We maximize the objective
` =
M
X
m=1
jymj
X
tm=1
log
a
tm
js
tm
;a
tm1
;u(x
m
);z
m
We emphasize here each BABY-STEP is treated independently of the others in this learning regime.
Each time a BABY-STEP is to be executed, we “preset” the agent in the human expert’s context and
the last visited state. We follow existing literature [3, 55] and use student-forcing based imitation
learning, which uses agent’s predicted action instead of the expert action for the trajectory rollout.
5.4.2.2 Curriculum Reinforcement Learning
We want the agent to be able to execute multiple consecutive BABY-STEPs and optimize its
performance on following longer navigation instructions (instead of the cross-entropy losses from
the imitation learning). However, there is a discrepancy between our goal of training the agent
to cope with the uncertainty in a long instruction and the imitation learning agent’s ability in
accomplishing shorter tasks given the human annotated history. Thus it is challenging to directly
optimize the agent with a typical RL learning procedure, even the imitation learning might have
provided a good initialization for the policy, see our ablation study in §5.5.3.
Inspired by the curriculum learning strategy [16], we design an incremental learning process
that the agent is presented with a curriculum of increasingly longer navigation tasks. Fig. 5.3
illustrates this idea with two “lectures”. Given a long navigation instructionX with M BABY-
STEPs, for the kth lecture, the agent is given all the human expert’s trajectory up to but not
including the (Mk + 1)th BABY-STEP, as well as the history contextz
Mk+1
. The agent is then
asked to execute thekth micro-instructions fromx
Mk+1
tox
M
using reinforcement learning to
produce its trajectory that optimizes a task related metric, for instance the fidelity metric measuring
how faithful the agent follows the instructions.
As we increasek from 1 toM, the agent faces the challenge of navigating longer and longer
tasks with reinforcement learning. However, the agent only needs to improve its skills from its
prior exposure to shorter ones. Our ablation studies show this is indeed a highly effective strategy.
5.4.3 New Datasets for Evaluation & Learning
To our best knowledge, this is the first work studying how well VLN agents generalize to long
navigation tasks. To this end, we create the following datasets in the same style as in [88].
59
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
Instruction Length (words)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Ratio
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Ground-truth Path Length (steps)
R2R R4R R6R R8R
Figure 5.4: The distribution of lengths of the
instructions and trajectories.
R2R R4R R6R R8R
Train seen instr. 14,039 233,532 89,632 94,731
Val unseen instr. 2,349 45,234 35,777 43,273
Avg instr. length 29.4 58.4 91.2 121.6
Avg # BABY-STEPs 1.8 3.6 5.6 7.4
Table 5.1: Statistics of datasets used for VLN
learning and evaluation
ROOM6ROOM and ROOM8ROOM We concatenate the trajectories in the training as well as
the validation unseen split of the ROOM2ROOM dataset for 3 times and 4 times respectively, thus
extending the lengths of navigation tasks to 6 rooms and 8 rooms. To join, the end of the former
trajectory must be within 0.5 meter with the beginning of the later trajectory. Table 5.1 and Fig. 5.4
contrast the different datasets in the # of instructions, the average length (in words) of instructions
and how the distributions vary.
Table 5.1 summarizes the descriptive statistics of BABY-STEPs across all datasets we used.
The datasets and the segmentation/alignments are made publically available
2
.
5.4.4 Key Implementation Details
In the following, we describe key information for research reproducibility.
States and Actions We follow [55] to set up the states as the visual features (i.e. ResNet-152
features [67]) from the agent-centric panoramic views in 12 headings 3 elevations with 30
degree intervals. Likewise, we use the same panoramic action space.
Identifying BABY-STEPs Our learning approach requires an agent to follow micro-instructions
(i.e., the BABY-STEPs). Existing datasets [3, 31, 88] do not provide fine-grained segmentations
of long instructions. Therefore, we use a template matching approach to aggregate consecutive
sentences into BABY-STEPs. First, we extract the noun phrase using POS tagging. Then, we
employs heuristic rules to chunk a long instruction into shorter segments according to punctuation
and landmark phrase (i.e., words for concrete objects). We document the details in the Appendix.
Aligning BABY-STEPs with Expert Trajectory Without extra annotation, we propose a method
to approximately chunk original expert trajectories into sub-trajectories that align with the BABY-
STEPs. This is important for imitation learning at the micro-instruction level (§5.4.2.1). Specifi-
cally, we learn a multi-label visual landmark classifier to identify concrete objects from the states
2
Available athttps://github.com/Sha-Lab/babywalk
60
along expert trajectories by using the landmark phrases extracted from the their instructions as
weak supervision. For each trajectory-instruction pair, we then extract the visual landmarks of ev-
ery state as well as the landmark phrases in BABY-STEP instructions. Next, we perform a dynamic
programming procedure to segment the expert trajectories by aligning the visual landmarks and
landmark phrases, using the confidence scores of the multi-label visual landmark classifier to form
the function.
Encoders and Embeddings The encoder u() for the (micro)instructions is a LSTM. The
encoder for the trajectoryy contains two separate Bi-LSTMs, one for the states
t
and the other for
the actiona
t
. The outputs of the two Bi-LSTMs are then concatenated to form the embedding
functionv().
Learning Policy with Reinforcement Learning In the second phase of learning, BABYWALK
uses RL to learn a policy that maximizes the fidelity-oriented rewards (CLS) proposed by Jain et al.
[88]. We use policy gradient as the optimizer [210]. Meanwhile, we set the maximum number of
lectures in curriculum RL to be 4, which is studied in Section 5.5.3.
5.5 Experiments
In-domain Generalization to other datasets
Setting R4R! R4R R4R! R2R R4R! R6R R4R! R8R
Metrics SR" CLS" SDTW" SR" CLS" SDTW" SR" CLS" SDTW" SR" CLS" SDTW"
SEQ2SEQ 25.7 20.7 9.0 16.3 27.1 10.6 14.4 17.7 4.6 20.7 15.0 4.7
SF
+
24.9 23.6 9.2 22.5 29.5 14.8 15.5 20.4 5.2 21.6 17.2 5.0
RCM(GOAL)
+
28.7 36.3 13.2 25.9 44.2 20.2 19.3 31.8 7.3 22.8 27.6 5.1
RCM(FIDELITY)
+
24.7 39.2 13.7 29.1 34.3 18.3 20.5 38.3 7.9 20.9 34.6 6.1
REGRETFUL
+?
30.1 34.1 13.5 22.8 32.6 13.4 18.0 31.7 7.5 18.7 29.3 5.6
FAST
+?
36.2 34.0 15.5 25.1 33.9 14.2 22.1 31.5 7.7 27.7 29.6 6.3
BABYWALK 29.6 47.8 18.1 35.2 48.5 27.2 26.4 44.9 13.1 26.3 44.7 11.5
BABYWALK
+
27.3 49.4 17.3 34.1 50.4 27.8 25.5 47.2 13.6 23.1 46.0 11.1
Table 5.2: VLN agents trained on the R4R dataset and evaluated on the unseen portion of the
R4R (in-domain) and the other 3 out-of-the-domain datasets: R2R, R6R and R8R with different
distributions in instruction length. (
+
: pre-trained with data augmentation.
?
: reimplemented or
adapted from the original authors’ public codes).
We describe the experimental setup (§5.5.1),followed by the main results in §5.5.2 where we
show the proposed BABYWALK agent attains competitive results on both the in-domain dataset but
also generalizing to out-of-the-domain datasets with varying lengths of navigation tasks. We report
61
results from various ablation studies in §5.5.3. While we primarily focus on the ROOM4ROOM
dataset, we re-analyze the original ROOM2ROOM dataset in §5.5.4 and were surprised to find out
the agents trained on it can generalize.
5.5.1 Experimental Setups
Datasets We conduct empirical studies on the existing datasets ROOM2ROOM and ROOM4ROOM
[3, 88], and the two newly created benchmark datasets ROOM6ROOM and ROOM8ROOM, de-
scribed in §5.4.3. Table 5.1 and Fig. 5.4 contrast their differences.
Evaluation Metrics We adopt the following metrics: Success Rate (SR) that measures the
average rate of the agent stopping within a specified distance near the goal location [3], Coverage
weighted by Length Score (CLS) [88] that measures the fidelity of the agent’s path to the reference,
weighted by the length score, and the newly proposed Success rate weighted normalized Dynamic
Time Warping (SDTW) that measures in more fine-grained details, the spatio-temporal similarity of
the paths by the agent and the human expert, weighted by the success rate [148]. Both CLS and
SDTW measure explicitly the agent’s ability to follow instructions and in particular, it was shown
that SDTW corresponds to human preferences the most. We report results in other metrics in the
appendix.
Agents to Compare to Whenever possible, for all agents we compare to, we either re-run,
reimplement or adapt publicly available codes from their corresponding authors with their provided
instructions to ensure a fair comparison. We also “sanity check” by ensuring the results from our
implementation and adaptation replicate and are comparable to the reported ones in the literature.
We compare our BABYWALK to the following: (1) the SEQ2SEQ agent [3], being adapted to the
panoramic state and action space used in this work; (2) the Speaker Follower (SF) agent [55]; (3) the
Reinforced Cross-Modal Agent (RCM) [236] that refines the SF agent using reinforcement learning
with either goal-oriented reward (RCM(GOAL)) or fidelity-oriented reward (RCM(FIDELITY)); (4)
the Regretful Agent (REGRETFUL) [146] that uses a progress monitor that records visited path
and a regret module that performs backtracking; (5) the Frontier Aware Search with Backtracking
agent (FAST) [97] that incorporates global and local knowledge to compare partial trajectories in
different lengths.
The last 3 agents are reported having state-of-the art results on the benchmark datasets.
Except the SEQ2SEQ agent, all other agents depend on an additional pre-training stage with data
augmentation [55], which improves cross-board. Thus, we train two BABYWALK agents: one
with and the other without the data augmentation.
62
10 20 30 40 50 60 70 80 90 100 110 120 130 140 >=150
Instruction length (words)
0
5
10
15
20
25
30
35
40
SDTW (%)
5k
10k
15k
20k
25k
30k
# of Data
Seq2Seq
SF
RCM (FIDELITY)
BabyWalk
Figure 5.5: Performance by various agents on navigation tasks in different lengths. See texts for
details.
5.5.2 Main results
In-domain Generalization This is the standard evaluation scenario where a trained agent is
assessed on the unseen split from the same dataset as the training data. The leftmost columns
in Table 5.2 reports the results where the training data is from R4R. The BABYWALK agents
outperform all other agents when evaluated on CLS and SDTW.
When evaluated on SR, FAST performs the best and the BABYWALK agents do not stand
out. This is expected: agents which are trained to reach goal do not necessarily lead to better
instruction-following. Note that RCM(FIDELITY) performs well in path-following.
Out-of-domain Generalization While our primary goal is to train agents to generalize well to
longer navigation tasks, we are also curious how the agents perform on shorter navigation tasks
too. The right columns in Table 5.2 report the comparison. The BABYWALK agents outperform
all other agents in all metrics except SR. In particular, on SDTW, the generalization to R6R
and R8R is especially encouraging, resulting almost twice those of the second-best agent FAST.
Moreover, recalling from Fig. 5.1, BABYWALK’s generalization to R6R and R8R attain even better
performance than the RCM agents that are trained in-domain.
Fig. 5.5 provides additional evidence on the success of BABYWALK, where we have contrasted
to its performance to other agents’ on following instructions in different lengths across all datasets.
Clearly, the BABYWALK agent is able to improve very noticeably on longer instructions.
Qualitative Results Fig. 5.6 contrasts visually several agents in executing two (long) navigation
tasks. BABYWALK’s trajectories are similar to what human experts provide, while other agents’
are not.
63
HUMAN BABYWALK RCM SF SEQ2SEQ
Figure 5.6: Trajectories by human experts and VLN agents on two navigation tasks.
Setting R4R! R4R R4R! others
Metrics SR" CLS" SDTW" SR" CLS" SDTW"
f
SUMMARY
=
NULL 18.9 43.1 9.9 17.1 42.3 9.6
LSTM() 25.8 44.0 14.4 25.7 42.1 14.3
f
SUMMARY
=
P
m1
i=1
i
(), i.e., eqs. (5.2,5.3)
= 5 27.5 46.8 15.8 26.7 44.4 14.9
= 0:5 27.3 49.4 17.3 27.6 47.9 17.5
= 0:05 27.5 47.7 16.2 26.0 45.5 15.2
= 0 26.1 46.6 15.1 25.1 44.3 14.4
Table 5.3: The memory buffer is beneficial to generalizing to different tasks from on which the
agent is trained.
5.5.3 Analysis
Memory Buffer is Beneficial Table 5.3 illustrates the importance of having a memory buffer to
summarize the agent’s past experiences. Without the memory (NULL), generalization to longer
tasks is significantly worse. Using LSTM to summarize is worse than using forgetting to summarize
(eqs. (5.2,5.3)). Meanwhile, ablating
of the forgetting mechanism concludes that
= 0:5 is
the optimal to our hyperparameter search. Note that when
= 0, this mechanism degenerates to
taking average of the memory buffer, and leads to inferior results.
Curriculum-based RL (CRL) is Important Table 5.4 establishes the value of CRL. While
imitation learning (IL) provides a good warm-up for SR, significant improvement on other two
metrics come from the subsequent RL (IL+RL). Furthermore, CRL (with 4 “lectures”) provides
64
Setting R4R! R4R R4R! others
Metrics SR" CLS" SDTW" SR" CLS" SDTW"
IL 24.7 27.9 11.1 24.2 25.8 10.2
IL+RL 25.0 45.5 13.6 25.0 43.8 14.1
IL+ CRL w/ LECTURE #
1st 24.1 44.8 13.5 24.1 43.1 13.6
2nd 26.7 45.9 15.2 26.2 43.7 14.8
3rd 27.9 47.4 17.0 26.7 45.4 16.3
4th 27.3 49.4 17.3 27.6 47.9 17.5
Table 5.4: BABYWALK’s performances with curriculum-based reinforcement learning (CRL),
which improves imitation learning without or with reinforcement learning (IL+RL).
Eval ! R6R ! R8R
Training SR" CLS" SDTW" SR" CLS" SDTW"
R2R 21.7 49.0 11.2 20.7 48.7 9.8
R4R 25.5 47.2 13.6 23.1 46.0 11.1
Eval ! R2R ! R4R
Training SR" CLS" SDTW" SR" CLS" SDTW"
R2R 43.8 54.4 36.9 21.4 51.0 13.8
R4R 34.1 50.4 27.8 27.3 49.4 17.3
Table 5.5: (Top) BABYWALK trained on R2R is nearly as effective as the agent trained on R4R
when generalizing to longer tasks. (Bottom) BABYWALK trained on R2R adapts to R4R better
than the agent trained in the reverse direction.
clear improvements over direct RL on the entire instruction (i.e., learning to execute all BABY-
STEPs at once). Each lecture improves over the previous one, especially in terms of the SDTW
metric.
5.5.4 Revisiting ROOM2ROOM
Our experimental study has been focusing on using R4R as the training dataset as it was established
that as opposed to R2R, R4R distinguishes well an agent who just learns to reach the goal from an
agent who learns to follow instructions.
Given the encouraging results of generalizing to longer tasks, a natural question to ask, how
well can an agent trained on R2R generalize?
Results in Table 5.5 are interesting. Shown in the top panel, the difference in the averaged
performance of generalizing to R6R and R8R is not significant. The agent trained on R4R has a
65
small win on R6R presumably because R4R is closer to R6R than R2R does. But for even longer
tasks in R8R, the win is similar.
In the bottom panel, however, it seems that R2R! R4R is stronger (incurring less loss in
performance when compared to the in-domain setting R4R! R4R) than the reverse direction
(i.e., comparing R4R! R2R to the in-domain R2R! R2R). This might have been caused by the
noisier segmentation of long instructions into BABY-STEPs in R4R. (While R4R is composed of
two navigation paths in R2R, the segmentation algorithm is not aware of the “natural” boundaries
between the two paths.)
66
Part III
Learning with Limited and Growing Data
67
Chapter 6
Few-shot and Generalized Few-shot Learning
Building a high-quality classification system usually requires to have a large scale of annotated
training set with many shots per category. Many large-scale datasets such as ImageNet have an
ample number of instances for popular classes [113, 184]. However, the tail categories of the
distribution matters. For example, a visual search engine needs to deal with the rare object of
interests (e.g., endangered species) or newly defined items (e.g., new smartphone models), which
only possesses a few data instances. Directly training a neural networks over all classes is prone to
over-fit and can be biased towards the data-rich categories [20, 40, 91, 249, 265]. This motivates
the research of learning tail categories with limited training data, while maintaining efficacy on the
head categories, i.e., Few-shot and Generalized Few-shot Learning. This Chapter introduces the
problem setups of these learning situations, and compares them with existing literature and then
discusses their general formulation.
6.1 Overview
We begin by describing the development of settings in the theme of learning with limited data.
Starting with zero-shot learning, we analyze its motivation and limitation, which leads us to the
more practical problem of few-shot and generalized few-shot learning.
Zero-shot learning (ZSL) [1, 25, 121, 246] is a popular problem that addresses learning without
labeled data. It transfers the relationship between images and attributes learned from SEEN classes
to UNSEEN classes, using the semantic descriptions of objects as a bridge. For instance, popular
methods [23, 24] learn a mapping from the semantic descriptions (e.g., word embedings of category
name) to its corresponding visual prototype. As a result, one can infer the visual prototype of
unseen classes by looking at its corresponding semantic descriptions. Both ZSLs are limited
to recognizing objects with well-defined semantic descriptions, which assumes that the visual
68
appearance of novel categories is harder to obtain than knowledge about their attributes, whereas
in the real-world we often get the appearance of objects before learning about their characteristics.
Few-shot learning (FSL) proposes a more realistic setup, where we have access to a limited
number (e.g., one example) of visual exemplars from the tail classes [125, 228] in the deployment
of the visual system, and are required to recognize new instances of these tail categories. It places
the challenge asking for a classification model to rapidly pick up the key characteristic of those
few training examples, and use them to build effective classifiers.
To deal with this challenges, FSL algorithms typically simulates the learning situation they
encountering in the deployment time, by using the training data of SEEN classes. Specifically,
works uses meta-learning algorithms [53, 198] to extract the inductive bias from the SEEN classes,
and transfer it to the learning process of UNSEEN classes with few training data during the
model deployment. For example, one line of works uses meta-learned discriminative feature
embeddings [198, 228] together with the non-parametric nearest neighbor classifiers agnostic to
its context, to recognize novel classes given a few exemplars. Another line of works chooses to
learn the common optimization strategy [17, 180] across few-shot tasks. Such strategy adapts a
pre-specified model initialization to the context of the specific classification task, using gradient
descents over the few-shot UNSEEN training data [10, 53, 124, 135, 161].
Generalized Few-shot learning (GFSL) takes a further step towards real-world usage, where
a model is required to master the recognition of not only UNSEEN tail classes but also SEEN
head classes. As a result, generalized few-shot learning exposes two additional challenges. First,
an algorithm needs to construct classifiers not only for the few-shot tail classes but also for the
many-shot head classes. More importantly, the learning of two types of classifiers needs to be
integrated together such that the predictions are compatible to each other, without sacrificing
recalls on either head or tail classes.
In this part, we focus on studying the problem and methods for few-shot learning and gen-
eralized few-shot learning. Specifically, Chapter 7 introduces a different approach expanding
the embedding based few-shot learning methods with a new capability to adapt itself to the
classification context, while being more stable than stochastic optimization based approaches.
Meanwhile, Chapter 8 introduces two approaches specially designed for constructing the joint
GFSL classifiers, as well as a effective learning framework that simultaneously optimizes the
accuracy of both few-shot and many-shot classification.
69
6.2 Problem Description
We define aK-shotN-way task as a classification task withN classes andK training examples
per class. The training set (i.e., the support set) is represented asD
train
=f(x
i
;y
i
)g
NK
i=1
, where
x
i
2R
D
is an instance andy
i
2f0; 1g
N
(i.e., one-hot vector) is its label. Similarly, the test set
(a.k.a. the query set) isD
test
, which contains i.i.d. samples from the same distribution asD
train
.
Many-shot Learning. In many-shot learning where the K is large (up to hundreds), a clas-
sification model f : R
D
!f0; 1g
N
learns by optimizing over the instances from the head
classes
1
:
E
(x
i
;y
i
)2D
train
`(f(x
i
);y
i
)
Heref is often instantiated as an embedding function() : R
D
! R
d
and a linear classifier
2 R
dN
: f(x
i
) = (x
i
)
>
. We denote the weight vector of then-class as
n
. The loss
function `(;) measures the discrepancy between the prediction and the true label, which is
typically a cross-entropy loss.
Few-shot Learning (FSL). Different from many-shot learning, FSL faces the challenge in
transferring knowledge from head visual concepts to the tail visual concepts. It assumes two
non-overlapping sets of SEEN (S) and UNSEEN (U) classes. The target objective is to minimize
the loss over test examples of UNSEEN classes:
E
D
U
train
E
(x
j
;y
j
)2D
U
test
h
`
f
x
j
;D
U
train
;y
j
i
(6.1)
Here, functionf builds the classifiers of UNSEEN classes using the UNSEEN training setD
U
train
that minimizes loss overD
U
test
, denoted asf
x
j
;D
U
train
. Given that we do not have access to
the UNSEEN classes during the model training, one needs to make effective use of the SEEN classes
to encode the inductive bias into the functionf, which in the end minimizes the objective 6.1.
Generalized Few-shot Learning (GFSL). Different from FSL, GFSL additionally aims at
building a model that simultaneously predicts overS [ U categories. Such a model needs to
deal with many-shot classification fromjSj SEEN classes along side with learningjUj emerging
1
In the chapters of Part III, we assumes that head classes are SEEN classes and tail classes are UNSEEN classes, and
use these notations interchangeably
70
UNSEEN classes.
2
The objective of GFSL is similar to the one in FSL, except that now test
examples come from both SEEN and UNSEEN classes:
E
D
U
train
E
(x
j
;y
j
)2D
S[U
test
h
`
f
x
j
;D
U
train
;
S
;y
j
i
(6.2)
Different from Eq. 6.1, the GFSL classifierf
;D
U
train
;
S
takes both the UNSEEN class few-shot
training setD
U
train
and the set of many-shot classifiers
S
from the SEEN classes as input.
2
jSj andjUj denote the total number of classes from the SEEN and UNSEEN class sets, respectively.
71
Chapter 7
Few-shot Learning by Embedding Adaptation
Learning with limited data is a key challenge for visual recognition. Many few-shot learning
methods address this challenge by learning an instance embedding function from seen classes
and apply the function to instances from unseen classes with limited labels. This style of transfer
learning is task-agnostic: the embedding function is not learned optimally discriminative with
respect to the unseen classes, where discerning among them leads to the target task. In this
chapter, we propose a novel approach to adapt the instance embeddings to the target classification
task with a set-to-set function, yielding embeddings that are task-specific and are discriminative.
We empirically investigated various instantiations of such set-to-set functions and observed the
Transformer is most effective — as it naturally satisfies key properties of our desired model.
We denote this model as FEAT (few-shot embedding adaptation with Transformer) and validate
it on both the standard few-shot classification benchmark and four extended few-shot learning
settings with essential use cases, i.e., cross-domain, transductive, generalized few-shot learning,
and low-shot learning. It archived consistent improvements over baseline models as well as
previous methods, and established the new state-of-the-art performances on two benchmarks.
7.1 Motivation
Few-shot visual recognition [53, 119, 120, 125, 228] emerged as a promising direction in tackling
the challenge of learning new visual concepts with limited annotations. Concretely, it distinguishes
two sets of visual concepts: SEEN and UNSEEN ones. The target task is to construct visual
classifiers to identify classes from the UNSEEN where each class has a very small number of
exemplars (“few-shot”). The main idea is to discover transferable visual knowledge in the SEEN
classes, which have ample labeled instances, and leverage it to construct the desired classifier. For
example, state-of-the-art approaches for few-shot learning [186, 198, 217, 228] usually learn a
discriminative instance embedding model on the SEEN categories, and apply it to visual data in
UNSEEN categories. In this common embedding space, non-parametric classifiers (e.g., nearest
72
Malamute
Ant
School bus
Golden retriever
Theater curtain
Adaptation
Lion
School bus
Hourglass
Vase
Trifle
Adaptation
Trifle
Scoreboard
Golden retriever
Dalmatian
Vase
Adaptation
Golden retriever
Nematode
Lion
Dalmatian
Malamute
Adaptation
(a) Acc": 40:33%! 55:33% (b) Acc": 48:00%! 69:60% (c) Acc": 43:60%! 63:33% (d) Acc#: 56:33%! 47:13%
Figure 7.1: Qualitative visualization of model-based embedding adaptation procedure (imple-
mented using FEAT) on test tasks (refer to §7.5.2.2 for more details). Each figure shows the
locations of PCA projected support embeddings (class prototypes) before and after the adaptation
of FEAT. Values below are the 1-shot 5-way classification accuracy before and after the the
adaptation. Interestingly, the embedding adaptation step of FEAT pushes the support embeddings
apart from the clutter and toward their own clusters, such that they can better fits the test data of its
categories.
neighbors) are then used to avoid learning complicated recognition models from a small number
of examples.
Such approaches suffer from one important limitation. Assuming a common embedding space
implies that the discovered knowledge – discriminative visual features – on the SEEN classes are
equally effective for any classification tasks constructed for an arbitrary set of UNSEEN classes.
In concrete words, suppose we have two different target tasks: discerning “cat” versus “dog”
and discerning “cat” versus “tiger”. Intuitively, each task uses a different set of discriminative
features. Thus, the most desired embedding model first needs to be able to extract discerning
features for either task at the same time. This could be a challenging aspect in its own right as the
current approaches are agnostic to what those “downstream” target tasks are and could accidentally
de-emphasize selecting features for future use. Secondly, even if both sets of discriminative
features are extracted, they do not necessarily lead to the optimal performance for a specific target
task. The most useful features for discerning “cat” versus “tiger” could be irrelevant and noise to
the task of discerning “cat” versus “dog”!
What is missing from the current few-shot learning approaches is an adaptation strategy that
tailors the visual knowledge extracted from the SEEN classes to the UNSEEN ones in a target
task. In other words, we desire separate embedding spaces where each one of them is customized
such that the visual features are most discriminative for a given task. Towards this, we propose a
few-shot model-based embedding adaptation method that adjusts the instance embedding models
derived from the SEEN classes. Such model-based embedding adaptation requires a set-to-set
function: a function mapping that takes all instances from the few-shot support set and outputs the
set of adapted support instance embeddings, with elements in the set co-adapting with each other.
Such output embeddings are then assembled as the prototypes for each visual category and serve
73
Classification
Scores
CNN CNN CNN CNN
Soft Nearest
Neighbor
(a) Instance Embedding
Classification
Scores
Embedding
Adaptation
CNN CNN CNN CNN
Soft Nearest
Neighbor
Train Instance
Test Instance
Task Agnostic
Embedding
Task Specific
Embedding
(b) Embedding Adaptation
Set-to-Set Function
Figure 7.2: Illustration of the proposed Few-Shot Embedding Adaptation Transformer (FEAT).
Existing methods usually use the same embedding function E for all tasks. We propose to adapt
the embeddings to each target few-shot learning task with a set-to-set function such as Transformer,
BiLSTM, DeepSets, and GCN.
as the nearest neighbor classifiers. Figure 7.1 qualitatively illustrates the embedding adaptation
procedure (as results of our best model). These class prototypes spread out in the embedding space
toward the samples cluster of each category, indicating the effectiveness of embedding adaptation.
A variety of function approximators can be useed to implement the set-to-set transformation,
including bidirectional LSTM [74] (Bi-LSTM), deep sets [257], graph convolutional network
(GCN) [104], and Transformer [138, 222]. Our experimental results (refer to §7.5.2.1) suggest
that Transformer is the most parameter efficient choice that at the same time best implements the
key properties of the desired set-to-set transformation, including contextualization, permutation
invariance, interpolation and extrapolation capabilities (see §7.4.1). As a consequence, we choose
the set-to-set function instantiated with Transformer to be our final model and denote it as FEAT
(Few-shot Embedding Adaptation with Transformer). We further conduct comprehensive analysis
on FEAT and evaluate it on many extended tasks, including few-shot domain generalization,
transductive few-shot learning, and generalized few-shot learning.
7.2 Related Work
Methods specifically designed for few-shot learning fall broadly into two categories. The first
is to control how a classifier for the target task should be constructed. One fruitful idea is the
meta-learning framework where the classifiers are optimized in anticipation that a future update
due to data from a new task performs well on that task [7, 10, 53, 64, 124, 161, 180, 186], or the
classifier itself is directly meta-predicted by the new task data [174, 240].
74
Another line of approach has focused on learning generalizable instance embeddings [1, 23,
24, 78, 108, 152, 193, 217, 228] and uses those embeddings on simple classifiers such as nearest
neighbor rules. The key assumption is that the embeddings capture all necessarily discriminative
representations of data such that simple classifiers are sufficed, hence avoiding the danger of
overfitting on a small number of labeled instances. Early work such as [108] first validated the
importance of embedding in one-shot learning, whilst [228] proposes to learn the embedding
with a soft nearest neighbor objective, following a meta-learning routine. Recent advances have
leveraged different objective functions for learning such embedding models, e.g., considering the
class prototypes [198], decision ranking [217], and similarity comparison [206]. Most recently,
[187] utilizes the graph convolution network [104] to unify the embedding learning.
Our work follows the second school of thoughts. The main difference is that we do not
assume the embeddings learned on SEEN classes, being agnostic to the target tasks, are necessarily
discriminative for those tasks. In contrast, we propose to adapt those embeddings for each target
task with a set-to-set function so that the transformed embeddings are better aligned with the
discrimination needed in those tasks. We show empirically that such task-specific embeddings
perform better than task-agnostic ones. We note that MetaOptNet [123] and CTM [127] are two
concurrent works to us following the same spirit of learning task-specific embedding (or classifiers)
via either explicitly optimization of target task or using concentrator and projector to make distance
metric task-specific.
7.3 Preliminary
As aforementioned in §6.2, the ultimate goal of few-shot learning is to find a functionf that
classifies a test instancex
j
D
U
test
as ^ y
j
=f(x
j
;D
U
train
)2f0; 1g
N
for classes sampled from
the UNSEEN categoriesU.
Since we do not have access to the UNSEEN classes during the model training, we learn this
classifiation functionf by simulating theK-shotN-way FSL tasks as meta-learning [53, 198, 228].
In particular, aK-shotN-way taskD
S
train
sampled fromS is constructed by randomly choosing
N classes fromS withK examples in each of them.
1
The main idea is to mimic the future few-shot
learning scenario (over UNSEEN classes) via optimizing a sharedf on theK-shotN-way sampled
tasks drawn from the SEEN class setsS:
E
(D
S
train
;D
S
test
)S
E
(x
j
;y
j
)2D
S
test
h
`
f
x
j
;D
S
train
;y
j
i
(7.1)
1
We use the super-scriptS andU to denote a set or an instance sampled fromS andU, respectively.
75
Eq. 7.1 approximates the Eq. 6.1 with the SEEN class data, and thef are applied to different
few-shot tasks constructed by the data of SEEN classes. We denote tasks and classes related toS
andU as “meta-training” and “meta-test”, respectively.
In this chapter, we consider the hypothesis class off to be the embedding-based classifiers [198,
228] (see Figure 7.2 (a) for an overview). In particular, the classifierf() is composed of two
elements. The first is an embedding function
x
= E(x)2 R
d
that maps an instancex to a
representation space. The second component applies the nearest neighbor classifiers in this space:
^ y
j
=f (x
j
;D
train
)
=
X
(x
i
;y
i
)2D
train
sim ((x
j
);(x
i
))y
i
sim((x
j
);(x
i
)) measures the similarity between the test instance(x
j
) and each training
instance(x
i
). When there is more than one instance per class, i.e.,K > 1, instances in the same
class can be averaged to assist make a final decision [198]. Note that only the embedding function
is learned by optimizing the loss in Eq. 7.1. For reasons to be made clear in below, we refer this
embedding function as task-agnostic.
7.4 Embedding Adaptation for Task-specific FSL
In what follows, we describe our approach for few-shot learning (FSL). We start by describing the
main idea (§7.4.1, also illustrated in Figure 7.2), then introduce the set-to-set adaptation function
(§7.4.2). Last are learning framework of the proposed model (§7.4.3).
7.4.1 Adapting to Task-Specific Embeddings
The key component of our approach is to learn task-specific embeddings, via adapting the
task-agnostic embeddings in the context of the specific classification task. We argue that the
embedding(x) is not ideal. In particular, the embeddings do not necessarily highlight the most
discriminative representation for a specific target task. To this end, we introduce an adaption step
where the embedding function(x) (more precisely, its values on instances) is transformed. This
transformation is a set-to-set function that contextualizes over the image instances of a set, to enable
strong co-adaptation of each item. Instance functions fails to have such co-adaptation property.
Furthermore, the set-to-set-function receives instances as bags, or sets without orders, requiring
76
Algorithm 1 Training strategy of embedding adaptation
Require: Seen class setS
1: for all iteration = 1,...,MaxIteration do
2: SampleN-wayK-shot (D
S
train
,D
S
test
) fromS
3: Compute(x) = E(x), forx2X
S
train
[X
S
test
4: for all (x
S
j
;y
S
j
)2D
S
test
do
5: Computef (x) ;8x2X
S
train
g with T via Eq. 7.2
6: Predict ^ y
S
j
withf (x)g as Eq. 7.3
7: Compute`(^ y
S
j
;y
S
j
) with Eq. 7.1
8: end for
9: Computer
E;T
P
(x
S
j
;y
S
j
)2D
S
test
`(^ y
S
test
;y
S
j
)
10: Update E and T withr
E;T
use SGD
11: end for
12: return Embedding function E and set function T.
the function to output the set of refined instance embeddings while being permutation-invariant.
Concretely,
f (x) ;8x2X
train
g = T (f(x) ;8x2X
train
g) (7.2)
= T (f(x) ;8x2X
train
g))
whereX
train
is a set of all the instances in the training setD
train
for the target task. () is a
permutation operator over a set. Thus the set of adapted embedding will not change if we apply
a permutation over the input embedding set. With adapted embedding (x), the test instance
(x
j
;y
j
)D
test
can be classified by computing nearest neighbors w.r.t.D
train
:
^ y
j
=f(
x
j
;f (x);8(x;y)2D
train
g) (7.3)
Our approach is generally applicable to different types of task-agnostic embedding function
E and similarity measure sim(;), e.g., the (normalized) cosine similarity [228] or the negative
distance [198]. Both the embedding function E and the set transformation function T are optimized
over synthesized FSL tasks sampled fromD
S
, sketched in Alg. 1. Its key difference from
conventional FSL is in the line 4 to line 8 where the embeddings are transformed.
77
7.4.2 Embedding Adaptation using Neural Networks
Next, we explain various choices of neural networks to implement the set-to-set embedding
adaptation function, whose input and output are the set of instance embeddings and the set of
contextualized embeddings, respectively.
Bidirectional LSTM (BILSTM) [74, 228] is one of the common choice to instantiate the set-
to-set transformation, where the addition between the input and the hidden layer outputs of each
BILSTM cell leads to the adapted embedding. It is notable that the output of the BILSTM suppose
to depend on the order of the input set. Note that using BILSTM as embedding adaptation model is
similar but different from the fully conditional embedding [228], where the later one contextualizes
both training and test instance embedding altogether, which results in a transductive setting.
DeepSets [257] is inherently a permutation-invariant transformation function. It is worth noting
that DEEPSETS aggregates the instances in a set into a holistic set vector. We consider two
components to implement such DeepSets transformation, an instance centric set vector combined
with a set context vector. Forx2X
train
, we define its complementary set as {
x
. Then we
implement the DEEPSETS by:
(x) =(x) +g([(x);
X
x
i
02{x
h((x
i
0))]) (7.4)
In Eq. 7.4,g andh are two-layer multi-layer perception (MLP) with ReLU activation which map
the embedding into another space and increase the representation ability of the embedding. For
each instance, embeddings in its complementary set is first combined into a set vector as the
context, and then this vector is concatenated with the input embedding to obtain the residual
component of adapted embedding. This conditioned embedding takes other instances in the set
into consideration, and keeps the permutation invariant property. In practice, we find using the
maximum operator in Eq. 7.4 works better than the sum operator suggested in [257].
Graph Convolutional Networks (GCN) [104, 187] propagate the relationship between in-
stances in the set. We first construct the degree matrix A to represent the similarity between
instances in a set. If two instances come from the same class, then we set the corresponding
element inA to 1, otherwise to 0. Based onA, we build the “normalized” adjacency matrixS for
a given set with added self-loopsS =D
1
2
(A +I)D
1
2
. I is the identity matrix, andD is the
diagonal matrix whose elements are equal to the sum of elements in the corresponding row of
A +I.
78
Let
0
=f(x) ; 8x2X
train
g, the relationship between instances could be propagated
based onS, i.e.,
t+1
= ReLU(S
t
W ); t = 0; 1;:::;T 1 (7.5)
W is a projection matrix for feature transformation. In GCN, the embedding in the set is
transformed based on Eq. 7.5 multiple times, and the final
T
gives rise to thef (x)g.
Transformer. [222] We use the Transformer architecture [222] to implement T. In particular,
we employ self-attention mechanism [138, 222] to transform each instance embedding with
consideration to its contextual instances. Note that it naturally satisfies the desired properties of T
because Transformer is permutation invariant and has strong capability in contextualizing a large
set of embeddings. Therefore, we use the Transformer as our primary model to instantiate the
set-to-set function, and denote it as Few-Shot Embedding Adaptation with Transformer (FEAT).
Transformer is a store of triplets in the form of (queryQ, keyK, and valueV). To com-
pute proximity and return values, those points are first linearly mapped into some spaceK =
W
>
K
(x
k
);8x
k
2K
2 R
djKj
, which is also the same forQ andV with W
Q
and W
V
respectively. Transformer computes what is the right value for a query point — the queryx
q
2Q
is first matched against a list of keysK where each key has a valueV . The final value is then
returned as the sum of all the values weighted by the proximity of the key to the query point, i.e.
(x
q
) =(x
q
) +
P
k
qk
V
:;k
, where
qk
/ exp
(x
q
)
>
W
Q
K
p
d
(7.6)
andV
:;k
is thek-th column ofV . In the standard FSL setup, we haveQ =K =V =X
train
.
7.4.3 Contrastive Learning of Intra-Class and Inter-Class Relation
To facilitate the learning of embedding adaptation, we apply a contrastive objective in addition
to the general one. It is designed to make sure that instances embeddings after adaptation is
similar to the same class neighbors and dissimilar to those from different classes. Specifically,
the embedding adaptation function T is applied to instances of eachn of theN class inD
S
train
[
D
S
test
, which gives rise to the transformed embedding
0
(x) and class centersfc
n
g
N
n=1
. Then
we apply the contrastive objective to make sure training instances are close to its own class
center than other centers, which augments the Eq. 7.1) with an additional regularization term
`
softmax (sim(
0
(x
j
); c
n
));y
j
, where controls the balances of two objectives. This
contrastive learning makes the set transformation extract common characteristic for instances of
the same category, so as to preserve the category-wise similarity.
79
7.5 Experiments
In this section, we first evaluate a variety of models for embedding adaptation in §7.5.2 with
standard FSL. It concludes that FEAT (with Transformer) is the most effective approach among
different instantiations. Next, we perform ablation studies in §7.5.2.2 to analyze FEAT in details.
Eventually, we evaluate FEAT on multi-domain few-shot and transductive few-shot learning to
study its general applicability (§7.5.3).
7.5.1 Experimental Setups
Datasets. Four datasets, MiniImageNet [228], TieredImageNet [181], Caltech-UCSD Birds
(CUB) 200-2011 [230], and OfficeHome [224] are investigated in this paper. Each dataset is
split into three parts based on different non-overlapping sets of classes, for model training (a.k.a.
meta-training in the literature), model validation (a.k.a. meta-val in the literature), and model
evaluation (a.k.a. meta-test in the literature). MiniImageNet [228] and TieredImageNet [181]
datasets are subsets of the ImageNet [184]. MiniImageNet includes a total number of 100 classes
and 600 examples per class. We follow the setup provided by [180], and use 64 classes as SEEN
categories, 16 and 20 as two sets of UNSEEN categories for model validation and evaluation
respectively. TieredImageNet is a large-scale dataset with more categories, which contains 351, 97,
and 160 categories for model training, validation, and evaluation, respectively. The CUB dataset is
initially designed for fine-grained classification. It contains in total 11,788 images of birds over
200 species. On CUB, we randomly sampled 100 species as SEEN classes, another two 50 species
are used as two UNSEEN sets for model validation and evaluation [217]. For all images in the CUB
dataset, we use the provided bounding box to crop the images as a pre-processing [217]. Before
input into the backbone network, all images in the dataset are resized based on the requirement
of the network. In addition to these, we investigate the OfficeHome [224] dataset to validate the
generalization ability of FEAT across domains. There are four domains in OfficeHome, and two
of them (“Clipart” and “Real World”) are selected, which contains 8722 images. After randomly
splitting all classes, 25 classes serve as the seen classes to train the model, and the remaining 15
and 25 classes are used as two UNSEEN for evaluation.
Evaluation protocols. Previous approaches [53, 198, 217] usually follow the original setting
of [228] and evaluate the models on 600 sampled target tasks (15 test instances per class). In a
later study [186], it was suggested that such an evaluation process could potentially introduce high
variances. Therefore, we follow the new and more trustworthy evaluation setting to evaluate both
baseline models and our approach on 10,000 sampled tasks. We report the mean accuracy (in %)
as well as the 95% confidence interval.
80
Implementation details We consider three different types of convolutional networks as the
backbone for instance embedding function E: 1) A 4-layer convolution network (ConvNet) [198,
217, 228] and 2) the 12-layer residual network (ResNet) used in [123], and 3) the Wide Residual
Network (WideResNet) [186, 256]. Due to the space limit, results on WideResNet are differed
to the appendix. We apply an additional pre-training stage for the backbones over the SEEN
classes, based on which our re-implemented methods are further optimized. To achieve more
precise embedding, we average the same-class instances in the training set before the embedding
adaptation with the set-to-set transformation. Adam [103] and SGD are used to optimize ConvNet
and ResNet variants respectively. Moreover, we follow the most standard implementations for
the four set-to-set functions — BiLSTM [74], DeepSets [257], Graph Convolutional Networks
(GCN) [104] and Transformer (FEAT) [222]. The code of our models are available at https:
//github.com/Sha-Lab/FEAT.
Baseline and embedding adaptation methods. We re-implement the prototypical network
(ProtoNet) [198] as a task-agnostic embedding baseline model. This is known as a very strong
approach [33] when the backbone architecture is deep, i.e., residual networks [67]. As suggested
by [164], we tune the scalar temperature carefully to scale the logits of both approaches in our
re-implementation. As mentioned, we implement the embedding adaptation model with four
different function approximators, and denote them as BILSTM, DEEPSETS, GCN, and FEAT (i.e.
Transformer).
Backbone pre-training. Instead of optimizing from scratch, we apply an additional pre-training
strategy as suggested in [174, 186]. The backbone network, appended with a softmax layer, is
trained to classify all SEEN classes with the cross-entropy loss (e.g., 64 classes in the MiniImageNet).
The classification performance over the penultimate layer embeddings of sampled 1-shot tasks
from the model validation split is evaluated to select the best pre-trained model, whose weights are
then used to initialize the embedding function E in the few-shot learning.
Pre-training strategy. As mentioned before, we apply an additional pre-training strategy as
suggested in [174, 186]. The backbone network, appended with a softmax layer, is trained to
classify all classes in the SEEN class split (e.g., 64 classes in the MiniImageNet) with the cross-
entropy loss. In this stage, we apply image augmentations like random crop, color jittering, and
random flip to increase the generalization ability of the model. After each epoch, we validate the
performance of the pre-trained weights based on its few-shot classification performance on the
model validation split. Specifically, we randomly sample 200 1-shotN-way few-shot learning
tasks (N equals the number of classes in the validation split, e.g., 16 in the MiniImageNet), which
81
contains 1 instance per class in the support set and 15 instances per class for evaluation. Based
on the penultimate layer instance embeddings of the pre-trained weights, we utilize the nearest
neighbor classifiers over the few-shot tasks and evaluate the quality of the backbone. We select
the pre-trained weights with the best few-shot classification accuracy on the validation set. The
pre-trained weights are used to initialize the embedding backbone E, and the weights of the whole
model are then optimized together during the model training.
Transformer Hyper-parameters. We follow the architecture as presented in [222] to build our
FEAT model. The hidden dimensiond
0
for the linear transformation in our FEAT model is set
to 64 for ConvNet and 640 for ResNet/WRN. The dropout rate in transformer is set as 0:5. We
empirically observed that the shallow transformer (with one set of projection and one stacked
layer) gives the best overall performance (also studied in §7.5.2.2).
Optimization. Following the literature, different optimizers are used for the backbones during
the model training. For the ConvNet backbone, stochastic gradient descent with Adam [103]
optimizer is employed, with the initial learning rate set to be 0:002. For the ResNet and WRN
backbones, vanilla stochastic gradient descent with Nesterov acceleration is used with an initial
rate of 0:001. We fix the weight decay in SGD as 5e-4 and momentum as 0.9. The schedule of the
optimizers is tuned over the validation part of the dataset. As the backbone network is initialized
with the pre-trained weights, we scale the learning rate for those parameters by 0:1.
7.5.2 Standard Few-Shot Image Classification
We compare our proposed FEAT method with the instance embedding baselines as well as previous
methods on the standard MiniImageNet [228] and TieredImageNet [181] benchmarks, and then
perform detailed analysis on the ablated models. We also include additional results with CUB [230]
dataset, which shares a similar observation.
7.5.2.1 Main Results
Comparison to previous State-of-the-arts. Table 7.1 and Table 7.2 show the results of our
method and others on the MiniImageNet and TieredImageNet. First, we observe that the best em-
bedding adaptation method (FEAT) outperforms the instance embedding baseline on both datasets,
indicating the effectiveness of learning task-specific embedding space. Meanwhile, the FEAT
model performs significantly better than the current state-of-the-art methods on MiniImageNet
dataset. On the TieredImageNet, we observe that the ProtoNet baseline is already better than
some previous state-of-the-arts based on the 12-layer ResNet backbone. This might due to the
82
Table 7.1: Few-shot classification accuracy 95% confidence interval on MiniImageNet with
ConvNet and ResNet backbones. Our implementation methods are measured over 10,000 test
trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
Backbone Network! ConvNet ResNet ConvNet ResNet
MatchNet [228] 43.40 0.78 - 51.09 0.71 -
MAML [53] 48.70 1.84 - 63.11 0.92 -
ProtoNet [198] 49.42 0.78 - 68.20 0.66 -
RelationNet [206] 51.38 0.82 - 67.07 0.69 -
PFA [174] 54.53 0.40 - 67.87 0.20 -
TADAM [164] - 58.50 0.30 - 76.70 0.30
MetaOptNet [123] - 62.64 0.61 - 78.63 0.46
Baselines
MAML 49.24 0.21 58.05 0.10 67.92 0.17 72.41 0.20
MatchNet 52.87 0.20 65.64 0.20 67.49 0.17 78.72 0.15
ProtoNet 52.61 0.20 62.39 0.21 71.33 0.16 80.53 0.14
Embedding Adaptation
BILSTM 52.13 0.20 63.90 0.21 69.15 0.16 80.63 0.14
DEEPSETS 54.41 0.20 64.14 0.22 70.96 0.16 80.93 0.14
GCN 53.25 0.20 64.50 0.20 70.59 0.16 81.65 0.14
Ours: FEAT 55.15 0.20 66.78 0.20 71.61 0.16 82.05 0.14
effectiveness of the pre-training stage on the TieredImageNet as it is larger than MiniImageNet
and a fully converged model can be itself very effective. Based on this, all embedding adaptation
approaches further improves over ProtoNet almost in all cases, with FEAT achieving the best
performances among all approaches. Note that here our pre-training strategy is most similar to the
one used in PFA [174], while we further fine-tune the backbone. Temperature scaling of the logits
influences the performance a lot when fine-tuning over the pre-trained weights.
Comparison among the embedding adaptation models. Among the four embedding adap-
tation methods, we observe that BILSTM in most cases achieves the worst performances and
sometimes performs worse than ProtoNet. This is partially due to the fact that BILSTM can not eas-
ily implement the required permutation invariant property (also shown in [257]), which confuses
the learning process of embedding adaptation. Secondly, we find that DEEPSETS and GCN have the
ability to adapt discriminative task-specific embeddings but do not achieve consistent performance
improvement over the baseline ProtoNet especially on MiniImageNet with the ConvNet backbone.
A potential explanation is that, such models when jointly learned with the backbone model, can
make the optimization process more difficult, which leads to the varying final performances. In
83
Table 7.2: Few-shot classification accuracy and 95% confidence interval on TieredImageNet with
the ResNet backbone.
Setups! 1-Shot 5-Way 5-Shot 5-Way
ProtoNet [198] 53.31 0.89 72.69 0.74
RelationNet [206] 54.48 0.93 71.32 0.78
MetaOptNet [123] 65.99 0.72 81.56 0.63
CTM [127] 68.41 0.39 84.28 1.73
SimpleShot [237] 69.09 0.22 84.58 0.16
Instance embedding
ProtoNet 68.23 0.23 84.03 0.16
Embedding adaptation
BILSTM 68.14 0.23 84.23 0.16
DEEPSETS 68.59 0.24 84.36 0.16
GCN 68.20 0.23 84.64 0.16
FEAT 70.80 0.23 84.79 0.16
Table 7.3: Few-shot classification performance with ConvNet backbone on CUB dataset (mean
accuracy95% confidence interval). Our implementation methods are measured over 10,000 test
trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
MatchNet [228] 61.16 0.89 72.86 0.70
MAML [53] 55.92 0.95 72.09 0.76
ProtoNet [198] 51.31 0.91 70.77 0.69
RelationNet [206] 62.45 0.98 76.11 0.69
Instance Embedding
MatchNet 67.73 0.23 79.00 0.16
ProtoNet 63.72 0.22 81.50 0.15
Embedding Adaptation
BILSTM 62.05 0.23 73.51 0.19
DEEPSETS 67.22 0.23 79.65 0.16
GCN 67.83 0.23 80.26 0.15
Ours: FEAT 68.87 0.22 82.90 0.15
contrast, we observe that FEAT can consistently improve ProtoNet and other embedding adaptation
approaches in all cases, without additional bells and whistles. It shows that the Transformer as
a set-to-set function can implement rich interactions between instances, which provides its high
expressiveness to model the embedding adaptation process.
84
5 10 15 20
Number of categories per task
0
10
20
30
40
50
60
70
Mean accuracy (in %)
52.5
36.8
29.3
24.6
55.0
38.6
30.6
25.8
53.2
37.1
29.5
24.9
55.1
39.1
31.3
26.4
Methods
Random
BILSTM
DeepSets
GCN
FEAT
5 10 15 20
Number of categories per task
0
10
20
30
40
50
60
70
52.1
35.5
27.5
22.9
54.4
36.9
27.3
20.6
54.1
37.9
30.1
25.3
55.1
39.1
31.1
26.2
Methods
Random
BILSTM
DeepSets
GCN
FEAT
(a) Way Interpolation (b) Way Extrapolation
Figure 7.3: Interpolation and Extrapolation of few-shot tasks from the “way” perspective. First,
We train various embedding adaptation models on 1-shot 20-way (a) or 5-way (b) classification
tasks and evaluate models on unseen tasks with different number of classes (N={5, 10, 15, 20}).
It shows that FEAT is superior in terms of way interpolation and extrapolation ability.
Table 7.3 shows the 5-way 1-shot and 5-shot classification results on the CUB dataset based
on the ConvNet backbone. The results on CUB are consistent with the trend on the MiniImageNet
and TieredImageNet datasets. Embedding adaptation indeed assists the embedding encoder for the
few-shot classification tasks. Facilitated by the set function property, the DEEPSETS works better
than the BILSTM counterpart. Among all, FEAT gets the top tier results.
Interpolation and extrapolation of classification ways. Next, we study different set-to-set
functions on their capability of interpolating and extrapolating across the number of classification
ways. To do so, we train each variant of embedding adaptation functions with both 1-shot 20-way
and 1-shot 5-way tasks, and measure the performance change as a function to the number of
categories in the test time. We report the mean accuracies evaluated on few-shot classification
withN =f5; 10; 15; 20g classes, and show results in Figure 7.3. Surprisingly, we observe that
FEAT achieves almost the same numerical performances in both extrapolation and interpolation
scenarios, which further displays its strong capability of learning the set-to-set transformation.
Meanwhile, we observe that DEEPSETS works well with interpolation but fails with extrapolation
as its performance drops significantly with the larger N. In contrast, GCN achieves strong
extrapolation performances but does not work as effectively in interpolation. BILSTM performs the
worst in both cases, as it is by design not permutation invariant and may have fitted an arbitrary
dependency between instances.
85
Table 7.4: Number of parameters introduced by each set-to-set function in additional to the
backbone’s parameters.
BILSTM DEEPSETS GCN FEAT
ConvNet 25K 82K 33K 16K
ResNet 2.5M 8.2M 3.3M 1.6M
Parameter efficiency. Table 7.4 shows the number of additional parameters each set-to-set
function has introduced. From this, we observe that with both ConvNet and ResNet backbones,
FEAT has the smallest number of parameters compared with all other approaches while achieving
best performances from various aspects (as results discussed above), which highlights its high
parameter efficiency.
All above, we conclude that: 1) learning embedding adaptation with a set-to-set model is
very effective in modeling task-specific embeddings for few-shot learning 2) FEAT is the most
parameter-efficient function approximater that achieves the best empirical performances, together
with nice permutation invariant property and strong interpolation/extrapolation capability over the
classification way.
7.5.2.2 Ablation Studies
We analyze FEAT and its ablated variants on the MiniImageNet dataset with ConvNet backbone.
How does the embedding adaptation looks like qualitatively? We sample four few-shot learn-
ing tasks and learn a principal component analysis (PCA) model (that projects embeddings into
2-D space) using the instance embeddings of the test data. We then apply this learned PCA
projection to both the support set’s pre-adapted and post-adapted embeddings. The results are
shown in Figure 7.1 (the beginning of the paper). In three out of four examples, post-adaptation
embeddings of FEAT improve over the pre-adaption embeddings. Interestingly, we found that the
embedding adaptation step of FEAT has the tendency of pushing the support embeddings apart
from the clutter, such that they can better fit the test data of its categories. In the negative example
where post-adaptation degenerates the performances, we observe that the embedding adaptation
step has pushed two support embeddings “Golden Retriever” and “Lion” too close to each other. It
has qualitatively shown that the adaptation is crucial to obtain superior performances and helps to
contrast against task-agnostic embeddings.
86
C! C C! R
Supervised 34.38 29.49
ProtoNet 35.51 29.47
FEAT 36.83 30.89
1-Shot 5-Shot
TPN [139] 55.51 69.86
TEAM [173] 56.57 72.04
FEAT 57.04 72.89
(a) Multi-domain FSL (b) Transductive FSL
Table 7.5: We evaluate our model on three additional few-shot learning tasks: (a) Multi-Domain
Few-shot, (b) Transductive few-shot learning, and (c) Generalized few-shot learning. We observe
that FEAT consistently outperform all previous methods or baselines.
Drill Bed TV Flower Screwdriver
! !"#$% from
“Clipart”
! !&'! from
“Real World”
Classify
Test Set
Train Set
Train Set
Test Set
! !"#$% from
“Clipart”
! !&'! from
“Real World”
Bed Curtains Refrigerator Sneakers Drill
Classify
Figure 7.4: Qualitative results of few-shot domain-generalization for FEAT. Correctly classified
examples are shown in red boxes and incorrectly ones are shown in blue boxes. We visualize one
task that FEAT succeeds (top) and one that fails (bottom).
7.5.3 Extended Few-Shot Learning Tasks
In this section, we evaluate FEAT on 3 different few-shot learning tasks. Specifically, cross-domain
FSL, and transductive FSL [139, 181].
Multi-Domain FSL. It assumes that examples in UNSEEN support and test set can come from
the different domains, e.g., sampled from different distributions [46, 90]. The example of this task
can be found in Figure 7.4. It requires a model to recognize the intrinsic property than texture of
objects, and is de facto analogical recognition.
Transductive FSL. The key difference between standard and transductive FSL is whether test
instances arrive one at a time or all simultaneously. The latter setup allows the structure of
unlabeled test instances to be utilized. Therefore, the prediction would depend on both the training
(support) instances and all the available test instances in the target task from UNSEEN categories.
87
7.5.3.1 Few-Shot Domain Generalization
We show that FEAT learns to adapt the intrinsic structure of tasks, and generalizes across domains,
i.e., predicting test instances even when the visual appearance is changed.
Setups. We train the FSL model in the standard domain and evaluate with cross-domain tasks,
where theN-categories are aligned but domains are different. In detail, a model is trained on tasks
from the “Clipart” domain of OfficeHome dataset [224], then the model is required to generalize
to both “Clipart (C)” and“Real World (R)” test instances. In other words, we need to classify
complex real images by seeing only a few sketches (Figure 7.4 gives an overview of data).
Results. Table 7.5 (a) gives the quantitative results and Figure 7.4 qualitatively examines it.
Here, the “supervised” denotes a model trained with the standard classification strategy and then
its penultimate layer’s output feature is used as the nearest neighbor classifier. We observe that
ProtoNet can outperform this baseline on tasks when evaluating instances from “Clipart” but not
ones from “real world”. However, FEAT improves over “real world” few-shot classification even
only seeing the support data from “Clipart”.
7.5.3.2 Transductive Few-Shot Learning
We show that without additional efforts in modeling, FEAT outperforms existing methods in
transductive FSL.
Setups. We further study this semi-supervised learning setting to see how well FEAT can in-
corporate test instances into joint embedding adaptation. Specifically, we use the unlabeled test
instances to augment the key and value sets of Transformer, so that the embedding adaptation takes
relationship of all test instances into consideration. We evaluate this setting on the transductive
protocol of MiniImageNet [181]. With the adapted embedding, FEAT makes predictions based on
Semi-ProtoNet [181].
Results. We compare with two previous approaches, TPN [139] and TEAM [173]. The results
are shown in Table 7.5 (b). We observe that FEAT improves its standard FSL performance (refer to
Table 7.1) and also outperforms previous semi-supervised approaches by a margin.
88
Chapter 8
Generalized Few-shot Learning with Classifier Synthesis
Object recognition in the real-world requires handling long-tailed or even open-ended data. An
ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile
efficiently learn about emerging new tail categories with a few training instances. Class-balanced
many-shot learning and few-shot learning tackle one side of this problem, by either learning strong
classifiers for head or learning to learn few-shot classifiers for the tail. In this chapter, we focus
on the problem of generalized few-shot learning (GFSL) —- a model during the deployment is
required to learn about tail categories with few shots and simultaneously classify the head classes.
We propose the Adaptive ClAssifier SynThesis LEarning (ACASTLE), a learning framework that
learns how to synthesize calibrated few-shot classifiers in addition to the multi-class classifiers of
head classes with a shared neural dictionary. Specifically, ACASTLE adapts the head classifiers
conditioned on the incoming tail training examples, yielding a framework that allows effective
backward knowledge transfer. We validate the proposed model on large-scale generalized few-shot
learning benchmark and demonstrate superior performances than existing GFSL algorithms and
strong baselines on MiniImageNet as well as TieredImageNet datasets.
8.1 Motivation
Visual recognition for objects in the “long tail” has been an important challenge to address [91, 141,
238, 265]. We often have a very limited amount of data on those objects as they are infrequently
observed and/or visual exemplars of them are hard to collect. As such, state-of-the-art methods
(e.g., deep learning) can not be directly applied due to their notorious demand of a large number of
annotated data [67, 113, 197].
Few-shot learning (FSL) [228] is mindful of the limited data per tail concept (i.e., shots),
which attempts to address this challenging problem by distinguishing between the data-rich head
categories as SEEN classes and data-scarce tail categories as UNSEEN classes. While it is difficult
89
−
+
−
Inductive
Transfer
Evaluate on
Unseen “Tail”
? ? ?
Train on
Seen “Head”
“Head” + “Tail”
Categories
−
Support exemplars
from unseen classes
Test instances
from unseen classes
Many-Shot
Learning
Few-Shot
Learning
?
Test
Sample
Prediction
Evaluate on
Entire Distribution
Train on
Seen “Head”
Inductive Transfer
+ Avoid Interference
−
+
−
Inductive
Transfer
Evaluate on
Unseen “Tail”
? ? ?
Train on
Seen “Head”
“Head” + “Tail”
Categories
−
Support exemplars
from unseen classes
Test instances
from unseen classes
Many-Shot
Learning
Few-Shot
Learning
?
Test
Sample
Prediction
Evaluate on
Entire Distribution
Train on
Seen “Head”
Inductive Transfer
+ Avoid Interference
(a) Few-shot learning (FSL) (b) Generalized Few-shot Learning (GFSL)
Figure 8.1: A conceptual diagram comparing the Few-Shot Learning (FSL) and the General-
ized Few-Shot Learning (GFSL). GFSL requires to extract inductive bias from SEEN categories
to facilitate efficiently learning on few-shot UNSEEN tail categories, while maintaining discernabil-
ity on head classes.
to build classifiers with data from UNSEEN classes, FSL mimics the test scenarios by sampling few-
shot tasks from SEEN class data, and extracts inductive biases for effective classifiers acquisition on
UNSEEN ones. Instance embedding [186, 198, 228, 250], model initialization [10, 53, 161], image
generator [239], and optimization flow [123, 180] act as popular meta-knowledge and usually
incorporates with FSL.
This type of learning makes the classifier from few-shot learning for UNSEEN classes difficult
to be combined directly with the classifier from many-shot learning for SEEN classes, however, the
demand to recognize all object categories simultaneously in object recognition is essential as well.
In this chapter, we study methods on the problem of Generalized Few-Shot Learning (GFSL),
which focuses on the joint classification of both data-rich and data-poor categories. Figure 8.1
illustrates the high-level idea of the GFSL, contrasting the standard FSL. In particular, our goal is
for the model trained on the SEEN categories to be capable of incorporating the limited UNSEEN
class examples, and make predictions for both the head and tail classes of the entire distribution.
One naive GFSL solution is to train a single classifier over the imbalanced long-tail distribu-
tion [65, 141, 238, 265], and re-balance it [20, 40, 91]. One main advantage of such a joint learning
objective over all classes is that it characterizes both SEEN and UNSEEN classes simultaneously.
In other words, training of one part (e.g., head) naturally takes the other part (e.g., tail) into
consideration, and promotes the knowledge transfer between classes. However, such a transductive
learning paradigm requires collecting the limited tail data in advance, which is violated in many
real-world tasks. In contrast to it, our learning setup requires an inductive modeling of the tail,
which is therefore more challenging as we assume no knowledge about the UNSEEN tail categories
is available during the model learning phase.
There are two key questions in the inductive GFSL problem: (1) how to construct the many-
shot and few-shot classifiers; (2) how to calibrate the many-shot and few-shot classifiers. To deal
90
with these, we propose ClAssifier SynThesis LEarning (CASTLE) that synthesizes the few-shot
classifiers using a neural dictionary capturing the common characteristics across classes. Such
synthesized few-shot classifiers are then used together with the many-shot classifiers, and learned
end-to-end. Specifically, we create a learning scenario by sampling a set of data instances from
SEEN categories and pretend that they come from UNSEEN categories, and apply the synthesized
classifiers (based on the above instances) as if they are many-shot classifiers to optimize multi-class
classification together with the remaining many-shot SEEN classifiers. In other words, we construct
few-shot classifiers to not only perform well on the few-shot classes but also to be competitive
with many-shot classifiers of populated classes. Such contrastive learning benefits the few-shot
classification in two aspects: (1) it provides high discernibility for its synthesized classifiers. (2) it
makes the synthesized classifier automatically calibrated with the many-shot classifiers.
Taking steps further, we introduce an adaptive version of ClAssifier SynThesis LEarning and
denote it as ACASTLE, which adds additional flexibility to adapt the many-shot classifiers based on
few-shot training examples. As a result, it allows additional backward knowledge transfer [142] —
new knowledge learned from novel few-shot training examples can benefit the existing many-shot
classifiers. In ACASTLE, we expand the neural dictionary to include task-specific neural bases in
addition to the globally shared bases. As a result, this hybrid dictionary summarize the generality
of many-shot visual classes and the specialty of current few-shot categories. This improved neural
dictionary facilitates the adaptation of the many-shot classifiers conditioned on the limited tail
training examples. The adapted many-shot classifiers in ACASTLE are again then used together
with the (jointly) synthesized few-shot classifiers for GFSL classification.
We first verify the effectiveness of the synthesized GFSL classifiers over multi-domain GFSL
tasks, where the UNSEEN classes would come from diverse domains. ACASTLE can best han-
dle such task heterogeneity due to its ability to adapt the head classifiers. Next, we empiri-
cally validate our approach on two standard benchmark datasets — MiniImageNet [228] and
TieredImageNet [181]. The proposed approach retains competitive tail concept recognition perfor-
mances while outperforming existing approaches on generalized few-shot learning with criteria
from different aspects. By carefully selecting a prediction bias from the validation set, those
miscalibrated FSL approaches or other baselines perform well in the GFSL scenario. The implicit
confidence calibration in CASTLE and ACASTLE works as well as or even better than the post-
calibration techniques. We note that CASTLE and ACASTLE is applicable for standard few-shot
learning, which stays competitive with and sometimes even outperforms state-of-the-art methods
when evaluated on two popular FSL benchmarks.
In the rest sections of this chapter, we first compare the problem formulation of GFSL to
FSL in §8.3, and then introduce our both CASTLE and ACASTLE in §8.4. We conduct thorough
experiments in §8.5 to verify the the proposed CASTLE and ACASTLE across multiple benchmarks.
91
8.2 Related Work
FSL emphasizes on building models of the UNSEEN classes, while the simultaneous recognition of
the many-shot head categories in real-world use cases is also important. Low-shot learning has
been studied in this manner [57, 65, 141, 239, 250]. The main aim is to recognize the entire set of
concepts in a transductive learning framework — during the training of the target model, it has
access to both the (many-shot) SEEN and (few-shot) UNSEEN categories. The key difference with
our Generalized Few-Shot Learning (GFSL) is that we assume no access to UNSEEN classes in
the model learning phase, which requires the model to inductively transfer knowledge from SEEN
classes to UNSEEN ones during the model evaluation phase.
Some of the previous GFSL approaches [57, 65, 239] apply the exemplar-based classification
paradigms on both SEEN and UNSEEN categories to resolve the transductive learning problem,
which requires recomputing the centroids for SEEN categories after model updates. Others [141,
189, 238] usually ignore the explicit relationship between SEEN and UNSEEN categories, and learn
separate classifiers. [60, 182] propose to solve inductive GFSL via either composing UNSEEN with
SEEN classifiers or meta-leaning with recurrent back-propagation procedure. Gidaris et al. [60]
is the most related work to CASTLE and ACASTLE, which composes the tail classifiers by a
convex combination of the many-shot classifiers. CASTLE is different from Gidaris et al. [60] as
it presents an end-to-end learnable framework with improved training techniques, as well as it
employs a shared neural dictionary to compose few-shot classifiers. Moreover, ACASTLE further
relates the knowledge for both SEEN and UNSEEN classes by constructing a neural dictionary with
both shared (yet task-agnostic) and task-specific basis, which allows backward knowledge transfer
to benefit SEEN classifiers with new knowledge of UNSEEN classes. As we have demonstrated
in §8.5.3, ACASTLE significantly improves SEEN classifiers when learning of UNSEEN visual
categories over heterogeneous visual domains.
8.3 Preliminary
As aforementioned in §6.2, the goal of generalized few-shot learning is to learn the functionf
that classifies a test instancex
j
D
U
test
as ^ y
j
=f(x
j
;D
S[U
train
)2f0; 1g
N
for classes sampled
from both the SEEN categoriesS and UNSEEN categoriesU. As a result, such a model needs to
deal with many-shot classification fromjSj SEEN classes along side with learningjUj emerging
UNSEEN classes.
1
During the training, a GFSL model only has access to the head classesS, so
1
jSj andjUj denote the total number of classes from the SEEN and UNSEEN class sets, respectively.
92
Unseen
Tail Classes
CASTLE
Many-Shot
Classifier
GFSL
Classifier
Seen
Head Classes
Retrieve
Few-Shot
Classifiers
Neural Dictionary
Synthesize
Union
Unseen
Tail Classes
Adaptive CASTLE
Many-Shot
Classifier
Seen
Head Classes
Retrieve
Synthesize
Query
Query
Query
Neural Dictionary
GFSL
Classifier
Figure 8.2: Illustration CASTLE and ACASTLE. In CASTLE (left), the synthesized few-shot
classifiers are directly unioned with the many-shot classifiers to make the join prediction. Different
from that, ACASTLE (right) synthesizes the joint classifiers, using both the many-shot and few-shot
classifiers as query to the neural dictionary. This ensures backward knowledge transfer, where
many-shot head classifiers co-adapt to the few-shot tail classifiers.
one needs to extract knowledge of building a joint classifier over many-shot data and few-shot
data using examples on the SEEN categories.
E
D
U
train
E
(x
j
;y
j
)2D
S[U
test
h
`
f
x
j
;D
U
train
;
S
;y
j
i
(8.1)
Here,
S
are the set of class descriptors for all SEEN classes, which identifies each category.
To this purpose, we simulate many GFSL tasks from the SEEN classes. At each time, we split
the SEEN classes into a tail split with classesC, and treat remainingjSjjCj classes as the head
split. Eq. 6.2 is transformed into:
E
D
C
train
C
E
(x
j
;y
j
)2D
S
test
`
f
x
j
;D
C
train
;
SC
;y
j
(8.2)
Here, the simulated tail classes are the subset of the full SEEN classesCS, with the number of
classes to beN. As a result, the functionf outputs ajSj-way classifier with two steps: (1) For the
simulated tail splitC, it follows whatf does in standard few-shot learning (see §7.3) and generates
the classifiers ofC using their few-shot training examplesD
C
train
. (2) For the head splitSC, this
function directly make use of the many-shot classifiers
SC
of theSC classes to generate the
classifiers. Note that the loss is measured on the test examples from the entire distributionD
S
test
,
which includes both head and simulated tail classes.
93
8.4 Learning Adaptive Classifier Synthesis
In this section, we first present the concrete classifier composition model to generate both many-
shot and few-shot classifiers, via querying a learnable neural dictionary. To simplify the understand-
ing, we present the situation of model inferences, with the goal of composing few-shot classifiers
using the few-shot training data at the time of model evaluation. Next, we introduce an effective
learning algorithm that learns many-shot classifiers and few-shot classifiers simultaneously, in an
end-to-end manner. Here we discuss the detail of how generalized few-shot learning tasks are
simulated with only data of SEEN classes.
8.4.1 Classifier Composition with a Neural Dictionary
We introduce a neural dictionary based method to construct both many-shot and few-shot classifiers.
This neural dictionary is being used as the common basesB =fb
k
g for constructing both head
and tail classifiers. Formally, the neural bases contain two sets of elements:
B =B
share
[
B
specic
Here,B
share
has M learnable elementB
share
=fb
1
;b
2
;:::;b
M
g withb
k
2 R
d
, which are
globally shared for all tasks. Meanwhile,B
specic
introduces the task-specific bases to construction
of the joint classifier, which we will discuss later.
Based on this, we define learnable key and value embeddings, where key and value is associated
with neural bases that encode shared primitives for composing the classifier ofS[U. Similar to
[220], the key and value for the neural dictionary are generated based on two linear projectionsU
andV of elements in the basesB. For instance,Ub
k
andVb
k
represent the generated key and
value embeddings. For a query to the neural dictionary, it first computes the similarity (a.k.a. the
attention) with all keys (Ub
k
), and the corresponding output of the query is the attention-weighted
combination of all the elements in the value set (Vb
k
).
Suppose we have sampled aK-shotN-way few-shot training dataD
U
train
, we first compute
the the class prototype of a categoryc by taking average of allK instance embeddings from the
D
U
train
(in aK-shotN-way task):
p
c
=
1
K
X
(x
i
;y
i
)2D
train
(x
i
)I [y
i
=c ] (8.3)
Here, we denote I [y
i
=c ] as an indicator that selects instances in the class c. These class
prototypes are then used as the task-specific basesB
specic
for constructing the classifiers, which
94
indicates thatB =B
share
S
fp
c
j c2Ug. We then compute the attention coefficients
c
over
each basesb
k
2B, which are then used for assembling the classifier of classc.
(p
c
;b
k
)/ exp
p
>
c
Ub
k
; where k = 1; ;jBj
The coefficient
k
c
is then normalized with the sum of compatibility scores over alljBj bases,
which then is used to convexly combine the value embeddings and synthesize the classifier,
W
c
=p
c
+
jBj
X
k=1
(p
c
;b
k
)Vb
k
(8.4)
As a result, W
c
represents the synthesized classifiers for the few-shot tasks usingD
S
train
. Such a
composed classifier is then`
2
-normalized and unions together with the many-shot classifiers
S
to make joint prediction over all classesS[U:
^
W = W
U
[
S
(8.5)
We denote the above classifier composition method as the ClAssifier SynThesis LEarning (CASTLE),
and illustrate its conceptual inference procedure in the left part of Figure 8.2.
The adaptive CASTLE (ACASTLE). One important limitation of CASTLE is its lack of adapta-
tion in the many-shot classifiers
S
, due to the union operation used for making the join classifier.
To overcome this drawback, we further propose an adaptive version of CASTLE, denoted as
ACASTLE, where we use the neural dictionary to synthesize classifiers for both many-shot SEEN
categoriesS and few-shot UNSEEN categoriesU. The right part of Figure 8.2 illustrates the
conceptual inference procedure of ACASTLE.
ACASTLE presents two key differences. First, it additionally include the many-shot classifiers
to the task-specific bases of the neural dictionary:
B
specic
=fp
c
jc2Ug
[
S
(8.6)
which means that the synthesis of few-shot classifier can better leverage the context of many-shot
classifiers. More importantly, all the classifiers W
c
; 8c2S[U are synthesized using the
dictionary:
^
W
c
=
(
p
c
+
P
jBj
k=1
(p
c
;b
k
)Vb
k
; 8c2U
c
+
P
jBj
k=1
(
c
;b
k
)Vb
k
; 8c2S
(8.7)
95
Here,
^
W corresponds to the final joint classifier. We specially note that ACASTLE allows back-
ward knowledge transfer, from the few-shot training dataD
U
train
to the many-shot classifiers
S
.
CASTLE is a degenerated version of the ACASTLE.
8.4.2 Unified Learning of Few-Shot and Many-Shot Classifiers
In generalized few-shot learning, the few-shot classifiers is required to do well together with
many-shot classifiers. Suppose we have sampled aK-shotN-way few-shot learning taskD
U
train
,
which containsjUj visual UNSEEN categories, a GFSL classifierf should have a low expected
error as in Eq. 8.1. Specifically, as aforementioned in §8.4.1, we use the neural dictionary to
implement the joint classifier f
x
j
;D
U
train
;
S
via either CASTLE or ACASTLE. Then the
classifierf predicts a test example inD
S[U
test
as both tail classesU and head classesS. However,
since we have no access to the UNSEEN classesU, the learning can only happen on the data from
SEEN classesS. Therefore, similar to the meta-learning procedure of standard FSL (details in
§7.3), we can simulate the learning situation of generalized few-shot learning using the data from
the SEEN categoriesS, to mimic both head and tail classes.
Unified learning objective. Suppose we sample aK-shotN-way few-shot task with categories
C to simulate the GFSL learning situation, whereC are a subset ofS. Therefore, given the
simulated few-shot task, we treat the remainingSC classes as the simulated head classes, whose
corresponding many-shot classifiers are
SC
. With either CASTLE or ACASTLE, we can then
obtain the join classifiers
^
W, that is used for predicting examples from allS classes. As a result,
we optimize the learning objective as follows:
min
f;B;fsg;U;Vg
X
CS
X
(x
j
;y
j
)S
`
^
W
>
x
j
;y
j
(8.8)
In addition to the learnable neural basesB,U andV are two projections in the neural bases to
facilitate the synthesis of the classifier, and there is no bias term in our implementation. Despite that
the few-shot classifiers
^
W
C
are synthesized using withK training instances, they are optimized to
perform well on all the instances fromC and moreover, to perform well against all the instances
from other SEEN categoriesSC.
Reusing Many-Shot classifiers. We optimize Eq. 8.8 by using the many-shot classifier overS
to initialize the embedding. In detail, ajSj-way many-shot classifier is trained over all SEEN
classes with the cross-entropy loss, whose backbone is used to initialize the embedding in the
GFSL classifier. We empirically observed that such initialization is essential for the prediction
calibration between SEEN and UNSEEN classes.
96
“Head” (seen) classes “Tail” (unseen) classes
Instance Index
For Meta-Training For Meta-Test
For Meta-Test
Augmented
Head Data
Standard Data
Dataset Construction Generalized Few-Shot Task Construction
“Head” (seen) classes “Tail” (unseen) classes
Few-Shot
Training Set
Head
Test Set
Tail
Test Set
Figure 8.3: The split of data in the generalized few-shot classification scenario. In addition to
the standard dataset like MiniImagetnet (blue part), we collect non-overlapping augmented head
class instances from the corresponding categories in the ImageNet (red part), to measure the
classification ability on the SEEN classes. Then in the generalized few-shot classification task,
few-shot instances are sampled from each of the UNSEEN classes, while the model should have the
ability to predict instances from both the head and tail classes.
Multi-classifier learning. A natural way to minimize Eq. 8.8 implements a stochastic gradient
descent step in each mini-batch by sampling one GFSL task, which contains aK-shotN-way
training set together with a set of test instances (x
j
;y
j
) fromS. It is clear that increasing the
number of GFSL tasks per gradient step can improve the optimization stability. Therefore, we
propose an efficient implementation that utilizes a large number of GFSL tasks to compute
gradients. Specifically, we sample two sets of instances from all SEEN classes, i.e.,D
S
train
and
D
S
test
. Then we construct a large number of joint classifiers
^
W
z
with either CASTLE or ACASTLE
on different sets ofC, which is then applied to compute the averaged loss overz using Eq. 8.8.
Note that there is only one single forward step to get the embeddings of the involved instances,
and we mimic multiple GFSL tasks through different random partitions of the simulated few-shot
and many-shot classes. We always use multi-classifier learning unless it is explicitly mentioned.
8.5 Experiments
8.5.1 Experimental Setups
This section details the experimental setups, including the general data splits strategy, the pre-
training technique, the specifications of the feature backbone, and the evaluation metrics for
GFSL.
Data Splits We visualize the general data split strategy in Figure 8.3. There are two parts of the
dataset for standard meta-learning tasks. The meta-training set for model learning (corresponds to
97
the SEEN classes), and the meta-val/test part for model evaluation (corresponds to the UNSEEN
classes). To evaluate a GFSL model, we’d like to augment the meta-training set with new instances,
so that the classification performance on SEEN classes could be measured. During the inference
phase, a few-shot training set from UNSEEN classes are provided with the model, and the model
should make a joint prediction over instances from both the head and tail classes. We will describe
the detailed splits for particular datasets in later sections.
Pre-training Strategy Before the meta-training stage, we try to find a good initialization for the
embedding, and then we reuse such a many-shot classifier as well as the embedding to facilitate
the training of a GFSL model. In later sections, we will verify this pre-training strategy does
not influence the few-shot classification performance a lot, but it is essential to make the GFSL
classifier well-calibrated. In particular, on MiniImageNet, we add a linear layer on the backbone
output and optimize a 64-way classification problem on the meta-training set with the cross-entropy
loss function. Stochastic gradient descent with initial learning rate 0.1 and momentum 0.9 is used
to complete such optimization. The 16 classes in MiniImageNet for model selection also assist the
choice of the pre-trained model. After each epoch, we use the current embedding and measures the
nearest neighbor based few-shot classification performance on the sampled few-shot tasks from
these 16 classes. The most suitable embedding function is recorded. After that, such a learned
backbone is used to initialize the embedding part of the whole model. The same strategy is also
applied to the meta-training set of the TieredImageNet, Heterogeneous, and Office-Home datasets,
where a 351-way, 100-way, and 25-way classifiers are pre-trained.
Feature Network Specification Following the setting of most recent methods [174, 186, 250],
we use ResNet variants [17, 67] to implement the embedding backbone. We follow [174, 186]
when investigating the multi-domain GFSL, where images are resized to 84 84 3. In concrete
words, three residual blocks are used after an initial convolutional layer (with stride 1 and padding
1) over the image, which have channels 160=320=640, stride 2, and padding 2. After a global
average pooling layer, it leads to a 640 dimensional embedding. While for the benchmark
experiments on MiniImageNet and TieredImageNet, we follow [123] to set the architecture of
ResNet, which contains 12 layers and uses the DropBlock [59] to prevent over-fitting.
We use the pre-trained backbone to initialize the embedding part of a model for CAS-
TLE/ACASTLE and our re-implemented comparison methods such as MC+kNN, ProtoNet+ProtoNet,
MC+ProtoNet, L2ML [238], and DFSL [60]. When there exists a backbone initialization, we
set the initial learning rate as 1e-4 and optimize the model with Momentum SGD. The learning
rate will be halved after optimizing 2,000 mini-batches. During meta-learning, all methods are
optimized over 5-way few-shot tasks, where the number of shots in a task is consistent with the
98
inference (meta-test) stage. For example, if the goal is a 1-shot 5-way model, we sample 1-shot
5-wayD
S
train
during meta-training, together with 15 instances per class inD
S
test
.
For CASTLE/ACASTLE, we use a multi-classifier training technique to improve learning
efficiency. Specifically, we randomly sample a 24-way task fromS in each mini-batch, and
re-sample 64 5-way tasks from it. It is notable that all instances in the 24-way task are encoded
by the ResNet backbone with the same parameters in advance. Therefore, by embedding the
synthesized 5-way few-shot classifiers into the global many-shot classifier, it results in 64 different
configurations of the generalized few-shot classifiers. To evaluate the classifier, we randomly
sample instances with batch size 128 fromS and compute the GFSL objective in Eq. 8.2.
8.5.2 Evaluation Measures
We take advantage of the auxiliary meta-training set from the benchmark datasets during GFSL
evaluations, and an illustration of the dataset construction can be found in Figure 8.3. The notation
X!Y withX;Y 2fS;U;S[Ug means computing prediction results with instances fromX
to labels ofY . For example,S!S[U means we first filter instances come from the SEEN class
set (x2S), and predict them into the joint label space (y2S[U). For a GFSL model, we
consider its performance with different measurements.
Few-shot accuracy. Following the standard protocol [53, 198, 228, 250], we sample 10,000K-
shotN-way tasks fromU during inference. In detail, we first sampleN classes fromU, and then
sampleK + 15 instances for each class. The firstNK labeled instances (K instances from each of
theN classes) are used to build the few-shot classifier, and the remaining 15N (15 instances from
each of theN classes) are used to evaluate the quality of such few-shot classifier. During our test,
we considerK = 1 andK = 5 as in the literature, and changeN ranges fromf5; 10; 15;:::;jUjg
as a more robust measure. It is noteworthy that in this test stage, all the instances come fromU
and are predicted to classes inU (U!U).
Generalized few-shot accuracy. Different from many-shot and few-shot evaluations, the gener-
alized few-shot learning takes the joint instance and label spaces into consideration. In other words,
the instances come fromS[U and their predicted labels also inS[U (S[U!S[U). This is
obviously more difficult than the many-shot (S!S) and few-shot (U!U) tasks. During the
test, with a bit abuse of notations, we sampleK-shotS +N-way tasks fromS[U. Concretely, we
first sample aK-shotN-way task fromU, withNK training and 15N test instances, respectively.
Then, we randomly sample 15N instances fromS. Thus in a GFSL evaluation task, there areNK
labeled instances fromU, and 30N test instances fromS[U. We compute the accuracy ofS[U
as the final measure. We abbreviate this criterion as “Mean Acc.” or “Acc.”.
99
Instance Index
Many-Shot
Learning
( → )
∈ Unseen Class Data
Seen Class Data
∈ ∈ ∈ Few-Shot
Learning
Class (label) Index
( → )
( → )
( → )
Instance Index
( → ∪ ) ∈ ∈ ∈ ∈ Class (label) Index
( → ∪ )
Measure 1
Measure 2
Harmonic Mean
Figure 8.4: An illustration of the harmonic mean based criterion for GFSL evaluation.S andU
denotes the SEEN and UNSEEN instances (x) and labels (y) respectively.S[U is the joint set
ofS andU. The notationX! Y;X;Y 2fS;U;S[Ug means computing prediction results
with instances fromX to labels ofY . By computing a performance measure (like accuracy) on
the joint label space prediction of SEEN and UNSEEN instances separately, a harmonic mean is
computed to obtain the final measure.
Generalized few-shot -value. Since the problem becomes difficult when the predicted label
space expands fromS!S toS!S[U (and alsoU!U toU!S[U), the accuracy of
a model will have a drop. To measure how the classification ability of a GFSL model changes
when working in a GFSL scenario, Ren et al. [182] propose the -Value to measure the average
accuracy drop. In detail, for each sampled GFSL task, we first compute its many-shot accuracy
(S!S) and few-shot accuracy (U!U). Then we calculate the corresponding accuracy of SEEN
and UNSEEN instances in the joint label space, i.e.,S!S[U andU!S[U. The -Value is
the average decrease of accuracy in these two cases. We denote this measure as “-value”.
Generalized few-shot harmonic mean. Directly computing the accuracy still gets biased to-
wards the populated classes, so we also consider the harmonic mean as a more balanced mea-
sure [246]. By computing performance measurement such as top-1 accuracy forS!S[U and
U!S[U, the harmonic mean is used to average the performance in these two cases as the final
measure. In other words, denote the accuracy forS!S[U andU!S[U as Acc
S
and Acc
U
,
respectively, the value
2Acc
S
Acc
U
Acc
S
+Acc
U
is used as a final measure. An illustration is in Figure 8.4. We
denote this measure as “HM” or “HM Acc.”.
Generalized few-shot AUSUC. Chao et al. [27] propose a calibration-agnostic criterion for
generalized zero-shot learning. To avoid evaluating a model influenced by a calibration factor
between SEEN and UNSEEN classes, they propose to determine the range of the calibration factor
for all instances at first, and then plot the SEEN-UNSEEN accuracy curve based on different
100
Acura Integra
Audi S6
Chevrolet Impala
Boeing 777
DC-3
ATR-42
Acadian Flycatcher
American Crow
Baltimore Oriole
Domain
Class
CAR Aircraft Bird
Bluetick
Saluki
Dandie Dinmont
Dog
Airport
Bakery
Bookstore
Indoor Clipart Product Real World
Toy Toy Toy
Shelf Shelf Shelf
Bed Bed Bed
Heterogeneous Data Set Office-Home Data Set
Figure 8.5: An illustration of the Heterogeneous and Office-Home dataset. Both datasets contain
multiple domains. In the Heterogeneous dataset, each class belongs to only one domain, while in
Office-Home, a class has images from all three domains.
Table 8.1: Generalized 1-shot classification performance (mean accuracy and harmonic mean
accuracy) on (a) the Heterogeneous dataset with 100 Head and 5 Tail categories and (b) the
Office-Home dataset with 25 Head and 5 Tail categories.S!S[U andU!S[U denote the
joint classification accuracy for SEEN class and UNSEEN class instances respectively. CASTLE
is
the variant of CASTLE without using the neural dictionary.
(a) Heterogeneous dataset (b) Office-Home dataset
Measures S[U!S[US!S[UU!S[U HM Acc.
DFSL [60] 48.13 0.12 46.33 0.12 48.25 0.22 47.27 0.12
CASTLE
48.29 0.12 45.13 0.13 50.14 0.22 47.50 0.12
CASTLE 50.16 0.13 48.05 0.13 50.86 0.22 49.05 0.12
ACASTLE 53.01 0.12 56.18 0.12 49.84 0.22 52.81 0.13
Measures S[U!S[US!S[UU!S[U HM Acc.
DFSL [60] 35.72 0.12 28.42 0.12 39.77 0.22 33.15 0.12
CASTLE
35.74 0.13 27.93 0.13 42.59 0.22 33.73 0.13
CASTLE 35.77 0.13 29.03 0.13 42.46 0.22 34.48 0.13
ACASTLE 39.99 0.14 40.29 0.13 39.68 0.22 39.98 0.14
configurations of the calibration values. Finally, the area under the SEEN-UNSEEN curve is used as
a more robust criterion. We follow [27] to compute the AUSUC value for sampled GFSL tasks.
We denote measure as “AUSUC”.
8.5.3 Pivot Study on Multi-Domain GFSL
We first present a pivot study to demonstrate the effectiveness of ACASTLE, which leverages
adaptive classifiers synthesized for both SEEN and UNSEEN classes. To achieve this, we investigate
two multi-domain datasets – “Heterogeneous” and “Office-Home” with more challenging settings,
where a GFSL model is required to transfer knowledge in backward direction (adapt SEEN
classifiers based on UNSEEN ones) to obtain superior joint classification performances over
heterogeneous domains.
101
8.5.3.1 Dataset
We construct a Heterogeneous dataset based on 5 fine-grained classification datasets, namely
AirCraft [149], Car-196 [111], Caltech-UCSD Birds (CUB) 200-2011 [230], Stanford Dog [99],
and Indoor Scenes [176]. Since these datasets have apparent heterogeneous semantics, we treat
images from different datasets as different domains. 20 classes with 50 images in each of them
are randomly sampled from each of the 5 datasets to construct the meta-training set. The same
sampling strategy is also used to sample classes for model validation (meta-val) and evaluation
(meta-test) sets. Therefore, there are 100 classes for meta-training/val/test sets, which contains
20 classes from each fine-grained dataset. To evaluate the performance of a GFSL model, we
augment the meta-training set by sampling another 15 images from the corresponding classes for
each of the SEEN classes.
We also investigate the Office-Home [224] dataset, which originates from a domain adaptation
task. There are 65 classes and 4 domains of images per class. Considering the scarce number
of images in one particular domain, we select three of the four domains, “Clipart”, “Product”,
and “Real World” to construct our dataset. The number of instances in a class per domain is not
equal. We randomly sample 25 classes (with all selected domains) for meta-training, 15 classes
for meta-validation, and the remaining 25 classes are used for meta-test. Similarly, we hold out 10
images per domain for each SEEN class to evaluate the generalized classification ability of a GFSL
model.
Note that in addition to the class label, images in these two datasets are also equipped with
at least one domain label. In particular, classes in Heterogeneous dataset belong to a single
domain corresponding to “aircraft”, “bird”, “car”, “dog”, or “indoor scene”, while the classes in
Office-Home possess images from all 3 domains, namely “Clipart”, “Product” and “Real World”.
Figures 8.5 presents an illustration of the sampled images (of different domains) from these two
datasets.
The key difference to standard GFSL (see §8.5.4.3) is that here the SEEN categories are
collected from multiple (heterogeneous) visual domains and used for training the inductive GFSL
model. During the evaluation, the few-shot training instances of tail classes only come from one
single domain. With this key difference, we note that the UNSEEN few-shot classes are close to a
certain sub-domain of SEEN classes and relatively far away from the others. Therefore, a model
capable of adapting its SEEN classifiers can take the advantages and adapt itself to the domain of
the UNSEEN classes.
102
8.5.3.2 Baselines and Comparison Methods
Besides CASTLE and ACASTLE, we consider two other baseline models. The first one optimizes the
Eq. 8.2 directly but without the neural dictionary, which relies on both the (fixed) linear classifier
U
S
and the few-shot prototypes to make a GFSL prediction (we denote it as “CASTLE
”); the
second one is DFSL [60], which requires a two-stage training of the GFSL model. It trains a
many-shot classifier with cosine similarity in the first stage. Then it freezes the backbone model as
feature extractor and optimizes a similar form of Eq. 8.2 via composing new few-shot classifiers
as the convex combination of those many-shot classifiers. It can be viewed as a degenerated neural
dictionary, where DFSL sets a size-jSj “shared” basesB
share
as the many-shot classifier U
S
.
We observe that DFSL is unstable to perform end-to-end learning. It is potentially because the
few-shot classifier composition uses many-shot classifiers as bases, but those bases are optimized
to both be good bases and good classifiers, which can likely to be conflicting to some degree. It is
also worth noting that all the baselines except ACASTLE only modify the few-shot classifiers, and
it is impossible for them to perform backward knowledge transfer.
8.5.3.3 GFSL over Heterogeneous Dataset
The Heterogeneous dataset has 100 SEEN classes in the meta-training set, 20 per domain. We
consider the case where during the inference, all of the tail classes come from one particular
domain. For example, the tail classes are different kinds of birds, and we need to do a joint
classification over all SEEN classes from the heterogeneous domains and the newly coming tail
classes with limited instances. To mimic the inference scenario, we sample “fake” few-shot tasks
with classes from one of the five domains randomly and contrasting the discerning ability from the
sampled classes w.r.t. the remaining SEEN classes as in Eq. 8.2.
Note that we train DFSL strictly follows the strategy in [60], and train other GFSL models
with a pre-trained embedding and the multi-classifier techniques to improve the training efficiency.
Following [60, 189, 246], we compute the 1-Shot 5-Way GFSL classification mean accuracy and
harmonic mean accuracy over 10,000 sampled tasks, whose results are recorded in Table 8.1.
S!S[U andU!S[U denote the average accuracy for the joint prediction of SEEN and
UNSEEN instances respectively.
From the results in Table 8.1, DFSL could not work well due to its fixed embedding and
restricted bases. CASTLE
is able to balance the training accuracy of both SEEN and UNSEEN
classes benefited from the pre-train strategy and the unified learning objective, which achieves the
highest joint classification performance over UNSEEN classes. The discriminative ability is further
improved with the help of the neural dictionary. CASTLE performs better than its degenerated
version, which verifies the effectiveness of the learned neural bases. The neural dictionary encodes
103
the common characteristic among all classes for the GFSL classification, so that CASTLE gets
better mean accuracy and harmonic mean accuracy than CASTLE
. Since ACASTLE is able to
adapt both many-shot and few-shot classifiers conditioned on the context of the tail instances, it
obtains the best GFSL performance in this case. It is notable that ACASTLE gets much higher joint
classification accuracy for SEEN classes than other methods, which validates its ability to adapt the
many-shot classifier over the SEEN classes based on the context of tail classes.
8.5.3.4 GFSL over Office-Home Dataset
We also investigate the similar multi-domain GFSL classification task over the Office-Home
dataset. However, in this case, a single class could belong to all three domains. We consider the
scenario to classify classes in a single domain and the domain of the classes should be inferred
from the limited tail instances. In other words, we train a GFSL model over 25 classes, and each
class has 3 sets of instances corresponding to the three domains. In meta-training, a 25-way SEEN
class classifier is constructed. During the inference, the model is provided by another 5-way 1-shot
set of UNSEEN class instances from one single domain. The model is required to output a joint
classifier for test instances from the whole 30 classes whose domains are the same as the one in
the UNSEEN class set.
Towards such a multi-domain GFSL task, we train a GFSL model by keeping the instances
in both the few-shot fake tail task and corresponding test set from the same domain. We use the
same set of comparison methods and evaluation protocols with the previous subsection. The mean
accuracy, harmonic mean accuracy, and the specific accuracy for SEEN and UNSEEN classes are
shown in Table 8.1.
Due to the ambiguity of domains for each class, the GFSL classification over Office-Home
gives rise to a more difficult problem, while the results in Table 8.1 reveal a similar trend with
those in Table 8.1. Since for Office-Home a single GFSL model needs to make the joint prediction
over classes from multiple domains conditioned on different configurations of the tail few-shot
tasks, the stationary SEEN class classifiers are not suitable for the classification over different
domains. In this case, ACASTLE still achieves the best performance over different GFSL criteria,
and gets larger superiority margins with the comparison methods.
8.5.4 Experiments on GFSL
In this section, we design experiments on benchmark datasets to validate the effectiveness of
the CASTLE and ACASTLE in GFSL (see §8.5.4.3). After a comprehensive comparison with
competitive methods using various protocols, we analyze different aspects of GFSL approaches,
and we observe the post calibration makes the FSL methods strong GFSL baselines. We verify
104
that CASTLE/ACASTLE learn a better calibration between SEEN and UNSEEN classifiers, and the
neural dictionary makes CASTLE/ACASTLE persist its high discerning ability with incremental
tail few-shot instances. Finally, we show that CASTLE/ACASTLE also benefit standard FSL
performances (see §8.5.4.4).
8.5.4.1 Datasets
Two benchmark datasets are used in our experiments. The MiniImageNet dataset [228] is a subset
of the ILSVRC-12 dataset [184]. There are totally 100 classes and 600 examples in each class.
For evaluation, we follow the split of [180] and use 64 of 100 classes for meta-training, 16 for
validation, and 20 for meta-test (model evaluation). In other words, a model is trained on few-shot
tasks sampled from the 64 SEEN classes set during meta-training, and the best model is selected
based on the few-shot classification performance over the 16 class set. The final model is evaluated
based on few-shot tasks sampled from the 20 UNSEEN classes.
The TieredImageNet [181] is a more complicated version compared with the MiniImageNet.
It contains 34 super-categories in total, with 20 for meta-training, 6 for validation (meta-val), and
8 for model testing (meta-test). Each of the super-category has 10 to 30 classes. In detail, there
are 351, 97, and 160 classes for meta-training, meta-validation, and meta-test, respectively. The
divergence of the super-concept leads to a more difficult few-shot classification problem.
Since both datasets are constructed by images from ILSVRC-12, we augment the meta-
training set of each dataset by sampling non-overlapping images from the corresponding classes
in ILSVRC-12. The auxiliary meta-train set is used to measure the generalized few-shot learning
classification performance on the SEEN class set. For example, for each of the 64 SEEN classes in
the MiniImageNet, we collect 200 more non-overlapping images per class from ILSVRC-12 as the
test set for many-shot classification. The illustration of the dataset split is shown in Figure 8.3.
8.5.4.2 Baselines and Prior Methods
We explore several (strong) choices in deriving classifiers for the SEEN and UNSEEN classes,
including Multiclass Classifier (MC) + kNN, which contains ajSj-way classifier trained on
the SEEN classes in a supervised learning manner as standard many-shot classification, and its
embedding with the nearest neighbor classifier is used for GFSL inference; ProtoNet + ProtoNet,
where the embeddings trained by Prototypical Network [198] is used, and 100 training instances
are sampled from each SEEN category to act as the SEEN class prototypes; MC + ProtoNet, where
we combine the learning objective of the previous two baselines to jointly learn the MC classifier
and feature embedding. The details of these baselines are listed in the Appendix.
105
Table 8.2: Generalized Few-shot classification performance (mean accuracy, -value, and har-
monic mean accuracy) on MiniImageNet when there are 64 Head and 5 Tail categories.
Setups 1-Shot 5-Shot 1-Shot 5-Shot
Perf. Measures Mean Acc." # Mean Acc." # Harmonic Mean Acc. "
IFSL [182] 54.950.30 11.84 63.040.30 10.66 - -
L2ML [238] 46.250.04 27.49 45.810.03 35.53 2.980.06 1.120.04
DFSL [60] 63.360.11 13.71 72.580.09 13.33 62.080.13 71.260.09
MC +kNN 46.170.03 29.70 46.180.03 40.21 0.000.00 0.000.00
MC + ProtoNet 45.310.03 29.71 45.850.03 39.82 0.000.00 0.000.00
ProtoNet + ProtoNet 50.490.08 25.64 71.750.08 13.65 19.260.18 67.730.12
Ours: CASTLE 67.130.11 10.09 76.780.09 9.88 66.220.15 76.320.09
Ours: ACASTLE 68.700.11 9.98 78.630.09 8.08 66.240.15 78.330.09
Besides, we also compare our approach with the L2ML [238], Dynamic Few-Shot Learning
without forgetting (DFSL) [60], and the newly proposed Incremental few-shot learning (IFSL) [182].
For CASTLE, we use the many-shot classifiersfU
S
g, see §8.4.2) for the SEEN classes and the
synthesized classifiers for the UNSEEN classes to classify an instance into all classes, and then
select the prediction with the highest confidence score. For ACASTLE, we adapt the head classifiers
tof
^
U
S
g with the help of the tail classes.
8.5.4.3 Main Results
We first evaluate all GFSL methods on MiniImageNet with the criteria in [60, 182], the mean
accuracy over all classes (the higher the better) and the -value (the lower the better). An effective
GFSL approach not only makes prediction well on the joint label space (with high accuracy) but
also keeps its classification ability when changing from many-shot/few-shot to the generalized
few-shot case (low -value).
The main results are shown in Table 8.2. We found that ACASTLE outperforms all the existing
methods as well as our proposed baseline systems in terms of the mean accuracy. Meanwhile,
when looked at the -value, and CASTLE variants are the least affected between predicting for
SEEN/USSEEN classes separately and predicting over all classes jointly.
However, we find that either mean accuracy or -value is not informative enough to tell about
a GFSL algorithm’s performances. For example, a baseline system, i.e., ProtoNet + ProtoNet
performs better than IFSL in terms of 5-shot mean accuracy but not -value. This is consistent
with the observation in [182] that the -value should be considered together with the mean
accuracy. In this case, how shall we rank these two systems? To answer this question, we propose
to use another evaluation measure, the harmonic mean of the mean accuracy for each SEEN and
UNSEEN category [189, 246], when they are classified jointly.
106
Table 8.3: Generalized Few-shot classification accuracies on MiniImageNet with 64 head cate-
gories and 20 tail categories.
Classification on 20 UNSEEN Categories 64 SEEN + 20 UNSEEN Categories
Perf. Measures U!U S!S[U U!S[U HM Acc.
Setups 1-Shot 5-Shot 1-Shot 5-Shot 1-Shot 5-Shot 1-Shot 5-Shot
L2ML [238] 27.79 0.07 43.42 0.06 90.99 0.03 90.99 0.03 0.64 0.00 1.21 0.01 1.27 0.09 2.38 0.02
DFSL [60] 33.02 0.08 50.96 0.07 61.68 0.06 66.06 0.05 31.13 0.07 47.16 0.06 41.21 0.07 54.95 0.05
MC +kNN 31.58 0.08 56.08 0.06 92.35 0.03 92.38 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
MC + ProtoNet 31.82 0.06 56.16 0.06 91.39 0.03 92.99 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ProtoNet + ProtoNet 32.90 0.08 55.69 0.06 89.15 0.04 85.17 0.04 9.89 0.05 41.17 0.06 17.71 0.08 55.51 0.06
Ours: CASTLE 35.69 0.08 56.97 0.06 80.32 0.06 80.43 0.06 29.42 0.08 42.55 0.05 43.06 0.07 55.65 0.07
Ours: ACASTLE 36.38 0.08 57.29 0.06 81.36 0.05 87.40 0.04 29.95 0.08 41.64 0.06 43.63 0.08 56.33 0.06
Table 8.4: Generalized Few-shot classification accuracy on TieredImageNet with 351 head cate-
gories and 160 tail categories.
Classification on 160 UNSEEN Categories 351 SEEN + 160 UNSEEN Categories
Perf. Measures U!U S!S[U U!S[U HM Acc.
Setups 1-Shot 5-Shot 1-Shot 5-Shot 1-Shot 5-Shot 1-Shot 5-Shot
DFSL [60] 15.79 0.02 30.69 0.02 11.29 0.05 14.95 0.06 14.24 0.06 27.22 0.07 12.60 0.11 19.29 0.05
MC +kNN 14.12 0.02 30.02 0.02 68.32 0.02 68.33 0.02 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00
MC + ProtoNet 14.13 0.02 30.05 0.02 68.34 0.02 68.33 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ProtoNet + ProtoNet 14.52 0.02 29.38 0.02 62.37 0.02 61.15 0.02 4.83 0.03 22.69 0.02 8.97 0.02 33.09 0.02
Ours: CASTLE 15.97 0.02 30.44 0.02 26.94 0.08 34.98 0.02 16.17 0.06 31.61 0.05 20.20 0.05 33.20 0.02
Ours: ACASTLE 16.36 0.02 30.75 0.02 27.01 0.08 35.41 0.08 16.17 0.06 31.86 0.05 22.23 0.05 33.54 0.02
Harmonic mean accuracy measures GFSL performance better. Since the number of SEEN
and UNSEEN classes are most likely to be not equal, e.g., 64 vs. 5 in our cases, directly computing
the mean accuracy over all classes is almost always biased. For example, a many-shot classifier that
only classifies samples into SEEN classes can receive a good performance than one that recognizes
both SEEN and UNSEEN. Therefore, we argue that harmonic mean over the mean accuracy can
better assess a classifier’s performance, as now the performances are negatively affected when a
classifier ignores classes (e.g., MC classifier get 0% harmonic mean). Specifically, we compute
the top-1 accuracy for instances from SEEN and UNSEEN classes, and take their harmonic mean as
the performance measure. The results are included in the right side of the Table 8.2.
We find the harmonic mean accuracy takes a holistic consideration of the “absolute” joint
classification performance and the “relative” performance drop when classifying towards the joint
set. For example, the many-shot baseline MC+kNN with good mean accuracy and high -value
has extremely low performance as it tends to ignore UNSEEN categories. Meanwhile, CASTLE and
ACASTLE remains the best when ranked by the harmonic mean accuracy against others.
107
Evaluate GFSL beyond 5 UNSEEN categories. Besides using harmonic mean accuracy, we
argue that another important aspect in evaluating GFSL is to go beyond the 5 sampled UNSEEN
categories, as it is never the case in real-world. On the contrary, we care most about the GFSL with
a large number of UNSEEN classes, which also measure the ability of the model to extrapolate the
number of novel classes in the UNSEEN class few-shot task. To this end, we consider an extreme
case – evaluating GFSL with all available SEEN and UNSEEN categories over both MiniImageNet
and TieredImageNet, and report their results in Table 8.3 and Table 8.4.
Together with the harmonic mean accuracy of all categories, we also report the tail classification
performance, which is a more challenging few-shot classification task (the standard FSL results
could be found in §8.5.4.4). In addition, the joint classification accuracy for SEEN classes instances
(S!S[U) and UNSEEN classes instances (U!S[U) are also listed.
The methods without a clear consideration of head-tail trade-off (e.g., ProtoNet+ProtoNet)
fails to make a joint prediction over both SEEN and UNSEEN classes. We observe that CASTLE and
ACASTLE outperform all approaches in the UNSEEN and more importantly, the ALL categories
section, across two datasets.
Confidence calibration matters in GFSL. In generalized zero-shot learning, Chao et al. [27]
have identified a significant prediction bias between classification confidence of SEEN and UNSEEN
classifiers. We find a similar phenomena in GFSL. For instance, the few-shot learning ProtoNet +
ProtoNet baseline becomes too confident to predict on SEEN categories than UNSEEN categories
(The scale of confidence is on average 2.1 times higher). To address this issue, we compute a
calibration factor based on the meta-validation set of UNSEEN categories, such that the predic-
tion logits are calibrated by subtracting this factor out from the confidence of SEEN categories’
predictions. With 5 UNSEEN classes from MiniImageNet, the GFSL results of all comparison
methods before and after calibration is shown in Figure 8.6. We observe a consistent and obvious
improvements over the harmonic mean accuracy for all methods. For example, although the
FSL approach ProtoNet neglects the classification performance over SEEN categories outside
the sampled task during meta-learning, it gets even better harmonic mean accuracy compared
with the GFSL method DFSL (62.70% vs. 62.38%) with such post-calibration, which becomes
a very strong GFSL baseline. Note that CASTLE and ACASTLE are the least affected with the
selected calibration factor. This suggests that CASTLE variants, learned with the unified GFSL
objective, have well-calibrated classification confidence and does not require additional data and
extra learning phase to search this calibration factor.
Moreover, we use area under SEEN-UNSEEN curve (AUSUC) as a measure of different GFSL
algorithms [27]. Here, AUSUC is a performance measure that takes the effects of the calibration
factor out. To do so, we enumerate through a large range of calibration factors and subtract it from
108
Figure 8.6: Calibration’s effect to the 1-shot
harmonic mean accuracy on MiniImageNet.
Baseline models improve a lot with the help
of the calibration factor.
0 10 20 30 40 50 60 70
0
10
20
30
40
50
60
70
MC+kNN
MC+ProtoNet
ProtoNet+ProtoNet
DFSL
CASTLE
ACASTLE
Figure 8.7: The 1-shot AUSUC performance
with two configurations of UNSEEN classes on
MiniImageNet. The larger the area under the
curve, the better the GFSL ability.
4 6 8 10 12 14 16 18 20
Tail Category Number
20
30
40
50
60
70
Harmonic Mean Accuracy (%)
66.24
55.84
48.75
43.63
ProtoNet + ProtoNet
DFSL
CASTLE
ACASTLE
Figure 8.8: Results of 1-shot GFSL harmonic
mean accuracy with incremental number of
UNSEEN classes on MiniImageNet. Note
MC+kNN and MC+ProtoNet bias towards
SEEN classes and get nearly zero harmonic
mean accuracy.
4 6 8 10 12 14 16 18 20
Tail Category Number
35
40
45
50
55
60
65
70
75
Harmonic Mean Accuracy (%)
66.50
56.53
49.96
45.32
MC + kNN
ProtoNet + ProtoNet
MC + ProtoNet
DFSL
CASTLE
ACASTLE
Figure 8.9: Post-calibrated results of 1-
shot GFSL harmonic mean accuracy with
incremental number of UNSEEN classes on
MiniImageNet. All methods select the their
best calibration factors from the meta-val data
split.
the confidence score of SEEN classifiers. Through this process, the joint prediction performances
over SEEN and UNSEEN categories, denoted asS!S[U andU!S[U, shall vary as the
calibration factor changes. For instance, when the calibration factor is infinitely large, we measure
a classifier that only predicts UNSEEN categories. We denote this as the SEEN-UNSEEN curve. The
1-shot GFSL results with 5 UNSEEN classes from MiniImageNet is shown in Figure 8.7. As a result,
we observe that ACASTLE and CASTLE archive the largest area under the curve, which indicates
that CASTLE variants are in general a better algorithm over others among different calibration
factors.
Robust evaluation of GFSL. Other than the harmonic mean accuracy of all SEEN and UNSEEN
categories shown in Table 8.3 and Table 8.4, we study the dynamic of how harmonic mean accuracy
changes with an incremental number of UNSEEN tail concepts. In other words, we show the GFSL
109
Table 8.5: Few-shot classification accuracy on
MiniImageNet with different types of back-
bones. Our methods are evaluated with 10,000
few-shot tasks.
Setup 1-Shot 5-Way 5-Shot 5-Way
IFSL [182] Res10 55.72 0.41 70.50 0.36
DFSL [60] Res10 56.20 0.86 73.00 0.64
ProtoNet [198] Res12 61.40 0.12 76.56 0.20
TapNet [253] Res12 61.65 0.15 76.36 0.10
MTL [205] Res12 61.20 1.80 75.50 0.90
MetaOptNet [123] Res12 62.64 0.61 78.63 0.46
FEAT [250] Res12 66.78 0.20 82.05 0.14
SimpleShot [237] Res18 62.85 0.20 80.02 0.14
CTM [127] Res18 64.12 0.82 80.51 0.13
LEO [186] WRN 61.76 0.08 77.59 0.12
Ours: CASTLE Res12 66.75 0.20 81.98 0.14
Ours: ACASTLE Res12 66.83 0.20 82.08 0.14
Table 8.6: Few-shot classification accuracy on
TieredImageNet with different types of back-
bones. Our methods are evaluated with 10,000
few-shot tasks.
Setup 1-Shot 5-Way 5-Shot 5-Way
ProtoNet [198] Conv 53.31 0.89 72.69 0.74
IFSL [182] Res18 51.12 0.45 66.40 0.36
DFSL [60] Res18 50.90 0.46 66.69 0.36
TapNet [253] Res12 63.08 0.15 80.26 0.12
MTL [205] Res12 65.60 1.80 78.60 0.90
MetaOptNet [123] Res12 65.99 0.72 81.56 0.63
FEAT [250] Res12 70.80 0.23 84.79 0.16
SimpleShot [237] Res18 69.09 0.22 84.58 0.16
CTM [127] Res18 68.41 0.39 84.28 1.73
LEO [186] WRN 66.33 0.05 81.44 0.09
Ours: CASTLE Res12 71.14 0.02 84.34 0.16
Ours: ACASTLE Res12 71.63 0.02 85.28 0.15
performances w.r.t. different numbers of tail concepts. We use this as a robust evaluation of
each system’s GFSL capability. In addition to the test instances from the head 64 classes in
MiniImageNet, 5 to 20 novel classes are included to compose the generalized few-shot tasks.
Concretely, only one instance per novel class is used to construct the tail classifier, combined with
which the model is asked to do a joint classification of both SEEN and UNSEEN classes. Figure 8.8
records the change of generalized few-shot learning performance (harmonic mean) when more
UNSEEN classes emerge. We omit the results of MC+kNN and MC+ProtoNet since they bias
towards SEEN classes and get nearly zero harmonic mean accuracy in all cases. We observe that
ACASTLE consistently outperforms all baseline approaches in each evaluation setup, with a clear
margin. We also compute the harmonic mean after selecting the best calibration factor from the
meta-val set (see Figure 8.9). It is obvious that almost all baseline models achieve improvements
and the phenomenon is consistent with Figure 8.6. The GFSL results of ACASTLE and CASTLE
are almost not influenced after using the post-calibration technique. ACASTLE still persists its
superiority in this case.
8.5.4.4 Standard Few-Shot Learning Results
Finally, we also evaluate our proposed approaches’ performance on two standard few-shot learning
benchmarks, i.e., MiniImageNet and TieredImageNet dataset. In other words, we evaluate the
classification performance of few-shot UNSEEN class instances with our GFSL objective. We
compare our approaches with the state-of-the-art methods in both 1-shot 5-way and 5-shot 5-way
scenarios. We cite the results of the comparison methods from their published papers and remark
110
the backbones used to train the FSL model by different methods. The mean accuracy and 95%
confidence interval are shown in the Table 8.5 and Table 8.6.
It is notable that some comparison methods such as CTM [127] are evaluated over only 600
UNSEEN class FSL tasks, while we test both CASTLE and ACASTLE over 10,000 tasks, leading to
more stable results. CASTLE and ACASTLE achieve almost the best 1-shot and 5-shot classification
results on both datasets. The results support our hypothesis that jointly learning with many-shot
classification forces few-shot classifiers to be discriminative.
111
Part IV
Conclusion
112
Chapter 9
Conclusion
9.1 Summary
To recapitulate, this dissertation presents research to learn the meaning of language in the visual,
dynamic, and long-tailed physical world. Specifically, we strive to develop models and algorithms
that put language learning in the context of perception data and embodied experiences.
Towards learning visually grounded concepts, Chapter 2 of this dissertation presents algorithm
and models to establish multi-granular associations from words, phrases, sentences to the images,
using a graph database generated from parallel pairs of pictures and sentences. Chapter 3 of this
dissertation further studies the problem of aligning long paragraphs with videos. It proposes a
hierarchical model that concurrently learns the global alignment of paragraph and video and the
local correspondence of sentences and short clips.
Towards learning agents that understand the intent behind instructions, Chapter 4 studies a
generalization situation where agents are required to transfer their skills compositionally across
the tasks and environments. It introduces a compositional model to accomplish this challenge.
Chapter 5 develops an agent that excels the cross-horizon transfer learning in grounded instruction
following, where agents learned with tasks of pre-determined horizon length are required to
generalize to instructions describe either shorter or longer horizons.
Moreover, our research studies learning with limited and growing data, closely aligning with
machine learning in the real world. Chapter 7 and 8 focus on the problem of few-shot learning and
generalized few-shot learning, and introduces models and algorithms that tackle the challenges
from these problems.
9.2 Future Directions
Moving forward, I am broadly interested in two parallel themes of future research centered around
the language learning: 1) leveraging structures of language to guide learning of other modality, and
113
2) using multi-modal data as the supervision to improve generalization of language representation.
Within these two themes, I identify the two concrete research topics that extend my past and
ongoing research, in the §9.2.1 and §9.2.2.
9.2.1 Leverage Structure of Language for Modeling Visual Concepts.
Neural networks [32, 36, 48, 105, 143] often learn visual concepts end-to-end, which typically
align image and text information without explicit modeling of the structures. While such end-to-
end learning have shown strong generalization capabilities in test examples that are i.i.d to the
training distribution [45], they often struggle in dealing with out-of-domain examples of novel
compositional structures, in many tasks [11, 29, 52, 70, 89, 98, 118, 170, 183].
soccer ball
a soccer ball a happy man
a man running on the field with a soccer ball. a soccer ball in a giftbox.
a man
man
field
the field a giftbox
giftbox
Primitive
Concept
Figure 9.1: Illustration of the graph we envisioned to guide the modeling of visual concepts.
Therefore, one interesting future research topic is investigating how complex concepts, com-
posed of simpler ones, are robustly grounded in images. In particular, one can investigate whether
the structures of how those concepts are composed can be exploited as modeling prior to improve
visual grounding. Specifically, we can design a graph database that is similar to the one introduced
in Chapter 2, but with additional elements denoted as predicates, i.e., the semantic snippets that
define the rule to combine child concepts into concepts. Figure 9.1 provides an illustration of the
high-level idea. Such predicate can encode richer information explicitly than the subsumption
relationships implicitly expressed in the denotation graphs. On top of this new form of the graph
database, I intend to investigate how compositional models can be built to explicitly take advantage
of the structural information.
114
9.2.2 Learning Language Representation with Visual Supervision.
This dissertation has explored methods to learn the visual grounding of language using models
that inputs both vision and language. Therefore such methods are mainly evaluated on the vision-
and-language (V+L) tasks, such as cross-modal retrieval and captioning. I envision that the
multi-modal data would also be beneficial to improve the language representation itself when
models are properly trained. A recent work [211] has demonstrated that the state-of-the-art text
representation could be further improved via using an additional word-level visual supervision.
Along with this general thinking, I believe that there are opportunities to transfer the techniques
such as the denotation graph (introduced in Chapter 2) to improve the quality of visual supervision,
which can potentially lead to a high-quality text representation.
Vokens (Token-Related Images)
Humans learn language by listening, speaking
Language Tokens
Visually-
Supervised
Language
Model
Visual
Supervision
Language
Input
……
Vokenization
Figure 9.2: Illustration of using visual data as supervision for language learning [211].
115
Bibliography
[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding
for attribute-based classification. In CVPR, 2013.
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual
question answering. In IEEE Conference Computer Vision and Pattern Recognition, 2018.
[3] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf,
Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-Language Navigation:
Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
[4] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks.
In CVPR, 2016.
[5] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning
with policy sketches. In ICML, 2017.
[6] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning
with policy sketches. In ICML, 2017.
[7] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman,
David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by
gradient descent. In NeurIPS, 2016.
[8] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan
Russell. Localizing moments in video with natural language. In ICCV, 2017.
[9] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In ICCV,
2015.
[10] Antreas Antoniou, Harrison Edwards, and Amos J. Storkey. How to train your MAML. In
ICLR, 2019.
[11] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm
de Vries, and Aaron Courville. Systematic generalization: What is required and can it be
learned? In ICLR, 2019.
[12] Kobus Barnard and David Forsyth. Learning the semantics of words and pictures. In ICCV,
2001.
116
[13] Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and
Michael I Jordan. Matching words and pictures. JMLR, 2003.
[14] André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, David Silver,
and Hado P. van Hasselt. Successor features for transfer in reinforcement learning. In NIPS,
2017.
[15] Emily M. Bender and Alexander Koller. Climbing towards nlu: On meaning, form, and
understanding in the age of data. In ACL, 2020.
[16] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum
learning. In ICML, 2009.
[17] Luca Bertinetto, João F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning
with differentiable closed-form solvers. In ICLR, 2019.
[18] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai,
Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto,
and Joseph Turian. Experience grounds language. In EMNLP, 2020.
[19] Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large
annotated corpus for learning natural language inference. In ACL, 2015.
[20] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbal-
anced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
[21] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis
Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D
data in indoor environments. In 3DV, 2017.
[22] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classifiers for
zero-shot learning. In CVPR, 2016.
[23] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classifiers for
zero-shot learning. In IEEE Conference Computer Vision and Pattern Recognition, 2016.
[24] Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. Predicting visual exemplars of unseen
classes for zero-shot learning. In ICCV, 2017.
[25] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Classifier and exemplar
synthesis for zero-shot learning. IJCV, 2020.
[26] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and
analysis of generalized zero-shot learning for object recognition in the wild. In ECCV,
2016.
[27] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and
analysis of generalized zero-shot learning for object recognition in the wild. In ECCV,
2016.
117
[28] Wei-Lun Chao, Hexiang Hu, and Fei Sha. Being negative but constructively: Lessons learnt
from creating better visual question answering datasets. NAACL, 2018.
[29] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra, Rama Kumar Pasumarthi,
Dheeraj Rajagopal, and Ruslan Salakhutdinov. Gated-attention architectures for task-
oriented language grounding. In AAAI, 2018.
[30] David L Chen and Raymond J Mooney. Learning to interpret natural language navigation
instructions from observations. In AAAI, 2011.
[31] Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown:
Natural language navigation and spatial reasoning in visual street environments. In CVPR,
2019.
[32] Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. Learning the best
pooling strategy for visual semantic embedding. In CVPR, 2021.
[33] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A
closer look at few-shot classification. In ICLR, 2019.
[34] Xinlei Chen and C Lawrence Zitnick. Mind’s eye: A recurrent visual representation for
image caption generation. In CVPR, 2015.
[35] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár,
and C Lawrence Zitnick. Microsoft COCO Captions: Data collection and evaluation server.
ArXiv 1504.00325, 2015.
[36] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan,
Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. Euro-
pean Conference Computer Vision, 2020.
[37] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS workshop,
2014.
[38] Guillem Collell and Marie-Francine Moens. Is an image worth more than a thousand words?
on the fine-grain semantic differences between visual and linguistic representations. In
COLING, 2016.
[39] Wikipedia Contributors. English wikipedia corpus, 2021. URL \url{https://en.
wikipedia.org/wiki/Wikipedia:Database_download}.
[40] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge J. Belongie. Class-balanced
loss based on effective number of samples. In CVPR, 2019.
[41] Robert Dale and Ehud Reiter. Computational interpretations of the gricean maxims in the
generation of referring expressions. Cognitive science, 1995.
[42] Hal Daumé. Language bias and black sheep, 2016. Extracted from an academic blog post:
https://nlpers.blogspot.com/2016/06/language-bias-and-black-sheep.html.
118
[43] Peter Dayan. Improving generalization for temporal difference learning: The successor
representation. Neural Computation, 1993.
[44] Coline Devin, Abhishek Gupta, Trevor Darrell, Pieter Abbeel, and Sergey Levine. Learning
modular neural network policies for multi-task and multi-robot transfer. In ICRA. IEEE,
2017.
[45] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
[46] Nanqing Dong and Eric P. Xing. Domain adaption in one-shot learning. In ECML PKDD,
2018.
[47] Aviv Eisenschtat and Lior Wolf. Linking image and text with 2-way nets. In IEEE
Conference Computer Vision and Pattern Recognition, 2017.
[48] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improved
visual-semantic embeddings. In BMVC, 2017.
[49] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: Improving
visual-semantic embeddings with hard negatives. In BMVC, 2018.
[50] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár,
Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual
concepts and back. In CVPR, 2015.
[51] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional two-stream
network fusion for video action recognition. In CVPR, 2016.
[52] Catherine "Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan,
Sesh Sadasivam, Rui Zhang, and Dragomir" Radev. "improving text-to-SQL evaluation
methodology". In ACL, 2018.
[53] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In ICML, 2017.
[54] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel.
Reverse curriculum generation for reinforcement learning. In CoRL, 2017.
[55] Daniel Fried, Ronghang Hu, V olkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe
Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. Speaker-
follower models for vision-and-language navigation. In NeurIPS, 2018.
[56] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio
Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In
NeurIPS, 2013.
[57] Hang Gao, Zheng Shou, Alireza Zareian, Hanwang Zhang, and Shih-Fu Chang. Low-shot
learning via covariance-preserving adversarial augmentation networks. In NeurIPS, 2018.
119
[58] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V . Le. Dropblock: A regularization method for
convolutional networks. In NeurIPS, 2018.
[59] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V . Le. Dropblock: A regularization method for
convolutional networks. In NeurIPS, 2018.
[60] Spyros Gidaris and Nikos Komodakis. Dynamic few-shot visual learning without forgetting.
In CVPR, 2018.
[61] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsuper-
vised domain adaptation. CVPR, 2012.
[62] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. A multi-view embedding
space for modeling internet images, tags, and their semantics. IJCV, 2014.
[63] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. Look, imagine and match:
Improving textual-visual cross-modal retrieval with generative models. In IEEE Conference
Computer Vision and Pattern Recognition, 2018.
[64] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and José M. F. Moura. Few-shot human
motion prediction via meta-learning. In ECCV, 2018.
[65] Bharath Hariharan and Ross B. Girshick. Low-shot visual recognition by shrinking and
hallucinating features. In ICCV, 2017.
[66] Bharath Hariharan and Ross B. Girshick. Low-shot visual recognition by shrinking and
hallucinating features. In ICCV, 2017.
[67] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In CVPR, 2016.
[68] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Ac-
tivitynet: A large-scale video benchmark for human activity understanding. In CVPR,
2015.
[69] Chris Herd. A black sheep image, 2021. Photo extracted from the
post: https://chrisherd.medium.com/why-being-a-black-sheep-is-the-only-way-to-become-
a-billionaire-fc797ad495db.
[70] Karl Moritz Hermann, Felix Hill, Simon Green, Fumin Wang, Ryan Faulkner, Hubert Soyer,
David Szepesvari, Wojciech Marian Czarnecki, Max Jaderberg, Denis Teplyashin, et al.
Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551,
2017.
[71] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep
Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep
neural networks for acoustic modeling in speech recognition: The shared views of four
research groups. IEEE Signal processing magazine, 2012.
[72] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
ArXiv, 2015.
120
[73] Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in
Neural Information Processing Systems, 2016.
[74] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation,
1997.
[75] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
1997.
[76] Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a
ranking task: Data, models and evaluation metrics. JAIR, 2013.
[77] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom
embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
[78] Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. In
ICLR, 2019.
[79] Hexiang Hu, Wei-Lun Chao, and Fei Sha. Learning answer embeddings for visual question
answering. In CVPR, 2018.
[80] Hexiang Hu, Liyu Chen, Boqing Gong, and Fei Sha. Synthesized policies for transfer and
adaptation across tasks and environments. In NeurIPS, 2018.
[81] Hexiang Hu, Ishan Misra, and Laurens van der Maaten. Evaluating text-to-image matching
using binary image selection (BISON). In ICCV workshop, 2019.
[82] Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason
Baldridge, and Eugene Ie. Transferable representation learning in vision-and-language
navigation. In ICCV, 2019.
[83] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial
attacks on neural network policies. ArXiv, 2017.
[84] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and
order for image and sentence matching. In IEEE Conference Computer Vision and Pattern
Recognition, 2018.
[85] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift. In ICML, 2015.
[86] Phillip Isola, Joseph J Lim, and Edward H Adelson. Discovering states and transformations
in image collections. In CVPR, 2015.
[87] Max Jaderberg, V olodymyr Mnih, Wojciech Czarnecki, Tom Schaul, Joel Z. Leibo, David
Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks.
ArXiv, 2016.
[88] Vihan Jain, Gabriel Magalhaes, Alex Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge.
Stay on the path: Instruction fidelity in vision-and-language navigation. In EMNLP, 2019.
121
[89] J. Johnson, B. Hariharan, L.J.P. van der Maaten, L. Fei-Fei, C.L. Zitnick, and R.B. Girshick.
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In
CVPR, 2017.
[90] Bingyi Kang and Jiashi Feng. Transferable meta learning across domains. In UAI, 2018.
[91] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng,
and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition.
In ICLR, 2020.
[92] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image
descriptions. In IEEE Conference Computer Vision and Pattern Recognition, 2015.
[93] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and
Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR,
2014.
[94] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans
for improved quality, stability, and variation. In ICLR, 2018.
[95] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-
narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human
action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[96] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame:
Referring to objects in photographs of natural scenes. In EMNLP, 2014.
[97] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao,
Yejin Choi, and Siddhartha Srinivasa. Tactical rewind: Self-correction via backtracking in
vision-and-language navigation. In CVPR, 2019.
[98] Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii
Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry
Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. Measuring compositional
generalization: A comprehensive method on realistic data. In ICLR, 2020.
[99] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Li Fei-Fei. Novel dataset
for fine-grained image categorization. In 1st Workshop on Fine-Grained Visual Categoriza-
tion, CVPR, June 2011.
[100] Douwe Kiela and Léon Bottou. Learning Image Embeddings using Convolutional Neural
Networks for Improved Multi-Modal Semantics. In EMNLP, 2014.
[101] Joohyun Kim and Raymond Mooney. Adapting discriminative reranking to grounded
language learning. In ACL, 2013.
[102] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
[103] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,
2015.
122
[104] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In ICLR, 2017.
[105] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic em-
beddings with multimodal neural language models. NeurIPS Workshop Deep Learning,
2014.
[106] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embed-
dings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
[107] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel
Urtasun, and Sanja Fidler. Skip-thought vectors. In NeurIPS, 2015.
[108] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for
one-shot image recognition. In ICML Deep Learning Workshop, 2015.
[109] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi.
Ai2-thor: An interactive 3d environment for visual ai. ArXiv, 2017.
[110] Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and Sanja Fidler. What are you
talking about? text-to-image coreference. In CVPR, 2014.
[111] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for
fine-grained categorization. In 3DRR Workshop, Sydney, Australia, 2013.
[112] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-
captioning events in videos. In ICCV, 2017.
[113] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In NeurIPS, 2012.
[114] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi,
Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple
image descriptions. T-PAMI, 2013.
[115] Tejas D. Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel Gershman. Deep
successor reinforcement learning. ArXiv, 2016.
[116] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh,
Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural
questions: a benchmark for question answering research. Transactions of the Association
for Computational Linguistics, 2019.
[117] Alice Lai and Julia Hockenmaier. Learning to predict denotational probabilities for modeling
entailment. In ACL, 2017.
[118] Brenden Lake and Marco Baroni. Generalization without systematicity: On the composi-
tional skills of sequence-to-sequence recurrent networks. In ICML, 2018.
[119] Brenden M. Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B. Tenenbaum. One shot
learning of simple visual concepts. In CogSci, 2011.
123
[120] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept
learning through probabilistic program induction. Science, 2015.
[121] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classifica-
tion for zero-shot visual object categorization. TPAMI, 2014.
[122] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross
attention for image-text matching. In European Conference Computer Vision, 2018.
[123] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning
with differentiable convex optimization. In CVPR, 2019.
[124] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise
metric and subspace. In ICML, 2018.
[125] Fei-Fei Li, Robert Fergus, and Pietro Perona. One-shot learning of object categories.
TPAMI, 2006.
[126] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal
encoder for vision and language by cross-modal pre-training. AAAI, 2019.
[127] Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang. Finding
task-relevant features for few-shot learning by category traversal. In CVPR, 2019.
[128] Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for
paragraphs and documents. ACL, 2015.
[129] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Visual semantic reasoning
for image-text matching. In International Conference Computer Vision, 2019.
[130] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A
simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557,
2019.
[131] Xinzhe Li, Qianru Sun, Yaoyao Liu, Qin Zhou, Shibao Zheng, Tat-Seng Chua, and Bernt
Schiele. Learning to self-train for semi-supervised few-shot classification. In NeurIPS,
2019.
[132] Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah
Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling.
In EMNLP, 2019.
[133] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. Jointly localizing and
describing events for dense video captioning. In CVPR, 2018.
[134] Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. Symmetry and group in attribute-object
compositions. In CVPR, 2020.
[135] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly
for few shot learning. ArXiv, 2017.
124
[136] Zhizhong Li and Derek Hoiem. Learning without forgetting. T-PAMI, 2018.
[137] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European Conference Computer Vision, 2014.
[138] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou,
and Yoshua Bengio. A structured self-attentive sentence embedding. In ICLR, 2017.
[139] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and
Yi Yang. Learning to propagate labels: Transductive propagation network for few-shot
learning. In ICLR, 2019.
[140] Yaoyao Liu, An-An Liu, Yuting Su, Bernt Schiele, and Qianru Sun. Mnemonics training:
Multi-class incremental learning without forgetting. In CVPR, 2020.
[141] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu.
Large-scale long-tailed recognition in an open world. In CVPR, 2019.
[142] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual
learning. In NeurIPS, 2017.
[143] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic
visiolinguistic representations for vision-and-language tasks. In International Conference
Computer Vision, 2019.
[144] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to
attention-based neural machine translation. EMNLP, 2015.
[145] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and
Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estimation. In
ICLR, 2019.
[146] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. The regretful
agent: Heuristic-aided navigation through progress estimation. In CVPR, 2019.
[147] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.
[148] Gabriel Magalhaes, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. Effective
and general evaluation for instruction conditioned navigation using dynamic time warping.
In NeurIPS ViGIL Workshop, 2019.
[149] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew B. Blaschko, and Andrea Vedaldi.
Fine-grained visual classification of aircraft. ArXiv, 2013.
[150] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The
neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural
supervision. In ICLR, 2019.
[151] Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Listen, attend, and walk: Neural
mapping of navigational instructions to action sequences. In AAAI, 2016.
125
[152] Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-Dickstein. Learning
unsupervised learning rules. ArXiv, 2018.
[153] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed
representations of words and phrases and their compositionality. In NeurIPS, 2013.
[154] George A Miller. Wordnet: a lexical database for english. Communications of the ACM,
1995.
[155] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition
with context. CVPR 2017, 2017.
[156] Ishan Misra, Abhinav Gupta, and Martial Hebert. From red wine to red tomato: Composition
with context. In CVPR, 2017.
[157] Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg,
Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. Midge: Generating image
descriptions from computer vision detections. In EACL, 2012.
[158] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.
Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig
Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Ku-
maran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through
deep reinforcement learning. Nature, 2015.
[159] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal
reasoning and matching. In IEEE Conference Computer Vision and Pattern Recognition,
2017.
[160] Khanh Nguyen and Hal Daumé III. Help, anna! visual navigation with natural multimodal
assistance via retrospective curiosity-encouraging imitation learning. In EMNLP, 2019.
[161] Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms.
ArXiv, 2018.
[162] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Hierarchical multimodal
lstm for dense visual-semantic embedding. In ICCV, 2017.
[163] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generaliza-
tion with multi-task deep reinforcement learning. ArXiv, 2017.
[164] Boris N. Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. TADAM: task dependent
adaptive metric for improved few-shot learning. In NeurIPS, 2018.
[165] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent
neural encoder for video representation with application to captioning. In CVPR, 2016.
[166] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask
and transfer reinforcement learning. In NeurIPS, 2015.
126
[167] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. In ICML, 2013.
[168] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors
for word representation. In EMNLP, 2014.
[169] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton
Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
[170] Sandro Pezzelle and Raquel Fernández. Is the red square big? malevic: Modeling adjectives
leveraging visual contexts. In EMNLP, 2019.
[171] Steven Pinker. Language as an adaptation to the cognitive niche. Studies in the Evolution of
Language, 2003.
[172] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier,
and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. In ICCV, 2015.
[173] Limeng Qiao, Yemin Shi, Jia Li, Yaowei Wang, Tiejun Huang, and Yonghong Tian. Trans-
ductive episodic-wise adaptive metric for few-shot learning. In ICCV, 2019.
[174] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L. Yuille. Few-shot image recognition by
predicting parameters from activations. In CVPR, 2018.
[175] Zhaofan Qiu, Ting Yao, and Tao Mei. Deep quantization: Encoding convolutional activa-
tions with deep generative model. In CVPR, 2017.
[176] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In CVPR, 2009.
[177] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language
understanding by generative pre-training. Technical Report, 2018.
[178] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. Technical Report, 2019.
[179] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+
questions for machine comprehension of text. In EMNLP, 2016.
[180] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR,
2017.
[181] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B.
Tenenbaum, Hugo Larochelle, and Richard S. Zemel. Meta-learning for semi-supervised
few-shot classification. In ICLR, 2018.
[182] Mengye Ren, Renjie Liao, Ethan Fetaya, and Richard Zemel. Incremental few-shot learning
with attention attractor networks. In NeurIPS, 2019.
[183] Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M Lake. A
benchmark for systematic generalization in grounded language understanding. In NeurIPS,
2020.
127
[184] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,
and Fei-Fei Li. Imagenet large scale visual recognition challenge. IJCV, 2015.
[185] Stuart J. Russell and Peter Norvig. Artificial Intelligence: a modern approach. Pearson, 3
edition, 2009.
[186] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon
Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In ICLR,
2019.
[187] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural
networks. In ICLR, 2018.
[188] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function
approximators. In ICML, 2015.
[189] Edgar Schönfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata.
Generalized zero- and few-shot learning via aligned variational autoencoders. In CVPR,
2019.
[190] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding
for face recognition and clustering. In CVPR, 2015.
[191] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-
dimensional continuous control using generalized advantage estimation. ArXiv, 2015.
[192] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. ArXiv, 2017.
[193] Tyler R. Scott, Karl Ridgeway, and Michael C. Mozer. Adapted deep embeddings: A
synthesis of methods for k-shot inductive transfer learning. In NeurIPS, 2018.
[194] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A
cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
[195] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai,
Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P.
Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play
with a general reinforcement learning algorithm. ArXiv, 2017.
[196] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action
recognition in videos. In NeurIPS, 2014.
[197] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. In ICLR, 2015.
[198] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical networks for few-shot
learning. In NeurIPS, 2017.
128
[199] Richard Socher, Andrej Karpathy, Quoc V . Le, Christopher D. Manning, and Andrew Y .
Ng. Grounded compositional semantics for finding and describing images with sentences.
Transactions of the Association for Computational Linguistics, 2014.
[200] Sungryull Sohn, Junhyuk Oh, and Honglak Lee. Hierarchical reinforcement learning for
zero-shot generalization with subtask dependencies. In NeurIPS, 2018.
[201] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101
human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[202] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT:
Pre-training of generic visual-linguistic representations. ICLR, 2020.
[203] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. A corpus of natural language for
visual reasoning. In ACL, 2017.
[204] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus
for reasoning about natural language grounded in photographs. In NAACL, 2018.
[205] Qianru Sun, Yaoyao Liu, Tat-Seng Chua, and Bernt Schiele. Meta-transfer learning for
few-shot learning. In CVPR, 2019.
[206] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M.
Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
[207] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Conference on Neural Information Processing Systems, 2014.
[208] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. IEEE
Transactions on Neural Networks, 1998.
[209] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A
framework for temporal abstraction in reinforcement learning. Artificial intelligence, 1999.
[210] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy
gradient methods for reinforcement learning with function approximation. In NeurIPS,
2000.
[211] Hao Tan and Mohit Bansal. V okenization: improving language understanding with contex-
tualized, visual-grounded supervision. In EMNLP, 2020.
[212] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to navigate unseen environments: Back
translation with environmental dropout. In EMNLP, 2019.
[213] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains:
A survey. Journal of Machine Learning Research, 2009.
[214] Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell,
Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In
Advances in Neural Information Processing Systems, 2017.
129
[215] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog
navigation. In CoRL, 2019.
[216] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning
spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
[217] Eleni Triantafillou, Richard S. Zemel, and Raquel Urtasun. Few-shot learning through an
information retrieval lens. In NeurIPS, 2017.
[218] Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. Learning robust
visual-semantic embeddings. In ICCV, 2017.
[219] Benjamin D Van Durme. Extracting implicit knowledge from text. University of Rochester,
2009.
[220] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[221] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Conference on
Neural Information Processing Systems, 2017.
[222] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[223] Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images
and language. In ICLR, 2016.
[224] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan.
Deep hashing network for unsupervised domain adaptation. In CVPR, 2017.
[225] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor
Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, 2015.
[226] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney,
and Kate Saenko. Translating videos to natural language using deep recurrent neural
networks. NAACL-HLT, 2015.
[227] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A
neural image caption generator. In CVPR, 2015.
[228] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra.
Matching networks for one shot learning. In NeurIPS, 2016.
[229] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets,
Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al.
Starcraft ii: a new challenge for reinforcement learning. ArXiv, 2017.
[230] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-
200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology,
2011.
130
[231] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R
Bowman. Glue: A multi-task benchmark and analysis platform for natural language
understanding. In ACL, 2018.
[232] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. Superglue: A stickier benchmark for general-
purpose language understanding systems. In NeurIPS, 2019.
[233] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc
Van Gool. Temporal segment networks: Towards good practices for deep action recognition.
In ECCV, 2016.
[234] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc
Van Gool. Temporal segment networks for action recognition in videos. arXiv preprint
arXiv:1705.02953, 2017.
[235] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text
embeddings. In CVPR, 2016.
[236] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang
Wang, William Yang Wang, and Lei Zhang. Reinforced cross-modal matching and self-
supervised imitation learning for vision-language navigation. In CVPR, 2019.
[237] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot:
Revisiting nearest-neighbor classification for few-shot learning. ArXiv, 2019.
[238] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In
NeurIPS, 2017.
[239] Yu-Xiong Wang, Ross B. Girshick, Martial Hebert, and Bharath Hariharan. Low-shot
learning from imaginary data. In CVPR, 2018.
[240] Xiu-Shen Wei, Peng Wang, Lingqiao Liu, Chunhua Shen, and Jianxin Wu. Piecewise
classifier mappings: Learning fine-grained learners for novel categories with few examples.
TIP, 2019.
[241] Adina Williams, Nikita Nangia, and Samuel Bowman. "a broad-coverage challenge corpus
for sentence understanding through inference". In NAACL-HLT, 2018.
[242] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement
learning: a hierarchical bayesian approach. In ICML, 2007.
[243] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha, Alexander J Smola, and Philipp
Krähenbühl. Compressed video action recognition. CVPR, 2018.
[244] Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying
Ma. Unified visual-semantic embeddings: Bridging vision and language with structured
meaning representations. In IEEE Conference Computer Vision and Pattern Recognition,
2019.
131
[245] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural
machine translation system: Bridging the gap between human and machine translation.
ArXiv 1609.08144, 2016.
[246] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning - the good, the bad and
the ugly. In CVPR, 2017.
[247] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated
residual transformations for deep neural networks. In CVPR, 2016.
[248] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov,
Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation
with visual attention. In ICML, 2015.
[249] Han-Jia Ye, Hong-You Chen, De-Chuan Zhan, and Wei-Lun Chao. Identifying and com-
pensating for feature deviation in imbalanced deep learning. ArXiv, 2020.
[250] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding
adaptation with set-to-set functions. In CVPR, 2020.
[251] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding
adaptation with set-to-set functions. In CVPR, 2020.
[252] Han-Jia Ye, Hexiang Hu, and De-Chuan Zhan. Learning adaptive classifiers synthesis for
generalized few-shot learning. International Journal of Computer Vision, 2021.
[253] Sung Whan Yoon, Jun Seo, and Jaekyun Moon. Tapnet: Neural network augmented with
task-adaptive projection for few-shot learning. In ICML, 2019.
[254] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions
to visual denotations: New similarity metrics for semantic inference over event descriptions.
TACL, 2014.
[255] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning
using hierarchical recurrent neural networks. In CVPR, 2016.
[256] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
[257] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan R. Salakhut-
dinov, and Alexander J. Smola. Deep sets. In NeurIPS, 2017.
[258] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition:
Visual commonsense reasoning. In CVPR, 2019.
[259] Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and Hanli Wang. Real-time action
recognition with enhanced motion vector cnns. In CVPR, 2016.
[260] Bowen Zhang, Hexiang Hu, and Fei Sha. Cross-modal and hierarchical modeling of video
and text. In ECCV, 2018.
132
[261] Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, and Fei Sha. Learning to represent
image and text with denotation graph. In EMNLP, 2020.
[262] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in
deep reinforcement learning. ArXiv, 2018.
[263] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long
short-term memory. In ECCV, 2016.
[264] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin.
Temporal action detection with structured segment networks. ICCV, 2017.
[265] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. BBN: bilateral-branch network
with cumulative learning for long-tailed visual recognition. In CVPR, 2020.
[266] Wang Zhu, Hexiang Hu, Jiacheng Chen, Zhiwei Deng, Vihan Jain, Eugene Ie, and Fei Sha.
Babywalk: Going farther in vision-and-language navigation by taking baby steps. In ACL,
2020.
[267] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question
answering in images. In CVPR, 2016.
[268] Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh
Mottaghi, and Ali Farhadi. Visual semantic planning using deep successor representations.
In ICCV, 2017.
[269] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and
Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement
learning. ICRA, 2017.
[270] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and
Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement
learning. In ICRA, 2017.
[271] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Tor-
ralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations
by watching movies and reading books. In ICCV, 2015.
133
Part V
Appendices
134
Appendix A
Details and Additional Experiments for Chapter 2
A.1 Additional Implementation Details
Constructing Denotation Graphs We summarize the procedures used to extract DG from V+L
datasets. For details, please refer to [254]. We used the publicly available tool
1
. The analysis
consists of several steps: (1) spell-checking; (2) tokenize the sentences into words; (3) tag the
words with Part-of-Speech labels and chunk works into phrases; (4) abstract semantics by using the
WordNet [154] to construct a hypernym lexicon table to replace the nouns with more generic terms;
(5) apply 6 types of templated rules to create fine-to-coarse (i.e., specific to generic) semantic
concepts and connect the concepts with edges.
We set 3 as the maximum levels (counting from the sentence level) to extract abstract semantic
concepts. This is due to the computation budget we can afford, as the final graphs can be huge
in both the number of nodes and the edges. Specifically, without the maximum level constraint,
we have 2.83M concept nodes in total for Flickr dataset. If the training is run on all these nodes,
we will consume 19 times more iterations than training on the original dataset, which has 145K
sentences [254]. As a result, much more time would be required for every experiment. With the 3
layers of DG from the leaf concepts, we have in 597K nodes. In this case, the training time would
be cut down to 4.1 times of the original dataset.
Nonetheless, we experimented with more than 3 levels to train ViLBERT + Flickr-DG with 5
and 7 maximum levels, respectively. The training hyper-parameters remain the same as ViLBERT
+ Flickr-DG with 3 maximum layers. The aim is to check how much gain we could get from the
additional annotations. We report the results in Table A.1. It shows that actually, the model trained
with 3 levels of DG achieves the best performance. This might be because those high-level layers
of DG (counting from the sentences) contain very abstract text concepts, such as “entity” and
“physical object”, which is non-informative in learning the visual grounding.
1
https://github.com/aylai/DenotationGraph
135
Table A.1: Text-based Image Retrieval Performance of ViLBERT trained with different number of
DG levels
# of DG levels R@1 R@5 R@10 RSUM
3 levels 65.9 91.4 95.5 252.7
5 levels 62.5 86.4 92.3 241.2
7 levels 62.8 86.3 91.6 240.7
Once the graph is constructed, we attach the images to the proper nodes by set-union images
of each node’s children, starting from the sentence-level node.
Implementation Details for Zero-shot Referring Expression Specifically, the learned ViL-
BERT and ViLBERT w/DG models are used first to produce a base matching scores
BASE
between
the expression to be referred and the whole image. We then compute the matching scores
MASKED
between the expression and the image with each region feature being replaced by a random
feature in turn. The masked region which causes the largest drop (s
BASE
s
MASKED
) is the model’s
prediction of which region the expression refers to.
Model architectures of ViLBERT and UNITER A comparison of these models is schemati-
cally illustrate in Fig. A.1.
• ViLBERT. It has 6 basic Transformer layers for text and 8 layers for image. For all the
Transformer layers on the text side, we use 12 attention heads and 256 feature dimensions,
then linearly project down to 1024 feature dimensions. For all the Transformers on the
image side, we use 8 attention heads and 128 feature dimensions, then combine into 1024
feature dimensions too.
• UNITER. All the Transformer layers have 12 heads and 256 feature dimensions.
The major difference between UNITER and ViLBERT is how attentions are used. In ViLBERT,
one modality is used as a query, and the other is used as value and key. In UNITER, however, both
are used as query, key, and value. Additionally, UNITER is similar to another model Unicoder-
VL [126]. However, the latter has not provided publicly available code for experimenting.
For ViLBERT model, each text and image co-attention Transformer layer contains 8 attention
heads with 1024 dimensions in total. The text Transformer layer contains 12 attention heads with
3072 hidden dimensions in total. In contrast, the image Transformer layer has 8 attention heads
with 1024 hidden dimensions in total. For UNITER model, each cross-attention Transformer layer
contains 12 heads with 3072 hidden dimensions in total.
136
(a) ViLBERT
()
Image Encoder
()
Text Encoder
=( , )
(,)=
!
-
…
Text Co-attention
Transformer
Image Co-attention
Transformer …
…
Image Transformer
MLP & Pooling
Text Transformer
MLP & Pooling
(b) UNITER
()
Image Encoder
()
Text Encoder
Cross-Modal Transformer
…
[CLS]
=( , )
(,)=
!
-
Figure A.1: Architecture of (a) ViLBERT, (b) UNITER. The
N
means element-wise product. The
[CLS] represents the embedding of [CLS] token in the last UNITER layer.
ViLBERT model contains 121 million parameters, while UNITER contains 111 million
parameters.
Training Details All models are optimized with the Adam optimizer [102]. The learning rate is
initialized as 4e
5
. Following ViLBERT [143], a warm-up training session is employed, during
which we linearly increase the learning rate from 0 to 4e
5
in the first 1:5% part of the training
epochs. The learning rate is dropped to 4e
6
and 4e
7
at the 10th and the 15th epochs, respectively.
For ViLBERT (Reduced), we randomly initialized the model parameters in the image stream. The
text stream is initialized from the first 3 layers of the pre-trained BERT model, and its co-attention
Transformer layers are randomly initialized. For ViLBERT (Full) and UNITER [36], we load the
model’s weights pre-trained on the Conceptual Caption dataset to initialize them.
Training ViLBERT (Full) + DG with a minibatch size of 64 takes 2 to 3 days on an 8 TitanXp
GPU server, or 1 day on TPU v2 cloud. The GPU server is equipped with Intel Xeon Gold 6154
CPU and 256G RAM.
Text Pre-processing We follow BERT [45] that uses WordPiece [245] tokenizer to tokenize
the texts. For ViLBERT (Reduced) and ViLBERT (Full), we use the uncased tokenizer with
a vocabulary size of 30,522. For UNITER, we use the cased tokenizer with a vocabulary size
of 28,996. After tokenization, the tokens are transformed to 768 dimension features by a word
embedding initialized from BERT pre-trained model. The 768-dimensional position features are
included in the input to represent the position of each token.
Visual Pre-processing For both ViLBERT and UNITER, we use the image patch features
generated by the bottom-up attention features, as suggested by the original papers [2]. The
image patch features contain up to 100 image patches with their dimensions to be 2048. Besides
137
Table A.2: Results on Cross-Modal Retrieval on COCO dataset 1K test split (Higher is better)
Text-based Image Retrieval Image-based Text Retrieval
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 62.3 89.5 95.0 246.8
ViLBERT + DG 65.9 91.4 95.5 252.7
UNITER 60.7 88.0 93.8 242.5
UNITER + DG 62.7 88.8 94.4 245.9
Known results from literature
VSE++[49] 52.0 84.3 92.0 228.3
SCO[84] 56.7 87.5 94.8 239.0
SCAN[122] 58.8 88.4 94.8 242.0
VSRN[129] 62.8 89.7 95.1 247.6
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 77.0 94.1 97.2 268.3
ViLBERT + DG 79.0 96.2 98.6 273.8
UNITER 74.4 93.9 97.1 265.4
UNITER + DG 77.7 95.0 97.5 270.2
Known results from literature
VSE++[49] 64.6 90.0 95.7 250.3
SCO[84] 69.9 92.9 97.5 260.3
SCAN[122] 72.7 94.8 98.4 265.9
VSRN[129] 76.2 94.8 98.2 269.2
this, a positional feature is used to represent the spatial location of bounding boxes for both
ViLBERT and UNITER. Specifically, ViLBERT uses 5-dimensional position feature that encodes
the normalized coordinates of the upper-left and lower-right corner for the bounding boxes, as well
as one additional dimension encoding the normalized patch size. UNITER uses two additional
spatial features that encode the normalized width and height of the object bounding box.
A.2 Additional Experimental Results
In this section, we include additional experimental results referred to by the main text. Specifically,
we include results from a variety of models (e.g., ViLBERT, ViLBERT + DG, UNITER, and
UNITER + DG) on COCO dataset 5K test split [92]. Then we provide a comprehensive ablation
study on the impact of
1
and
2
of the Eq. 7 in the main text.
Complete Results on COCO Dataset We report the full results on COCO dataset (1K test
split and 5K test split) in Table A.2 and Table A.3. Additionally, we contrast to other existing
approaches on these tasks. It could be seen that ViLBERT + DG and UNITER + DG improves the
performance over the counterparts without DG by a significant margin on both COCO 1K and 5K
test split – the only exception is that on the task of image-based text retrieval, UNITER performs
better than UNITER+DG.
These results support our claim that training with DG helps the model to learn better visual
and linguistic features. Although ViLBERT and UNITER have different architectures, training
with DG could improve the performance consistently.
138
Table A.3: Results on Cross-Modal Retrieval on COCO dataset 5K test split (Higher is better)
Text-based Image Retrieval Image-based Text Retrieval
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 38.6 68.2 79.0 185.7
ViLBERT + DG 41.8 71.5 81.5 194.8
UNITER 37.8 67.3 78.0 183.1
UNITER + DG 39.1 68.0 78.3 185.4
Known results from literature
VSE++[49] 30.3 59.4 72.4 162.1
SCO[84] 33.1 62.9 75.5 171.5
SCAN[122] 38.6 69.3 80.4 188.3
VSRN[129] 40.5 70.6 81.1 192.2
UNITER[36]
y
48.4 76.7 85.9 211.0
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 53.5 79.7 87.9 221.1
ViLBERT + DG 57.5 84.0 90.1 232.2
UNITER 52.8 79.7 87.8 220.3
UNITER + DG 51.4 78.7 87.0 217.1
Known results from literature
VSE++[49] 41.3 71.1 81.2 193.6
SCO[84] 42.8 72.3 83.0 198.1
SCAN[122] 50.4 82.2 90.0 222.6
VSRN[129] 53.0 81.1 89.4 223.5
UNITER [36]
y
63.3 87.0 93.1 243.4
y
: The UNITER[36] model performs an additional online hard-negative mining (which we did
not) during the training of image-text matching to improve their results, which is computationally
very costly.
Table A.4: Results on Text-based Image Retrieval on Flickr test split (Higher is better)
Method R@1 R@5 R@10 RSUM
Models ran or implemented by us
ViLBERT 59.1 85.7 92.0 236.7
ViLBERT + DG 63.8 87.3 92.2 243.3
UNITER 62.9 87.2 92.7 242.8
UNITER + DG 66.4 88.2 92.2 246.8
Known results from literature
VSE++[49] 39.6 70.1 79.5 189.2
SCO[84] 41.1 70.5 80.1 191.7
SCAN[122] 48.6 77.7 85.2 211.5
VSRN[129] 54.7 81.8 88.2 224.7
ViLBERT[143] 58.2 84.9 91.5 234.6
UNITER[36] 71.5 91.2 95.2 257.9
Complete Results on Flickr Dataset We contrast to other existing approaches in Table A.4 on
the task of text-based image retrieval on the Flickr dataset.
139
Table A.5: Transferrability of the learned representations
SOURCE!TARGET Flickr!COCO COCO!Flickr
Model R@1 R@5 R@10 RSUM R@1 R@5 R@10 RSUM
ViLBERT 43.5 72.5 83.4 199.4 49.0 76.0 83.9 209.0
ViLBERT + SOURCE DG 44.9 72.7 83.0 200.5 52.8 79.2 86.2 218.2
Transfer Learning Results Table A.5 reports the full set of evaluation metrics on transferring
across datasets. Training with DG improves training without DG noticeably.
140
Appendix B
Details and Additional Experiments for Chapter 4
B.1 Additional Implementation Details
B.1.1 Details on simulators
B.1.1.1 Details about GRIDWORLD Configurations
As we have mentioned in the main text, there are in total 20 environments for this simulator. The
tasks presented in this simulator includes a sequential execution of picking up two treasures in
different colors. The agent can observe the layout of the environment inside a 3x3 square centered
at the agent’s current position (see Figure B.1 (a) for details). The agent can take 5 actions, which
includes moving in the four directions and picking up an object right below it. Note that in each
run of a certain given task, the locations of both agent and treasures are randomized.
In terms of the reward setting, we follow the common practice and set the reward for moving
one step to be -0.01 and touching a wall to be an additional - 0.01. Picking up a target treasure
gives 1 unit of the reward and completing a task gives 10 unites of the reward. Picking up a wrong
target directly ends an episode and gives reward -10. During the training, we use an optimal
planner with shortest path search algorithm for expert policy. To represent a state for our network,
we follow the practice in DQN [158] and concatenate the last four observations as the input to the
policy.
B.1.1.2 Details about THOR Configurations.
THOR [109] is a 3D robotic simulator developed recently for simulating the indoor environments
a robot could encounter. An agent is working like a real robot with a first-persion view camera,
which delivers RGB images in egocentric view (see Figure B.1 (b) for details). The environment
has interactable components that a agent can play with, which enables the learning of human like
behaviors such as semantic planning [268] and indoor navigation [269]. We describe the concrete
settings we used as what follows.
141
(a) Agent’s View in GRIDWORLD (b) Egocentric View in THOR
Figure B.1: Demonstrations of agent’s view in two simulators. In the left, we present the agent’s
input state of GRIDWORLD. An agent only have the vision to its surrounding context and the
locations of all treasures (see (a)). Similarly, in the THOR, an agent has access to an egocentric
image that represents the first-person viewpoint (see (b)).
We extract the image features using convolutional neural networks to represent an observation
for each egocentric view of a robotic agent. Specifically, we extract the activation output from the
penultimate layer of a Resnet101 [67] pre-trained on ImageNet [184], which has the dimensionality
of 2048. Similar to the GRIDWORLD experiments, we then concatenate those features of the last
four observations as the input to the policy network. The agent can take 7 actions in THOR: move
ahead, turn left, turn right, look up, look down, open/close an object, pick up/put down an object.
We set the reward for moving one step to be -0.01 and executing invalid actions to be -0.01. The
reward of picking up the correct object is 1, and the reward of finishing the task is 10. Picking
up the wrong object and putting the object in the wrong receptacle ends an episode and gives -10
units of the reward. The interactable objects, receptacles and index of environments (kitchens) are
listed in table B.1. In our experiment we selected environments with similar size (see Table B.1
for the complete list).
Table B.1: interactable objects, receptacles and environment indexes in THOR
Entries Values
Objects Container, Lettuce, Mug, Tomato, Plate, Apple, Bowl
Receptacles Fridge, Microwave, Sink
Environments Kitchen {1, 2, 3, 4, 5, 6, 8, 9, 11, 12, 18, 22, 23, 24, 25, 27, 28, 20, 30}
142
B.1.2 Imitation Learning Algorithm and Optimization Details
As mentioned in the main text, now we describe the imitation learning algorithm used for learning
SYNPO and all baseline models. The concrete details are presented in Algorithm 2.
Algorithm 2 Policy Imitation Learning Algorithm.
Input: Given training simulators simulator(z), wherez2 (E;T )
train
Initialize Expert Replay MemoryD
E
with the capacity N
for episode = 1, M do
Samplez2 (E;T )
train
TRAJ
z
(fs
i
;a
i
;r
i
g;
E
) = ROLLOUT
E
z
; simulator(z)
Store TRAJ
z
(fs
i
;a
i
;r
i
g;
E
) toD
E
Sample a random mini-batch B withj Bj trajectories fromD
E
Compute gradientrL and update the parameters with specified optimizer
end for
In each episode, we sample a trajectory using the expert policy and store it into the replay
buffer. At the end of each episode, we sample 64 trajectories uniformly from the replay buffer to
calculate the total loss. Here, the size of replay buffer for storing expert trajectories is 20,000. In
each episode, we uniformly sample 64 trajectories from the replay buffer (coming from different
"andpairs) to compute the loss. We set the hyper-parameters as follows: there is
1
= 0:01
for reward prediction;
2
= 0:1 and
3
= 0:001 for environment and task disentanglement loss.
The dimensionality of environment embedding and task embedding are 128. Besides, we use
Adam [102] as the optimizer with the initial learning rate set to be 0.001. Additionally, we set the
value of weight decay factor to be 0:001 in all our experiments.
B.1.3 Reinforcement Learning Algorithm and Optimization Details
As mentioned in the main text, we have employed reinforcement learning to further fine-tune our
model, which archived improvement in transfer learning performances. Now we describe the
detailed setups of our experiments. We use PPO [192] to fine-tune our model. We optimize our
model by RMSProp with learning rate 0:000025 and weight decay 0:0001. We use GAE [191]
to calculate advantages, with
= 0:99 and = 0:95, entropy weight is 0:01, rollout length 128,
objective clipping ratio 0:1. Gradient norms are clipped to 0:5. We divide the trajectories collected
into 4 mini batches and do four optimization steps on each update. We fine-tuned our model for
2 10
7
steps. During RL fine-tuning we also included our disentangling objectives as auxilary
loss.
143
B.1.4 Detailed Configuration of Methods
Details about our Policy Network for SYNPO in GRIDWORLD First, we introduce the specific
setups we used for policy networks in GRIDWORLD. We directly parameterize the outcome of
a dot product between and
a
as a tensor, for the sake of computation efficiency in practice.
However, our model, as mentioned in the main text, is indeed a bilinear policy. Therefore, with a
more general application scenario that action space (jAj) is large, we can apply the original form
of our approach and learn separate action embeddings
a
with the shared basis . The coefficient
functions() and() that compose environment and task embeddings are one-hidden-layer
MLPs with 512 hidden units and output size of 128. The dimension of the state feature
s
extracted
from ResNet before the bilinear weightU is 128. The state feature extractor is a customized
ResNet. Its concrete structure is shown as Table B.2. The dimensionality of the environment
embeddingse
"
and task embeddingse
are 128.
Table B.2: Structure of State Feature Function
s
in GRIDWORLD
group name output size block type stride
input 16 16 3 - -
conv 1 8 8 32
3 3; 32
3 3; 32
2 2
conv 2 4 4 64
3 3; 64
3 3; 64
2 2
conv 3 2 2 128
3 3; 128
3 3; 128
2 2
conv 4 2 2 256
3 3; 256
3 3; 256
2 0
fc 128
h
1024 128
i
-
Details about our Policy Network for SYNPO in THOR Next we describe the network setups
we used in THOR. Again, we directly parameterize the outcome of a dot product between
and
a
as a tensor, as the action space is small (jAj = 7) in this simulator. With the stacked
2; 048 4 dimensional ResNet101 feature as input, we learn a two 1-D convolutional networks
with kernel size of 3 and stride of 2, which first reduces the dimensionality of feature to 1,024 and
then aggregates over the temporal axis. Next, the encoding of visual feature is then concatenated
with an embedding (e
obj
) that represents object the agent is carrying. The concatenated feature
vector is next input into a one-hidden-layer MLP wth hidden state of 2,048 dimension. The output
of this MLP (which is also the final output ofstate feature function
s
) has dimension of 256. The
144
concrete config is shown as Table B.3. The dimensionality of the environment embeddingse
"
and
task embeddingse
are 128.
Table B.3: Structure of State Feature Function
s
in THOR
group name output size block type stride
image input 2048 4 - -
conv 1 1024 2
h
3 1; 1024
i
2
conv 2 1024 1
h
3 1; 1024
i
2
concat 1056 concate
obj
-
fc1 2048
h
1056 2048
i
-
fc2 256
h
2048 256
i
-
Details about learning Disentanglement Objective In addition to both of the above settings,
we applied another set of one-hidden-layer MLPsf
"
andf
(hidden=512) to represent the auxiliary
function that project the high-dimensional trajectory featurex to the embedding spacese
"
and
e
. Note that this function is only used in the disentanglement objective, and could be discarded
during the deployment of policy network.
B.2 Additional Experimental Results
Complete Details of Main Results and Comparison between Methods As mentioned in the
main text, we put our complete results of GRIDWORLD here. Now we report not only the average
success rate (AvgSR.) but also average reward (AvgReward), on both seen and unseen pairs.
Table B.4: Performance of the best model for each method on GRIDWORLD (Seen/Un-
seen=144/256). All algorithms are trained using three random seeds and reported with mean and
std. on each (",) pair, we sample the locations of agent and treasures for 100 times to evaluate
the performances.
Method SF ModuleNet MLP MTL SYNPO
AvgSR. (SEEN) 0.0 0.0% 50.9 33.8% 69.0 2.0% 64.1 1.2% 83.3 0.5 %
AvgSR. (UNSEEN) 0.0 0.0% 30.4 20.1% 66.1 2.6% 41.5 1.4% 82.1 1.5%
145
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
ModuleNet
SF
SynPo
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
ModuleNet
SF
SynPo
(a) AvgSR. over Time on SEEN (b) AvgSR. over Time on UNSEEN
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
average reward
MLP
MTL
ModuleNet
SF
SynPo
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
average success rate
MLP
MTL
ModuleNet
SF
SynPo
(c) AvgReward over Time on SEEN (d) AvgReward over Time on UNSEEN
Figure B.2: Results on GRIDWORLD. (a)-(b): Comparison between average success rate (ASR.)
of algorithms on seen split and unseen split. (c)-(d): Comparison between average accumulated
reward (AvgReward.) of algorithms in each episode on seen split and unseen split. Results are
reported on the setting withjEj = 20 andjTj = 20. For each intermediate performance, we
sample 100 (", ) combinations and test one configuration to evaluate the performances. We
evaluate models trained with 3 random seeds and report results in terms of the mean AvgSR and
its standard deviation.
We found that the trend of average reward on seen and unseen splits are quite similar to the
trend of average success rate. We also note that the reward for successor feature (SF) is stable
around -3, which indicated that the agent only tries to avoid negative reward and refuse to learn
getting positive reward. On the contrary, all methods that make progress later starts with a lower
average reward, meaning that the agent tries to complete the task by picking up objects but failed a
lot at the beginning.
Specifically, we find that SYNPO is consistently performing better across all metrics, in
terms of both the convergence and final performance. On the seen splits, MTL and MLP have
similar performances, while MTL has a much worse generalization performance on unseen
splits, comparing to MLP, possibly due to over-fitting or the lack of the capability in recognizing
environments. At the same time, it is worth noting that Module Network has a significantly larger
variance in its performances, comparing against all other approaches. This is possibly due to the
fact that the environment modules and task modules are adhered together during the inference,
where instability could occur. Similar issue has also been reported by Devin et al. [44]. In addition,
146
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
1.0
average success rate
SynPo w/o EnvDisentg
SynPo w/o TaskDisentg
SynPo
Synpo w/o Disentg
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
SynPo w/o EnvDisentg
SynPo w/o TaskDisentg
SynPo
Synpo w/o Disentg
(a) AvgSR. over Time on SEEN (b) AvgSR. over Time on UNSEEN
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
10
5
0
5
10
average reward
SynPo w/o EnvDisentg
SynPo w/o TaskDisentg
SynPo
Synpo w/o Disentg
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
average reward
SynPo w/o EnvDisentg
SynPo w/o TaskDisentg
SynPo
Synpo w/o Disentg
(c) AvgReward over Time on SEEN (d) AvgReward over Time on UNSEEN
Figure B.3: An ablation study about our learning objectives. We report the results of the
ablated versions without the disentanglement loss (Disentg) on environment (EnvDisentg) and
on task (TaskDisentg). (a)-(b): Comparison between average success rate (ASR.) of algorithms
on SEEN split and UNSEEN split. (c)-(d): Comparison between average accumulated reward
(AvgReward.) of algorithms in each episode on SEEN split and UNSEEN split. Results are reported
on the setting withjEj = 20 andjTj = 20. Similarly, for each intermediate performance, we
sample 100 (",) combinations to evaluate the performances.
even in the best performing cases, ModuleNet could achieve a similar performances comparing to
MLP and still far from approaching SYNPO’s performance.
Ablation Studies of the Learning Objectives How does each component in the objective
function of our approach affect the performance of our model? Figure B.3 shows that the task
disentanglement loss is crucial for achieving good success rates on either seen or unseen pairs.
This is probably because the differences between tasks are very subtle, making the agent hard
to find the right distinct embeddings for them without the explicit task disentanglement loss. In
contrast, the approach without the environment disentanglement loss can still reach a high success
rate though it converges a bit slower.
Detailed Transfer Learning Experiments As mentioned in the main text, here we include the
complete splits for the transfer learning study (Experiments evaluated the transfer learning result
147
(a) 10 Train and 90 Test (b) 20 Train and 80 Test
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.90
0.10
0.10
0.00
0.40
0.30
0.10
0.20
0.10
0.40
0.10
1.00
0.10
0.00
0.00
0.00
0.30
0.10
0.00
0.20
0.20
0.00
1.00
0.00
0.00
0.00
0.10
0.10
0.00
0.10
0.10
0.00
0.10
1.00
0.20
0.00
0.00
0.00
0.00
0.00
0.10
0.00
0.00
0.00
1.00
0.10
0.00
0.10
0.10
0.20
0.30
0.20
0.20
0.00
0.30
0.90
0.10
0.10
0.00
0.10
0.40
0.30
0.00
0.20
0.00
0.20
0.90
0.10
0.20
0.10
0.20
0.10
0.00
0.00
0.00
0.30
0.10
1.00
0.00
0.00
0.10
0.00
0.30
0.20
0.30
0.10
0.00
0.00
1.00
0.00
0.00
0.10
0.00
0.20
0.10
0.00
0.10
0.10
0.00
0.90
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
1.00
0.50
0.10
0.70
0.30
0.30
0.50
0.40
0.40
0.10
0.10
0.30
0.10
1.00
0.40
0.30
0.30
0.20
0.00
0.30
1.00
0.60
0.50
0.50
1.00
0.20
0.60
0.10
0.80
0.30
0.20
0.20
0.30
0.30
0.90
0.20
0.30
0.10
0.00
0.90
0.10
0.10
0.20
1.00
0.20
0.10
0.50
0.30
0.00
0.30
0.00
0.50
0.10
0.90
0.60
0.40
0.30
1.00
0.00
0.60
0.60
0.90
0.30
1.00
0.30
0.90
0.90
0.30
0.30
0.50
0.10
0.30
0.40
0.80
0.90
0.10
0.20
0.10
0.10
0.40
0.20
0.50
1.00
0.50
0.70
1.00
0.30
0.10
0.60
0.30
0.20
0.90
1.00
0.90
0.60
0.80
0.90
0.20
0.50
1.00
(c) 30 Train and 70 Test (d) 40 Train and 60 Test
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.60
0.50
1.00
1.00
1.00
0.90
0.60
1.00
0.10
0.60
1.00
0.10
0.80
0.50
1.00
0.50
0.60
1.00
0.90
0.40
0.80
0.90
0.90
1.00
0.70
0.90
0.70
0.90
0.60
0.90
0.40
0.30
0.90
0.90
1.00
1.00
0.40
0.90
0.70
0.60
1.00
0.40
0.10
0.30
0.70
0.60
0.40
0.50
1.00
0.40
0.30
0.90
0.50
0.80
1.00
0.40
1.00
0.60
0.70
0.50
1.00
0.80
0.70
0.90
0.90
0.90
0.90
0.70
0.90
1.00
0.60
0.90
0.70
0.30
0.50
0.50
0.90
0.60
0.20
0.80
0.60
0.80
0.80
0.40
0.90
0.60
0.40
1.00
1.00
0.30
0.50
0.90
0.80
0.90
0.80
0.80
1.00
1.00
0.40
1.00
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.90
0.90
0.90
0.90
1.00
0.90
0.80
0.90
1.00
0.80
0.90
1.00
1.00
1.00
0.90
1.00
1.00
0.80
0.90
1.00
0.80
1.00
0.80
0.90
0.60
0.90
1.00
1.00
0.90
0.80
1.00
0.70
0.80
1.00
0.70
0.70
1.00
0.80
0.60
1.00
0.90
0.90
0.40
0.90
0.70
0.90
0.90
0.70
0.90
1.00
1.00
1.00
1.00
1.00
0.90
0.80
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.90
0.80
1.00
1.00
1.00
0.90
0.80
0.90
0.80
1.00
0.70
1.00
0.80
1.00
0.90
0.90
1.00
1.00
0.90
1.00
0.70
0.90
1.00
1.00
0.80
0.80
1.00
0.70
0.50
1.00
1.00
0.90
0.70
0.80
0.70
1.00
0.80
Figure B.4: Average test success rate on each environment-task combination. Blue grids represent
seen combinations and red grids represent unseen combinations
w.r.t. ratio # of seen vs.# of total). The success rate of our method on each pair is marked on the
matrices. The full success rate matrices are shown as Figure 4 and Figure 5.
Specifically we case study the situation when this ratio is 0:2. The detailed transfer learning
performance is shown as Figure B.6. Here each row corresponds to a task and each column
corresponds to an environment. The red grids represents the unseen pairs and the purple grids
represents the seen pairs. We mark the average success rate (over 100 runs of evaluations) in the
grid to better quantitatively identify the performance at a pair of (",). The darker the color of
148
(a) 50 Train and 50 Test (b) 60 Train and 40 Test
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.90
1.00
0.90
1.00
1.00
1.00
0.90
0.90
1.00
0.90
0.90
1.00
0.90
0.70
0.90
0.80
1.00
0.90
0.90
0.70
0.70
0.90
0.90
0.90
1.00
0.90
1.00
0.90
0.80
0.80
1.00
1.00
0.90
0.70
1.00
0.50
0.80
1.00
0.80
1.00
0.80
0.90
1.00
0.70
1.00
1.00
0.80
1.00
1.00
0.60
0.90
0.90
0.70
0.90
1.00
0.90
1.00
1.00
1.00
0.90
1.00
0.80
1.00
0.90
1.00
0.90
1.00
0.80
1.00
0.90
0.50
0.90
0.80
1.00
0.70
0.50
1.00
0.90
0.70
1.00
0.90
1.00
1.00
0.70
0.90
1.00
0.90
1.00
0.70
0.90
1.00
1.00
0.90
0.40
1.00
0.90
0.90
1.00
1.00
0.90
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.90
0.90
1.00
0.90
1.00
0.90
1.00
1.00
1.00
0.90
0.90
0.70
0.80
0.90
1.00
0.80
0.60
0.90
1.00
0.50
1.00
0.80
0.90
1.00
0.90
0.90
0.80
1.00
1.00
1.00
0.70
1.00
1.00
1.00
0.90
0.90
0.90
0.90
0.90
0.80
0.80
0.80
1.00
0.90
0.80
1.00
0.90
0.80
0.90
0.90
1.00
1.00
0.80
0.90
0.80
1.00
1.00
1.00
1.00
0.90
1.00
1.00
1.00
1.00
0.90
1.00
1.00
1.00
0.90
0.70
0.90
0.80
0.80
1.00
0.70
0.70
1.00
0.90
0.90
0.90
0.90
1.00
1.00
0.80
0.60
0.90
1.00
1.00
0.70
1.00
0.90
0.90
1.00
0.80
0.90
0.90
0.90
0.90
1.00
0.80
(c) 70 Train and 30 Test (d) 80 Train and 20 Test
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
0.90
0.80
1.00
1.00
0.80
0.90
0.60
1.00
1.00
0.80
0.90
1.00
1.00
0.80
0.90
1.00
0.70
0.80
0.90
0.80
0.90
1.00
0.80
1.00
1.00
1.00
1.00
1.00
0.80
1.00
1.00
0.80
1.00
1.00
0.80
1.00
1.00
1.00
1.00
1.00
0.90
0.90
0.90
1.00
0.90
0.80
0.80
0.90
0.90
1.00
1.00
0.80
0.90
1.00
0.60
0.90
0.90
1.00
0.80
1.00
1.00
0.90
1.00
1.00
0.60
0.90
0.70
0.70
1.00
1.00
0.80
1.00
0.90
1.00
0.80
0.80
0.90
1.00
0.90
0.90
1.00
0.80
0.90
1.00
1.00
0.90
0.90
0.90
0.70
1.00
0.80
1.00
1.00
1.00
0.90
0.80
0.80
0.90
0.80
0.90
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
1.00
0.90
0.90
1.00
1.00
1.00
0.80
0.90
0.90
1.00
0.90
1.00
0.80
0.90
1.00
1.00
1.00
1.00
0.90
0.90
0.80
1.00
0.80
1.00
1.00
1.00
0.80
1.00
0.90
0.90
1.00
0.70
0.90
1.00
1.00
1.00
1.00
0.90
1.00
0.90
1.00
0.90
0.80
1.00
1.00
0.90
0.80
1.00
1.00
0.80
0.90
1.00
1.00
0.90
1.00
0.70
1.00
1.00
1.00
0.90
0.70
0.90
0.90
1.00
0.80
0.90
1.00
1.00
1.00
1.00
1.00
0.90
1.00
1.00
1.00
1.00
0.80
0.90
1.00
0.90
0.90
0.80
0.80
0.90
1.00
1.00
1.00
0.90
1.00
0.80
0.90
0.80
1.00
1.00
1.00
0.90
0.90
1.00
1.00
0.90
Figure B.5: Average test success rate on each environment-task combination. Blue grids represent
seen combinations and red grids represent unseen combinations
a grid is, the better the corresponding performance. We can see that with the row “(O, R)” and
column “env_0”, although only entry along the row and column is seen by the model, the transfer
learning performance does not fail completely. Instead, many entries along the row and column
have a superior success rate. This supports our claim about disentanglement of the environment
and task embedding, and at the same time indicates the success in the learning compositionality.
149
env_0
env_1
env_2
env_3
env_4
env_5
env_6
env_7
env_8
env_9
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
1.00
0.50
0.10
0.70
0.30
0.30
0.50
0.40
0.40
0.10
0.10
0.30
0.10
1.00
0.40
0.30
0.30
0.20
0.00
0.30
1.00
0.60
0.50
0.50
1.00
0.20
0.60
0.10
0.80
0.30
0.20
0.20
0.30
0.30
0.90
0.20
0.30
0.10
0.00
0.90
0.10
0.10
0.20
1.00
0.20
0.10
0.50
0.30
0.00
0.30
0.00
0.50
0.10
0.90
0.60
0.40
0.30
1.00
0.00
0.60
0.60
0.90
0.30
1.00
0.30
0.90
0.90
0.30
0.30
0.50
0.10
0.30
0.40
0.80
0.90
0.10
0.20
0.10
0.10
0.40
0.20
0.50
1.00
0.50
0.70
1.00
0.30
0.10
0.60
0.30
0.20
0.90
1.00
0.90
0.60
0.80
0.90
0.20
0.50
1.00
Figure B.6: Case study for a situation when the ratio of # of combinations seen and the total is 0.2
Details on Experiments of transfer setting 2 and setting 3 In this section, we describe the
details of transfer learning settings. In both the setting 2 of “Incremental learning of small pieces
and integrating knowledge later” and setting 3 of “Learning in giant jumps and connecting dots”,
we fix all parameters of the policy basis pre-trained onP and fine-tune the network to learn new
(randomly initialized) embeddings for environments and tasks. In this stage, we use only one
demonstration from each (",) pair to fine-tune the embedding and find that our network is able
to generalize to new environment or/and task. Concretely, we randomly initialize the 10 new
environment embeddings and the 10 new task embeddings for additional learning. In the transfer
setting 2, we sample only one expert trajectory as demonstration data for each (",) pair in the
upper right and lower left quadrant. In the transfer settings 3, we sample only one expert trajectory
as demonstration data for each (",) pair in the lower right quadrant. Following the same routine
Algorithm 1, we train the embeddings for 10000 iterations and then test the performance of models
on the entire matrix of (",) pairs. The result is shown as Figure B.8. Besides what we have
mentioned in the main text, we plot a more visually discernible success rate matrices as Figure B.8
(a) and (b). We observe that in both cases, transfer learning across the task axis is easier comparing
to the environment axis, given the results.
150
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
SynPo
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
0.0
0.2
0.4
0.6
0.8
average success rate
MLP
MTL
SynPo
(a) AvgSR. on SEEN Split (b) AvgSR. on UNSEEN Split
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
10.0
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
average reward
MLP
MTL
SynPo
0 25000 50000 75000 100000 125000 150000 175000 200000
iteration
7.5
5.0
2.5
0.0
2.5
5.0
7.5
average success rate
MLP
MTL
SynPo
(c) AvgReward on SEEN Split (d) AvgReward on UNSEEN Split
Figure B.7: Results of “A blind agent scenario” on GRIDWORLD with window size of 0. (a)-
(b): Comparison between average success rate (ASR.) of algorithms on seen split and unseen
split. (c)-(d): Comparison between average accumulated reward (AvgReward.) of algorithms in
each episode on seen split and unseen split. Results are reported on the setting withjEj = 20 and
jTj = 20. For each intermediate performance, we sample 100 (",) combinations and test one
configuration to evaluate the performances. We evaluate models trained with 3 random seeds and
report results in terms of the mean AvgSR and its standard deviation.
An extreme studies about the effectiveness of environment embeddings. As mentioned in
the main text, to study the effectiveness of the environment embedding, we run an additional
experiment as a sanity check. In this setting, we made agent’s observation window size to be 1,
which made agent only capable of seeing itself and the location of treasures on the map, without
any knowledge about the maze. We denote this agent as a “blind” agent. Therefore, such a
agent would need to remember the structure of the maze to perform well under this circumstance.
We follow our original imitation training process as well as evaluation process and tested three
representative methods in this setting, and plot the results as Table B.5. As we have expected, we
observe that algorithms such as MTL which do not distinguish between environments would fail
severely. It could still success in some cases such as the treasures are generated at the same room
as the agent, or very close by. With the additional environment embedding, a simple algorithm
such as MLP could significantly outperforms this degenerated multi-task model. In addition,
SYNPO can achieve almost as good as it was in the normal circumstance, demonstrating its strong
capability in memorizing the environment.
151
Env_0
Env_1
Env_2
Env_3
Env_4
Env_5
Env_6
Env_7
Env_8
Env_9
Env_10
Env_11
Env_12
Env_13
Env_14
Env_15
Env_16
Env_17
Env_18
Env_19
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
('R', 'O')
('B', 'P')
('G', 'R')
('O', 'B')
('P', 'G')
('R', 'P')
('B', 'R')
('G', 'B')
('O', 'G')
('P', 'O')
0.90
1.00
0.90
0.90
1.00
0.90
1.00
0.90
1.00
1.00
0.50
0.30
0.40
0.60
0.50
0.30
0.50
0.30
0.60
0.70
1.00
0.90
1.00
0.90
0.90
1.00
0.90
1.00
0.90
0.90
0.30
0.50
0.80
0.50
0.50
0.30
0.50
0.30
0.70
0.60
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.90
1.00
0.90
0.50
0.40
0.80
0.50
0.40
0.50
0.20
0.30
0.50
0.70
0.90
1.00
0.80
0.90
1.00
1.00
0.90
0.90
0.80
0.90
0.50
0.10
0.80
0.60
0.50
0.70
0.20
0.40
0.40
0.70
0.90
1.00
1.00
0.90
1.00
0.80
0.90
0.80
0.90
1.00
0.40
0.50
0.60
0.50
0.60
0.50
0.30
0.10
0.50
0.40
1.00
0.90
1.00
0.80
0.90
1.00
1.00
1.00
0.90
1.00
0.40
0.20
0.50
0.30
0.60
0.80
0.50
0.50
0.70
0.80
1.00
1.00
1.00
0.80
1.00
1.00
0.90
0.80
1.00
1.00
0.70
0.30
0.80
0.90
0.40
0.30
0.40
0.20
0.40
0.40
0.80
0.80
0.90
0.90
1.00
1.00
1.00
0.90
1.00
0.90
0.30
0.50
0.70
0.60
0.50
0.50
0.60
0.20
0.80
0.40
1.00
1.00
1.00
0.80
1.00
1.00
0.90
0.80
1.00
1.00
0.60
0.60
0.60
0.50
0.60
0.50
0.50
0.20
0.60
0.60
0.90
1.00
0.90
1.00
0.70
1.00
0.80
0.80
1.00
1.00
0.60
0.70
0.80
0.60
0.80
0.60
0.70
0.40
0.40
0.60
0.30
0.00
0.20
0.40
0.10
0.40
0.20
0.10
0.50
0.30
0.20
0.10
0.20
0.20
0.10
0.10
0.00
0.00
0.10
0.00
0.20
0.30
0.40
0.30
0.10
0.40
0.20
0.30
0.10
0.20
0.10
0.00
0.20
0.10
0.20
0.20
0.10
0.00
0.20
0.10
0.10
0.50
0.30
0.10
0.10
0.10
0.30
0.40
0.10
0.40
0.20
0.20
0.10
0.20
0.40
0.20
0.20
0.10
0.20
0.30
0.50
0.20
0.30
0.20
0.10
0.20
0.20
0.50
0.20
0.20
0.20
0.10
0.10
0.10
0.70
0.10
0.20
0.00
0.20
0.10
0.00
0.10
0.20
0.20
0.20
0.10
0.40
0.10
0.20
0.50
0.10
0.00
0.20
0.10
0.10
0.10
0.00
0.10
0.30
0.10
0.40
0.30
0.50
0.40
0.30
0.50
0.40
0.50
0.50
0.30
0.20
0.10
0.20
0.20
0.10
0.10
0.20
0.00
0.30
0.30
0.20
0.40
0.20
0.10
0.40
0.30
0.50
0.00
0.40
0.20
0.20
0.20
0.50
0.40
0.00
0.30
0.30
0.10
0.40
0.30
0.10
0.20
0.20
0.00
0.00
0.00
0.10
0.20
0.20
0.20
0.00
0.00
0.20
0.00
0.10
0.00
0.00
0.00
0.20
0.00
0.50
0.00
0.10
0.00
0.10
0.00
0.00
0.00
0.10
0.30
0.00
0.10
0.00
0.10
0.10
0.10
0.00
0.10
0.10
0.10
0.00
0.00
0.00
0.10
0.00
0.00
0.10
0.00
0.20
0.00
0.10
0.10
0.00
0.10
0.10
0.00
0.20
0.10
0.00
0.00
(a) Transfer Setting 2
Env_0
Env_1
Env_2
Env_3
Env_4
Env_5
Env_6
Env_7
Env_8
Env_9
Env_10
Env_11
Env_12
Env_13
Env_14
Env_15
Env_16
Env_17
Env_18
Env_19
('R', 'B')
('B', 'G')
('G', 'O')
('O', 'P')
('P', 'R')
('R', 'G')
('B', 'O')
('G', 'P')
('O', 'R')
('P', 'B')
('R', 'O')
('B', 'P')
('G', 'R')
('O', 'B')
('P', 'G')
('R', 'P')
('B', 'R')
('G', 'B')
('O', 'G')
('P', 'O')
1.00
1.00
1.00
1.00
0.90
0.90
0.90
0.90
0.90
1.00
0.40
0.50
0.80
0.20
0.50
0.40
0.50
0.30
0.80
0.00
1.00
1.00
0.90
0.90
0.90
1.00
0.90
1.00
0.90
1.00
0.40
0.30
0.20
0.40
0.60
0.20
0.30
0.10
0.90
0.10
1.00
0.80
0.90
1.00
0.70
0.90
0.90
1.00
1.00
1.00
0.30
0.30
0.70
0.50
0.50
0.40
0.50
0.30
0.70
0.30
1.00
0.80
0.90
1.00
0.80
0.90
0.80
0.90
1.00
0.90
0.50
0.30
0.50
0.40
0.40
0.40
0.40
0.30
0.80
0.30
0.90
0.90
0.80
1.00
1.00
0.80
0.90
1.00
0.90
1.00
0.40
0.80
0.60
0.20
0.20
0.20
0.20
0.40
0.60
0.50
1.00
1.00
0.90
0.70
0.90
0.90
0.80
0.80
0.90
1.00
0.50
0.30
0.60
0.60
0.90
0.20
0.40
0.50
0.60
0.40
1.00
1.00
1.00
0.70
0.90
0.80
0.90
0.90
0.90
0.90
0.40
0.50
0.50
0.30
0.80
0.40
0.30
0.10
0.40
0.40
0.90
0.80
0.90
0.70
0.80
0.80
0.80
0.90
1.00
0.90
0.30
0.40
0.70
0.40
0.20
0.40
0.30
0.30
0.70
0.10
0.80
0.80
0.80
1.00
1.00
1.00
0.80
0.80
1.00
1.00
0.00
0.40
0.70
0.40
0.80
0.40
0.50
0.30
0.70
0.40
0.80
1.00
1.00
0.90
0.90
0.80
0.90
1.00
0.90
1.00
0.30
0.30
0.70
0.40
0.70
0.40
0.30
0.20
0.90
0.20
0.40
0.20
0.30
0.20
0.20
0.50
0.50
0.30
0.20
0.10
0.20
0.10
0.00
0.20
0.20
0.20
0.10
0.10
0.00
0.10
0.20
0.30
0.20
0.30
0.10
0.20
0.10
0.10
0.40
0.20
0.30
0.00
0.10
0.10
0.00
0.10
0.00
0.10
0.40
0.00
0.30
0.20
0.10
0.40
0.10
0.20
0.30
0.10
0.30
0.30
0.10
0.20
0.20
0.00
0.10
0.20
0.10
0.00
0.10
0.10
0.20
0.30
0.20
0.30
0.20
0.00
0.20
0.10
0.10
0.30
0.10
0.20
0.00
0.20
0.10
0.10
0.00
0.00
0.40
0.20
0.10
0.10
0.20
0.20
0.00
0.10
0.10
0.00
0.20
0.20
0.00
0.20
0.00
0.10
0.10
0.10
0.30
0.00
0.00
0.10
0.40
0.20
0.30
0.40
0.50
0.10
0.30
0.10
0.30
0.50
0.40
0.40
0.20
0.10
0.10
0.20
0.10
0.10
0.70
0.40
0.10
0.50
0.20
0.20
0.10
0.20
0.20
0.30
0.30
0.20
0.30
0.30
0.40
0.00
0.30
0.20
0.10
0.00
0.30
0.00
0.40
0.00
0.20
0.00
0.00
0.30
0.10
0.10
0.20
0.00
0.20
0.00
0.20
0.00
0.10
0.20
0.00
0.00
0.00
0.00
0.10
0.00
0.10
0.00
0.10
0.30
0.10
0.00
0.10
0.30
0.20
0.00
0.00
0.10
0.10
0.10
0.00
0.10
0.00
0.10
0.00
0.10
0.00
0.20
0.00
0.20
0.10
0.10
0.30
0.20
0.10
0.00
0.00
0.00
0.40
0.20
0.10
0.00
0.10
0.40
(b) Transfer Setting 3
Figure B.8: Visualizing the effectiveness transferring. Average success rates are marked in the grid
(more visually discernible plots are in the Suppl. Materials). The purple cells are fromQ set and red cells
represents the rest. The darker the color is, the better the corresponding performance.
152
Table B.5: Performance of SynPo, MTL and MLP on GRIDWORLD (SEEN/UNSEEN=144/256)
with window size = 0. All algorithms trained are trained using three random seeds and reported
with mean and std.
Method MLP MTL SYNPO
AvgSR. (SEEN) 56.8 0.9% 16.4 0.4% 80.9 1.5 %
AvgSR. (UNSEEN) 51.8 1.7% 6.1 0.2% 76.8 1.4%
153
Appendix C
Details and Additional Experiments for Chapter 5
C.1 Details on BABY-STEP Identification and Trajectory Alignments
In this section, we describe the details of how BABY-STEPs are identified in the annotated natural
language instructions and how expert trajectory data are segmented to align with BABY-STEP
instructions.
C.1.1 Identify BABY-STEPs
We identify the navigable BABY-STEPs from the natural language instructions of R2R, R4R, R6R
and R8R, based on the following 6 steps:
1. Split sentence and chunk phrases. We split the instructions by periods. For each sentence,
we perform POS tagging using the SpaCy [77] package to locate and chunk all plausible noun
phrases and verb phrases.
2. Curate noun phrases. We curate noun phrases by removing the stop words (i.e., the, for, from
etc.) and isolated punctuations among them and lemmatizing each word of them. The purpose
is to collect a concentrated set of semantic noun phrases that contain potential visual objects.
3. Identify “landmark words”. Next, given the set of candidate visual object words, we filter out
a blacklist of words that either do not correspond to any visual counterpart or are mis-classified
by the SpaCy package. The word blacklist includes:
end, 18 inch, head, inside, forward, position, ground, home, face,
walk, feet, way, walking, bit, veer, ’ve, next, stop, towards,
right, direction, thing, facing, side, turn, middle, one, out,
piece, left, destination, straight, enter, wait, don’t, stand,
back, round
154
We use the remaining noun phrases as the “landmark words” of the sentences. Note that this
step identifies the “landmark words” for the later procedure which aligns BABY-STEPs and
expert trajectories.
4. Identifying verb phrases. Similarly, we use a verb blacklist to filter out verbs that require no
navigational actions of the agent. The blacklist includes: make, turn, face, facing,
veer.
5. Merge non-actionable sentences. We merge the sentence without landmarks and verbs into
the next sentence, as it is likely not actionable.
6. Merge stop sentences. There are sentences that only describe the stop condition of a naviga-
tion action, which include verb-noun compositions indicating the stop condition. We detect the
sentences starting withwait, stop, there, remain, you will see as the sen-
tences that only describe the stop condition and merge them to the previous sentence. Similarly,
we detect sentences starting withwith, facing and merge them to the next sentence.
After applying the above 6 heuristic rules to the language instruction, we obtain chunks
of sentences that describes the navigable BABY-STEPs of the whole task (i.e., a sequence of
navigational sub-goals.).
C.1.2 Align Expert Trajectories with identified BABY-STEPs
In the previous section, we describe the algorithm for identifying BABY-STEP instructions from the
original natural language instructions of the dataset. Now we are going to describe the procedure of
aligning BABY-STEPs with the expert trajectories, which segments the expert trajectories according
to the BABY-STEPs to create the training data for the learning pipeline of our BABYWALK agent.
Note that during the training, our BABYWALK does not rely on the existence of ground-truth
alignments between the (micro)instructions and BABY-STEPs trajectories.
Main Idea The main idea here is to: 1) perform visual landmark classification to produce
confidence scores of landmarks for each visual state s along expert trajectories; 2) use the
predicted landmark scores and the “landmark words” in BABY-STEPs to guide the alignment
between the expert trajectory and BABY-STEPs. To achieve this, we train a visual landmark
classifier with weak supervision — trajectory-wise existence of landmark objects. Next, based on
the predicted landmark confidence scores, we use dynamic programming (DP) to chunk the expert
trajectory into segments and assign the segments to the BABY-STEPs.
155
Weakly Supervised Learning of the Landmark Classifier Given the pairs of aligned instruc-
tion and trajectories (X;Y ) from the original dataset, we train a landmark classifier to detect
landmarks mentioned in the instructions. We formulate it as a multi-label classification problem
that asks a classifierf
LDMK
(s
t
;O) to predict all the landmarksO
X
of the instructionX given
the corresponding trajectoryY . Here, we denotes all possible landmarks from the entire dataset
to beO, and the landmarks of a specific instructionX to beO
X
. Concretely, we first train a
convolutional neural network (CNN) based on the visual state featuress
t
to independently predict
the existence of landmarks at every time step, then we aggregate the predictions across all time
steps to get trajectory-wise logits via max-pooling over all states of the trajectory.
=maxff
LDMK
(s
t
;O)jt = 1;:::;jYjf
Heref
LDMK
denotes the independent state-wise landmark classifier, and is the logits before
normalization for computing the landmark probability. For the specific details off
LDMK
, we input
the 6 6 panorama visual feature (i.e. ResNet-152 feature) into a two-layer CNN (with kernel size
of 3, hidden dimension of 128 and ReLU as non-linearity layer) to produce feature activation with
spatial extents, followed by a global averaging operator over spatial dimensions and a multi-layer
perceptron (2-layer with hidden dimension of 512 and ReLU as non-linearity layer) that outputs
the state-wise logits for all visual landmarksO. We then max pool all the state-wise logits along
the trajectory and compute the loss using a trajectory-wise binary cross-entropy between the
ground-truth landmark label (of existence) and the prediction.
Aligning BABY-STEPs and Trajectories with Visual Landmarks Now, sppose we have a
sequence of BABY-STEP instructionsX =fx
m
; m = 1;:::;Mg, and its expert trajectoryY =
fs
t
; t = 1;:::;jYjg, we can compute the averaged landmark score for the landmarksO
xm
that
exists in this sub-task instructionx
m
on a single states
t
: Here 1 [o
m
2O] represents the one-hot
encoding of the landmarks that exists in the BABY-STEPx
m
, andjO
xm
j is the total number of
existed landmarks. We then apply dynamic programming (DP) to solve the trajectory segmentation
specified by the following Bellman equation (in a recursive form).
(t;m) =
8
>
>
>
>
<
>
>
>
>
:
(t;m); if t = 1
(t;m) +
max
i2f1;:::;t1g
(i; m 1)
; otherwise
156
Instruction Encoder Memory Buffer
( )
Trajectory Encoder
Bi-LSTM
Bi-LSTM
Vision
Attention
( ො )
( )
Word
Embedding
BabyWalk Policy
Ƹ MLP
Concat
ℎ
+ 1
Softmax
Dot Product ℎ
Concat
− 1
Vision
Attention
LSTM
Text
Attention
LSTM
,
ො ,
1
ො 1
,
ො ⋮
⋮
MLP
m-1 m-1
Figure C.1: Our network architecture at the m-th BABY-STEP sub-task. Red line
represents the procedure of encoding context variable z
m
via summarizing the BABY-
STEP trajectory f
SUMMARY
(v(^ y
1
);:::;v(^ y
m1
)) and the corresponding (micro)instruction
f
SUMMARY
(u(x
1
);:::;u(x
m1
)) in the memory buffer. Blue line represents the procedure of
encoding the (micro)instructionu(x
m
) of the current BABY-STEP. Purple line represents the
detailed decision making process of our BABYWALK policy (A
st
is denoted as the set of navigable
directions ats
t
as defined by Fried et al. [55])
Here, (t;m) represents the maximum potential of choosing the states
t
as the end point of
the BABY-STEP instructionx
m
. Solving this DP leads to a set of correspondingly segmented
trajectoriesY =fy
m
; m = 1;:::;Mg, withy
m
being them-th BABY-STEP sub-trajectory.
C.2 Additional Implementation details
C.2.1 Navigation Agent Configurations
Figure C.1 gives an overview of the unrolled version of our full navigation agent.
Panoramic State-Action Space [55] We set up the statess
t
as the stacked visual feature of
agent-centric panoramic views in 12 headings 3 elevations with 30 degree intervals. The visual
feature of each view is a concatenation of the ResNet-152 feature vector of size 2048 and the orien-
tation feature vector of size 128 (The 4-dimensional orientation feature [sin(); cos(); sin(!); cos(!)]
are tiled 32 times). We use similar single-view visual feature of size 2176 as our action embeddings.
Encoders Instruction encoderu() for the instructions is a single directional LSTM with hidden
size 512 and a word embedding layer of size 300 (initialized with GloVE embedding [168]). We
use the same encoder for encoding the past experienced and the current executing instruction.
Trajectory encoderv() contains two separate bidirectional LSTMs (Bi-LSTM), both with hidden
157
size 512. The first Bi-LSTM encodesa
t
i
and outputs a hidden state for each time stept
i
. Then
we attends the hidden state to the panoramic views
t
i
to get a state feature of size 2176 for each
time step. The second Bi-LSTM encoders the state feature. We use the trajectory encoder just for
encoding the past experienced trajectories.
BABYWALK Policy The BABYWALK policy network consists of one LSTM with two attention
layers and an action predictor. First we attend the hidden state to the panoramic views
t
to get
state feature of size 2176. The state feature is concatenated with the previous action embedding
as a variable to update the hidden state using a LSTM with hidden size 512. The updated hidden
state is then attended to the context variables (output ofu()). For the action predictor module, we
concatenate the output of text attention layer with the summarized past context ^ z
m
in order to get
an action prediction variable. We then get the action prediction variable through a 2-layer MLP
and make a dot product with the navigable action embeddings to retrieve the probability of the
next action.
Model Inference During the inference time, the BABYWALK policy only requires running the
heuristic BABY-STEP identification on the test-time instruction. No need for oracle BABY-STEP
trajectory during this time as the BABYWALK agent is going to roll out for each BABY-STEP by
itself.
C.2.2 Details of Reward Shaping for RL
As mentioned in the main text, we learn policy via optimizing the Fidelity-oriented reward [88].
Now we give the complete details of this reward function. Suppose the total number of roll out
steps is T =
P
M
i=1
j^ y
i
j, we would have the following form of reward function:
r(s
t
;a
t
) =
8
<
:
0; ift< T
SR(Y;
^
Y ) + CLS(Y;
^
Y ); ift = T
Here,
^
Y = ^ y
1
::: ^ y
M
represents the concatenation of BABY-STEP trajectories produced by
the navigation agent (and we note as the concatenation operation).
C.2.3 Optimization Hyper-parameters
For each BABY-STEP task, we set the maximal number of steps to be 10, and truncate the
corresponding BABY-STEP instruction length to be 100. During both the imitation learning and
the curriculum reinforcement learning procedures, we fix the learning rate to be 1e-4. In the
158
imitation learning, the mini-batch size is set to be 100. In the curriculum learning, we reduce
the mini-batch size as curriculum increases to save memory consumption. For the 1st, 2nd, 3rd
and 4th curriculum, the mini-batch size is set to be 50, 32, 20, and 20 respectively. During the
learning, we pre-train our BABYWALK model for 50000 iterations using the imitation learning
as a warm-up stage. Next, in each lecture (up to 4) of the reinforcement learning (RL), we train
the BABYWALK agent for an additional 10000 iterations, and select the best performing model
in terms of SDTW to resume the next lecture. For executing each instruction during the RL, we
sample 8 navigation episodes before performing any back-propagation. For each learning stage,
we use separate Adam optimizers to optimize for all the parameters. Meanwhile, we use the L2
weight decay as the regularizer with its coefficient set to be 0.0005. In the reinforcement learning,
the discounted factor
is set to be 0.95.
C.3 Additional Experimental Results
In this section, we describe a comprehensive set of evaluation metrics and then show transfer
results of models trained on each dataset, with all metrics. We provide additional analysis studying
the effectiveness of template based BABY-STEP identification.
Complete set of Evaluation Metrics. We adopt the following set of metrics:
• Path Length (PL) is the length of the agent’s navigation path.
• Navigation Error (NE) measures the distance between the goal location and final location of the
agent’s path.
• Success Rate (SR) that measures the average rate of the agent stopping within a specified distance
near the goal location [3]
• Success weighted by Path Length (SPL) [3] measures the success rate weighted by the inverse
trajectory length, to penalize very long successful trajectory.
• Coverage weighted by Length Score (CLS) [88] that measures the fidelity of the agent’s path to
the reference, weighted by the length score, and the newly proposed
• Normalized Dynamic Time Warping (NDTW) that measures in more fine-grained details, the
spatio-temporal similarity of the paths by the agent and the human expert [148].
• Success rate weighted normalized Dynamic Time Warping (SDTW) that further measures the
spatio-temporal similarity of the paths weighted by the success rate [148]. CLS, NDTW and
159
SDTW measure explicitly the agent’s ability to follow instructions and in particular, it was shown
that SDTW corresponds to human preferences the most.
Data Splits R2R Validation Unseen
Perf. Measures PL NE# SR" SPL
Reported Results
SEQ2SEQ [55] - 7.07 31.2 -
SF
+
[55] - 6.62 35.5 -
RCM
+
[236] 14.84 5.88 42.5 -
REGRETFUL
+?
[146] - 5.32 50.0 41.0
FAST
+?
[97] 21.17 4.97 56.0 43.0
Re-implemented Version
SEQ2SEQ 15.76 6.71 33.6 25.5
SF
+
15.55 6.52 35.8 27.6
RCM
+
11.15 6.18 42.4 38.6
REGRETFUL
+?
13.74 5.38 48.7 39.7
FAST
+?
20.45 4.97 56.6 43.7
Table C.1: Sanity check of model trained on R2R and evaluated on its validation unseen split
(
+
: pre-trained with data augmentation;?:reimplemented or readapted from the original authors’
released code).
Sanity Check between Prior Methods and Our Re-implementation As mentioned in the
main text, we compare our re-implementation and originally reported results of baseline methods
on the R2R datasets, as Table C.1. We found that the results are mostly very similar, indicating
that our re-implementation are reliable.
Complete Curriculum Learning Results We present the curriculum learning results with all
evaluation metrics in Table C.2.
Results of BABY-STEP Identification We present an additional analysis comparing different
BABY-STEP identification methods. We compare our template-based BABY-STEP identification
with a simple method that treat each sentence as an BABY-STEP (referred as sentence-wise), both
using the complete BABYWALK model with the same training routine. The results are shown in
the Table C.3. Generally speaking, the template based BABY-STEP identification provides a better
performance.
160
IL+ CRL w/ LECTURE #
Datasets
Metrics
IL
IL+RL
1st
2nd
3rd
4th
R2R
PL 22.4 12.0 11.6 13.2 10.6 9.6
NE# 6.8 7.1 6.8 6.8 6.7 6.6
SR" 28.1 29.8 29.9 33.2 32.2 34.1
SPL" 15.7 24.3 24.9 26.6 27.5 30.2
CLS" 28.9 46.2 46.6 47.2 48.1 50.4
NDTW" 30.6 43.8 42.5 41.0 47.7 50.0
SDTW" 16.5 23.2 23.1 24.3 25.7 27.8
R4R
PL 43.4 22.8 23.9 25.5 21.4 19.0
NE# 8.4 8.6 8.5 8.4 8.0 8.2
SR" 24.7 25.0 24.1 26.7 27.9 27.3
SPL" 8.2 11.2 11.0 12.3 13.7 14.7
CLS" 27.9 45.5 44.8 45.9 47.4 49.4
NDTW" 24.3 34.4 32.8 33.7 38.4 39.6
SDTW" 11.1 13.6 13.5 15.2 17.0 17.3
R6R
PL 68.8 35.3 37.0 40.6 33.2 28.7
NE# 9.4 9.5 9.4 9.4 8.9 9.2
SR" 22.7 23.7 21.9 23.4 24.7 25.5
SPL" 4.2 7.2 6.4 6.8 8.1 9.2
CLS" 24.4 43.0 41.8 42.3 44.2 47.2
NDTW" 17.8 28.1 26.0 26.9 30.9 32.7
SDTW" 7.7 10.8 9.7 11.0 12.7 13.6
R8R
PL 93.1 47.5 50.0 55.3 45.2 39.9
NE# 10.0 10.2 10.2 10.1 9.3 10.1
SR" 21.9 21.4 20.4 22.1 23.1 23.1
SPL" 4.3 6.1 5.5 6.1 6.8 7.4
CLS" 24.1 42.1 41.0 41.5 43.9 46.0
NDTW" 15.5 24.6 22.9 23.8 27.7 28.2
SDTW" 6.4 8.3 7.9 9.2 10.5 11.1
Average
PL 51.8 26.8 27.9 30.6 25.1 22.1
NE# 8.5 8.7 8.5 8.5 8.1 8.3
SR" 24.7 25.5 24.6 27.0 27.5 28.1
SPL" 8.6 13.1 12.9 13.9 15.1 16.5
CLS" 26.6 44.5 43.9 44.6 46.2 48.6
NDTW" 23.0 33.9 32.2 32.4 37.4 39.0
SDTW" 11.0 14.8 14.4 15.7 17.3 18.4
Table C.2: Ablation on BABYWALK after each learning stage (trained on R4R).
161
Datasets Metrics Sentence-wise Template based
R2R
PL 10.3 9.6
NE# 6.8 6.6
SR" 28.7 34.1
SPL" 24.9 30.2
CLS" 48.3 50.4
NDTW" 43.6 50.0
SDTW" 22.4 27.8
R4R
PL 20.9 19.0
NE# 8.2 8.2
SR" 26.3 27.3
SPL" 12.7 14.7
CLS" 46.4 49.4
NDTW" 35.5 39.6
SDTW" 15.9 17.3
R6R
PL 32.1 28.7
NE# 9.0 9.2
SR" 22.5 25.5
SPL" 7.5 9.2
CLS" 44.2 47.2
NDTW" 29.3 32.7
SDTW" 11.1 13.6
R8R
PL 42.9 39.9
NE# 9.8 10.1
SR" 21.2 23.1
SPL" 6.3 7.4
CLS" 43.2 46.0
NDTW" 25.5 28.2
SDTW" 9.3 11.1
Average
PL 24.2 22.1
NE# 8.3 8.3
SR" 25.2 28.1
SPL" 13.8 16.5
CLS" 45.9 48.6
NDTW" 34.6 39.0
SDTW" 15.4 18.4
Table C.3: BABYWALK Agent performances between different segmentation rules (trained on
R4R). Refer to text for more details.
In-domain Results of Models Trained on Instructions with Different lengths As mentioned
in the main text, we display all the in-domain results of navigation agents trained on R2R, R4R,
162
Datasets
Metrics
SEQ2SEQ
SF
+
RCM(GOAL)
+
RCM(FIDELITY)
+
BABYWALK
BABYWALK
+
R2R! R2R
PL 15.8 15.6 11.1 10.2 10.7 10.2
NE# 6.7 6.5 6.2 6.2 6.2 5.9
SR" 33.6 35.8 42.4 42.1 42.6 43.8
SPL" 25.5 27.6 38.6 38.6 38.3 39.6
CLS" 38.5 39.8 52.7 52.6 52.9 54.4
NDTW" 39.2 41.0 51.0 50.8 53.4 55.3
SDTW" 24.9 27.2 33.5 34.4 35.7 36.9
R4R! R4R
PL 28.5 26.1 12.3 26.4 23.8 19.0
NE# 8.5 8.3 7.9 8.4 7.9 8.2
SR" 25.7 24.9 28.7 24.7 29.6 27.3
SPL" 14.1 16.0 22.1 11.6 14.0 14.7
CLS" 20.7 23.6 36.3 39.2 47.8 49.4
NDTW" 20.6 22.7 31.3 31.3 38.1 39.6
SDTW" 9.0 9.2 13.2 13.7 18.1 17.3
R6R! R6R
PL 34.1 43.4 11.8 28.0 28.4 27.2
NE# 9.5 9.6 9.2 9.4 9.4 9.3
SR" 18.1 17.8 18.2 20.5 21.7 22.0
SPL" 9.6 7.9 14.8 7.4 7.8 8.1
CLS" 23.4 20.3 31.6 39.0 47.1 47.4
NDTW" 19.3 17.8 25.9 25.8 32.6 33.4
SDTW" 6.5 5.9 7.6 9.5 11.5 11.8
R8R! R8R
PL 40.0 53.0 12.4 42.3 35.6 39.1
NE# 9.9 10.1 10.2 10.7 9.6 9.9
SR" 20.2 18.6 19.7 18.2 22.3 22.0
SPL" 12.4 9.8 15.4 5.3 7.3 7.0
CLS" 19.8 16.3 25.7 37.2 46.4 46.4
NDTW" 15.8 13.5 19.4 21.6 29.6 28.3
SDTW" 5.1 4.4 5.8 7.6 10.4 10.1
Table C.4: Indomain results. Each model is trained on the training set of R2R, R4R, R6R and
R8R datasets, and evaluated on the corresponding unseen validation set (
+
: pre-trained with data
augmentation).
R6R, R8R, respectively. The complete results of all different metrics are included in the Table C.4.
We note that our BABYWALK agent consistently outperforms baseline methods on each dataset. It
is worth noting that on R4R, R6R and R8R datasets, RCM(GOAL)
+
achieves better results in SPL.
163
This is due to the aforementioned fact that they often take short-cuts to directly reach the goal,
with a significantly short trajectory. As a consequence, the success rate weighted by inverse path
length is high.
Datasets
Metrics
SEQ2SEQ
SF
+
RCM(GOAL)
+
RCM(FIDELITY)
+
REGRETFUL
+?
FAST
+?
BABYWALK
BABYWALK
+
R2R! R4R
PL 28.6 28.9 13.2 14.1 15.5 29.7 19.5 17.9
NE# 9.1 9.0 9.2 9.3 8.4 9.1 8.9 8.9
SR" 18.3 16.7 14.7 15.2 19.2 13.3 22.5 21.4
SPL" 7.9 7.4 8.9 8.9 10.1 7.7 12.6 11.9
CLS" 29.8 30.0 42.5 41.2 46.4 41.8 50.3 51.0
NDTW" 25.1 25.3 33.3 32.4 31.6 33.5 38.9 40.3
SDTW" 7.1 6.7 7.3 7.2 9.8 7.2 14.5 13.8
R2R! R6R
PL 39.4 41.4 14.2 15.7 15.9 32.0 29.1 25.9
NE# 9.6 9.8 9.7 9.8 8.8 9.0 10.1 9.8
SR" 20.7 17.9 22.4 22.7 24.2 26.0 21.4 21.7
SPL" 11.0 9.1 17.7 18.3 16.6 16.5 7.9 8.8
CLS" 25.9 26.2 37.1 36.4 40.9 37.7 48.4 49.0
NDTW" 20.5 20.8 26.6 26.1 16.2 21.9 30.8 32.6
SDTW" 7.7 7.2 8.2 8.4 6.8 8.5 11.2 11.2
R2R! R8R
PL 52.3 52.2 15.3 16.9 16.6 34.9 38.3 34.0
NE# 10.5 10.5 11.0 11.1 10.0 10.6 11.1 10.5
SR" 16.9 13.8 12.4 12.6 16.3 11.1 19.6 20.7
SPL" 6.1 5.6 7.4 7.5 7.7 6.2 6.9 7.8
CLS" 22.5 24.1 32.4 30.9 35.3 33.7 48.1 48.7
NDTW" 17.1 18.2 23.9 23.3 8.1 14.5 26.7 29.1
SDTW" 4.1 3.8 4.3 4.3 2.4 2.4 9.4 9.8
Average
PL 40.1 40.8 14.2 15.6 16.0 32.2 29.0 25.9
NE# 9.7 9.8 10.0 10.1 9.1 9.6 10.0 9.7
SR" 18.6 16.1 16.5 16.8 19.9 16.8 21.2 21.3
SPL" 8.3 7.4 11.3 11.6 11.5 10.1 9.1 9.5
CLS" 26.1 26.8 37.3 36.2 40.9 37.7 48.9 49.6
NDTW" 20.9 21.4 27.9 27.3 18.6 23.3 32.1 34.0
SDTW" 6.3 5.9 6.6 6.6 6.3 6.0 11.7 11.6
Datasets
Metrics
SEQ2SEQ
SF
+
RCM(GOAL)
+
RCM(FIDELITY)
+
REGRETFUL
+?
FAST
+?
BABYWALK
BABYWALK
+
R4R! R2R
PL 16.2 17.4 10.2 17.7 20.0 26.5 12.1 9.6
NE# 7.8 7.3 7.1 6.7 7.5 7.2 6.6 6.6
SR" 16.3 22.5 25.9 29.1 22.8 25.1 35.2 34.1
SPL" 9.9 14.1 22.5 18.2 14.0 16.3 28.3 30.2
CLS" 27.1 29.5 44.2 34.3 32.6 33.9 48.5 50.4
NDTW" 29.3 31.8 41.1 33.5 28.5 27.9 46.5 50.0
SDTW" 10.6 14.8 20.2 18.3 13.4 14.2 27.2 27.8
R4R! R6R
PL 40.8 38.5 12.8 33.0 19.9 26.6 37.0 28.7
NE# 9.9 9.5 9.2 9.3 9.5 8.9 8.8 9.2
SR" 14.4 15.5 19.3 20.5 18.0 22.1 26.4 25.5
SPL" 6.8 8.4 15.2 8.5 10.6 13.7 8.1 9.2
CLS" 17.7 20.4 31.8 38.3 31.7 31.5 44.9 47.2
NDTW" 16.4 18.3 23.5 23.7 23.5 23.0 30.1 32.7
SDTW" 4.6 5.2 7.3 7.9 7.5 7.7 13.1 13.6
R4R! R8R
PL 56.4 50.8 13.9 38.7 20.7 28.2 50.0 39.9
NE# 10.1 9.5 9.5 9.9 9.5 9.1 9.3 10.1
SR" 20.7 21.6 22.8 20.9 18.7 27.7 26.3 23.1
SPL" 10.4 11.8 16.9 9.0 9.2 13.7 7.2 7.4
CLS" 15.0 17.2 27.6 34.6 29.3 29.6 44.7 46.0
NDTW" 13.4 15.1 19.5 21.7 19.0 17.7 27.1 28.2
SDTW" 4.7 5.0 5.1 6.1 5.6 6.9 11.5 11.1
Average
PL 37.8 35.6 12.3 29.8 20.2 27.1 33.0 26.1
NE# 9.3 8.8 8.6 8.6 8.8 8.4 8.2 8.6
SR" 17.1 19.9 22.7 23.5 19.8 25.0 29.3 27.6
SPL" 9.0 11.4 18.2 11.9 11.3 14.6 14.5 15.6
CLS" 19.9 22.4 34.5 35.7 31.2 31.7 46.0 47.9
NDTW" 19.7 21.7 28.0 26.3 23.7 22.9 34.6 37.0
SDTW" 6.6 8.3 10.9 10.8 8.8 9.6 17.3 17.5
(a) R2R trained model (b) R4R trained model
Table C.5: Transfer results of R2R, R4R trained model evaluated on their complementary unseen
validation datasets (
+
: pre-trained with data augmentation;
?
: reimplemented or readapted from
the original authors’ released code).
Transfer Results of Models Trained on Instructions with Different lengths For complete-
ness, we also include all the transfer results of navigation agents trained on R2R, R4R, R6R,
R8R, respectfully. The complete results of all different metrics are included in the Table C.5 and
Table C.6. According to this table, we note that models trained on R8R can achieve the best overall
transfer learning performances. This could because of the fact that R8R trained model only needs
to deal with interpolating to shorter ones, rather than extrapolating to longer instructions, which is
intuitively an easier direction.
164
Datasets
Metrics
SEQ2SEQ
SF
+
RCM(GOAL)
+
RCM(FIDELITY)
+
BABYWALK
BABYWALK
+
R6R! R2R
PL 14.5 19.4 8.1 15.5 9.4 9.2
NE# 7.7 7.1 7.6 7.5 6.8 6.8
SR" 19.3 21.9 19.6 22.6 31.3 30.6
SPL" 13.3 11.6 17.2 14.1 28.3 27.8
CLS" 32.1 26.2 43.2 34.3 49.9 50.0
NDTW" 31.9 30.8 39.7 32.4 49.5 49.4
SDTW" 13.1 13.3 15.3 14.3 25.9 25.4
R6R! R4R
PL 25.2 33.0 11.6 25.7 18.1 17.7
NE# 8.7 8.6 8.5 8.4 8.4 8.2
SR" 24.2 22.4 23.6 25.4 24.3 24.3
SPL" 13.7 9.3 17.5 10.6 12.8 12.9
CLS" 25.8 21.4 35.8 34.8 48.6 48.6
NDTW" 22.9 20.6 29.8 26.5 39.0 39.4
SDTW" 9.3 7.5 10.8 11.1 15.1 15.1
R6R! R8R
PL 43.0 52.8 14.2 29.9 38.3 36.8
NE# 9.9 9.9 9.6 9.7 10.2 10.0
SR" 20.1 20.3 20.3 22.4 20.8 21.0
SPL" 11.2 9.4 14.9 8.1 6.6 6.8
CLS" 20.6 18.3 27.7 38.9 45.9 46.3
NDTW" 16.3 15.2 21.9 22.2 28.4 29.3
SDTW" 5.6 5.0 6.4 6.8 9.6 9.9
Average
PL 27.6 35.1 11.3 23.7 21.9 21.2
NE# 8.8 8.5 8.6 8.5 8.5 8.3
SR" 21.2 21.5 21.2 23.5 25.5 25.3
SPL" 12.7 10.1 16.5 10.9 15.9 15.8
CLS" 26.2 22.0 35.6 36.0 48.1 48.3
NDTW" 23.7 22.2 30.5 27.0 39.0 39.4
SDTW" 9.3 8.6 10.8 10.7 16.9 16.8
Datasets
Metrics
SEQ2SEQ
SF
+
RCM(GOAL)
+
RCM(FIDELITY)
+
BABYWALK
BABYWALK
+
R8R! R2R
PL 13.7 19.3 7.8 17.8 9.1 9.8
NE# 7.6 7.3 8.0 8.2 6.8 6.7
SR" 18.7 23.4 14.8 19.2 30.0 32.1
SPL" 13.3 12.9 12.9 10.6 27.0 28.2
CLS" 32.7 26.6 37.9 28.9 49.5 49.3
NDTW" 32.4 29.9 34.9 25.9 48.9 48.9
SDTW" 12.7 14.5 11.1 10.5 24.6 26.2
R8R! R4R
PL 23.1 31.7 11.1 32.5 17.4 19.0
NE# 8.7 8.8 8.7 9.2 8.2 8.5
SR" 23.6 21.8 23.2 21.7 24.4 24.4
SPL" 15.1 10.5 18.2 7.4 12.6 12.5
CLS" 24.9 20.8 32.3 29.4 48.1 48.5
NDTW" 22.3 19.7 26.4 20.6 39.1 38.5
SDTW" 8.8 7.7 9.3 8.4 14.9 15.2
R8R! R6R
PL 30.9 42.2 11.9 39.9 26.6 29.2
NE# 9.7 9.9 9.9 10.1 9.0 9.3
SR" 15.4 14.7 14.8 20.0 22.9 22.9
SPL" 8.6 6.7 11.6 5.3 8.4 7.9
CLS" 22.2 18.5 29.1 33.5 46.9 46.6
NDTW" 18.5 15.9 22.5 20.1 33.3 31.8
SDTW" 5.5 4.7 6.0 7.8 12.1 11.8
Average
PL 22.6 31.1 10.3 30.1 17.7 19.3
NE# 8.7 8.7 8.9 9.2 8.0 8.2
SR" 19.2 20.0 17.6 20.3 25.8 26.5
SPL" 12.3 10.0 14.2 7.8 16.0 16.2
CLS" 26.6 22.0 33.1 30.6 48.2 48.1
NDTW" 24.4 21.8 27.9 22.2 40.4 39.7
SDTW" 9.0 9.0 8.8 8.9 17.2 17.7
(c) R6R trained model (d) R8R trained model
Table C.6: Transfer results of R6R, R8R trained model evaluated on their complementary unseen
validation datasets (
+
: pre-trained with data augmentation;
?
: reimplemented or readapted from
the original authors’ released code).
165
Appendix D
Details and Additional Experiments for Chapter 7
D.1 Additional Implementation Details
Backbone architecture. We consider three backbones, as suggested in the literature, as the
instance embedding function E for the purpose of fair comparisons. We resize the input image to
84 84 3 before using the backbones.
• ConvNet. The 4-layer convolution network [198, 217, 228] contains 4 repeated blocks. In each
block, there is a convolutional layer with 3 3 kernel, a Batch Normalization layer [85], a
ReLU, and a Max pooling with size 2. We set the number of convolutional channels in each
block as 64. A bit different from the literature, we add a global max pooling layer at last to
reduce the dimension of the embedding. Based on the empirical observations, this will not
influence the results, but reduces the computation burden of later transformations a lot.
• ResNet. We use the 12-layer residual network in [123].
1
The DropBlock [58] is used in this
ResNet architecture to avoid over-fitting. A bit different from the ResNet-12 in [123], we apply
a global average pooling after the final layer, which leads to 640 dimensional embeddings.
• WRN. We also consider the Wide residual network [186, 256]. We use the WRN-28-10 structure
as in [174, 186], which sets the depth to 28 and width to 10. After a global average pooling in
the last layer of the backbone, we get a 640 dimensional embedding for further prediction.
Datasets. Four datasets, MiniImageNet [228], TieredImageNet [181], Caltech-UCSD Birds
(CUB) 200-2011 [230], and OfficeHome [224] are investigated in this paper. Each dataset is
split into three parts based on different non-overlapping sets of classes, for model training (a.k.a.
meta-training in the literature), model validation (a.k.a. meta-val in the literature), and model
evaluation (a.k.a. meta-test in the literature). The CUB dataset is initially designed for fine-grained
1
The source code of the ResNet is publicly available onhttps://github.com/kjunelee/MetaOptNet
166
classification. It contains in total 11,788 images of birds over 200 species. On CUB, we randomly
sampled 100 species as SEEN classes, another two 50 species are used as two UNSEEN sets for
model validation and evaluation [217]. For all images in the CUB dataset, we use the provided
bounding box to crop the images as a pre-processing [217]. Before input into the backbone
network, all images in the dataset are resized based on the requirement of the network.
Pre-training strategy. As mentioned before, we apply an additional pre-training strategy as
suggested in [174, 186]. The backbone network, appended with a softmax layer, is trained to
classify all classes in the SEEN class split (e.g., 64 classes in the MiniImageNet) with the cross-
entropy loss. In this stage, we apply image augmentations like random crop, color jittering, and
random flip to increase the generalization ability of the model. After each epoch, we validate the
performance of the pre-trained weights based on its few-shot classification performance on the
model validation split. Specifically, we randomly sample 200 1-shotN-way few-shot learning
tasks (N equals the number of classes in the validation split, e.g., 16 in the MiniImageNet), which
contains 1 instance per class in the support set and 15 instances per class for evaluation. Based
on the penultimate layer instance embeddings of the pre-trained weights, we utilize the nearest
neighbor classifiers over the few-shot tasks and evaluate the quality of the backbone. We select
the pre-trained weights with the best few-shot classification accuracy on the validation set. The
pre-trained weights are used to initialize the embedding backbone E, and the weights of the whole
model are then optimized together during the model training.
Transformer Hyper-parameters. We follow the architecture as presented in [222] to build our
FEAT model. The hidden dimensiond
0
for the linear transformation in our FEAT model is set
to 64 for ConvNet and 640 for ResNet/WRN. The dropout rate in transformer is set as 0:5. We
empirically observed that the shallow transformer (with one set of projection and one stacked
layer) gives the best overall performance (also studied in §D.2).
Optimization. Following the literature, different optimizers are used for the backbones during
the model training. For the ConvNet backbone, stochastic gradient descent with Adam [103]
optimizer is employed, with the initial learning rate set to be 0:002. For the ResNet and WRN
backbones, vanilla stochastic gradient descent with Nesterov acceleration is used with an initial
rate of 0:001. We fix the weight decay in SGD as 5e-4 and momentum as 0.9. The schedule of the
optimizers is tuned over the validation part of the dataset. As the backbone network is initialized
with the pre-trained weights, we scale the learning rate for those parameters by 0:1.
167
Table D.1: Few-shot classification accuracy 95% confidence interval on MiniImageNet with
ConvNet and ResNet backbones. Our implementation methods are measured over 10,000 test
trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
Backbone Network! ConvNet ResNet ConvNet ResNet
MatchNet [228] 43.40 0.78 - 51.09 0.71 -
MAML [53] 48.70 1.84 - 63.11 0.92 -
ProtoNet [198] 49.42 0.78 - 68.20 0.66 -
RelationNet [206] 51.38 0.82 - 67.07 0.69 -
PFA [174] 54.53 0.40 - 67.87 0.20 -
TADAM [164] - 58.50 0.30 - 76.70 0.30
MetaOptNet [123] - 62.64 0.61 - 78.63 0.46
Baselines
MAML 49.24 0.21 58.05 0.10 67.92 0.17 72.41 0.20
MatchNet 52.87 0.20 65.64 0.20 67.49 0.17 78.72 0.15
ProtoNet 52.61 0.20 62.39 0.21 71.33 0.16 80.53 0.14
Embedding Adaptation
BILSTM 52.13 0.20 63.90 0.21 69.15 0.16 80.63 0.14
DEEPSETS 54.41 0.20 64.14 0.22 70.96 0.16 80.93 0.14
GCN 53.25 0.20 64.50 0.20 70.59 0.16 81.65 0.14
Ours: FEAT 55.15 0.20 66.78 0.20 71.61 0.16 82.05 0.14
Table D.2: Few-shot classification performance with Wide ResNet (WRN)-28-10 backbone on
MiniImageNet dataset (mean accuracy95% confidence interval). Our implementation methods
are measured over 10,000 test trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
PFA [174] 59.60 0.41 73.74 0.19
LEO [186] 61.76 0.08 77.59 0.12
SimpleShot [237] 63.50 0.20 80.33 0.14
ProtoNet (Ours) 62.60 0.20 79.97 0.14
Ours: FEAT 65.10 0.20 81.11 0.14
D.2 Additional Experimental Results
In this section, we will show more experimental results over the MiniImageNet/CUB dataset, the
ablation studies, and the extended few-shot learning.
168
Table D.3: Few-shot classification performance with Wide ResNet (WRN)-28-10 backbone on
TieredImageNet dataset (mean accuracy95% confidence interval). Our implementation methods
are measured over 10,000 test trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
LEO [186] 66.33 0.05 81.44 0.09
SimpleShot [237] 69.75 0.20 85.31 0.15
Ours: FEAT 70.41 0.23 84.38 0.16
Table D.4: Few-shot classification performance with ConvNet backbone on CUB dataset (mean
accuracy95% confidence interval). Our implementation methods are measured over 10,000 test
trials.
Setups! 1-Shot 5-Way 5-Shot 5-Way
MatchNet [228] 61.16 0.89 72.86 0.70
MAML [53] 55.92 0.95 72.09 0.76
ProtoNet [198] 51.31 0.91 70.77 0.69
RelationNet [206] 62.45 0.98 76.11 0.69
Instance Embedding
MatchNet 67.73 0.23 79.00 0.16
ProtoNet 63.72 0.22 81.50 0.15
Embedding Adaptation
BILSTM 62.05 0.23 73.51 0.19
DEEPSETS 67.22 0.23 79.65 0.16
GCN 67.83 0.23 80.26 0.15
Ours: FEAT 68.87 0.22 82.90 0.15
Additional Results with Wide ResNet Backbone We also investigate the Wide ResNet (WRN)
backbone over MiniImageNet, which is also the popular one used in [174, 186]. SimpleShot [237]
is a recent proposed embedding-based few-shot learning approach that takes full advantage of the
pre-trained embeddings. We cite the results of PFA [174], LEO [186], and SimpleShot [237] from
their papers. The results can be found in Table D.2. We re-implement ProtoNet and our FEAT
approach with WRN. It is notable that in this case, our FEAT achieves much higher promising
results than the current state-of-the-art approaches. Table D.3 shows the classification results with
WRN on the TieredImageNet data set, where our FEAT still keeps its superiority when dealing
with 1-shot tasks.
Table D.4 shows the 5-way 1-shot and 5-shot classification results on the CUB dataset based
on the ConvNet backbone. The results on CUB are consistent with the trend on the MiniImageNet
169
Table D.5: Ablation studies on whether the embedding adaptation improves the discerning quality
of the embeddings. After embedding adaptation, FEAT improves w.r.t. the before-adaptation
embeddings a lot for Few-shot classification.
1-Shot 5-Way 5-Shot 5-Way
Pre-Adapt 51.60 0.20 70.40 0.16
Post-Adapt 55.15 0.20 71.61 0.16
Table D.6: Ablation studies on the position to average the same-class embeddings when there
are multiple shots per class in FEAT (tested on the 5-Way tasks with different numbers of shots).
“Pre-Avg” and “Post-Avg” means we get the embedding center for each class before or after the
set-to-set transformation, respectively.
Setups! Pre-Avg Post-Avg
5 71.61 0.16 70.70 0.16
15 77.76 0.14 76.58 0.14
30 79.66 0.13 78.77 0.13
Table D.7: Ablation studies on the number of heads in the Transformer of FEAT (with number of
layers fixes to one).
Setups! 1-Shot 5-Way 5-Shot 5-Way
1 55.15 0.20 71.57 0.16
2 54.91 0.20 71.44 0.16
4 55.05 0.20 71.63 0.16
8 55.22 0.20 71.39 0.16
Table D.8: Ablation studies on the number of layers in the Transformer of FEAT (with number of
heads fixes to one).
Setups! 1-Shot 5-Way 5-Shot 5-Way
1 55.15 0.20 71.57 0.16
2 55.42 0.20 71.44 0.16
3 54.96 0.20 71.63 0.16
dataset. Embedding adaptation indeed assists the embedding encoder for the few-shot classification
tasks. Facilitated by the set function property, the DEEPSETS works better than the BILSTM
counterpart. Among all the results, the transformer based FEAT gets the top tier results.
170
Additional Ablation Studies In this section, we perform further analyses for our proposed FEAT
and its ablated variants classifying in the ProtoNet manner, on the MiniImageNet dataset, using
the ConvNet as the backbone network.
Do the adapted embeddings improve the pre-adapted embeddings? We report few-shot
classification results by using the pre-adapted embeddings of support data (i.e., the embedding
before adaptation), against those using adapted embeddings, for constructing classifiers. Table D.5
shows that task-specific embeddings after adaptation improves over task-agnostic embeddings in
few-shot classifications.
When to average the embeddings of the same class? When there is more than one instance
per class, i.e.M > 1, we average the instances in the same class and use the class center to make
predictions. There are two positions to construct the prototypes in FEAT — before the set-to-set
transformation (Pre-Avg) and after the set-to-set transformation (Post-Avg). In Pre-Avg, we adapt
the embeddings of the centers, and a test instance is predicted based on its distance to the nearest
adapted center; while in Post-Avg, the instance embeddings are adapted by the set-to-set function
first, and the class centers are computed based on the adapted instance embeddings. We investigate
the two choices in Table D.6, where we fix the number of ways to 5 (N = 5) and change the
number of shots (M) amongf5; 15; 30g. The results demonstrate the Pre-Avg version performs
better than the Post-Avg in all cases, which shows a more precise input of the set-to-set function
by averaging the instances in the same class leads to better results. So we use the Pre-Avg strategy
as a default option in our experiments.
Will deeper and multi-head transformer help? In our current implementation of the set-to-set
transformation function, we make use of a shallow and simple transformer, i.e., one layer and one
head (set of projection). From [222], the transformer can be equipped with complex components
using multiple heads and deeper stacked layers. We evaluate this augmented structure, with the
number of attention heads increases to 2, 4, 8, as well as with the number of layers increases to 2
and 3. As in Table D.7 and Table D.8, we empirically observe that more complicated structures
do not result in improved performance. We find that with more layers of transformer stacked, the
difficulty of optimization increases and it becomes harder to train models until their convergence.
Whilst for models with more heads, the models seem to over-fit heavily on the training data, even
with the usage of auxiliary loss term (like the contrastive loss in our approach). It might require
some careful regularizations to prevent over-fitting, which we leave for future work.
The effectiveness of contrastive loss. Table D.9 show the few-shot classification results with
different weight values () of the contrastive loss term for FEAT. From the results, we can find
171
Table D.9: Ablation studies on effects of the contrastive learning of the set-to-set function on
FEAT.
Setups! 1-Shot 5-Way 5-Shot 5-Way
= 10 53.92 0.20 70.41 0.16
= 1 54.84 0.20 71.00 0.16
= 0:1 55.15 0.20 71.61 0.16
= 0:01 54.67 0.20 71.26 0.16
Table D.10: Ablation studies on the prediction strategy (with cosine similarity or euclidean
distance) of FEAT.
Setups! 1-Shot 5-Way 5-Shot 5-Way
Backbone! ConvNet ResNet ConvNet ResNet
Cosine Similarity-based Prediction
FEAT 54.64 0.20 66.26 0.20 71.72 0.16 81.83 0.15
Euclidean Distance-based Prediction
FEAT 55.15 0.20 66.78 0.20 71.61 0.16 82.05 0.14
that the balance of the contrastive term in the learning objective can influence the final results.
Empirically, we set = 0:1 in our experiments.
The influence of the prediction strategy. We investigate two embedding-based prediction
ways for the few-shot classification, i.e., based on the cosine similarity and the negative euclidean
distance to measure the relationship between objects, respectively. We compare these two choices
in Table D.10. Two strategies in Table D.10 only differ in their similarity measures. In other words,
with more than one shot per class in the task training set, we average the same class embeddings
first, and then make classification by computing the cosine similarity or the negative euclidean
distance between a test instance and a class prototype. During the optimization, we tune the
logits scale temperature for both these methods. We find that using the euclidean distance usually
requires small temperatures (e.g.,
=
1
64
) while a large temperature (e.g.,
= 1) works well with
the normalized cosine similarity. The former choice achieves a slightly better performance than
the latter one.
Multi-Domain Few-Shot Learning We show that FEAT learns to adapt the intrinsic structure
of tasks, and generalize across domains, i.e., predicting test instances even when the visual
appearance is changed.
172
Table D.11: Cross-Domain 1-shot 5-way classification results of the FEAT approach.
C! C C! R R! R
Supervised 34.380.16 29.490.16 37.430.16
ProtoNet 35.510.16 29.470.16 37.240.16
FEAT 36.830.17 30.890.17 38.490.16
Setups. We train a few-shot learning model in the standard domain and evaluate it with cross-
domain tasks, where theN-categories are aligned but domains are different. In detail, a model
is trained on tasks from the “Clipart” domain of OfficeHome dataset [224], then the model is
required to generalize to both “Clipart (C)” and “Real World (R)” instances. In other words, we
need to classify complex real images by seeing only a few sketches, or even based on the instances
in the “Real World (R)” domain.
Results. Table D.11 gives the quantitative results. Here, the “supervised” refers to a model
trained with standard classification and then is used for the nearest neighbor classifier with its
penultimate layer’s output feature. We observe that ProtoNet can outperform this baseline on tasks
when evaluating instances from “Clipart” but not ones from “real world”. However, FEAT can
improve over “real world” few-shot classification even only seeing the support data from “Clipart”.
Besides, when the support set and the test set of the target task are sampled from the same but new
domains, e.g., the training and test instances both come from “real world”, FEAT also improves the
classification accuracy w.r.t. the baseline methods. It verifies the domain generalization ability of
the FEAT approach.
Additional Discussions on Transductive FSL We list the results of the transductive few-shot
classification in Table D.12, where the unlabeled test instances arrive simultaneously, so that
the common structure among the unlabeled test instances could be captured. We compare with
three approaches, Semi-ProtoNet [181], TPN [139], and TEAM [173]. Semi-ProtoNet utilizes
the unlabeled instances to facilitate the computation of the class center and makes predictions
similar to the prototypical network; TPN meta learns a label propagation way to take the unlabeled
instances relationship into consideration; TEAM explores the pairwise constraints in each task,
and formulates the embedding adaptation into a semi-definite programming form. We cite the
results of Semi-ProtoNet from [181], and cite the results of TPN and TEAM from [173]. We
also re-implement Semi-ProtoNet with our pre-trained backbone (the same pre-trained ConvNet
weights as the standard few-shot learning setting) for a fair comparison.
In this setting, our model leverages the unlabeled test instances to augment the transformer as
discussed in §7.4.2 and the embedding adaptation takes the relationship of all test instances into
173
Table D.12: Results of models for transductive FSL with ConvNet backbone on MiniImageNet.
We cite the results of Semi-ProtoNet and TPN from [181] and [173] respectively. For TEAM [173],
the authors do not report the confidence intervals, so we set them to 0.00 in the table. FEAT
y
and
FEAT
z
adapt embeddings with the joint set of labeled training and unlabeled test instances, while
make prediction via ProtoNet and Semi-ProtoNet respectively.
Setups! 1-Shot 5-Way 5-Shot 5-Way
Standard
ProtoNet 52.61 0.20 71.33 0.16
FEAT 55.15 0.20 71.61 0.16
Transductive
Semi-ProtoNet [181] 50.41 0.31 64.39 0.24
TPN [139] 55.51 0.84 69.86 0.67
TEAM [173] 56.57 0.00 72.04 0.00
Semi-ProtoNet (Ours) 55.50 0.10 71.76 0.08
FEAT
y
56.49 0.16 72.65 0.20
FEAT
z
57.04 0.16 72.89 0.20
consideration. Based on the adapted embedding by the joint set of labeled training instances and
unlabeled test instances, we can make predictions with two strategies. First, we still compute the
center of the labeled instances, while such adapted embeddings are influenced by the unlabeled
instances (we denote this approach as FEAT
y
, which works the same way as standard FEAT except
the augmented input of the embedding transformation function); Second, we consider to take
advantage of the unlabeled instances and use their adapted embeddings to construct a better class
prototype as in Semi-ProtoNet (we denote this approach as FEAT
z
).
By using more unlabeled test instances in the transductive environment, FEAT
y
achieves
further performance improvement compared with the standard FEAT, which verifies the unlabeled
instances could assist the embedding adaptation of the labeled ones. With more accurate class
center estimation, FEAT
z
gets a further improvement. The performance gain induced by the
transductive FEAT is more significant in the one-shot learning setting compared with the five-shot
scenario, since the helpfulness of unlabeled instance decreases when there are more labeled
instances.
Large-Scale Low-Shot Learning The large-scale low-shot learning [60, 66, 239] considers
the few-shot classification ability on both SEEN and UNSEEN classes on the full ImageNet [184]
dataset. There are in total 389 SEEN classes and 611 UNSEEN classes [66]. We follow the setting
(including the splits) of the prior work [66] and use features extracted based on the pre-trained
ResNet-50 [67]. Three evaluation protocols are evaluated, namely the top-5 few-shot accuracy on
174
Table D.13: The top-5 low-shot learning accuracy over all classes on the large scale ImageNet [184]
dataset (w/ ResNet-50).
UNSEEN 1-Shot 2-Shot 5-Shot 10-Shot 20-Shot
ProtoNet [198] 49.6 64.0 74.4 78.1 80.0
PMN [239] 53.3 65.2 75.9 80.1 82.6
FEAT 53.8 65.4 76.0 81.2 83.6
All 1-Shot 2-Shot 5-Shot 10-Shot 20-Shot
ProtoNet [198] 61.4 71.4 78.0 80.0 81.1
PMN [239] 64.8 72.1 78.8 81.7 83.3
FEAT 65.1 72.5 79.3 82.1 83.9
All w/ Prior 1-Shot 2-Shot 5-Shot 10-Shot 20-Shot
ProtoNet [198] 62.9 70.5 77.1 79.5 80.8
PMN [239] 63.4 70.8 77.9 80.9 82.7
FEAT 63.8 71.2 78.1 81.3 83.4
the UNSEEN classes, on the combined set of both SEEN and UNSEEN classes, and the calibrated
accuracy on weighted by selected set prior on the combined set of both SEEN and UNSEEN classes.
The results are listed in Table D.13.
175
Appendix E
Details and Additional Experiments for Chapter 8
E.1 Additional Implementation Details
Pre-training Strategy In particular, on MiniImageNet, we add a linear layer on the backbone
output and optimize a 64-way classification problem on the meta-training set with the cross-entropy
loss function. Stochastic gradient descent with initial learning rate 0.1 and momentum 0.9 is used
to complete such optimization. The 16 classes in MiniImageNet for model selection also assist the
choice of the pre-trained model. After each epoch, we use the current embedding and measures the
nearest neighbor based few-shot classification performance on the sampled few-shot tasks from
these 16 classes. The most suitable embedding function is recorded. After that, such a learned
backbone is used to initialize the embedding part of the whole model. The same strategy is also
applied to the meta-training set of the TieredImageNet, Heterogeneous, and Office-Home datasets,
where a 351-way, 100-way, and 25-way classifiers are pre-trained.
Feature Network Specification We follow [174, 186] when investigating the multi-domain
GFSL, where images are resized to 84 84 3. In concrete words, three residual blocks are
used after an initial convolutional layer (with stride 1 and padding 1) over the image, which have
channels 160=320=640, stride 2, and padding 2. After a global average pooling layer, it leads
to a 640 dimensional embedding. While for the benchmark experiments on MiniImageNet and
TieredImageNet, we follow [123] to set the architecture of ResNet, which contains 12 layers and
uses the DropBlock [59] to prevent over-fitting.
We use the pre-trained backbone to initialize the embedding part of a model for CAS-
TLE/ACASTLE and our re-implemented comparison methods such as MC+kNN, ProtoNet+ProtoNet,
MC+ProtoNet, L2ML [238], and DFSL [60]. When there exists a backbone initialization, we
set the initial learning rate as 1e-4 and optimize the model with Momentum SGD. The learning
rate will be halved after optimizing 2,000 mini-batches. During meta-learning, all methods are
optimized over 5-way few-shot tasks, where the number of shots in a task is consistent with the
176
inference (meta-test) stage. For example, if the goal is a 1-shot 5-way model, we sample 1-shot
5-wayD
S
train
during meta-training, together with 15 instances per class inD
S
test
.
For both CASTLE and ACASTLE, we take advantage of the multi-classifier training technique
to improve learning efficiency. We randomly sample a 24-way task fromS in each mini-batch, and
re-sample 64 5-way tasks from it. It is notable that all instances in the 24-way task are encoded
by the ResNet backbone with the same parameters in advance. Therefore, by embedding the
synthesized 5-way few-shot classifiers into the global many-shot classifier, it results in 64 different
configurations of the generalized few-shot classifiers. To evaluate the classifier, we randomly
sample instances with batch size 128 fromS and compute the GFSL objective in Eq. 8.2.
Baselines for GFSL Benchmarks Here we describe some baseline approaches compared in the
GFSL benchmarks in detail. (1) Multiclass Classifier (MC) +kNN. AjSj-way classifier is trained
on the SEEN classes in a supervised learning manner as standard many-shot classification [67].
During the inference, test examples ofS categories are evaluated based on thejSj-way classifiers
andjUj categories are evaluated using the support embeddings fromD
U
train
with a nearest neighbor
classifier. To evaluate the generalized few-shot classification task, we take the union of multi-class
classifiers’ confidence and nearest neighbor confidence (the normalized negative distance values
as in [198]) as joint classification scores onS[U.
• (1) Multiclass Classifier (MC) +kNN. AjSj-way classifier is trained on the SEEN classes in
a supervised learning manner as standard many-shot classification [67]. During the inference,
test examples ofS categories are evaluated based on thejSj-way classifiers andjUj categories
are evaluated using the support embeddings fromD
U
train
with a nearest neighbor classifier. To
evaluate the generalized few-shot classification task, we take the union of multi-class classifiers’
confidence and nearest neighbor confidence (the normalized negative distance values as in [198])
as joint classification scores onS[U.
• (2) ProtoNet + ProtoNet. We train a few-shot classifier (initialized by the MC classifier’s
feature mapping) using the Prototypical Network [198] (a.k.a. ProtoNet), pretending they were
few-shot. When evaluated on the SEEN categories, we randomly sample 100 training instances
per category to compute the class prototypes. The class prototypes of UNSEEN classes are
computed based on the sampled few-shot training set. During the inference of generalized
few-shot learning, the confidence of a test instances is jointly determined by its (negative)
distance to both SEEN and UNSEEN class prototypes.
• (3) MC + ProtoNet. We combine the learning objective of the previous two baselines ((1) and
(2)) to jointly learn the MC classifier and feature embedding. Since there are two objectives
for many-shot (cross-entropy loss on all SEEN classes) and few-shot (ProtoNet meta-learning
177
objective) classification respectively, it trades off between many-shot and few-shot learning.
Therefore, this learned model can be used as multi-class linear classifiers on the head categories,
and used as ProtoNet on the tail categories. During the inference, the model predicts instances
from SEEN classS with the MC classifier, while takes advantage of the few-shot prototypes to
discern UNSEEN class instances. To evaluate the generalized few-shot classification task, we take
the union of multi-class classifiers’ confidence and ProtoNet confidence as joint classification
scores onS[U.
• (4) L2ML. Wang et al. [238] propose learning to model the “tail” (L2ML) by connecting a
few-shot classifier with the corresponding many-shot classifier. The method is designed to learn
classifier dynamics from data-poor “tail” classes to the data-rich head classes. Since L2ML is
originally designed to learn with both SEEN and UNSEEN classes in a transductive manner. In
our experiment, we adaptive it to our setting. Therefore, we learn a classifier mapping based
on the sampled few-shot tasks from SEEN class setS, which transforms a few-shot classifier in
UNSEEN class setU inductively. Following [238], we first train a many-shot classifierW upon
the ResNet backbone on the SEEN class setS. We use the same residual architecture as in [238]
to implement the classifier mappingf, which transforms a few-shot classifier to a many-shot
classifier. During the meta-learning stage, aS-way few-shot task is sampled in each mini-batch,
which produces aS-way linear few-shot classifier
^
W based on the fixed pre-trained embedding.
The objective of L2ML not only regresses the mapped few-shot classifierf(
^
W ) close to the
many-shot oneW measured by square loss, but also minimizes the classification loss off(
^
W )
over a randomly sampled instances fromS. L2ML uses a pre-trained multi-class classifierW for
those head categories and used the predicted few-shot classifiers withf for the tail categories.
E.2 Additional Experimental Results
In this appendix, we do analyses to show the influence of training a GFSL model by reusing the
many-shot classifier and study different implementation choices in the proposed methods. We
mainly investigate and provide the results over CASTLE on MiniImageNet. We observe the results
on ACASTLE and other datasets reveal similar trends.
Reusing the many-shot classifier facilitates the calibration for GFSL We compare the strat-
egy to train CASTLE from scratch and fine-tune based on the many-shot classifier. We show both
the results of 1-Shot 5-Way few-shot classification performance and GFSL performance with 5
UNSEEN tasks for CASTLE when trained from random or with provided initialization. From the
results in Table E.1, we find training from scratch gets only a bit lower few-shot classification
results with the fine-tune strategy, but much lower GFSL harmonic mean accuracy. Therefore,
178
Table E.1: The difference between training with a pre-trained backbone or from scratch with
1-Shot 5-Way Tasks on MiniImageNet. “MA” and “HM” denote the Mean Accuracy and Harmonic
Mean Accuracy, respectively.
Perf. Measures FSL MA GFSL HM
CASTLE w/ pre-train 66.830.21 66.220.15
CASTLE w/o pre-train 64.230.21 38.240.09
Table E.2: Comparison between CASTLE variants and the incremental learning methods on
MiniImageNet. The harmonic mean accuracy in different evaluation scenarios are recorded.
Classification on 5-Way 20-Way
Setups 1-Shot 5-Shot 1-Shot 5-Shot
LwF [136] 60.180.15 73.480.09 28.700.06 39.880.06
iCARL[136] 61.140.15 73.580.09 31.600.06 46.550.06
CASTLE 66.220.15 76.320.09 43.06 0.07 55.65 0.07
ACASTLE 66.240.15 78.330.09 43.63 0.08 56.33 0.06
reusing the parameters in the many-shot classifier benefits the predictions on SEEN and UNSEEN
classes of a GFSL model. Therefore, we use the pre-trained embedding to initialize the backbone.
Comparison with One-Phase Incremental Learning Methods The inductive generalized few-
shot learning is also related to the one-phase incremental learning [136, 136, 140], where a model
is required to adapt itself to the open set environment. In other words, after training over the
closed set categories, a classifier should be updated based on the data with novel distributions
or categories accordingly. One important thread of incremental learning methods relies on the
experience replay, where a set of the closed set instances is preserved and the classifier for all
classes is optimized based on the saved and novel few-shot data. In our inductive GFSL, the
CASTLE variants do not save SEEN class instances and rely on the neural dictionary to adapt the
classifier for a joint classification. Thus, CASTLE variants have lower computational (time) costs
during the inference stage.
Towards comprehensive comparisons, we also investigate two popular incremental learning
methods, i.e., LwF [136] and iCARL[136]. We randomly save 5 images per SEEN class for both
methods. By combining the stored images and the newly given UNSEEN class images together,
the model will be updated based on a cross-entropy loss and a distillation loss [72]. We tune
the balance weight between the classification and distillation loss, the initial learning rate for
fine-tuning, and the optimization steps for both methods over the validation set. The harmonic
mean accuracy in various evaluation scenarios over 10,000 tasks are listed in Table E.2.
179
Table E.3: The light-weight model adaptation by fine-tuning the scale and bias weights based
on the classifier initialization from CASTLE variants. The harmonic mean accuracy in different
evaluation scenarios on MiniImageNet are recorded. The superscripty denotes the method with
another light-weight update step.
Classification on 5-Way 20-Way
Setups 1-Shot 5-Shot 1-Shot 5-Shot
CASTLE 66.220.15 76.320.09 43.06 0.07 55.65 0.07
CASTLE
y
66.240.15 76.430.09 43.12 0.07 55.85 0.07
ACASTLE 66.240.15 78.330.09 43.63 0.08 56.33 0.06
ACASTLE
y
66.330.15 78.930.09 43.68 0.08 56.42 0.06
Table E.4: The performance with different choices of classifier synthesize strategies when tested
with 5-Shot 5-Way UNSEN Tasks on MiniImageNet. We denote the option compute embedding
prototype and average synthesized classifiers as “Pre-A VG” and “Post-A VG” respectively.
Perf. Measures FSL Mean Acc. GFSL HM Acc.
CASTLE w/ Pre-A VG 81.980.20 76.320.09
CASTLE w/ Post-A VG 82.000.20 76.280.09
In our empirical evaluations, we find that incremental learning methods can get better results
than our baselines since it fine-tunes the model with the distillation loss. However, their results are
not stable since there are many hyper-parameters. Compared with these approaches, our CASTLE
variants still keep their superiority over all criteria.
Light-Weight Adaptation on CASTLE Variants As shown in the previous paragraph, directly
fine-tuning the whole model is prone to over-fit even with another distillation loss. Inspired
by [131, 205], we consider a light-weight fine-tune step based on the synthesized classifier by
CASTLE variants.
Given a few-shot task with UNSEEN class instances, the model will be updated in the following
ways. 5 images per SEEN class are randomly selected, after freezing the backbone, the classifier
W , the scale, and the bias are optimized based on a cross-entropy loss over both stored SEEN
and UNSEEN classes images. We tune the initial learning rate and the optimization steps over the
validation set.
The results of such model adaptation strategies are listed in Table E.3. With further model
adaptation, both CASTLE and ACASTLE could be improved.
180
0 1 2 3 4
Dictionary Size (ratio w.r.t. SEEN size)
64.0
64.5
65.0
65.5
66.0
66.5
67.0
67.5
68.0
Mean Accuracy (%)
64.80
65.34
66.57
66.83
67.10
66.42
Figure E.1: The 1-shot 5-way accuracy on
UNSEEN of MiniImageNet with different size
of dictionaries.
ProtoNet Synthesized w/o ULO Synthesized Multi-Class
Classifier Types
90
91
92
93
94
95
Mean Accuracy on SEEN (%)
90.03
91.35
92.05
92.37
Multi-Class Accuracy
(1-Shot Trained Model)
Figure E.2: The 64-way multi-class accuracy
on SEEN of MiniImageNet with 1-shot trained
model.
Table E.5: The GFSL performance (harmonic mean accuracy) change with different number of
classifiers (# of CLS) when tested with 1-Shot 5-Way UNSEN Tasks on MiniImageNet.
# of Classifiers 1 64 128 256
CASTLE 64.530.15 65.610.15 66.220.15 66.720.15
Effects on the neural dictionary sizejBj We show the effects of the dictionary size (as the
ratio of SEEN class size 64) for the standard few-shot learning (measured by mean accuracy when
there are 5 UNSEEN classes) in Figure E.1. We observe that the neural dictionary with a ratio of 2
or 3 works best amongst all other dictionary sizes. Therefore, we fix the dictionary size as 128
across all experiments. Note that whenjBj = 0, our method degenerates to case optimizing the
unified objective in Eq. 8.2 without using the neural dictionary (the CASTLE
model in §8.5.3).
How well is synthesized classifiers comparing with multi-class classifiers? To assess the
quality of synthesized classifier, we made a comparison against ProtoNet and also the Multi-class
Classifier on the head SEEN concepts. To do so, we sample few-shot training instances on each
SEEN category to synthesize classifiers (or compute class prototypes for ProtoNet), and then use
the synthesized classifiers/class prototypes solely to evaluate multi-class accuracy. The results are
shown in Figure E.2. We observe that the learned synthesized classifier outperforms over ProtoNet.
Also, the model trained with unified learning objective improves over the vanilla synthesized
classifiers. Note that there is still a gap left against multi-class classifiers trained on the entire
dataset. It suggests that the classifier synthesis we learned is effective against using sole instance
embeddings.
Different choices of the classifier synthesis As in Eq. 8.3, when there is more than one instance
per class in a few-shot task (i.e.,K > 1), CASTLE compute the averaged embeddings first, and
181
Table E.6: The performance gap between CASTLE variants and a kind of “many-shot” upper bound
(denoted as “UB”) on MiniImageNet. The ability of FSL classification is measured by the mean
accuracy, while the harmonic mean accuracy is used as a criterion for GFSL. 5-Shot classification
performance of CASTLE and ACASTLE are listed for a comparison.
Setups 5-Way 20-Way
Measures FSL GFSL FSL GFSL
CASTLE 81.980.14 76.320.09 56.970.06 43.060.07
ACASTLE 82.080.14 78.330.09 57.290.06 56.330.06
UB 87.080.10 80.230.09 68.250.05 68.720.12
then use the prototype of each class as the input of the neural dictionary to synthesize their
corresponding classifiers. Here we explore another choice to deal with multiple instances in each
class. We synthesize classifiers based on each instance first, and then average the corresponding
synthesized classifiers for each class. This option equals an ensemble strategy to average the
prediction results of each instance’s synthesized classifier. We denote the pre-average strategy (the
one used in CASTLE) as “Pre-A VG”, and the post-average strategy as “Post-A VG”. The 5-Shot
5-way classification results on MiniImageNet for these two strategies are shown in Table E.4.
From the results, “Post-A VG” does not improve the FSL and GFSL performance obviously. Since
averaging the synthesized classifiers in a hindsight way costs more memory during meta-training,
we choose the “Pre-A VG” option to synthesize classifiers when there are more than 1 shot in each
class. In our experiments, the same conclusion also applies to ACASTLE.
How is multiple classifiers learning’s impact over the training? Both CASTLE and ACASTLE
adopts a multi-classifier training strategy (as described in §8.4), i.e. considering multiple GFSL
tasks with different combinations of classifiers in a single mini-batch. In Table E.5, we show the
influence of the multi-classifier training method based on their GFSL performance (harmonic
mean). It shows that with a large number of classifiers during the training, the performance of
CASTLE asymptotically converges to its upper-bound. We find ACASTLE shares a similar trend.
The gap to the performance “Upper Bound” (UB) We focus on the (generalized) few-shot
learning scenario where there are only budgeted examples in the UNSEEN class tasks. To show the
potential improvement space in such tasks, we also investigate a kind of upper bound model where
all the available images are used to build the UNSEEN class classifier during the inference stage.
We implement the upper bound model based on the ProtoNet, and the results are in Table E.6.
Specifically, in the FSL classification scenario, all the UNSEEN class images except those preserved
for evaluation are used to build more precise prototypes, and the mean accuracy over 10,000 tasks
182
are recorded; in the GFSL classification scenario, the many-shot UNSEEN class images are utilized
as well, and the calibrated harmonic mean is used as the performance measure.
Since the upper bound takes advantage of all the available training images for the few-shot
categories, it performs better than the few-shot CASTLE and ACASTLE in all the scenarios. The
gap between the few-shot learning methods and the upper bound becomes larger when more
UNSEEN classes (ways) are involved.
183
Abstract (if available)
Abstract
Human ability to understand language is general, flexible, and more importantly, grounded to the physical world. We digest the natural language not by looking at the co-occurring statistics of words in sentences, but by associating its meaning to the corresponding situation and interacting accordingly within the physical environment. Language learning requires going beyond text. In particular, building intelligent agents that understand the meaning of language asks for access to the multi-modal and physical world. ? Towards this goal, this thesis describes techniques to understand visually grounded concepts and follow instructions in the embodied environment. Specifically, we present three primary research directions. The first part of this thesis proposes learning the concepts described by the language with perception data, via developing models that associate the words, phrases, sentences, and paragraphs with both the static and temporally expanded visual world. Based on that, the second part focuses on learning the underlying intent of language instructions and propose models that execute the instructions faithfully in the dynamic visual environment, to achieve substantial generalization performance. Finally, the third part studies more realistic and challenging learning situations, developing methods to handle the learning of data coming from the long-tailed and growing data distribution. In all three parts, we conduct extensive empirical studies on multiple large-scale datasets and demonstrate the superior performance of the proposed models and learning algorithms.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transfer learning for intelligent systems in the wild
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Modeling, learning, and leveraging similarity
PDF
Grounding language in images and videos
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Visual representation learning with structural prior
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Leveraging training information for efficient and robust deep learning
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Learning logical abstractions from sequential data
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Learning shared subspaces across multiple views and modalities
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Single-image geometry estimation for various real-world domains
PDF
Identifying and mitigating safety risks in language models
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Mining and modeling temporal structures of human behavior in digital platforms
Asset Metadata
Creator
Hu, Hexiang
(author)
Core Title
Towards understanding language in perception and embodiment
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-08
Publication Date
07/21/2021
Defense Date
05/27/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
embodied AI,few-shot learning,OAI-PMH Harvest,structural learning,vision and language
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sha, Fei (
committee chair
), Kuo, C.-C. Jay (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
frank.hexiang@gmail.com,hexiangh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15614232
Unique identifier
UC15614232
Legacy Identifier
etd-HuHexiang-9812
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hu, Hexiang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
embodied AI
few-shot learning
structural learning
vision and language