Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Transfer learning for intelligent systems in the wild
(USC Thesis Other)
Transfer learning for intelligent systems in the wild
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Transfer Learning for Intelligent Systems in the Wild
by
Wei-Lun Chao
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2018
Copyright 2018 Wei-Lun Chao
Acknowledgments
I would like to express my sincere appreciation to my fantastic mentors, colleagues, friends,
and family for their generous contributions to the work presented in this thesis and toward com-
pleting my Ph.D. degree.
Special mention goes to my enthusiastic adviser, Prof. Fei Sha. My Ph.D. has been an
amazing five-year journey, and I thank Fei for his tremendous academic guidance and support,
giving me many wonderful opportunities to explore and grow. As a mentor, Fei is knowledgeable
and inspiring, always motivating me to think out of the box and challenge myself to the limit. As
a research scientist, Fei has an extremely high standard and never stop pursuing impactful work,
from which I can always derive “gradients” to improve. It was my great honor to work with him
and follow his career path to become a faculty, and I am especially thankful for the excellent
support and suggestions he has provided during my job hunting season.
Similarly, profound gratitude goes to my long-term collaborators and mentors Prof. Kristen
Grauman and Dr. Boqing Gong. Kristen is a deep thinker, who can always point out key aspects
to raise our work to the next level. She helped me set up a productive and solid path of research in
my first half of Ph.D., and was enthusiastically supportive when I was on the job market. Boqing
is my very best collaborator, friend, and teacher. It is he who step-by-step guided me on every
aspect of doing research. I will never forget those paper deadlines we faced together and how
we used to minimize the overlap of sleeping time to maintain progress. I am greatly inspired and
motivated by watching him from being a hardworking senior Ph.D. to a productive young faculty.
Like Fei and Kristen, Boqing offered tremendous support throughout my job search. Thank you.
I thank Prof. Winston Hsu and Prof. Jian-Jiun Ding, who led me into research and prepared
me for the program when I was a M.S. student in NTU, Taiwan. My Ph.D. journey would not
have started without them.
I would like to thank my dissertation defense committee members, Prof. Laurent Itti, Ja-
son Lee, Panayiotis Georgiou, and Joseph Lim, for their interest, valuable time, and helpful
comments. The same appreciation goes to my proposal committee members, Prof. Antonio Or-
tega, Haipeng Luo, and Meisam Razaviyayn. Especially Haipeng, Meisam, and Joseph set up
mock interviews and job talks for me, greatly improving my preparation for job interviews. I also
want to thank Lizsl De Leon and Jennifer Gerson for their professional services and excellent
consultations at the CS department and Viterbi school to resolve many issues during my Ph.D.
My sincere gratitude also goes to Prof. Justin Solomon and Dominik Michels, and Dr.
Hoifung Poon, Kris Quirk, and Xiaodong He, who provided me opportunities to collaborate on
different projects, greatly expanding my research horizon and interest. The experience and skills
I learned from them definitely facilitated my thesis work. I would specifically thank Justin for his
help throughout my job search.
During my Ph.D., I was so fortunate to join the TEDS lab and have many talented and fun lab
mates to work and live with. Kuan and Ali brought me many precious memories at my first half
ii
of Ph.D.; the same appreciation goes to Dong, Yuan, Franziska, and Wenzhe. Thank Zhiyun
for bringing me a great number of enjoyable moments at the lab, and thank Chao-Kai for many
inspiring theoretical discussions. I also enjoyed chatting about lives with both of them. I want to
give special thanks to Beer, Hexing (Frank), and Ke for their hardworking while we collaborated
on zero-shot learning, visual question answering, and video summarization, respectively. Beer
always works hard, thinks hard, plays hard, and sleeps hard. He cares a lot about people and
everything but his daily routine. Frank has endless energy and passion for research and acquiring
knowledge, and Ke is always humble and thoughtful. It was a tremendously productive period
when working with them. Besides, I was so grateful to meet many new members at my final
year, including Seb, Shariq, Aaron, Liyu, Ivy, Bowen, Chin-Cheng (Jeremy), Yuri, Yiming,
Melissa, Han-Jia, and Marc. I learned so much from them, not only about research but about
cultures and lifestyles. I also thank them for giving me opportunities to practice being a mentor.
It was my great pleasure to discuss and work with so many gifted colleagues on diverse research
problems, and thank all of them for helping me improve my job talks and defense presentation.
Finally, but by no means least, thanks to my family for all their love and encouragement.
This thesis is dedicated to them. My parents are always supportive of me and my younger sister
Yung-Hsuan for every decision we made, especially for studying abroad. It is amazing that we
both studied and obtained the Ph.D. degree at USC, and attended the commencement in 2018.
My Ph.D. life was with many up and down moments, and I really thank their company to share
my happiness and depression. In the end, I would like to thank my loving, encouraging, patient,
and supportive wife Feng-Ju (Claire). We have been life partners for seven years since our M.S.
studies in Taiwan and it is she who brings me love and joy, teaches me how to pursue and fulfill
dreams and never giving up, and makes me the most fortunate person. Thank you.
iii
Table of Contents
Acknowledgments ii
List of Tables ix
List of Figures xiii
Abstract xviii
I Background 1
1 Introduction 2
1.1 Machine learning for intelligent systems . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Challenges in the wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Transfer learning for intelligent systems . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Published work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.1 Zero-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5.2 Domain generalization for visual question answering . . . . . . . . . . . 7
1.5.3 Other work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
II Zero-shot Learning 9
2 Introduction to Zero-shot Learning 10
2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Problem formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 The framework of algorithms . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Outline of Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Literature Survey on Zero-Shot Learning 15
3.1 Semantic representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Visual attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
iv
3.1.2 Vector representations (word vectors) of class names . . . . . . . . . . . 16
3.1.3 Textural descriptions of classes . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 Hierarchical class taxonomies . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.5 Other forms of semantic representations . . . . . . . . . . . . . . . . . . 18
3.2 Algorithms for conventional ZSL . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Embedding-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Similarity-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Algorithms for generalized ZSL . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Related tasks to zero-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 Transductive and semi-supervised zero-shot learning . . . . . . . . . . . 25
3.4.2 Zero-shot learning as the prior for active learning . . . . . . . . . . . . . 25
3.4.3 Few-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Synthesize Classifier (SynC) for Zero-Shot Learning 26
4.1 Main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Manifold learning with phantom classes . . . . . . . . . . . . . . . . . . 27
4.2.3 Learning phantom classes . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Comparison to existing methods . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Hyper-parameter tuning: cross-validation (CV) strategies . . . . . . . . . . . . . 31
4.5 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.1 setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.4 Large-scale zero-shot learning . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.5 Detailed analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5.6 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Generalized Zero-Shot Learning 40
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Generalized zero-shot learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Conventional and generalized zero-shot learning . . . . . . . . . . . . . 41
5.2.2 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Generalized ZSL is hard . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Approach for GZSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Calibrated stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Area Under Seen-Unseen Accuracy Curve (AUSUC) . . . . . . . . . . . 44
5.3.3 Comparisons to alternative approaches . . . . . . . . . . . . . . . . . . . 45
5.4 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4.2 Hyper-parameter tuning strategies . . . . . . . . . . . . . . . . . . . . . 46
5.4.3 Which method to use to perform GZSL? . . . . . . . . . . . . . . . . . . 47
v
5.4.4 Which zero-shot learning approach is more robust to GZSL? . . . . . . . 47
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6 From Zero-Shot Learning to Conventional Supervised Learning 50
6.1 Comparisons among different learning paradigms . . . . . . . . . . . . . . . . . 50
6.2 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7 Improving Semantic Representations by Predicting Visual Exemplars (EXEM) 55
7.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1.1 Learning to predict the visual exemplars from the semantic representations 56
7.1.2 Zero-shot learning based on the predicted visual exemplars . . . . . . . . 56
7.1.2.1 Predicted exemplars as training data . . . . . . . . . . . . . . 56
7.1.2.2 Predicted exemplars as the ideal semantic representations . . . 56
7.1.3 Comparison to related approaches . . . . . . . . . . . . . . . . . . . . . 57
7.2 Other details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7.3 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7.3.3 Predicted visual exemplars . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.3.4 Results on the conventional setting . . . . . . . . . . . . . . . . . . . . . 61
7.3.4.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.3.4.2 Large-scale zero-shot classification results . . . . . . . . . . . 62
7.3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3.5 Results on the generalized setting . . . . . . . . . . . . . . . . . . . . . 64
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
III Domain Generalization for Visual Question Answering 66
8 Introduction to Visual Question Answering and Its Challenges 67
8.1 Review of existing work on Visual QA . . . . . . . . . . . . . . . . . . . . . . . 68
8.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.1.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.3 Contributions and outline of Part III . . . . . . . . . . . . . . . . . . . . . . . . 70
9 Creating Better Visual Question Answering Datasets 73
9.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
9.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.3 Analysis of decoy answers’ effects . . . . . . . . . . . . . . . . . . . . . . . . . 75
vi
9.3.1 Visual QA models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.3.2 Analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.4 Creating better Visual QA datasets . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
9.4.2 Comparison to other datasets . . . . . . . . . . . . . . . . . . . . . . . . 78
9.5 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.5.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.5.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.5.4 Additional results and analysis . . . . . . . . . . . . . . . . . . . . . . . 83
9.5.5 Qualitative results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10 Cross-dataset Adaptation 88
10.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
10.2 Visual QA and bias in the datasets . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2.1 Visual QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.2.2 Bias in the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.3 Cross-dataset adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3.1 Main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.3.3 Joint optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.3.4 Related work on domain adaptation . . . . . . . . . . . . . . . . . . . . 95
10.4 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.4.3 Experimental results on Visual7W and VQA . . . . . . . . . . . . . . 96
10.4.4 Experimental results across five datasets . . . . . . . . . . . . . . . . . . 100
10.4.5 Additional experimental results . . . . . . . . . . . . . . . . . . . . . . 101
10.5 Details on the proposed domain adaptation algorithm . . . . . . . . . . . . . . . 103
10.5.1 Approximating the JSD divergence . . . . . . . . . . . . . . . . . . . . 103
10.5.2 Details on the proposed algorithm . . . . . . . . . . . . . . . . . . . . . 103
10.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11 Learning Answer Embedding for Visual Question Answering 105
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11.2.1 Setup and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11.2.2 Main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.2.3 Large-scale stochastic optimization . . . . . . . . . . . . . . . . . . . . 109
11.2.4 Defining the weighting function . . . . . . . . . . . . . . . . . . . . . 110
11.2.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.2.6 Comparison to existing algorithms . . . . . . . . . . . . . . . . . . . . . 110
11.3 Empirical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
vii
11.3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.3.3 Results on individual Visual QA datasets . . . . . . . . . . . . . . . . . 113
11.3.4 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11.3.5 Transfer learning across datasets . . . . . . . . . . . . . . . . . . . . . . 116
11.3.6 Analysis with seen/unseen answers . . . . . . . . . . . . . . . . . . . . 117
11.3.7 Visualization on answer embeddings . . . . . . . . . . . . . . . . . . . . 118
11.3.8 Analysis on answer embeddings . . . . . . . . . . . . . . . . . . . . . . 118
11.3.9 Inference efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
IV Conclusion 121
12 Conclusion 122
12.1 Remarks on future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
12.1.1 Advanced zero-shot learning . . . . . . . . . . . . . . . . . . . . . . . . 123
12.1.2 Advanced transfer learning for AI . . . . . . . . . . . . . . . . . . . . . 123
12.1.3 Principled frameworks for transferable machine learning . . . . . . . . . 124
V Bibliography 126
Bibliography 127
viii
List of Tables
4.1 Key characteristics of studied datasets . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Comparison between our results and the previously published results in multi-
way classification accuracies (in %) on the task of zero-shot learning. For each
dataset, the best is in red and the 2nd best is in blue. . . . . . . . . . . . . . . . . 34
4.3 Comparison between results by ConSE and our method on ImageNet. For both
types of metrics, the higher the better. . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Comparison between sample- and class-wise cross-validation for hyper-parameter
tuning on CUB (learning with the one-versus-other loss). . . . . . . . . . . . . . 35
4.5 Detailed analysis of various methods: the effect of feature and attribute types on
multi-way classification accuracies (in %). Within each column, the best is in red
and the 2nd best is in blue. We cite both previously published results (numbers in
bold italics) and results from our implementations of those competing methods
(numbers in normal font) to enhance comparability and to ease analysis (see texts
for details). We use the shallow features provided by [106], [83], [142] for AwA,
CUB, SUN, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.6 Effect of types of semantic representations on AwA. . . . . . . . . . . . . . . . . 37
4.7 Effect of learning semantic representations . . . . . . . . . . . . . . . . . . . . . 37
5.1 Classification accuracies (%) on conventional ZSL (A
U!U
), multi-class classi-
fication for seen classes (A
S!S
), and GZSL (A
S!T
andA
U!T
), on AwA and
CUB. Significant drops are observed fromA
U!U
toA
U!T
. . . . . . . . . . . . 43
5.2 Comparison of performance measured in AUSUC between two cross-validation
strategies on AwA and CUB. One strategy is based on accuracies (A
S!S
and
A
U!U
) and the other is based on AUSUC. See text for details. . . . . . . . . . . 47
5.3 Performances measured in AUSUC of several methods for Generalized Zero-Shot
Learning on AwA and CUB. The higher the better (the upper bound is 1). . . . . 48
5.4 Performances measured in AUSUC by different zero-shot learning approaches on
GZSL on ImageNet, using our method of calibrated stacking. . . . . . . . . . . 48
6.1 Comparison of performances measured in AUSUC between GZSL (using WORD2VEC
and G-attr) and multi-class classification on ImageNet-2K. Few-shot results
are averaged over 100 rounds. GZSL with G-attr improves upon GZSL with
WORD2VEC significantly and quickly approaches multi-class classification per-
formance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Comparison of performances measured in AUSUC between GZSL with WORD2VEC
and GZSL with G-attr on the full ImageNet with over 20,000 unseen classes.
Few-shot results are averaged over 20 rounds. . . . . . . . . . . . . . . . . . . . 54
ix
7.1 We compute the Euclidean distance matrix between the unseen classes based on
semantic representations (D
au
), predicted exemplars (D
(au)
), and real exem-
plars (D
vu
). Our method leads toD
(au)
that is better correlated withD
vu
than
D
au
is. See text for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2 Comparison between existing ZSL approaches in multi-way classification accu-
racies (in %) on four benchmark datasets. For each dataset, we mark the best
in red and the second best in blue. Italic numbers denote per-sample accuracy
instead of per-class accuracy. On ImageNet, we report results for both types of
semantic representations: Word vectors (wv) and MDS embeddings derived from
WordNet (hie). All the results are based on GoogLeNet features [175]. . . . . . . 62
7.3 Comparison between existing ZSL approaches on ImageNet using word vectors
of the class names as semantic representations. For both metrics (in %), the
higher the better. The best is in red. The numbers of unseen classes are listed in
parentheses.
y
: our implementation. . . . . . . . . . . . . . . . . . . . . . . . . 63
7.4 Comparison between existing ZSL approaches on ImageNet (with 20,842 unseen
classes) using MDS embeddings derived from WordNet [123] as semantic rep-
resentations. The higher, the better (in %). The best is in red. . . . . . . . . . . . 63
7.5 Accuracy of EXEM (1NN) on AwA, CUB, and SUN when predicted exemplars are
from original visual features (No PCA) and PCA-projected features (PCA withd
= 1024 andd = 500). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.6 Comparison between EXEM (1NN) with support vector regressors (SVR) and with
2-layer multi-layer perceptron (MLP) for predicting visual exemplars. Results on
CUB are for the first split. Each number for MLP is an average over 3 random
initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.7 Generalized ZSL results in Area Under Seen-Unseen accuracy Curve (AUSUC)
on AwA, CUB, and SUN. For each dataset, we mark the best in red and the
second best in blue. All approaches use GoogLeNet as the visual features and
calibrated stacking to combine the scores for seen and unseen classes. . . . . . . 65
9.1 Accuracy of selecting the right answers out of 4 choices (%) on the Visual QA
task on Visual7W. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.2 Summary of Visual QA datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.3 Test accuracy (%) on Visual7W. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.4 Accuracy (%) on the validation set in VQA. . . . . . . . . . . . . . . . . . . . . 82
9.5 Test accuracy (%) on qaVG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.6 Using models trained on qaVG to improve Visual7W and VQA (Accuracy in %). 84
9.7 Accuracy (%) on VQA
-2014val, which contains 76,034 triplets. . . . . . . . . 84
9.8 Test accuracy (%) on COCOQA. . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.9 Test accuracy (%) on VQA2-2017val. . . . . . . . . . . . . . . . . . . . . . . . 85
9.10 Test accuracy (%) on VQA2
-2017val, which contains 134,813 triplets. . . . . . 86
9.11 Test accuracy (%) on Visual7W, comparing different embeddings for questions
and answers. The results are reported for the IoU +QoU-decoys. . . . . . . . . . 86
9.12 Test accuracy (%) on Visual7W, comparing different random decoy strategies to
our methods: (A) Orig + uniformly random decoys from unique correct answers,
(B) Orig + weighted random decoys w.r.t. their frequencies, and All (Orig+IoU
+QoU). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
x
10.1 Results of Name That Dataset! . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 Various Settings for cross-dataset Adaptation. Source domain always provide I,
Q and A (T+D) while the target domain provides the same only during testing. . . 92
10.3 Domain adaptation (DA) results (in %) on original VQA [3] and Visual7W [230].
Direct: direct transfer without DA. [174]: CORAL. [183]: ADDA. Within: ap-
ply models trained on the target domain if supervised data is provided. (best DA
result in bold) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.4 Domain adaptation (DA) results (in %) on revised VQA and Visual7W from
Chapter 9. (best DA result in bold) . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.5 DA results (in %) on original datasets, with target data sub-sampling by 1/16.
FT: fine-tuning. (best DA result in bold) . . . . . . . . . . . . . . . . . . . . . . 98
10.6 DA results (in %) on revised datasets, with target data sub-sampling by 1/16. FT:
fine-tuning. (best DA result in bold) . . . . . . . . . . . . . . . . . . . . . . . . 98
10.7 DA results (in %) on VQA and Visual7W (both original and revised) using a
variant of the SMem model [203]. . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.8 DA results (in %) on VQA and Visual7W (both original and revised) using a
variant of the HieCoAtt model [122]. . . . . . . . . . . . . . . . . . . . . . . . . 100
10.9 Transfer results (in %) across different datasets (the decoys are generated accord-
ing to Chapter 9). The setting for domain adaptation (DA) is on [Q+T+D] using
1/16 of the training examples of the target domain. . . . . . . . . . . . . . . . . 101
10.10Domain adaptation (DA) results (in %) with or without the discriminative loss
surrogate term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.11OE results (VQA
! COCOQA, sub-sampled by 1/16). . . . . . . . . . . . . . 103
10.12Transfer results (in %) across datasets (the decoys are generated according to
Chapter 9). The setting for domain adaptation (DA) is on [Q+T+D] using all the
training examples of the target domain. . . . . . . . . . . . . . . . . . . . . . . . 103
11.1 Summary statistics of Visual QA datasets. . . . . . . . . . . . . . . . . . . . . . 111
11.2 The answer coverage of each dataset. . . . . . . . . . . . . . . . . . . . . . . . . 112
11.3 Results (%) on Visual QA with different settings: open-ended (Top-K) and multiple-
choice (MC) based for different datasets. The omitted ones are due to their miss-
ing in the corresponding work. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
11.4 The effect of negative sampling (M = 3; 000) on fPMC. The number is the
accuracy in each question type on VQA2 (val). . . . . . . . . . . . . . . . . . . 115
11.5 Detailed analysis of different(a;d) for weighted likelihood. The reported num-
ber is the accuracy on VQA2 (validation). . . . . . . . . . . . . . . . . . . . . . 116
11.6 The # of common answers across datasets (training set) . . . . . . . . . . . . . . 116
11.7 Results of cross-dataset transfer using either classification-based models or our
models (PMC) for Visual QA. (f
= SAN) . . . . . . . . . . . . . . . . . . . . . 117
11.8 Transferring is improved on the VQA2 dataset without Yes/No answers (and the
corresponding questions) (f
= SAN). . . . . . . . . . . . . . . . . . . . . . . . 117
11.9 Analysis of cross dataset performance over Seen/Unseen answers using either
CLS or PMC for Visual QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
11.10Results for the baseline method that fix answer embedding as GloVe. (We show
results with SAN asf
(i;q)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
xi
11.11Efficiency study among CLS(MLP), uPMC(MLP) and fPMC(MLP). The reported
numbers are the average inference time of a mini-batch of 128 (jTj = 1000). . . 120
xii
List of Figures
1.1 In films that aim to render our future world, intelligent systems that can perform
visual recognition, language understanding, and reasoning on top of the two in-
formation play a significant role and can always catch our eyes. . . . . . . . . . . 2
1.2 The performance trend of machines on classifying among 1,000 common object
in ILSVRC [157]. The numbers in bars are the top-5 error rates. The leading algo-
rithms shown in the figure are SIFT + FVs [161], AlexNet [101], VGGNet [170],
GoogLeNet [175], and ResNet [73]. . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The visual question answering (Visual QA) task [14]: given an image an intelli-
gent system needs to answer questions related to the image. . . . . . . . . . . . . 3
1.4 An illustration on developing a Visual QA system, which involves collecting
training data and performing (supervised) learning. The resulting system then
can be applied to an environment similar to the training data; i.e., answering (rec-
ognizing) familiar questions (objects). . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 An illustration on how a learned system in Fig. 1.4 will fail in the wild to answer
unfamiliar question or recognize unseen objects. “?” means the system refuses
to answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 An illustration of the long-tailed distribution on object categories in nature images
(from the the SUN dataset [202]) [228]. The vertical axis corresponds to the
number of examples. The blue curve in the inset shows a log-log plot, along
with a best-fit line in red. This suggests that the distribution follows a long-tailed
power law. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 An illustration on developing a Visual QA system using transfer learning. The
learned model can not only perform well in an environment similar to the training
data, but transfer its abilities to the wild to further answering unfamiliar questions
and recognize unseen objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 We consider zero-shot learning for visual recognition by ignoring the question. . 10
2.2 An illustration on the possibility of zero-shot learning (ZSL). Given two images
and two categories, Okapi and Araripe Manakin, telling which image belongs
to which class can be hard if we have not seen them before (i.e., a ZSL task).
However, if we are further provided with the class semantic representations (e.g.,
Okapi has stripes and a black body), then the task becomes much simpler to us.
This is because we can visually understand the semantic representation, probably
learned from other animals we have seen before. . . . . . . . . . . . . . . . . . . 11
3.1 Illustration on visual attributes and annotations for (a) animals and (b) human faces. 16
xiii
3.2 Illustrations of vector representation of words. (a) Two-dimensional PCA projec-
tion of the skip-gram vectors of countries and their capital cities [131]. (b) t-SNE
visualization [184] of the skip-gram vectors [131] of visual object names [50]. . . 17
3.3 Example annotations on bird and flower images [148]. . . . . . . . . . . . . . . 18
3.4 The class taxonomies of animals [68] derived from WordNet [132, 46]. . . . . . . 18
3.5 An illustration of the DAP model [105, 106]. The figure is from [105, 106] so
that the notations are not the same as the ones defined in the thesis. In the figure
fa
1
; ;a
M
g corresponds to M attributes,fy
1
; ;y
K
g corresponds to seen
classes, andfz
1
; ;z
L
g corresponds to unseen classes. . . . . . . . . . . . . . 20
3.6 An illustration of the SJE model [6, 5]. The figure is from [5] so that the notations
are not the same as the ones defined in the thesis. In the figure,(x) is equivalent
tox in the thesis;'(y
i
) is equivalent toa
y
i
in the thesis. . . . . . . . . . . . . . 21
3.7 An illustration of the IAP model [105, 106]. The figure is from [105, 106] so
that the notations are not the same as the ones defined in the thesis. In the figure
fa
1
; ;a
M
g corresponds to M attributes,fy
1
; ;y
K
g corresponds to seen
classes, andfz
1
; ;z
L
g corresponds to unseen classes. . . . . . . . . . . . . . 23
4.1 Illustration of our method SynC for zero-shot learning. Object classes live in
two spaces. They are characterized in the semantic space with semantic repre-
sentations (as) such as attributes and word vectors of their names. They are also
represented as models for visual recognition (ws) in the model space. In both
spaces, those classes form weighted graphs. The main idea behind our approach
is that these two spaces should be aligned. In particular, the coordinates in the
model space should be the projection of the graph vertices from the semantic
space to the model space—preserving class relatedness encoded in the graph.
We introduce adaptable phantom classes (b andv) to connect seen and unseen
classes—classifiers for the phantom classes are bases for synthesizing classifiers
for real classes. In particular, the synthesis takes the form of convex combination. 27
4.2 Data splitting for different cross-validation (CV) strategies: (a) the seen-unseen
class splitting for zero-shot learning, (b) the sample-wise CV , (c) the class-wise CV . 31
4.3 We vary the number of phantom classesR as a percentage of the number of seen
classes S and investigate how much that will affect classification accuracy (the
vertical axis corresponds to the ratio with respect to the accuracy when R = S).
The base classifiers are learned with Ours
o-vs-o
. . . . . . . . . . . . . . . . . . . 38
4.4 Qualitative results of our method (Ours
struct
) on AwA. (Top) We list the 10 unseen
class labels. (Middle) We show the top-5 images classified into each class, ac-
cording to the decision values. Misclassified images are marked with red bound-
aries. (Bottom) We show the first misclassified image (according to the decision
value) into each class and its ground-truth class label. . . . . . . . . . . . . . . . 39
5.1 Comparisons of (a) conventional ZSL and (b) generalized ZSL in the testing
phase—conventional ZSL assumes the absence of seen classes’ instances and
only classifies test instances into one of the unseen classes. The notations follow
those in Section 5.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xiv
5.2 We observed that seen classes usually give higher scores than unseen classes, even
to an unseen class instance (e.g., a zebra image). We thus introduce a calibration
factor
, either to reduce the scores of seen classes or to increase those of unseen
classes (cf. eq. (5.2)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 The Seen-Unseen accuracy Curve (SUC) obtained by varying
in the calibrated
stacking classification rule eq. (5.2). The AUSUC summarizes the curve by com-
puting the area under it. We use the method SynC
o-vs-o
on the AwA dataset, and
tune hyper-parameters as in Table 5.1. The red cross denotes the accuracies by
direct stacking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Comparison between several ZSL approaches on the task of GZSL for AwA and
CUB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Comparison between ConSE and SynC of their performances on the task of
GZSL for ImageNet where the unseen classes are within 2 tree-hops from seen
classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 The comparison of zero-shot, few-shot, and conventional supervised learning
(i.e., many shot-learning). For all the paradigms, categories of interest can be
separated into two portions: one with many training examples per class; one with
zero, few, or again many examples. For ZSL, the first (second) portion is called
seen (unseen) classes, and extra class semantic representationsa
c
are provided.
In our SynC algorithm, we learn a mechanismh to synthesize the classifierw
c
given the correspondinga
c
. We can actually learn and apply the same mechanism
to the other mechanisms if we havea
c
: for example, constructinga
c
by average
visual features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 We contrast the performances of GZSL to multi-class classifiers trained with la-
beled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL
uses WORD2VECTOR (in red) and the idealized visual features (G-attr) as seman-
tic representations (in black). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.1 Given the semantic information and visual features of the seen classes, our method
learns a kernel-based regressor () such that the semantic representationa
c
of
classc can predict well its class exemplar (center)v
c
that characterizes the clus-
tering structure. The learned () can be used to predict the visual feature vectors
of the unseen classes for nearest-neighbor (NN) classification, or to improve the
semantic representations for existing ZSL approaches. . . . . . . . . . . . . . . 55
7.2 t-SNE [184] visualization of randomly selected real images (crosses) and pre-
dicted visual exemplars (circles) for the unseen classes on (from left to right)
AwA, CUB, SUN, and ImageNet. Different colors of symbols denote different
unseen classes. Perfect predictions of visual features would result in well-aligned
crosses and circles of the same color. Plots for CUB and SUN are based on their
first splits. Plots for ImageNet are based on randomly selected 48 unseen classes
from 2-hop and word vectors as semantic representations. Best viewed in color. . 60
8.1 The visual question answering (Visual QA) task [14]: given an image an intelli-
gent system needs to answer questions related to the image. . . . . . . . . . . . . 67
xv
8.2 To learn a Visual QA system, we need to collect multiple question-answer pairs
for an image (in black color). However, human language is extremely flexible—
there can be exponentially many distinct questions or answers with respect to the
vocabulary size and the text length. Moreover, people can have different language
styles. As a consequence, there can be many ways of phrasing questions (or
answers) of the same semantic meaning (in gray color). It is thus desirable to
have a system capable of dealing with those unfamiliar language usage. . . . . . 68
8.3 We experiment knowledge transfer across five popular datasets: VQA [14], Vi-
sual7W [230], Visual Genome (VG) [100], COCOQA [150], and VQA2 [67].
We train a model on one dataset and investigate how well it can perform on the
others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.4 We introduced a framework by adapting the unfamiliar language usage (target do-
main) to what the learned Visual QA model has been trained on (source domain)
so that we can re-use the model without re-training. . . . . . . . . . . . . . . . . 71
8.5 Denote i as an image, q as a question, and c as a candidate answer, we aim to
learn a scoring function f(i;q;c) so that it gives a high score if c is the target
answer of the (i;q) pair. We factorizef(i;q;c) intoh(i;q) andg(c), in which
we can take advantage of existing joint embedding of vision and language for
h(i;q). Moreover,g(c) can effectively captures the answer semantic ignored in
many state-of-the-art models. The scoring function is learned to maximize the
likelihood of outputting the target answer from a set of stochastically sampled
candidates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.1 An illustration of how the shortcuts in the Visual7W dataset [230] should be
remedied. In the original dataset, the correct answer “A train” is easily selected
by a machine as it is far often used as the correct answer than the other decoy
(negative) answers. (The numbers in the brackets are probability scores com-
puted using eq. (9.2)). Our two procedures—QoU and IoU (cf. Section 9.4)—
create alternative decoys such that both the correct answer and the decoys are
highly likely by examining either the image or the question alone. In these cases,
machines make mistakes unless they consider all information together. Thus,
the alternative decoys suggested our procedures are better designed to gauge how
well a learning algorithm can understand all information equally well. . . . . . . 74
9.2 Example image-question-target triplets from Visual7W, VQA, and VG, together
with our IoU-decoys (A, B, C.) and QoU-decoys (D, E, F). G is the target. Ma-
chine’s selections are denoted by green ticks (correct) or red crosses (wrong). . . 87
9.3 Ambiguous examples by our IoU-decoys (A, B, C) and QoU-decoys (D, E, F). G
is the target. Ambiguous decoys F are marked. . . . . . . . . . . . . . . . . . . . 87
10.1 An illustration of the dataset bias in visual question answering. Given the same
image, Visual QA datasets like VQA [14] (right) and Visual7W [230] (left) pro-
vide different styles of questions, correct answers (red), and candidate answer
sets, each can contributes to the bias to prevent cross-dataset generalization. . . . 89
10.2 An illustration of the MLP-based model for multiple-choice Visual QA. Given an
IQA triplet, we compute theM(I;Q;C
k
) score for each candidate answerC
k
.
The candidate answer that has the highest score is selected as the model’s answer. 90
xvi
10.3 Domain adaptation (DA) results (in %) with limited target data, under Setting[Q+T+D]
with = 0:1. A sub-sampling ratea means using
1
2
a
of the target data. . . . . . . 99
10.4 Qualitative comparison on different type of questions when transferring from
VQA to Visual7W (on the original datasets). . . . . . . . . . . . . . . . . . . . 101
10.5 Results by varying on the original VQA and Visual7W datasets, for both the
[Q+T] and [Q+T+D] settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.1 Conceptual diagram of our approach. We learn two embedding functions to trans-
form image question pair (i;q) and (possible) answera into a joint embedding
space. The distance (by inner products) between the embedded (i;q) and a is
then measured and the closesta (in red) would be selected as the output answer. . 106
11.2 Detailed analysis on the size of negative sampling to fPMC(MLP) and fPMC(SAN)
at each mini-batch. The reported number is the accuracy on VQA2 (val). . . . . . 115
11.3 t-SNE visualization. We randomly select 1000 answers from Visual7W and vi-
sualize them in the initial answer embedding and learned answer embeddings.
Each answer is marked with different colors according to their question types.
(e.g. when, how, who, where, why, what). To make the figure clear for reading,
we randomly sub-sampled the text among those 1000 answers to visualize. . . . 119
11.4 Inference time Vs. Mini-batch index. fPMC(MLP) and CLS(MLP) model are
10x faster than uPMC(MLP) (use PyTorch v0.2.0 + Titan XP + Cuda 8 + Cudnnv5).120
xvii
Abstract
Developing intelligent systems for vision and language understanding has long been a crucial
part that people dream about the future. In the past few years, with the accessibility to large-scale
data and the advance of machine learning algorithms, vision and language understanding has had
significant progress for constrained environments. However, it remains challenging for uncon-
strained environments in the wild where the intelligent system needs to tackle unseen objects and
unfamiliar language usage that it has not been trained on. Transfer learning, which aims to trans-
fer and adapt the learned knowledge from the training environment to a different but related test
environment has thus emerged as a promising framework to remedy the difficulty.
In my thesis, I focus on two challenging paradigms of transfer learning: zero-shot learning
and domain adaptation. I will begin with zero-shot learning (ZSL), which aims to expand the
learned knowledge from seen objects, of which we have training data, to unseen objects, of
which we have no training data. I will present an algorithm SynC that can construct the classifier
of any object class given its semantic representation, even without training data, followed by a
comprehensive study on how to apply it to different environments. The study further suggests
directions to improve the semantic representation, leading to an algorithm EXEM that can widely
benefit existing ZSL algorithms.
I will then describe an adaptive visual question answering (Visual QA) framework that builds
upon the insight of zero-shot learning and can further adapt its knowledge to new environments
given limited information. Along our work we also revisit and revise existing Visual QA datasets
so as to ensure that a learned model can faithfully comprehend and reason both the visual and
language information, rather than relying on incidental statistics to perform the task.
For both zero-shot learning for object recognition and domain adaptation for visual ques-
tion answering, we conduct extensive empirical studies on multiple (large-scale) datasets and
experimental settings to demonstrate the superior performance and applicability of our proposed
algorithms toward developing intelligent systems in the wild.
xviii
Part I
Background
1
Chapter 1
Introduction
Intelligent systems have long played the main role in our dreams about the future. While “intel-
ligence” can be defined in many different ways to include the capacity for logic, understanding,
self-awareness, learning, reasoning, planning, and problem solving, to we human beings the most
useful intelligent systems in our daily lives will be those that can fluently interact with us and the
environment via visual perception and natural language. Specifically, in films that aim to render
our future world, most intelligent systems are equipped with the abilities to visually recognize
the environment, understand human language, and reason on top of both sources of information
(see Fig. 1.1). These systems can perform those abilities not only in environment they have been
familiar with (e.g., at home or office, and with familiar people), but also in the wild—to interact
with new environment and even update themselves to get familiar with it, just like we humans do.
Figure 1.1: In films that aim to render our future world, intelligent systems that can perform
visual recognition, language understanding, and reasoning on top of the two information play a
significant role and can always catch our eyes.
2
Figure 1.2: The performance trend of machines on classifying among 1,000 common object in
ILSVRC [157]. The numbers in bars are the top-5 error rates. The leading algorithms shown
in the figure are SIFT + FVs [161], AlexNet [101], VGGNet [170], GoogLeNet [175], and
ResNet [73].
Figure 1.3: The visual question answering (Visual QA) task [14]: given an image an intelligent
system needs to answer questions related to the image.
Beyond just dreaming, it has taken us a great amount of time and effort toward developing
intelligent systems, with several milestones being achieved in the past few years. One striking
success is on visual object recognition, in which an intelligent system (or machine) needs to tell
the object category pictured in an image. In the ImageNet Large Scale Visual Recognition Com-
petition (ILSVRC) [157], where there are 1,000 common object categories, a machine can achieve
a 4% top-5 error rate, better than 5% by humans (see Fig. 1.2). We also have question answering
systems like Siri and Amazon Alexa that can interact with humans via natural language.
Built upon these technologies, now we are looking more into systems that can handle multi-
modal information, such as one that can perform visual question answering (Visual QA), in which
given a visual input (e.g., an image) a machine needs to answer related questions (see Fig. 1.3).
Seen essentially as a form of (visual) Turing test that artificial intelligence should strive to achieve,
Visual QA has attracted a lot of attention lately. Just in the past three years, promising progresses
have been made. For example, on the VQA dataset [14] where human attains accuracy of 88.5%,
the state-of-the-art model on the multiple-choice task can already achieved 71.4% [227].
1.1 Machine learning for intelligent systems
Many of those progresses are attributed to the availability of large-scale training data and the
advance of machine learning algorithms. For example, to develop a Visual QA system, we need
3
Figure 1.4: An illustration on developing a Visual QA system, which involves collecting training
data and performing (supervised) learning. The resulting system then can be applied to an envi-
ronment similar to the training data; i.e., answering (recognizing) familiar questions (objects).
to first collect training data (e.g., many image-question-answer triplets). Then we design the
system’s model architecture and perform (supervised) learning to determine the parameters of
the system
1
. If everything goes well—collecting sufficient and high-quality data, designing a
suitable architecture and algorithm, and learning the model till converge—the resulting system
should perform well in a test environment similar to where the training data is collected.
1.2 Challenges in the wild
A system constructed in such a way, however, may not perform well in the wild to answer un-
familiar questions or recognize unseen objects. For instance, the system learned in Fig. 1.4 will
fail if given an image of zebra. The system has never seen zebra before so the best answer it can
guess will be a horse. On the other hand, if the system is given an unfamiliar question like “What
is the creature called?”. Then even if the image is about familiar objects like horse, the system
might refuse to give any answer. (See Fig. 1.5 for an illustration.) Therefore, in order to develop
a system that can work in the wild, we must resolve the above two challenges.
A straightforward solution, following the conventional machine learning pipeline as depicted
in Fig. 1.4, is to re-collect training data to cover those unseen and unfamiliar instances and re-
learn the systems from scratch. This method, however, is practically tedious and costly. There are
exponentially many possible questions and a huge amount of object categories (depending on the
1
The availability of large-scale data allows learning powerful models like deep neural networks [101, 175, 170, 73]
and enables the learned model to generalize well to the test data (sampled from the same distribution of the training
data). For instance, the models for ILSVRC are trained using one million labeled images [38].
4
Figure 1.5: An illustration on how a learned system in Fig. 1.4 will fail in the wild to answer
unfamiliar question or recognize unseen objects. “?” means the system refuses to answer.
granularity). Moreover, many object categories in natural images follow the so-called long-tail
distribution [160, 228]: in contrast to common objects such as household items, they do not occur
frequently enough for us to collect and label a large set of representative images. (See Fig. 1.6 for
an illustration.) Last but not the least, re-training models by ignoring the existing ones is simply
time and computationally consuming. A more efficient and extensible solution is thus desirable.
Figure 1.6: An illustration of the long-tailed distribution on object categories in nature images
(from the the SUN dataset [202]) [228]. The vertical axis corresponds to the number of examples.
The blue curve in the inset shows a log-log plot, along with a best-fit line in red. This suggests
that the distribution follows a long-tailed power law.
1.3 Transfer learning for intelligent systems
In my thesis, I am dedicated to developing intelligent systems using the concept of transfer learn-
ing [139], which seeks to properly transfer data, knowledge, or models from related (training)
environments to the test environment. In our case, this amounts to designing transfer learning
algorithms so that the learned model can not only perform well in an environment similar to the
5
Figure 1.7: An illustration on developing a Visual QA system using transfer learning. The learned
model can not only perform well in an environment similar to the training data, but transfer its
abilities to the wild to further answering unfamiliar questions and recognize unseen objects.
training data, but transfer its abilities to the wild (with the help of external knowledge, or a limited
amount of data from the wild) to further answering unfamiliar questions and recognize unseen
objects. Fig. 1.7 gives an illustration.
To begin with, we should note that the two challenges are fundamentally different. According
to the training data depicted in Fig. 1.7 (on the top), we call zebra an unseen object, and we require
the system not only to recognize it (i.e., be able to tell the category) but also to differentiate it
from visually-similar objects like horse. On contrast, we call “What is the creature called?” an
unfamiliar question because it semantically means the same as “What is the animal?”. In this
case, we would like the system to associate the two questions so as to apply its already learned
knowledge (i.e., generate the answer “a horse”).
We thus formulate them via two different paradigms of transfer learning and develop algo-
rithms accordingly. On one end, we view recognizing unseen object classes as a zero-shot learn-
ing (ZSL)
2
problem [138, 105]—in which no training data of (parts of) the classes of interest in
the test environment is available. ZSL thus aims to expand and transfer the classifiers (or more
abstractly, learned discriminative knowledge) and label space from seen classes, of which we
have access to labeled training data, to unseen ones using external class semantic representations.
On the other end, we view answering unfamiliar questions as a domain adaptation (DA)
problem [64, 62]—the statistical distributions of training and test data are not identical but related.
Domain adaptation thus aims to bridge (or reduce) the distribution difference so that the learned
knowledge fro the training data can be applied to the test one.
2
The term “shot” corresponds to the number of training examples for a certain category.
6
1.4 Contributions and outline
My thesis provides a comprehensive set of insights and techniques to improve zero-shot learning
(ZSL) for applications in the wild—from effectively leveraging the semantic representations in
relating classes (algorithm design), to revisiting and revising the ZSL setting (evaluation met-
ric), and to unifying ZSL with few-shot learning and applying the insights to improve semantic
representation (connection to other paradigms).
My thesis further provides a series of analysis and techniques to improve knowledge transfer
across domains for visual question answering (Visual QA)—from revisiting and revising existing
datasets (dataset design), to mitigating domain mismatch while ensuring consistency among
modalities (algorithm design), and to developing a probabilistic framework on leveraging answer
semantics to account for out-of-vocabulary answers (ZSL for Visual QA).
The remaining of the thesis is organized as follows: Part II on zero-shot learning, Part III
on domain generalization for visual question answering, and Part IV on the conclusion.
1.5 Published work
1.5.1 Zero-shot learning
Chapter 4 corresponds to our CVPR 2016 paper [26]:
• Soravit Changpinyo*, Wei-Lun Chao*, Boqing Gong, and Fei Sha. Synthesized classifiers
for zero-shot learning. In CVPR, 2016.
Chapter 5 and Chapter 6 correspond to our ECCV 2016 paper [28]:
• Wei-Lun Chao*, Soravit Changpinyo*, Boqing Gong, and Fei Sha. An empirical study
and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV,
2016.
Chapter 7 corresponds to our ICCV 2017 paper [27]:
• Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. Predicting visual exemplars of unseen
classes for zero-shot learning. In ICCV, 2017.
1.5.2 Domain generalization for visual question answering
Chapter 9 corresponds to our NAACL 2018 paper [30]:
• Wei-Lun Chao*, Hexiang Hu*, and Fei Sha. Being Negative but Constructively: Lessons
Learnt from Creating Better Visual Question Answering Datasets. In NAACL, 2018.
Chapter 10 corresponds to our CVPR 2018 paper [31]:
• Wei-Lun Chao*, Hexiang Hu*, and Fei Sha. Cross-dataset adaptation for visual question
answering. In CVPR, 2018.
Chapter 11 corresponds to our CVPR 2018 paper [78]:
• Hexiang Hu*, Wei-Lun Chao*, and Fei Sha. Learning answer embeddings for visual ques-
tion answering. In CVPR, 2018.
7
1.5.3 Other work
Besides the publications relevant to my thesis, we have also published other research accom-
plishments in NIPS 2014 [61], ICML 2015 [32], UAI 2015 [29], CVPR 2016 [216], and ECCV
2016 [217].
• Boqing Gong*,Wei-Lun Chao*, Kristen Grauman, and Fei Sha. Diverse sequential subset
selection for supervised video summarization. In NIPS, 2014.
• Wei-Lun Chao, Justin Solomon, Dominik L Michels, and Fei Sha. Exponential integration
for hamiltonian monte carlo. In ICML, 2015.
• Wei-Lun Chao*, Boqing Gong*, Kristen Grauman, and Fei Sha. Large-margin determi-
nantal point processes. In UAI, 2015.
• Ke Zhang*,Wei-Lun Chao*, Fei Sha, and Kristen Grauman. Summary transfer: Exemplar-
based subset selection for video summarization. In CVPR, 2016.
• Ke Zhang*, Wei-Lun Chao*, Fei Sha, and Kristen Grauman. Video summarization with
long short-term memory. In ECCV, 2016.
*: Equal contributions
8
Part II
Zero-shot Learning
9
Chapter 2
Introduction to Zero-shot Learning
In this part, we will focus on zero-shot learning (ZSL). Built upon the flow chart in Fig 1.7, we
make an assumption: the question is always “What is the animal (or object, scene, etc.)?”. We
thus can ignore the question, leading to a visual recognition task (see Fig. 2.1). We will reconsider
different questions in Part III.
Figure 2.1: We consider zero-shot learning for visual recognition by ignoring the question.
In contrast to conventional supervised learning for visual recognition, where both the training
and test data (e.g., images and the corresponding category labels) are assumed to come from
the same distribution, zero-shot learning (ZSL) distinguishes between two types of classes: seen
classes, of which we have access to labeled training data, and unseen ones, of which no training
data are available. ZSL then aims to transfer and adapt the classifiers (or more abstractly, learned
discriminative knowledge) and label space from seen classes to unseen ones.
To this end, we need to address two key interwoven issues [138]: (1) how to relate unseen
classes to seen ones and (2) how to attain discriminative performance on the unseen classes even
though we do not have their labeled data. Existing literature assumes the availability of class
semantic representations for both types of classes, e.g., human-annotated attributes of classes [44,
105, 140], word vectors of class names [131, 130, 143], textual descriptions of each class [148],
and hierarchical class taxonomies [132, 46]. Such representations provide the cue to relate classes
10
Figure 2.2: An illustration on the possibility of zero-shot learning (ZSL). Given two images
and two categories, Okapi and Araripe Manakin, telling which image belongs to which class
can be hard if we have not seen them before (i.e., a ZSL task). However, if we are further
provided with the class semantic representations (e.g., Okapi has stripes and a black body), then
the task becomes much simpler to us. This is because we can visually understand the semantic
representation, probably learned from other animals we have seen before.
and design algorithms to recognize unseen ones. (See Fig. 2.2 for an illustration.) The learning
problem of ZSL can thus be formulated as follows.
2.1 Definition
2.1.1 Notations
Denote byS =f1; 2; ;Sg the label space of seen classes andU =fS + 1; ;S +Ug the
label space of unseen classes. We useT =S[U to represent the union of the two sets of
classes. We then denote byP
S
(y) the distribution onS,P
U
(y) the distribution onU,P
T
(y) =
P
S
(Y ) + (1 ) P
U
(Y ) the distribution onT (with 1 > > 0), and P
XjY
(xjy)
the conditional feature distribution onx2 R
D
giveny2T where D is the dimensionality of
features. Finally, we denote bya
c
2A the semantic representation of classc2T
1
.
1
Some existing approaches assume the availability of similaritysi;j between a pair of classesi andj. In this case,
we can assumesi;j =s(ai;aj), wheres(;) is a certain similarity measure andai andaj are derived accordingly.
11
2.1.2 Problem formulations
In ZSL, we are given the training dataD
tr
=f(x
n
2 R
D
;y
n
)g
N
n=1
, where (x
n
2 R
D
;y
n
) is
i.i.d. sampled from P
S
(y)P
XjY
(xjy). That is, the label space ofD
tr
isS = f1; 2; ;Sg.
Additionally, we are given for each c2S the correspondinga
c
. The goal of ZSL is to learn
fromD
tr
andfa
c
g
S
c=1
so that in testing, givenfa
c
g
S+U
c=S+1
that corresponds to the unseen classes
c2U, we can further recognize instances of the unseen classes
2
.
According to how the test instances are generated, ZSL can be categorized into conventional
ZSL and generalized ZSL,
• Conventional ZSL: A test instance (x;y) is sampled fromP
U
(y)P
XjY
(xjy). That is, test
instances come only from unseen classesU and we only classify them amongU, implying
the absence of seen classes’ instances in the test environment.
• Generalized ZSL: A test instance (x;y) is sampled fromP
T
(y)P
XjY
(xjy). That is, test
instances can come from both seen and unseen classes. The label space is thus the union of
them (i.e.,T ).
So far, most of the existing work focuses on the conventional setting.
2.1.3 The framework of algorithms
Most of the ZSL algorithms, although not shown obviously at the first glance, aim to learn a
scoring functionf(a;x) :AX7!R so that we can assign label ^ y to the instancex by
^ y = arg max
c2U
f(a
c
;x) (for conventional ZSL);
^ y = arg max
c2T
f(a
c
;x) (for generalized ZSL):
Some recently published work take an alternative way of thinking, aiming to generate in-
stances (images or visual features) of the unseen classes given the semantic representations. Con-
ventional supervised learning algorithms then can be applied to train classifiers. We will provide
more details on these algorithms in Chapter 3.
2.2 Challenges
According to the problem formulations and the framework of algorithms presented above, ZSL
has several challenges categorized as follows.
Class semantic representations While different forms of class semantic representations have
been exploited and compared in existing literature, it remains unclear what ideal semantic repre-
sentations will be and how to improve existing ones or design better ones accordingly.
2
Some existing approaches assumes the availability offacg
S+U
c=S+1
in training, or perform training oncefacg
S+U
c=S+1
is given. These approaches may need to store the training data or retrain the models when unseen classes change.
12
Algorithms Designing algorithms is the main focus of ZSL research in the literature. The
challenges are on how to effectively leverage the given semantic representations to relate classes
and how to define and learn the scoring functionf(a;x) as presented in Section 2.1.3.
Experimental settings As mentioned in Section 2.1, most of the existing work focuses on the
conventional setting, in which the test environment only contains instances of unseen categories.
In real-world applications, categories that have available training data are likely the commonly-
seen ones. It is thus unrealistic to assume their absence in the test environment. It is important
to investigate how the existing work can be applied to the generalized setting. In our studies
(Chapter 5), we show that naively combining the scoring functions of seen and unseen classes as
in Section 2.1.3 leads to poor performance for the generalized setting.
The performance gap to supervised learning While much effort has been committed to ZSL
and the result on benchmarked datasets has been significantly improved in the past few years, it
remains unclear if the state-of-the-art performance is good enough compared to training classifiers
with labeled data of all the classes of interest. In the worst case, if the performance gap between
these two paradigms (i.e., zero-shot learning vs. supervised learning) is large, it may imply that
we should put more effort on collecting and labeling data instead. Or, the existing literature may
miss essential factors in exploiting semantic representations or designing algorithms.
Theoretical foundations Last but not the least, compared to conventional supervised learning
that has solid theoretical foundations on performance guarantee, zero-shot learning so far has no
such notions, making itself a rather ad-hoc or empirical topic in machine learning.
[41, 167, 107, 53] have also pointed out other challenges including hubness and domain shift.
2.3 Contributions
We provide a comprehensive set of insights and techniques to improve zero-shot learning—from
a principled algorithm to effectively leveraging the semantic representations in relating classes
(Chapter 4), to revisiting and revising the ZSL settings and evaluation metrics (Chapter 5), to in-
vesting the gap between ZSL and conventional supervised learning as well as suggesting the ideal
form of semantic representations (Chapter 6), and to improving the class semantic representations
by incorporating visual domain knowledge (Chapter 7).
2.4 Outline of Part II
The remaining of this part is organized as follows:
Chapter 3 is a survey on zero-shot learning. We present the class semantic representations,
algorithms, and settings of the existing work in the literature, and discuss related tasks to zero-
shot learning.
13
Chapter 4 presents our synthesized classifiers (SynC) to construct the classifier of any class
given its semantic representation for ZSL.
Chapter 5 presents our studies on generalized ZSL, together with an effective calibration
framework to balance recognizing seen and unseen classes as well as a metric called Area Under
Seen and Unseen Curve (AUSUC) to characterize such a trade-off.
Chapter 6 presents the relationship and investigate the performance gap among zero-shot learn-
ing, few-shot learning, and conventional supervised learning. We show that, by designing the
semantic representations in a certain way, the performance gap can be largely reduced.
Chapter 7 builds upon Chapter 6 and introduces a novel approach to improve semantic repre-
sentations by learning a mapping from the original representations to the average visual features.
14
Chapter 3
Literature Survey on Zero-Shot Learning
Zero-shot learning (ZSL) has attracted significant attention in recent years from computer vi-
sion [106, 197, 105, 128, 129, 5, 6, 189], machine learning [171, 156, 83, 136, 75, 91, 172,
138, 207, 112], and artificial intelligence [178, 146, 185]. The major focus is on the classi-
fication problem, the one formulated in Chapter 2, while some others work on reinforcement
learning [136, 75], imitation learning [141], generative models [91, 149], and visual question an-
swering and captioning [178, 146, 185]. In this chapter, we provide a survey on zero-shot learning
for classification
1
—including semantic representations, algorithm design, and relations to other
learning paradigms.
3.1 Semantic representations
Semantic representations are the essential information in performing zero-shot learning—without
them we have no guidance on how to apply, transfer, or adapt the knowledge learned from the
training labeled data to the test environment whose label space has a significant non-overlapping
with the training data. Since the goal of zero-shot learning for classification is to classify instances
of the classes that have no training data, a situation likely results from rare observations or high
cost to collect and annotate data, the semantic representations should be extracted from a different
resources or modalities from the data. For example, in visual object recognition, the labeled data
are images and their corresponding class labels. Therefore, the semantic representations are
usually derived from textual descriptions of the classes. In the following we review the semantic
representations that have been developed and exploited for visual recognition.
3.1.1 Visual attributes
Visual attributes are properties (e.g., shapes, materials, colors, textures, parts) commonly ob-
servable from the appearances of objects [44, 105, 140, 47, 187, 52] (or scenes [142], human
faces [103, 102], etc.) that have human-designated names (e.g., “cylindrical”, “furry”, “red”,
1
We specifically focus on image-based visual recognition.
15
(a) Attributes for animals [105] (b) Attributes for human faces [103]
Figure 3.1: Illustration on visual attributes and annotations for (a) animals and (b) human faces.
“stripped”, “four-legged”). See Fig. 3.1 for an illustration. A good dictionary of visual at-
tributes should contain vocabularies that (1) collectively can concisely and discriminatively de-
scribe an object
2
and (2) individually are shared among several objects. These properties make
visual attributes a compelling way to visually and semantically represent an object (or object
class) and measure the similarity (or difference) among objects. There have been extensive work
in the literature of computer vision on how to design a good dictionary and detect attributes
from object appearances, and how attributes can benefit visual recognition or other applica-
tions [44, 140, 47, 102, 118, 206, 19, 212, 145].
For zero-shot learning, we directly take the pre-defined dictionary and the ground-truth at-
tribute annotations on the class level (mostly done by humans, especially domain experts) as the
class semantic representations [105, 187, 142, 223]. We note that it is nearly unavoidable of such
human efforts—since we have no labeled images for the unseen classes, we can only rely on the
semantic understanding or visual experience by humans to annotate attributes for those classes.
This fact makes visual attributes a less practical and attractive way for zero-shot learning when
accounting for a massive number of unseen classes. Nevertheless, visual attributes so far have
been the most popular semantic representations in existing work.
3.1.2 Vector representations (word vectors) of class names
How to represent a word (or phrases) has long been a core task in natural language processing—
good representations should faithfully describe the semantic similarity and difference among
words. The vector representations (also known as word vectors) learned from the word co-
occurrence statistics from large-scale ontologies (such as Wikipedia or news corpora) have been
shown as a concise and powerful way to represent words [131, 130, 143]. See Fig. 3.2 for an
2
An object can be described by a vector with the dictionary size as the dimensionality. Each entry means the
existence (f0;1g) or the relative strength or observation probability (R or[0;1]) of the corresponding attribute.
16
(a) On countries and their capital cities [131]. (b) On visual object names [50].
Figure 3.2: Illustrations of vector representation of words. (a) Two-dimensional PCA projection
of the skip-gram vectors of countries and their capital cities [131]. (b) t-SNE visualization [184]
of the skip-gram vectors [131] of visual object names [50].
illustration. For zero-shot learning on visual recognition, as long as the class names do show up
in the ontologies, we can automatically learn and extract the corresponding word vectors to be
the class semantic representations
3
. Compared to visual attributes, word vectors of class names
require much less human efforts and are more suitable for large-scale tasks. However, since they
are learned from the word co-occurrence (or other objectives defined solely on the ontologies)
without explicitly taking visual information into account, they may not describe the visual simi-
larity among objects as good as visual attributes.
Word vectors have been used as semantic representations in much recent work [135, 50, 172,
55, 54, 9, 37, 191]. See [6] for a comparison on word vectors learned using different objectives.
3.1.3 Textural descriptions of classes
In stead of treating class names as words in ontologies, we can derive class semantic represen-
tations from the textural descriptions of the classes; e.g., from the Wikipedia page of classes.
This idea has been used in [111, 42, 144, 148, 8]. In [148], the authors specifically collect for
each image of birds or flowers a short textual description (see Fig. 3.3). To represent the textual
descriptions, existing work uses the bag of word representation (e.g., term frequency-inverse doc-
ument frequency feature vectors) [42, 111, 144, 229] or encodes the text sequences by recurrent
neural networks (RNN) or convolutional neural networks (CNN) [148]. One exception is [8],
which discovers visual terms from documents and represents each class in a similar way to visual
attributes.
3
For class names that contain multiple words (e.g., “baseball player”), we can treat the whole names as new words
(e.g., “baseball player”) and re-learn the word vectors.
17
Figure 3.3: Example annotations on bird and flower images [148].
Figure 3.4: The class taxonomies of animals [68] derived from WordNet [132, 46].
3.1.4 Hierarchical class taxonomies
Hierarchical class taxonomies, such as WordNet [132, 46] or domain-specific taxonomies for
animals and plants, provide another source of information to relate classes. See Fig. 3.4 for
an illustration. In [123], Lu computes the shortest paths between class names on the WordNet
hierarchy, transforms the path lengths into similarities, and perform multidimensional scaling
(MDS) to obtain the class semantic representations. [5] constructs a binary vector of the size of
the total number of nodes in the hierarchy to represent each leaf classj—thei-th element is 1 if
the corresponding node is the leaf classj (i.e.,i =j) or its ancestor; otherwise 0. [6] constructs a
real-valued vector of the same size as in [5]; the elements encode the similarities (computed from
the hierarchy) between a leaf class to all the nodes. In [7], Al-Halah and Stiefelhagen proposed
a hierarchical attribute transfer method that combines visual attributes and class taxonomies for
zero-shot learning. [155, 154] also consider extracting semantic representations or relationship
among classes from the hierarchy. We note that the class taxonomies are usually constructed by
human experts. Therefore, they might suffer from the same difficulty as visual attributes.
3.1.5 Other forms of semantic representations
There are also other forms of semantic representations developed in existing literature. [155,
154, 127] use the search hit counts or text snippets from World Wide Web. [43, 4] utilize the part
information (e.g., part descriptions or localizations) to obtain high-quality semantic representa-
tions for fine-grained classes like bird species, while [92, 120] investigates the use of gazes and
similes. Knowledge bases (or graphs) have also be used recently to model relationship among
classes [109, 191]. Finally, [84, 166] specifically work on how to combine multiple semantic
representations for enhanced performance.
18
3.2 Algorithms for conventional ZSL
With the class semantic representations as well as the labeled training data of seen classes, zero-
shot learning algorithms are then designed to leverage such information so as to obtain discrimi-
native performance on unseen classes. According to how the semantic representations are being
used, existing algorithms can roughly be categorized into (1) embedding-based methods and (2)
similarity-based methods. In this section we survey several representative algorithms of each
category. We will also discuss some methods that may not be identified as either category.
We note that some methods require the similarity (or relatedness) between pairs of classes
as the semantic cue to relate classes (i.e.,s
ij
for classesi andj). By assuming that there exists
a certain similarity measures(;) and class semantic representationsfa
c
g
S+U
c=1
such thats
ij
=
s(a
i
;a
j
)8i;j2T , we can still view those methods as takingfa
c
g
S+U
c=1
as the semantic cues to
perform zero-shot learning.
Without loss of generality, in the following we treata
c
as either a binary or real-valued vector.
3.2.1 Embedding-based methods
In the embedding-based approaches, one first maps the input image representationx to the seman-
tic representation spaceA, and then infers the class label in this space by various similarity (or
relatedness) measures to the unseen classes’ semantic representations [5, 6, 50, 51, 53, 97, 106,
116, 135, 172, 192], essentially a two-stage procedure. In other words, denote bym :X 7!A
such a mapping, and denote bys(;) the similarity measure onA, embedding-based approaches
predicts the label for the input feature vectorx by
^ y = arg max
c2U
s(m(x);a
c
): (3.1)
Note thats(;) can be asymmetric. Approaches of this category are different by how to define
s(;) and how to learn the mappingm().
The concept behind this category of approaches is that the semantic representations, together
with the measures(;), can capture the similarities among both seen and unseen classes. More-
over, each element of the representations has certain meaning that shares across multiple cate-
gories. For example, if the representations are attributes (i.e., each entry ofa
c
corresponds to
the existence of a certain attribute or not), there should be multiple classes sharing one attribute.
Therefore, even we do not have labeled images of unseen classes in training, we can still recog-
nize those classes by detecting if the test image has certain attributes—the attributes detectors can
be learned from images of seen classes.
Direct attribute prediction (DAP) DAP [105, 106] assumes that theK dimensional class se-
mantic representations are binary vectors, and builds for each elementa[k] (i.e., k-th attribute)
19
Figure 3.5: An illustration of the DAP model [105, 106]. The figure is from [105, 106] so that
the notations are not the same as the ones defined in the thesis. In the figurefa
1
; ;a
M
g corre-
sponds toM attributes,fy
1
; ;y
K
g corresponds to seen classes, andfz
1
; ;z
L
g corresponds
to unseen classes.
a probabilistic detectorp(a[k]jx). We usea here as a random vector, anda
c
the vector corre-
sponding to classc. DAP then defines the posteriorp(cjx) on class labelc givenx as
4
p(cjx) =
X
a2f0;1g
K
p(cja)p(ajx) =
p(c)
p(a
c
)
K
Y
k=1
p(a
c
[k]jx); (3.2)
where normally the priorp(c) is set as uniform over unseen classes, andp(a
c
) =
Q
K
k=1
p(a
c
[k])
can be estimated empirically from the training data. Thek-th detectorp(a[k]jx) can be learned
by training a logistic regression to classify images of the seen classes that have thek-th attribute
from images of those that do not have thek-th attribute. The maximum a posteriori (MAP) rule
is then applied to assign the class label tox
^ y = arg max
c2U
p(cjx) (3.3)
= arg max
c2U
K
Y
k=1
p(a
c
[k]jx)
p(a
c
[k])
(3.4)
= arg max
c2U
K
Y
k=1
p(a[k] = 1jx)
ac[k]
(1p(a[k] = 1jx))
(1ac[k])
p(a
c
[k])
: (3.5)
See Fig. 3.5 for an illustration.
In terms of Eq. (3.1), in DAP we have
m(x) = [p(a[1] = 1jx); ;p(a[K] = 1jx)]
>
; (3.6)
s(m(x);a
c
) =
K
Y
k=1
m(x)[k]
ac[k]
(1m(x)[k])
(1ac[k])
p(a
c
[k])
: (3.7)
4
The assumptionp(a
c
0jc)=1 ifc
0
=c, otherwise=0, is imposed.
20
Figure 3.6: An illustration of the SJE model [6, 5]. The figure is from [5] so that the notations are
not the same as the ones defined in the thesis. In the figure,(x) is equivalent tox in the thesis;
'(y
i
) is equivalent toa
y
i
in the thesis.
DAP is among the first few algorithms for zero-shot learning and has a probabilistic interpre-
tation. However, it cannot be directly applied to real-valued semantic representations. Moreover,
DAP learn p(a[k]jx) to minimize the error of predicting attributes, which may not guarantee
good performance on predicting class labels. See [192, 83, 7] for extensions on DAP to incorpo-
rate correlations among attributes, to account for the unreliability of attribute predictions, and to
define better decision rules based on hierarchical attribute transfer.
Structured Joint embedding (SJE) SJE [6] learns a linear mappingW 2 R
DK
from the
input image featuresx to the semantic representation space (i.e.,m(x) =W
>
x). It then mea-
sures the similarity between the image and the classc by the inner product between the mapped
featuresW
>
x anda
c
(i.e.,x
>
Wa
c
). The class decision rule is thus
^ y = arg max
c2U
x
>
Wa
c
: (3.8)
SJE applies the structured SVM formulation to learnW from seen classes’ data
min
W
X
n
max(0; max
c2S
(c;y
n
) +x
>
n
Wa
c
x
>
n
Wa
yn
) +
(W ); (3.9)
where
() is a certain regularizer onW . In [6], (c;y
n
) = 1[c6=y
n
] and = 0, and stochastic
gradient descent with early stopping is applied to optimizeW .
SJE, compared to DAP, can be applied to real-valued semantic representations, andW is
learned to optimize the class prediction loss on the seen classes’ data. There are several other
methods that share such advantages as SJE, including but not limited to [5, 50, 172, 156, 199, 54,
98, 115, 204, 133, 114, 215]. The main difference of these methods are on how to define the map-
pingm(), the similarity, the loss function, and the regularization term. For example, [5] adapts
the WSABIE loss [193] that is originally proposed for ranking. [172] minimizes the`
2
distance
betweenm(x
n
) anda
yn
, wherem() is modeled by a multi-layer perceptron (MLP). [24] con-
siders learning the metric for minimizing Mahalanobis distance. [156] adopts a regression loss
and a special regularizer that jointly lead to a close-form solution forW . [98, 33] exploit the
reconstruction loss (inspired by auto-encoders) as a regularizer. [199, 43, 4] learn multipleW s
at the same time to boost the performance.
21
Other methods Eq. (3.1) can be extended to
^ y = arg max
c2U
s(m(x);g(a
c
)); (3.10)
where m :X 7!B and g :A7!B are learnable mappings that map the image featuresx
and the semantic representationa into a joint embedding spaceB, in which the similarity can be
faithfully measured or noise inx;a can be suppressed. This extension has been considered in
[207, 111, 220, 123, 221, 144, 148, 40, 86]. The mappings are mostly learned to minimize the
classification error on the training data. The loss function or mapping forms need carefully design
to prevent over-fitting to seen classes and poor generalization to unseen classes. For example,
[144] learns linear mappings and impose a sparse constraint so as to robustly extract informative
words from the documents to describe a class.
3.2.2 Similarity-based methods
In the similarity-based approaches, in contrast, one builds the classifiers for unseen classes by re-
lating them to seen ones via class-wise similarities [55, 60, 127, 154, 155, 135, 105, 106]. Denote
byh
c
:X 7! R or [0; 1] the scoring function (classifier) for each seen classc2S. Similarity-
based approaches construct the classifier of an unseen class u2U by defining or learning a
mapping so thath
u
= (a
u
;fa
c
g
S
c=1
;fh
c
()g
S
c=1
). The concept behind this category of ap-
proaches is that the class semantic representations can convey the relatedness among classifiers,
especially between the seen classes’ classifiers and the unseen ones.
One common drawback of the similarity-based approaches is that the learned seen class clas-
sifiers are not optimized to transfer the discriminative knowledge among classes.
Indirect attribute prediction (IAP) IAP [105, 106] is very similar to DAP: they both estimate
p(a[k]jx) for theK attributes and then apply the MAP rule to assign class labels tox. The main
difference is on how to estimate p(a[k]jx). Instead of training K binary classifiers directly as
in DAP, IAP first trains the probabilistic classifierp(cjx) forc2S from the seen classes’ data.
This can be done by learning a softmax classifier. DAP then obtainsp(a[k]jx) by the following
formula,
p(a[k]jx) =
X
c2S
p(a[k]jc)p(cjx) =
X
c2S
1[a[k] =a
c
[k]]p(cjx): (3.11)
See Fig. 3.7 for an illustration.
IAP enjoys the same probabilistic interpretation as DAP. However, it also suffers from the
same disadvantage—the learnedp(cjx) can not guarantee good classification performance at the
final MAP step.
Convex combination of semantic embedding (ConSE) ConSE [135], similar to IAP, makes
use of pre-trained classifiersp(cjx) for seen classes and their probabilistic outputs. ConSE uses
22
Figure 3.7: An illustration of the IAP model [105, 106]. The figure is from [105, 106] so that the
notations are not the same as the ones defined in the thesis. In the figurefa
1
; ;a
M
g corre-
sponds toM attributes,fy
1
; ;y
K
g corresponds to seen classes, andfz
1
; ;z
L
g corresponds
to unseen classes.
p(cjx) to infer the semantic embeddings ofx, and then classifies it into the unseen class using
the same rule as in Eq. (3.1),
^ y = arg max
c2U
s(m(x);a
c
) = arg max
c2U
s(
X
c2S
a
c
p(cjx);a
c
): (3.12)
In [135], the cosine similarity is used fors(;).
Co-occurrence statistics (COSTA) COSTA [127] constructs classifierh
u
for an unseen class
u by
h
u
(x) =
X
c2S
s(a
u
;a
c
)h
c
(x): (3.13)
COSTA considers several co-occurrence statistics to estimates
ij
=s(a
i
;a
j
). This way of clas-
sifier construction is also adopted in [154, 155],
h
u
(x) =
1
T
X
c2S
(u)
T
h
c
(x); (3.14)
whereS
(u)
T
is a subset ofS containing those that have theT highest similarities tou.
Other methods Elhoseiny et al. [42] proposed to learn a mapping fromA toH (the hypothesis
space of classifiers). This is, in theory, the most straightforward way to perform zero-shot learn-
ing. In practice, however, we only have S pairs of dataf(a
c
;h
c
)g
S
c=1
to learn such a mapping.
Therefore, how to regularize in the learning process becomes extremely crucial.
23
Fu et al. [55] proposed a similar idea to IAP and ConSE, first gettingp(cjx) to representx.
They then applied a novel absorbing Markov chain process (AMP) on the graph relating seen and
unseen classes (the edge weights are based ons(;)) to predict the class label forx.
3.2.3 Other approaches
Predicting visual instances of unseen classes The main issue that leads to the need of zero-
shot learning is the lack of (labeled) training data for unseen classes. One idea, beyond embedding-
based and similarity-based methods, is to generate synthetic images (or the corresponding visual
features) for each unseen class according to its semantic representation—if we can have labeled
data for unseen classes, we can then apply conventional supervised learning techniques to learn
classifiers for unseen classes.
This idea has gradually become popular for ZSL. According to the ability to generate multiple
instances per class or not, methods can be separated into predicting visual exemplars [218, 189,
13, 119, 224] or predicting visual instances [149, 121, 25, 200, 190, 188, 229, 15, 225, 69]. Guo
et al. [70] proposed to weightedly transfer labeled examples from seen classes to unseen ones, so
that unseen classes can have pseudo labeled data.
Predicting attributes from word vectors [9, 37] consider a setting called unsupervised zero-
shot learning, where attribute annotations are provided only for seen classes. They propose to
leverage word vectors (for both types of classes) to explicitly or implicitly predict the attributes
for unseen classes before performing zero-shot learning using attributes. The underlying belief is
that attributes provide better semantic information than word vectors for object recognition.
3.3 Algorithms for generalized ZSL
There has been very little work on generalized zero-shot learning. [50, 135, 128, 176] allow the
label space of their classifiers to include seen classes but they only test on the data from the unseen
classes. [172] proposes a two-stage approach that first determines whether a test data point is from
a seen or unseen class, and then apply the corresponding classifiers. However, their experiments
are limited to only 2 or 6 unseen classes. In the domain of action recognition, [57] investigates
the generalized setting with only up to 3 seen classes. [42] and [111] focus on training a zero-
shot binary classifier for each unseen class (against seen ones)—it is not clear how to distinguish
multiple unseen classes from the seen ones. Finally, open set recognition [162, 163, 81] considers
testing on both types of classes, but treating the unseen ones as a single outlier class. In the
following, we describe the methods in [172], which is the most relevant one to ours.
Socher et al. [172] propose a two-stage zero-shot learning approach that first predicts whether
an image is of seen or unseen classes and then accordingly applies the corresponding classifiers.
The first stage is based on the idea of novelty detection and assigns a high novelty score if it is
unlikely for the data point to come from seen classes. They experiment with two novelty detection
strategies: Gaussian and LoOP models [99]. The main idea is to assign a novelty scoreN(x) to
each samplex. With this novelty score, the final prediction rule becomes
^ y =
arg max
c2S
f(a
c
;x); ifN(x)
:
arg max
c2U
f(a
c
;x); ifN(x)>
:
(3.15)
24
where
is the novelty threshold. The scores above this threshold indicate belonging to unseen
classes. To estimateN(x), for the Gaussian model, data points in seen classes are first modeled
with a Gaussian mixture model. The novelty score of a data point is then its negative log proba-
bility value under this mixture model. Alternatively, the novelty score can be estimated using the
Local Outlier Probabilities (LoOP) model [99]. The idea there is to compute the distances ofx
to its nearest seen classes. Such distances are then converted to an outlier probability, interpreted
as the likelihood ofx being from unseen classes.
Recently Lee et al. [110] propose a hierarchical novelty detector to improve the performance.
3.4 Related tasks to zero-shot learning
In this section we present and discuss related tasks to the problem formulations of zero-shot
learning described in Chapter 2.
3.4.1 Transductive and semi-supervised zero-shot learning
[153, 51, 53, 97, 169, 222, 210, 134, 205, 173, 71] focus on the transductive setting, where they
have access to unlabeled test data from unseen classes during the training stage. [113, 115]
works on the semi-supervised setting, where a portion of unlabeled data (not used for testing)
from unseen classes are available at training. For both settings, the unlabeled data from unseen
classes can be used to refined the embedding functionm() (c.f. Section 3.2.1) or the semantic
representations. One key difference between the two settings is that in the transductive setting,
the test data are given as a whole (i.e., we can exploit certain properties like smoothness among
the test data to perform joint prediction).
3.4.2 Zero-shot learning as the prior for active learning
Gavves et al. [60] consider the active learning problem—how to pick unlabeled data for acquiring
annotations so as to efficiently obtain supervised learning signal. For classes that previously have
no labeled data, they propose to use the classifiers constructed by zero-shot learning as the prior
knowledge and develop several informative measures to rank instances for querying annotations.
3.4.3 Few-shot learning
Few-shot learning [151, 176, 45, 104, 20, 171] considers the case where we only have for each
class of interest few labeled training examples (e.g., one-shot learning means each class has only
one labeled training example). Similar to zero-shot learning, few-shot learning usually assumes
the availability of either (1) sufficient labeled training data for a set of common classes or (2) class
semantic representations. This information enables inferring the data variation of those few-shot
classes for constructing robust classifiers. In Chapter 5, we will discuss how zero-shot learning
algorithms can be applied to one- or few-shot learning.
25
Chapter 4
Synthesize Classifier (SynC) for Zero-Shot Learning
In this chapter, we introduce Synthesized Classifiers (SynC), a state-of-the-art zero-shot learning
algorithm for visual recognition. SynC effectively leverages the class semantic representations to
relate classes so that the discriminative knowledge (i.e., the classifiers or models) learned from the
seen classes can be transferred to the unseen ones. In contrast to the embedding- and similarity-
based approaches, we aim to learn to predict the (linear) classifier of a class given its semantic
representation. We describe the main idea first, followed by the details of the algorithm.
4.1 Main idea
Given class semantic representations, zero-shot learning aims to accurately recognize instances
of the unseen classes by associating them to the seen ones. We tackle this challenge with ideas
from manifold learning [16, 76], converging to a two-pronged approach. We view object classes
in a semantic space as a weighted graph where the nodes correspond to object class names and the
weights of the edges represent how they are related (according to the semantic representations).
On the other end, we view models for recognizing visual images of those classes as if they live in
a space of models. In particular, the parameters for each object model are nothing but coordinates
in this model space whose geometric configuration also reflects the relatedness among objects.
Fig. 4.1 illustrates this idea conceptually.
But how do we align the semantic space and the model space? The semantic space coor-
dinates of objects are designated or derived based on class semantic representations that do not
directly examine visual appearances at the lowest level, while the model space concerns itself
largely for recognizing low-level visual features. To align them, we view the coordinates in
the model space as the projection of the vertices on the graph from the semantic space—there
is a wealth of literature on manifold learning for computing (low-dimensional) Euclidean space
embeddings from the weighted graph, for example, the well-known algorithm of Laplacian eigen-
maps [16].
To adapt the embeddings (or the coordinates in the model space) to data, we introduce a set of
phantom object classes—the coordinates of these classes in both the semantic space and the model
space are adjustable and optimized such that the resulting model for the real object classes achieve
the best performance in discriminative tasks. However, as their names imply, those phantom
classes do not correspond to and are not optimized to recognize any real classes directly. For
mathematical convenience, we parameterize the weighted graph in the semantic space with the
phantom classes in such a way that the model for any real class is a convex combinations of the
26
Semantic space
Model space
penguin
cat
dog
a
1
a
2
a
3
b
1
b
2
w
1
w
2
w
3
v
1
v
2
v
3
b
3
Figure 4.1: Illustration of our method SynC for zero-shot learning. Object classes live in two
spaces. They are characterized in the semantic space with semantic representations (as) such as
attributes and word vectors of their names. They are also represented as models for visual recog-
nition (ws) in the model space. In both spaces, those classes form weighted graphs. The main
idea behind our approach is that these two spaces should be aligned. In particular, the coordinates
in the model space should be the projection of the graph vertices from the semantic space to the
model space—preserving class relatedness encoded in the graph. We introduce adaptable phan-
tom classes (b andv) to connect seen and unseen classes—classifiers for the phantom classes are
bases for synthesizing classifiers for real classes. In particular, the synthesis takes the form of
convex combination.
coordinates of those phantom classes. In other words, the “models” for the phantom classes can
also be interpreted as bases (classifiers) in a dictionary from which a large number of classifiers
for real classes can be synthesized via convex combinations. In particular, when we need to
construct a classifier for an unseen class, we will compute the convex combination coefficients
from this class’s semantic space coordinates and use them to form the corresponding classifier.
4.2 Approach
4.2.1 Notations
We focus on linear classifiers in the visual feature spaceR
D
that assign a label ^ y to a data point
x by
^ y = arg max
c
w
>
c
x; (4.1)
wherew
c
2 R
D
, although our approach can be readily extended to nonlinear settings by the
kernel trick [165].
4.2.2 Manifold learning with phantom classes
We introduce a set of phantom classes associated with semantic representationsb
r
;r = 1; 2;:::;R.
We stress that they are phantom as they themselves do not correspond to any real objects —they
are introduced to increase the modeling flexibility, as shown below.
27
The real and phantom classes form a weighted bipartite graph, with the weights defined as
s
cr
=
expfd(a
c
;b
r
)g
P
R
r=1
expfd(a
c
;b
r
)g
(4.2)
to correlate a real classc and a phantom classr, where
d(a
c
;b
r
) = (a
c
b
r
)
>
1
(a
c
b
r
); (4.3)
and
1
is a parameter that can be learned from data, modeling the correlation among semantic
representations. For simplicity, we set =
2
I and tune the scalar free hyper-parameter by
cross-validation.
The specific form of defining the weights is motivated by several manifold learning methods
such as SNE [76]. In particular,s
cr
can be interpreted as the conditional probability of observing
classr in the neighborhood of classc. However, other forms can be explored.
In the model space, each real class is associated with a classifierw
c
and the phantom classr
is associated with a virtual classifierv
r
. We align the semantic and the model spaces by viewing
w
c
(orv
r
) as the embedding of the weighted graph. In particular, we appeal to the idea behind
Laplacian eigenmaps [16], which seeks the embedding that maintains the graph structure as much
as possible; equally, the distortion error
min
wc;vr
kw
c
R
X
r=1
s
cr
v
r
k
2
2
is minimized. This objective has an analytical solution
w
c
=
R
X
r=1
s
cr
v
r
; 8c2T =f1; 2; ;S +Ug (4.4)
In other words, the solution gives rise to the idea of synthesizing classifiers from those virtual
classifiersv
r
. For conceptual clarity, from now on we refer tov
r
as base classifiers in a dictionary
from which new classifiers can be synthesized. We identify several advantages. First, we could
construct an infinite number of classifiers as long as we know how to computes
cr
. Second, by
making R S, the formulation can significantly reduce the learning cost as we only need to
learnR base classifiers.
4.2.3 Learning phantom classes
Learning base classifiers We learn the base classifiersfv
r
g
R
r=1
from the training data (of the
seen classes only). We experiment with two settings. To learn one-versus-other classifiers, we
optimize,
28
min
v
1
;;v
R
S
X
c=1
N
X
n=1
`(x
n
;I
yn;c
;w
c
) +
2
S
X
c=1
kw
c
k
2
2
; (4.5)
s:t: w
c
=
R
X
r=1
s
cr
v
r
; 8c2T =f1; ;Sg
where`(x;y;w) = max(0; 1yw
>
x)
2
is the squared hinge loss. The indicatorI
yn;c
2f1; 1g
denotes whether or noty
n
= c. It is easy to show that is a convex formulation inv
r
and can be
efficiently solved, at the same computational cost as trainingR one-versus-other linear classifiers.
Alternatively, we apply the Crammer-Singer multi-class SVM loss [35], given by
`
cs
(x
n
; y
n
;fw
c
g
S
c=1
)
= max(0; max
c2Sfyng
(c;y
n
) +w
c
>
x
n
w
yn
>
x
n
); (4.6)
We have the standard Crammer-Singer loss when the structured loss (c;y
n
) = 1 if c6= y
n
,
which, however, ignores the semantic relatedness between classes. We additionally use the `
2
distance for the structured loss (c;y
n
) =ka
c
a
yn
k
2
2
to exploit the class relatedness in our
experiments. These two learning settings have separate strengths and weaknesses in empirical
studies.
Learning semantic representations for phantom classes The weighted graph eq. (4.2) is also
parameterized by adaptable embeddings of the phantom classes b
r
. For this work, however,
for simplicity, we assume that each of them is a sparse linear combination of the seen classes’
semantic representations:
b
r
=
S
X
c=1
rc
a
c
;8r2f1; ;Rg;
Thus, to optimize those embeddings, we solve the following optimization problem
min
fvrg
R
r=1
;frcg
R;S
r;c=1
S
X
c=1
N
X
n=1
`(x
n
;I
yn;c
;w
c
) (4.7)
+
2
S
X
c=1
kw
c
k
2
2
+
R;S
X
r;c=1
j
rc
j +
2
R
X
r=1
(kb
r
k
2
2
h
2
)
2
;
s:t: w
c
=
R
X
r=1
s
cr
v
r
; 8c2T =f1; ;Sg;
whereh is a predefined scalar equal to the norm of real semantic representations (i.e., 1 in our
experiments since we perform`
2
normalization). Note that in addition to learningfv
r
g
R
r=1
, we
learn combination weightsf
rc
g
R;S
r;c=1
: Clearly, the constraint together with the third term in the
objective encourages the sparse linear combination of the seen classes’ semantic representations.
The last term in the objective demands that the norm ofb
r
is not too far from the norm ofa
c
.
29
We perform alternating optimization for minimizing the objective function with respect to
fv
r
g
R
r=1
andf
rc
g
R;S
r;c=1
. While this process is nonconvex, there are useful heuristics to initialize
the optimization routine. For example, if R = S, then the simplest setting is to letb
r
= a
r
for r = 1;:::;R. If R S, we can let them be (randomly) selected from the seen classes’
semantic representationsfb
1
;b
2
; ;b
R
gfa
1
;a
2
; ;a
S
g, or first perform clustering on
fa
1
;a
2
; ;a
S
g and then let eachb
r
be a combination of the seen classes’ semantic represen-
tations in clusterr. IfR> S, we could use a combination of the above two strategies.
There are four hyper-parameters ;;; and
to be tuned (Section 4.2.3). To reduce the
search space during cross-validation, we first fixb
r
=a
r
forr = 1;:::;R and tune;. Then we
fix and and tune and
. We describe in more detail how to cross-validate hyper-parameters
in Section 4.4.
Classification with synthesized classifiers Given a data samplex from U unseen classes and
their corresponding semantic representations (or coordinates in other semantic spaces), we clas-
sify it in the label spaceU by
^ y = arg max
c2U
w
c
>
x (4.8)
with the classifiers being synthesized according to eq. (4.4). This is in sharp contrast to many
existing two-stage methods (see Chapter 3). There, the imagex needs to be first mapped with
embedding-based functions (e.g., classifiers) and then the outputs of those functions are combined
to predict a label inU.
4.3 Comparison to existing methods
We contrast our approach to some existing methods. [127] combines pre-trained classifiers of
seen classes to construct new classifiers. To estimate the semantic representation (e.g., word
vector) of a test image, [135] uses the decision values of pre-trained classifiers of seen objects
to weighted average the corresponding semantic representations. Neither of them has the notion
of base classifiers, which we introduce for constructing the classifiers and nothing else. We thus
expect them to be more effective in transferring knowledge between seen and unseen classes than
overloading the pretrained and fixed classifiers of the seen classes for dual duties. We note that [5]
can be considered as a special case of our method. In [5], each attribute corresponds to a base
and each “real” classifier corresponding to an actual object is represented as a linear combination
of those bases, where the weights are the real objects’ “representations” in the form of attributes.
This modeling is limiting as the number of bases is fundamentally limited by the number of
attributes. Moreover, the model is strictly a subset of our model.
1
[220, 221] propose similar
ideas of aligning the visual and semantic spaces but take different approaches. Very recently,
[191] extends our idea on predicting classifiers using the graph convolutional neural networks.
1
For interested readers, if we set the number of attributes as the number of phantom classes (eachbr is the one-hot
representation of an attribute), and use Gaussian kernel with an isotropically diagonal covariance matrix in eq. (4.3)
with properly set bandwidths (either very small or very large) for each attribute, we will recover the formulation in [5]
when the bandwidths tend to zero or infinity.
30
4.4 Hyper-parameter tuning: cross-validation (CV) strategies
class index
Sample index
Seen
classes
Unseen
classes
(a) Zero-shot
splitting
class index
Sample index
Unseen
classes
(b) Sample-wise CV
folds
class index
Sample index
(c) Class-wise CV
folds
Unseen
classes
Figure 4.2: Data splitting for different cross-validation (CV) strategies: (a) the seen-unseen class
splitting for zero-shot learning, (b) the sample-wise CV , (c) the class-wise CV .
There are a few free hyper-parameters in our approach (Section 4.2.3). To choose the hyper-
parameters in the conventional cross-validation (CV) for multi-way classification, one splits the
training data into several folds such that they share the same set of class labels with one another.
Clearly, this strategy is not sensible for zero-shot learning as it does not imitate what actually
happens at the test stage. We thus introduce a new strategy for performing CV , inspired by the
hyper-parameter tuning in [156]. The key difference of the new scheme to the conventional CV
is that we split the data into several folds such that the class labels of these folds are disjoint. For
clarity, we denote the conventional CV as sample-wise CV and our scheme as class-wise CV .
Figure 4.2(b) and 4.2(c) illustrate the two scenarios, respectively. We empirically compare them
in Section 4.5. Note that several existing models [6, 42, 156, 220] also follow similar hyper-
parameter tuning procedures.
4.5 Empirical studies
We conduct extensive empirical studies of our approach SynC for the conventional zero-shot
learning setting on four benchmark datasets—Animal with Attributes (AwA) [105], CUB-200-
2011 Birds (CUB) [187], SUN Attribute (SUN) [142], and the full ImageNet Fall 2011 dataset [38]
with more than 20,000 unseen classes.
31
Table 4.1: Key characteristics of studied datasets
Dataset # of seen # of unseen Total #
name classes classes of images
AwA
y
40 10 30,475
CUB
z
150 50 11,788
SUN
z
645/646 72/71 14,340
ImageNet
x
1,000 20,842 14,197,122
y
: Following the prescribed split in [106].
z
: 4 (or 10, respectively) random splits, reporting average.
x
: Seen and unseen classes from ImageNet ILSVRC 2012 1K [157] and Fall 2011
release [38, 50, 135].
4.5.1 setup
Datasets We use four benchmark datasets in our experiments: the Animals with Attributes
(AwA) [106], CUB-200-2011 Birds (CUB) [187], SUN Attribute (SUN) [142], and the Ima-
geNet (with full 21,841 classes) [38]. Table 4.1 summarizes their key characteristics.
Semantic spaces For the classes in AwA, we use 85-dimensional binary or continuous at-
tributes [106], as well as the 100 and 1,000 dimensional word vectors [130], derived from their
class names and extracted by Fu et al. [51, 53]. For CUB and SUN, we use 312 and 102 di-
mensional continuous-valued attributes, respectively. We also thresh them at the global means
to obtain binary-valued attributes, as suggested in [106]. Neither datasets have word vectors for
their class names. For ImageNet, we train a skip-gram language model [130, 131] on the latest
Wikipedia dump corpus
2
(with more than 3 billion words) to extract a 500-dimensional word
vector for each class. We ignore classes without word vectors in the experiments, resulting in
20,345 (out of 20,842) unseen classes. We also derive 21,632 dimensional semantic vectors of
the class names using multidimensional scaling (MDS) on the WordNet hierarchy, as in [123].
For both the continuous attribute vectors and the word vector embeddings of the class names, we
normalize them to have unit`
2
norms unless stated otherwise.
Visual features Due to variations in features being used in literature, it is impractical to try
all possible combinations of features and methods. Thus, we make a major distinction in using
shallow features (such as color histograms, SIFT, PHOG, Fisher vectors) [5, 6, 83, 106, 155, 192]
and deep learning features in several recent studies of zero-shot learning. Whenever possible, we
use (shallow) features provided by those datasets or prior studies. For comparative studies, we
also extract the following deep features: AlexNet [101] for AwA and CUB and GoogLeNet [175]
for all datasets (all extracted with the Caffe package [85]). For AlexNet, we use the 4,096-
dimensional activations of the penultimate layer (fc7) as features. For GoogLeNet, we take the
1,024-dimensional activations of the pooling units, as in [6].
2
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.
bz2 on September 1, 2015
32
Evaluation protocols For AwA, CUB, and SUN, we use the (normalized, by class-size) multi-
way classification accuracy, as in previous work. Note that the accuracy is always computed on
images from unseen classes.
Evaluating zero-shot learning on the large-scale ImageNet requires substantially different
components from evaluating on the other three datasets. First, two evaluation metrics are used,
as in [50]: Flat hit@K (F@K) and Hierarchical precision@K (HP@K).
F@K is defined as the percentage of test images for which the model returns the true label
in its top K predictions. Note that, F@1 is the multi-way classification accuracy. HP@K takes
into account the hierarchical organization of object categories. For each true label, we generate a
ground-truth list of K closest categories in the hierarchy and compute the degree of overlapping
(i.e., precision) between the ground-truth and the model’s top K predictions. For the detailed
description of this metric, please see the Appendix of [50].
Secondly, following the procedure in [50, 135], we evaluate on three scenarios of increasing
difficulty:
• 2-hop contains 1,509 unseen classes that are within two tree hops of the seen 1K classes
according to the ImageNet label hierarchy
3
.
• 3-hop contains 7,678 unseen classes that are within three tree hops of seen classes.
• All contains all 20,345 unseen classes in the ImageNet 2011 21K dataset that are not in the
ILSVRC 2012 1K dataset.
The numbers of unseen classes are slightly different from what are used in [50, 135] due to the
missing semantic representations (i.e., word vectors) for certain class names.
In addition to reporting published results, we have also reimplemented the state-of-the-art
method ConSE [135] on this dataset.
4.5.2 Implementation details
We cross-validate all hyperparameters. For convenience, we set the number of phantom classes
R to be the same as the number of seen classesS, and setb
r
=a
c
forr =c. We also experiment
setting different R and learningb
r
. Our study (cf. Fig. 4.3) shows that when R is about 60% of
S, the performance saturates. We denote the three variants of our methods in constructing clas-
sifiers (Section 4.2.3) by Ours
o-vs-o
(one-versus-other), Ours
cs
(Crammer-Singer) and Ours
struct
(Crammer-Singer with structured loss).
4.5.3 Main results
Table 4.2 compares the proposed methods to the state-of-the-art from the previously published
results on benchmark datasets. While there is a large degree of variations in implementation
details, the main observation is that our methods attain the best performance in most scenarios.
In what follows, we analyze those results in detail.
We also point out that the settings in some existing work are highly different from ours; we
do not include their results for fair comparison [7, 51, 53, 55, 82, 97, 113, 212]. In some cases,
even with additional data and attributes, those methods under-perform ours.
3
http://www.image-net.org/api/xml/structure_released.xml
33
Table 4.2: Comparison between our results and the previously published results in multi-way
classification accuracies (in %) on the task of zero-shot learning. For each dataset, the best is in
red and the 2nd best is in blue.
Methods AwA CUB SUN ImageNet
DAP [106] 41.4 - 22.2 -
IAP [106] 42.2 - 18.0 -
BN [192] 43.4 - - -
ALE [5] 37.4 18.0
y
- -
SJE [6] 66.7 50.1
y
- -
ESZSL [156] 49.3 - - -
ConSE[135] - - - 1.4
SSE-ReLU [220]
?
76.3 30.4
y
- -
[221]
?
80.5 42.1
y
- -
Ours
o-vs-o
69.7 53.4 62.8 1.4
Ours
cs
68.4 51.6 52.9 -
Ours
struct
72.9 54.7 62.7 1.5
y
: Results reported on a particular seen-unseen split.
?
: Results were just brought to our attention. Note that VGG [170] instead of GoogLeNet
features were used, improving on AwA but worsening on CUB.
Table 4.3: Comparison between results by ConSE and our method on ImageNet. For both types
of metrics, the higher the better.
Scenarios Methods Flat Hit@K Hierarchical precision@K
K= 1 2 5 10 20 2 5 10 20
2-hop ConSE [135] 9.4 15.1 24.7 32.7 41.8 21.4 24.7 26.9 28.4
ConSE by us 8.3 12.9 21.8 30.9 41.7 21.5 23.8 27.5 31.3
Ours
o-vs-o
10.5 16.7 28.6 40.1 52.0 25.1 27.7 30.3 32.1
Ours
struct
9.8 15.3 25.8 35.8 46.5 23.8 25.8 28.2 29.6
3-hop ConSE [135] 2.7 4.4 7.8 11.5 16.1 5.3 20.2 22.4 24.7
ConSE by us 2.6 4.1 7.3 11.1 16.4 6.7 21.4 23.8 26.3
Ours
o-vs-o
2.9 4.9 9.2 14.2 20.9 7.4 23.7 26.4 28.6
Ours
struct
2.9 4.7 8.7 13.0 18.6 8.0 22.8 25.0 26.7
All ConSE [135] 1.4 2.2 3.9 5.8 8.3 2.5 7.8 9.2 10.4
ConSE by us 1.3 2.1 3.8 5.8 8.7 3.2 9.2 10.7 12.0
Ours
o-vs-o
1.4 2.4 4.5 7.1 10.9 3.1 9.0 10.9 12.5
Ours
struct
1.5 2.4 4.4 6.7 10.0 3.6 9.6 11.0 12.2
34
Table 4.4: Comparison between sample- and class-wise cross-validation for hyper-parameter tun-
ing on CUB (learning with the one-versus-other loss).
CV CUB CUB
Scenarios (AlexNet) (GoogLeNet)
Sample-wise 44.7 52.0
Class-wise 46.6 53.4
4.5.4 Large-scale zero-shot learning
One major limitation of most existing work on zero-shot learning is that the number of unseen
classes is often small, dwarfed by the number of seen classes. However, real-world computer
vision systems need to face a very large number of unseen objects. To this end, we evaluate our
methods on the large-scale ImageNet dataset.
Table 4.3 summarizes our results and compares to the ConSE method [135], which is, to the
best of our knowledge, the state-of-the-art method on this dataset.
4
Note that in some cases,
our own implementation (“ConSE by us” in the table) performs slightly worse than the reported
results, possibly attributed to differences in visual features, word vector embeddings, and other
implementation details. Nonetheless, the proposed methods (using the same setting as “ConSE by
us”) always outperform both, especially in the very challenging scenario of All where the number
of unseen classes is 20,345, significantly larger than the number of seen classes. Note also that,
for both types of metrics, whenK is larger, the improvement over the existing approaches is more
pronounced. It is also not surprising to notice that as the number of unseen classes increases from
the setting 2-hop to All, the performance of both our methods and ConSE degrade.
4.5.5 Detailed analysis
We experiment extensively to understand the benefits of many factors in our and other algorithms.
While trying all possible combinations is prohibitively expensive, we have provided a compre-
hensive set of results for comparison and drawing conclusions.
Cross-validation (CV) strategies Table 4.4 shows the results on CUB (averaged over four
splits) using the hyper-parameters tuned by class-wise CV and sample-wise CV , respectively.
The results based on class-wise CV are about 2% better than those of sample-wise CV , verifying
the necessity of simulating the zero-shot learning scenario while we tune the hyper-parameters at
the training stage.
Advantage of continuous attributes It is clear from Table 4.5 that, in general, continuous
attributes as semantic representations for classes attain better performance than binary attributes.
This is especially true when deep learning features are used to construct classifiers. It is somewhat
4
We are aware of recent work by Lu [123] that introduces a novel form of semantic representations.
35
Table 4.5: Detailed analysis of various methods: the effect of feature and attribute types on
multi-way classification accuracies (in %). Within each column, the best is in red and the 2nd
best is in blue. We cite both previously published results (numbers in bold italics) and results
from our implementations of those competing methods (numbers in normal font) to enhance
comparability and to ease analysis (see texts for details). We use the shallow features provided
by [106], [83], [142] for AwA, CUB, SUN, respectively.
Methods Attribute Shallow features Deep features
type AwA CUB SUN AwA CUB SUN
DAP [106] binary 41.4 28.3 22.2 60.5 (50.0) 39.1 (34.8) 44.5
IAP [106] binary 42.2 24.4 18.0 57.2 (53.2) 36.7 (32.7) 40.8
BN [192] binary 43.4 - - - - -
ALE [5]
z
binary 37.4 18.0
y
- - - -
ALE binary 34.8 27.8 - 53.8 (48.8) 40.8 (35.3) 53.8
SJE [6] continuous 42.3
z
19.0
yz
- 66.7(61.9) 50.1(40.3)
y
-
SJE continuous 36.2 34.6 - 66.3 (63.3) 46.5 (42.8) 56.1
ESZSL [156]
x
continuous 49.3 37.0 - 59.6 (53.2) 44.0 (37.2) 8.7
ESZSL continuous 44.1 38.3 - 64.5 (59.4) 34.5 (28.0) 18.7
ConSE [135] continuous 36.5 23.7 - 63.3 (56.5) 36.2 (32.6) 51.9
COSTA [127]
]
continuous 38.9 28.3 - 61.8 (55.2) 40.8 (36.9) 47.9
Ours
o-vs-o
continuous 42.6 35.0 - 69.7 (64.0) 53.4 (46.6) 62.8
Ours
cs
continuous 42.1 34.7 - 68.4 (64.8) 51.6 (45.7) 52.9
Ours
struct
continuous 41.5 36.4 - 72.9 (62.8) 54.5 (47.1) 62.7
y
: Results reported by the authors on a particular seen-unseen split.
z
: Based on Fisher vectors as shallow features, different from those provided in [83, 106, 142].
x
: On the attribute vectors without`
2
normalization, while our own implementation shows that
normalization helps in some cases.
]
: As co-occurrence statistics are not available, we combine pre-trained classifiers with the
weights defined in eq. (4.2).
36
Table 4.6: Effect of types of semantic representations on AwA.
Semantic representations Dimensions Accuracy (%)
word vectors 100 42.2
word vectors 1000 57.5
attributes 85 69.7
attributes + word vectors 185 73.2
attributes + word vectors 1085 76.3
Table 4.7: Effect of learning semantic representations
Datasets Types of embeddings w/o learning w/ learning
AwA attributes 69.7% 71.1%
100-d word vectors 42.2% 42.5%
1000-d word vectors 57.6% 56.6%
CUB attributes 53.4% 54.2%
SUN attributes 62.8% 63.3%
expected that continuous attributes provide a more accurate real-valued similarity measure among
classes. This presumably is exploited further by more powerful classifiers.
Advantage of deep features It is also clear from Table 4.5 that, across all methods, deep fea-
tures significantly boost the performance based on shallow features. We use GoogLeNet and
AlexNet (numbers in parentheses) and GoogLeNet generally outperforms AlexNet. It is worth-
while to point out that the reported results under deep features columns are obtained using linear
classifiers, which outperform several nonlinear classifiers that use shallow features. This seems
to suggest that deep features, often thought to be specifically adapted to seen training images, still
work well when transferred to unseen images [50].
Which types of semantic space? In Table 4.6, we show how effective our proposed method
(Ours
o-vs-o
) exploits the two types of semantic spaces: (continuous) attributes and word-vector
embeddings on AwA (the only dataset with both embedding types). We find that attributes yield
better performance than word-vector embeddings. However, combining the two gives the best re-
sult, suggesting that these two semantic spaces could be complementary and further investigation
is ensured.
Table 4.7 takes a different view on identifying the best semantic space. We study whether we
can learn optimally the semantic representations (cf. Section 4.2.3) for the phantom classes that
correspond to base classifiers. These preliminary studies seem to suggest that learning attributes
could have a positive effect, though it is difficult to improve over word-vector embeddings. We
plan to study this issue more thoroughly in the future.
How many base classifiers are necessary? In Fig. 4.3, we investigate how many base clas-
sifiers are needed—so far, we have set that number to be the number of seen classes out of
convenience. The plot shows that in fact, a smaller number (about 60% -70%) is enough for our
37
20 40 60 80 100 120 140 160
70
80
90
100
110
Ratio to the number of seen classes (%)
Relative accuracy (%)
AwA
CUB
Figure 4.3: We vary the number of phantom classes R as a percentage of the number of seen
classes S and investigate how much that will affect classification accuracy (the vertical axis cor-
responds to the ratio with respect to the accuracy when R = S). The base classifiers are learned
with Ours
o-vs-o
.
algorithm to reach the plateau of the performance curve. Moreover, increasing the number of
base classifiers does not seem to have an overwhelming effect.
4.5.6 Qualitative results
In this subsection, we present qualitative results of our method. We first illustrate what visual
information the models (classifiers) for unseen classes capture, when provided with only semantic
embeddings (no example images). In Figure 4.4, we list (on top) the 10 unseen class labels of
AwA, and show (in the middle) the top-5 images classified into each class c, according to the
decision valuesw
>
c
x. Misclassified images are marked with red boundaries. At the bottom, we
show the first (highest score) misclassified image (according to the decision value) into each class
and its ground-truth class label. According to the top images, our method reasonably captures
discriminative visual properties of each unseen class based solely on its semantic embedding. We
can also see that the misclassified images are with appearance so similar to that of predicted class
that even humans cannot easily distinguish between the two. For example, the pig image at the
bottom of the second column looks very similar to the image of hippos.
4.6 Summary
We have developed a novel classifier synthesis mechanism (SynC) for zero-shot learning by in-
troducing the notion of “phantom” classes. The phantom classes connect the dots between the
seen and unseen classes—the classifiers of the seen and unseen classes are constructed from the
same base classifiers for the phantom classes and with the same coefficient functions. As a result,
we can conveniently learn the classifier synthesis mechanism leveraging labeled data of the seen
classes and then readily apply it to the unseen classes. SynC is conceptually clean, and flexible
to incorporate various forms of similarity functions, classifiers, and semantic representations—
by certain combinations, SynC can recover several existing methods, essentially a superset of
them. Moreover, it is widely applicable to different visual recognition tasks, including fine-
grained object, scene, and large-scale object recognition. Specifically, on the setting proposed by
Google [50], where over 20,000 unseen categories are to be recognized, SynC so far holds the best
performance and outperforms other methods by a margin, as reported in a recent survey [201].
38
Persian cat Hippo Leopard
Humpback
whale
Seal Chimpanzee Rat Giant panda Pig Raccoon
Raccoon
Pig
Persian cat
Seal
Humpback
whale
rat
Raccoon
Seal
Hippo
Rat
Figure 4.4: Qualitative results of our method (Ours
struct
) on AwA. (Top) We list the 10 unseen
class labels. (Middle) We show the top-5 images classified into each class, according to the
decision values. Misclassified images are marked with red boundaries. (Bottom) We show the
first misclassified image (according to the decision value) into each class and its ground-truth
class label.
39
Chapter 5
Generalized Zero-Shot Learning
In the previous chapter, we introduce our zero-shot learning algorithm SynC, which achieves
superior performance on four benchmark datasets for the conventional setting—once models for
unseen classes are constructed, they are judged based on their ability to discriminate among
unseen classes, assuming the absence of seen objects during the test phase. Originally proposed
in the seminal work of Lampert et al. [105], this setting has almost always been adopted for
evaluating ZSL methods [138, 214, 154, 90, 5, 212, 50, 127, 135, 82, 7, 6, 53, 55, 113, 156, 97,
220, 221].
But, does this problem setting truly reflect what recognition in the wild entails? While the
ability to learn novel concepts is by all means a trait that any zero-shot learning systems should
possess, it is merely one side of the coin. The other important—yet so far under-studied—trait is
the ability to remember past experiences, i.e., the seen classes.
Why is this trait desirable? Consider how data are distributed in the real world. The seen
classes are often more common than the unseen ones; it is therefore unrealistic to assume that we
will never encounter them during the test stage. For models generated by ZSL to be truly useful,
they should not only accurately discriminate among either seen or unseen classes themselves but
also accurately discriminate between the seen and unseen ones.
Thus, to understand better how existing ZSL approaches will perform in the real world, we
advocate evaluating them in the setting of generalized zero-shot learning (GZSL), where test data
are from both seen and unseen classes and we need to classify them into the joint labeling space
of both types of classes. Previous work in this direction is scarce. See Chapter 3 for more details.
5.1 Overview
We conduct an extensive empirical study of several existing ZSL approaches in the new GZSL
setting. We show that a straightforward application of classifiers constructed by those approaches
performs poorly. In particular, test data from unseen classes are almost always classified as a
class from the seen ones. We propose a surprisingly simple yet very effective method called
calibrated stacking to address this problem. This method is mindful of the two conflicting forces:
recognizing data from seen classes and recognizing data from unseen ones. We introduce a
new performance metric called Area Under Seen-Unseen accuracy Curve (AUSUC) that can
evaluate ZSL approaches on how well they can trade off between the two. We demonstrate the
utility of this metric by evaluating several representative ZSL approaches under this metric on the
benchmark datasets considered in the experiments of Chapter 4.
40
(a)
(b)
Figure 5.1: Comparisons of (a) conventional ZSL and (b) generalized ZSL in the testing phase—
conventional ZSL assumes the absence of seen classes’ instances and only classifies test instances
into one of the unseen classes. The notations follow those in Section 5.2.2.
5.2 Generalized zero-shot learning
In this section, we review the setting of generalized zero-shot learning that has been defined in
Chapter 2. We then present empirical evidence to illustrate the difficulty of this problem.
5.2.1 Conventional and generalized zero-shot learning
Suppose we are given the training dataD =f(x
n
2 R
D
;y
n
)g
N
n=1
with the labelsy
n
from the
label space of seen classesS =f1; 2; ;Sg. Denote byU =fS + 1; ;S + Ug the label
space of unseen classes. We useT =S[U to represent the union of the two sets of classes.
In the (conventional) zero-shot learning (ZSL) setting, the main goal is to classify test data
into the unseen classes, assuming the absence of the seen classes in the test phase. In other words,
each test data point is assumed to come from and will be assigned to one of the labels inU.
Existing research on ZSL has been almost entirely focusing on this setting [105, 138, 214,
154, 90, 5, 212, 50, 127, 135, 82, 7, 6, 53, 55, 113, 156, 97, 220, 221]. However, in real applica-
tions, the assumption of encountering data only from the unseen classes is hardly realistic. The
seen classes are often the most common objects we see in the real world. Thus, the objective
in the conventional ZSL does not truly reflect how the classifiers will perform recognition in the
wild.
Motivated by this shortcoming of the conventional ZSL, we advocate studying the more gen-
eral setting of generalized zero-shot learning (GZSL), where we no longer limit the possible class
memberships of test data—each of them belongs to one of the classes inT . (See Fig. 5.1.)
41
5.2.2 Classifiers
Without the loss of generality, we assume that for each class c2T , we have a discriminant
scoring function f
c
(x) (or more generally f(a
c
;x)), from which we would be able to derive
the label forx. For instance, for an unseen class u, SynC defines f
u
(x) = w
>
u
x, wherew
u
is the model parameter vector for the classu
1
, constructed from its semantic representationa
u
(such as its attribute vector or the word vector associated with the name of the class). In ConSE
[135],f
u
(x) = cos(m(x);a
u
), wherem(x) is the predicted embedding of the data samplex.
In DAP/IAP [106], f
u
(x) is a probabilistic model of attribute vectors. We assume that similar
discriminant functions for seen classes can be constructed in the same manner given their corre-
sponding semantic representations.
How to assess an algorithm for GZSL? We define and differentiate the following performance
metrics:A
U!U
the accuracy of classifying test data fromU intoU,A
S!S
the accuracy of classi-
fying test data fromS intoS, and finallyA
S!T
andA
U!T
the accuracies of classifying test data
from either seen or unseen classes into the joint labeling space. Note thatA
U!U
is the standard
performance metric used for conventional ZSL andA
S!S
is the standard metric for multi-class
classification. Furthermore, note that we do not reportA
T!T
as simply averagingA
S!T
and
A
U!S
to computeA
T!T
might be misleading when the two metrics are not balanced, as shown
below.
5.2.3 Generalized ZSL is hard
To demonstrate the difficulty of GZSL, we report the empirical results of using a simple but intu-
itive algorithm for GZSL. Given the discriminant functions, we adopt the following classification
rule
^ y = arg max
c2T
f
c
(x) = arg max
c2T
f(a
c
;x) (5.1)
which we refer to as direct stacking.
We use the rule on “stacking” classifiers from the following zero-shot learning approaches:
DAP and IAP [106], ConSE [135], and Synthesized Classifiers (SynC). We tune the hyper-
parameters for each approach based on class-wise cross validation. We test GZSL on two datasets
AwA [106] and CUB [187]—details about those datasets can be found in Section 4.5.
Table 5.1 reports experimental results based on the 4 performance metrics we have described
previously. Our goal here is not to compare between methods. Instead, we examine the impact of
relaxing the assumption of the prior knowledge of whether data are from seen or unseen classes.
We observe that, in this setting of GZSL, the classification performance for unseen classes
(A
U!T
) drops significantly from the performance in conventional ZSL (A
U!U
), while that of
seen ones (A
S!T
) remains roughly the same as in the multi-class task (A
S!S
). That is, nearly
all test data from unseen classes are misclassified into the seen classes. This unusual degradation
in performance highlights the challenges of GZSL; as we only see labeled data from seen classes
during training, the scoring functions of seen classes tend to dominate those of unseen classes,
leading to biased predictions in GZSL and aggressively classifying a new data point into the label
1
Note that in SynC, wu is a function of au under fixed coordinates of phantom classes. Therefore, we can also
vieww
>
u
x asf(au;x)
42
Table 5.1: Classification accuracies (%) on conventional ZSL (A
U!U
), multi-class classification
for seen classes (A
S!S
), and GZSL (A
S!T
andA
U!T
), on AwA and CUB. Significant drops
are observed fromA
U!U
toA
U!T
.
AwA CUB
Method A
U!U
A
S!S
A
U!T
A
S!T
A
U!U
A
S!S
A
U!T
A
S!T
DAP [106] 51.1 78.5 2.4 77.9 38.8 56.0 4.0 55.1
IAP [106] 56.3 77.3 1.7 76.8 36.5 69.6 1.0 69.4
ConSE [135] 63.7 76.9 9.5 75.9 35.8 70.5 1.8 69.9
SynC
o-vs-o
70.1 67.3 0.3 67.3 53.0 67.2 8.4 66.5
SynC
struct
73.4 81.0 0.4 81.0 54.4 73.0 13.2 72.0
space ofS because classifiers for the seen classes do not get trained on “negative” examples from
the unseen classes.
5.3 Approach for GZSL
The previous example shows that the classifiers for unseen classes constructed by conventional
ZSL methods should not be naively combined with models for seen classes to expand the labeling
space required by GZSL.
In what follows, we propose a simple variant to the naive approach of direct stacking to curb
such a problem. We also develop a metric that measures the performance of GZSL, by acknowl-
edging that there is an inherent trade-off between recognizing seen classes and recognizing unseen
classes. This metric, referred to as the Area Under Seen-Unseen accuracy Curve (AUSUC), bal-
ances the two conflicting forces. We conclude this section by describing two related approaches:
despite their sophistication, they do not perform well empirically.
5.3.1 Calibrated stacking
Our approach stems from the observation that the scores of the discriminant functions for the
seen classes are often greater than the scores for the unseen classes. Thus, intuitively, we would
like to reduce the scores for the seen classes. This leads to the following classification rule:
^ y = arg max
c2T
f
c
(x)
I[c2S]; (5.2)
where the indicatorI[]2f0; 1g indicates whether or notc is a seen class and
is a calibration
factor. We term this adjustable rule as calibrated stacking. See Fig. 5.2 for an illustration.
Another way to interpret
is to regard it as the prior likelihood of a data point coming from
unseen classes. When
= 0, the calibrated stacking rule reverts back to the direct stacking rule,
described previously.
It is also instructive to consider the two extreme cases of
. When
! +1, the classification
rule will ignore all seen classes and classify all data points into one of the unseen classes. When
there is no new data point coming from seen classes, this classification rule essentially implements
what one would do in the setting of conventional ZSL. On the other hand, when
! 1,
the classification rule only considers the label space of seen classes as in standard multi-way
43
Figure 5.2: We observed that seen classes usually give higher scores than unseen classes, even to
an unseen class instance (e.g., a zebra image). We thus introduce a calibration factor
, either to
reduce the scores of seen classes or to increase those of unseen classes (cf. eq. (5.2)).
0 0.2 0.4 0.6 0.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Seen−Unseen accuracy Curve (SUC)
A
U→T
A
S→T
SynC
o−v−o
: AUSUC = 0.398
Figure 5.3: The Seen-Unseen accuracy Curve (SUC) obtained by varying
in the calibrated
stacking classification rule eq. (5.2). The AUSUC summarizes the curve by computing the area
under it. We use the method SynC
o-vs-o
on the AwA dataset, and tune hyper-parameters as in
Table 5.1. The red cross denotes the accuracies by direct stacking.
classification. The calibrated stacking rule thus represents a middle ground between aggressively
classifying every data point into seen classes and conservatively classifying every data point into
unseen classes. Adjusting this hyperparameter thus gives a trade-off, which we exploit to define
a new performance metric.
5.3.2 Area Under Seen-Unseen Accuracy Curve (AUSUC)
Varying the calibration factor
, we can compute a series of classification accuracies (A
U!T
,
A
S!T
). Fig. 5.3 plots those points for the dataset AwA using the classifiers generated by SynC
based on class-wise cross validation. We call such a plot the Seen-Unseen accuracy Curve (SUC).
On the curve,
= 0 corresponds to direct stacking, denoted by a cross. The curve is similar to
many familiar curves for representing conflicting goals, such as the Precision-Recall (PR) curve
and the Receiving Operator Characteristic (ROC) curve, with two ends for the extreme cases
(
!1 and
! +1).
44
A convenient way to summarize the plot with one number is to use the Area Under SUC
(AUSUC)
2
. The higher the area is, the better an algorithm is able to balanceA
U!T
andA
S!T
.
We evaluate the performance of existing zero-shot learning methods under this metric, as well as
provide further insights and analyses in Section 5.4.
An immediate and important use of the metric AUSUC is for model selection. Many ZSL
learning methods require tuning hyperparameters—previous work tune them based on the ac-
curacy A
U!U
. The selected model, however, does not necessarily balance optimally between
A
U!T
andA
S!T
. Instead, we advocate using AUSUC for model selection and hyperparamter
tuning. Models with higher values of AUSUC are likely to perform in balance for the task of
GZSL. We provide detailed discussions in Section 5.4.2.
5.3.3 Comparisons to alternative approaches
As introduced in Chapter 3, Socher et al. [172] propose a two-stage zero-shot learning approach
that first predicts whether an image is of seen or unseen classes according to certain novelty
scores, and then accordingly applies the corresponding classifiers. If we define a new form of
novelty scoreN(x) = max
u2U
f
u
(x) max
s2S
f
s
(x) in eq. (3.15), we recover the prediction
rule in eq. (5.2). However, this relation holds only if we are interested in predicting one label ^ y.
When we are interested in predicting a set of labels (for example, hoping that the correct labels
are in the topK predicted labels, (i.e., the Flat hit@K metric, cf. Section 5.4), the two prediction
rules will give different results.
5.4 Empirical studies
5.4.1 Setup
Datasets, features, and semantic representations We mainly use three benchmark datasets:
the Animals with Attributes (AwA) [106], CUB-200-2011 Birds (CUB) [187], and ImageNet
[157]. Please be refer to Section 4.5 for details. We use the GoogLeNet deep features.
Compared methods We examine SynC and several representative conventional zero-shot learn-
ing approaches, described briefly below. Direct Attribute Prediction (DAP) and Indirect Attribute
Prediction (IAP) [106] are probabilistic models that perform attribute predictions as an interme-
diate step and then use them to compute MAP predictions of unseen class labels. ConSE [135]
makes use of pre-trained classifiers for seen classes and their probabilitic outputs to infer the se-
mantic representations of each test example, and then classifies it into the unseen class with the
most similar semantic representations. We use binary attributes for DAP and IAP, and continuous
attributes and WORD2VEC for ConSE and SynC, following [106, 135].
Generalized zero-shot learning tasks There are no previously established benchmark tasks
for GZSL. We thus define a set of tasks that reflects more closely how data are distributed in
real-world applications.
2
If a single
is desired, the “F-score” that balancesAU!T andAS!T can be used.
45
We construct the GZSL tasks by composing test data as a combination of images from both
seen and unseen classes. We follow existing splits of the datasets for the conventional ZSL
to separate seen and unseen classes. Moreover, for the datasets AwA and CUB, we hold out
20% of the data points from the seen classes (previously, all of them are used for training in the
conventional zero-shot setting) and merge them with the data from the unseen classes to form the
test set; for ImageNet, we combine its validation set (having the same classes as its training set)
and the 21K classes that are not in the ILSVRC 2012 1K dataset.
Evaluation metrics While we will primarily report the performance of ZSL approaches under
the metric Area Under Seen-Unseen accuracy Curve (AUSUC) developed in Section 5.3.1, we
explain how its two accuracy componentsA
S!T
andA
U!T
are computed below.
For AwA and CUB, seen and unseen accuracies correspond to (normalized-by-class-size)
multi-way classification accuracy, where the seen accuracy is computed on the 20% images from
the seen classes and the unseen accuracy is computed on images from unseen classes.
For ImageNet, seen and unseen accuracies correspond to Flat hit@K (F@K), defined as the
percentage of test images for which the model returns the true label in its top K predictions. Note
that, F@1 is the unnormalized multi-way classification accuracy. Moreover, following the proce-
dure in [50, 135], we evaluate on three scenarios of increasing difficulty: (1) 2-hop contains 1,509
unseen classes that are within two tree hops of the 1K seen classes according to the ImageNet
label hierarchy
3
. (2) 3-hop contains 7,678 unseen classes that are within three tree hops of the
seen classes. (3) All contains all 20,345 unseen classes.
5.4.2 Hyper-parameter tuning strategies
Cross-validation with AUSUC In Section 5.3.2, we introduce the Area Under Seen-Unseen
accuracy Curve (AUSUC), which is analogous to many metrics in computer vision and machine
learning that balance two conflicting (sub)metrics, such as area under ROC. To tune the hyper-
parameters based on this metric
4
, we simulate the generalized zero-shot learning setting during
cross-validation.
Concretely, we split the training data into 5 folds A1, A2, A3, A4 and A5 so that the class
labels of these folds are disjoint. We further split 80% and 20% of data from each fold (A1-A5,
respectively) into pseudo-train and pseudo-test sets, respectively. We then combine the pseudo-
train sets of four folds (for example, A1-A4) for training, and validate on (i) the pseudo-test sets
of such four folds (i.e., A1-A4) and (ii) the pseudo-train set of the remaining fold (i.e., A5). That
is, the remaining fold serves as the pseudo-unseen classes in cross-validation. We repeat this
process for 5 rounds—each round selects a fold as the “remaining” fold, and computes AUSUC
on the corresponding validation set. Finally, the average of AUSUCs over all rounds is used to
select hyper-parameters.
Comparison to an alternative strategy Another strategy for hyper-parameter tuning is to find
two sets of hyper-parameters: one optimized for seen classes and the other for unseen classes. The
3
http://www.image-net.org/api/xml/structure_released.xml
4
AUSUC is computed by varying the
factor within a range. If a single
is desired, another measure such as
“F-score” balancingAU!T andAS!T can be used. One can also assume a prior probability of whether any instance
is seen or unseen to select the factor.
46
Table 5.2: Comparison of performance measured in AUSUC between two cross-validation strate-
gies on AwA and CUB. One strategy is based on accuracies (A
S!S
andA
U!U
) and the other is
based on AUSUC. See text for details.
AwA CUB
Method CV strategies CV strategies
Accuracies AUSUC Accuracies AUSUC
DAP [106] 0.341 0.366 0.202 0.194
IAP [106] 0.366 0.394 0.194 0.199
ConSE [135] 0.443 0.428 0.190 0.212
SynC
o-vs-o
0.539 0.568 0.324 0.336
SynC
struct
0.551 0.583 0.356 0.356
standard cross-validation technique, whereA
S!S
is optimized, can be used for the former. For
the latter, it has been shown that the class-wise cross-validation technique, where the conventional
zero-shot learning task is simulated, outperforms the standard technique. In this case,A
U!U
is
optimized. We thus use the first set of hyper-parameters to construct the scoring functions for the
seen classes, and use the second set for the unseen classes (cf. Section 5.2).
In this subsection, we show that the strategy that jointly optimizes hyper-parameters based
on AUSUC in most cases leads to better models for GZSL than the strategy that optimizes seen
and unseen classifiers’ performances separately. On AwA and CUB, we perform 5-fold cross-
validation based on the two strategies and compare the performance of those selected models in
Table 5.2. In general, cross-validation based on AUSUC leads to better models for GZSL
5
. In the
following we thus stick with cross-validation with AUSUC.
5.4.3 Which method to use to perform GZSL?
Table 5.3 provides an experimental comparison between several methods utilizing seen and un-
seen classifiers for generalized ZSL, with hyperparameters cross-validated to maximize AUSUC.
The results show that, irrespective of which ZSL methods are used to generate models for
seen and unseen classes, our method of calibrated stacking for generalized ZSL outperforms
other methods. In particular, despite their probabilistic justification, the two novelty detection
methods do not perform well. We believe that this is because most existing zero-shot learning
methods are discriminative and optimized to take full advantage of class labels and semantic
information. In contrast, either Gaussian or LoOP approach models all the seen classes as a
whole, possibly at the cost of modeling inter-class differences.
5.4.4 Which zero-shot learning approach is more robust to GZSL?
Fig. 5.4 contrasts in detail several ZSL approaches when tested on the task of GZSL, using the
method of calibrated stacking. Clearly, the SynC method dominates all other methods in the
whole ranges. The crosses on the plots mark the results of direct stacking (Section 5.2).
5
The exceptions are ConSE on AwA and DAP on CUB.
47
Table 5.3: Performances measured in AUSUC of several methods for Generalized Zero-Shot
Learning on AwA and CUB. The higher the better (the upper bound is 1).
AwA CUB
Method Novelty detection [172] Calibrated Novelty detection [172] Calibrated
Gaussian LoOP Stacking Gaussian LoOP Stacking
DAP 0.302 0.272 0.366 0.122 0.137 0.194
IAP 0.307 0.287 0.394 0.129 0.145 0.199
ConSE 0.342 0.300 0.428 0.130 0.136 0.212
SynC
o-vs-o
0.420 0.378 0.568 0.191 0.209 0.336
SynC
struct
0.424 0.373 0.583 0.199 0.224 0.356
0 0.2 0.4 0.6 0.8
0
0.2
0.4
0.6
0.8
1
AwA
A
U→T
A
S→T
DAP: 0.366
IAP: 0.394
ConSE: 0.428
SynC
o−v−o
: 0.568
SynC
struct
: 0.583
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CUB (Split 1)
A
U→T
A
S→T
DAP: 0.205
IAP: 0.211
ConSE: 0.208
SynC
o−v−o
: 0.338
SynC
struct
: 0.354
Figure 5.4: Comparison between several ZSL approaches on the task of GZSL for AwA and
CUB.
Table 5.4: Performances measured in AUSUC by different zero-shot learning approaches on
GZSL on ImageNet, using our method of calibrated stacking.
Unseen Method Flat hit@K
classes 1 5 10 20
2-hop ConSE 0.042 0.168 0.247 0.347
SynC
o-vs-o
0.044 0.218 0.338 0.466
SynC
struct
0.043 0.199 0.308 0.433
3-hop ConSE 0.013 0.057 0.090 0.135
SynC
o-vs-o
0.012 0.070 0.119 0.186
SynC
struct
0.013 0.066 0.110 0.170
All ConSE 0.007 0.030 0.048 0.073
SynC
o-vs-o
0.006 0.034 0.059 0.097
SynC
struct
0.007 0.033 0.056 0.090
Fig. 5.5 contrasts in detail ConSE to SynC, the two known methods for large-scale ZSL.
When the accuracies measured in Flat hit@1 (i.e., multi-class classification accuracy), neither
method dominates the other, suggesting the different trade-offs by the two methods. However,
when we measure hit rates in the topK > 1, SynC dominates ConSE. Table 5.4 gives summarized
48
0 0.05 0.1
0
0.2
0.4
0.6
0.8
1
ImageNet (Flat hit@1)
A
U→T
A
S→T
ConSE: 0.042
SynC
o−v−o
: 0.044
SynC
struct
: 0.043
0 0.05 0.1 0.15 0.2 0.25 0.3
0
0.2
0.4
0.6
0.8
1
ImageNet (Flat hit@5)
A
U→T
A
S→T
ConSE: 0.168
SynC
o−v−o
: 0.218
SynC
struct
: 0.199
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
ImageNet (Flat hit@10)
A
U→T
A
S→T
ConSE: 0.247
SynC
o−v−o
: 0.338
SynC
struct
: 0.308
0 0.1 0.2 0.3 0.4 0.5
0
0.2
0.4
0.6
0.8
1
ImageNet (Flat hit@20)
A
U→T
A
S→T
ConSE: 0.347
SynC
o−v−o
: 0.466
SynC
struct
: 0.433
Figure 5.5: Comparison between ConSE and SynC of their performances on the task of GZSL
for ImageNet where the unseen classes are within 2 tree-hops from seen classes.
comparison in AUSUC between the two methods on the ImageNet dataset. We observe that SynC
in general outperforms ConSE except when Flat hit@1 is used, in which case the two methods’
performances are nearly indistinguishable.
5.5 Summary
Zero-shot learning (ZSL) methods have been studied in the unrealistic setting where test data are
assumed to come from unseen classes only. In this chapter, we advocate studying the problem
of generalized zero-shot learning (GZSL) where the test data’s class memberships are uncon-
strained. We show empirically that naively using the classifiers constructed by ZSL approaches
does not perform well in the generalized setting. Motivated by this, we propose a simple but
effective calibration method that can be used to balance two conflicting forces: recognizing data
from seen classes versus those from unseen ones. We develop a performance metric to character-
ize such a trade-off and examine the utility of this metric in evaluating various ZSL approaches.
SynC outperforms the compared ones. Starting from the work being published [28], much new
work has been dedicated to the generalized setting, ranging from visual object recognition to
video-based action recognition [98, 43, 204, 114, 33, 13, 215, 191, 25, 200, 15, 229, 224].
49
Chapter 6
From Zero-Shot Learning to Conventional Supervised Learning
The generalized zero-shot learning (GZSL) setting, approach, and evaluation metric introduced
in the previous chapter allow us to realistically and fairly compare zero-shot learning with con-
ventional supervised learning, in which for any classc2T we have labeled training data. This
comparison is extremely important in understanding how far the current development of zero-
shot learning techniques is from the performance that can be achieved by conventional supervised
learning if we put more effort on collecting and labeling data.
To this end, we conduct a large-scale study including 1,000 seen and 1,000 or over 20,000
unseen classes. Our analysis shows a large gap between the GZSL approaches (using the existing
semantic representations) and multi-class classification. We then demonstrate that by improving
the representations to incorporate domain cues—e.g., peeking few instances of each category and
treating the average features as the representations—such a gap can be largely reduced even using
the same ZSL approaches, suggesting the next step to advance ZSL.
In the following, we start with the comparison among zero-shot, few-shot, and the conven-
tional supervised learning paradigms. We then describe our experimental setup, present the re-
sults, and provide the key insights.
6.1 Comparisons among different learning paradigms
Conventional supervised learning for classification assumes that for all the categories of interest
(i.e.,T ), sufficient training examples are accessible. Zero-shot learning (ZSL), on the other hand,
separatesT into two disjoint subsetsS andU, where for classes inU no training examples are
accessible. Such a separation can be applied to other learning paradigms like one-shot or few-
shot learning [186, 171] as well, where for classes inU only one or few training examples are
accessible. See Fig 6.1 for an illustration. In this case, conventional supervised learning can also
be viewed as many-shot learning.
To construct classifiers forT , in supervised learning we can directly train a multi-class clas-
sifier using the one-versus-other or Crammer-Singer loss mentioned in Section 4.2. For zero-shot
learning, we leverage the class semantic representations to transfer the classifiers or discrimina-
tive knowledge fromS toU. For example, our SynC algorithm learns a mechanism (from training
data ofS) to synthesize the classifier of any class given its semantic representation.
We note that SynC (and many other ZSL algorithms) is not designed for a specific type of
semantic representations. That is, it can be applied to few-shot or many-shot learning as long as
we have class semantic representations. While semantic representations are mostly not provided
50
Figure 6.1: The comparison of zero-shot, few-shot, and conventional supervised learning (i.e.,
many shot-learning). For all the paradigms, categories of interest can be separated into two por-
tions: one with many training examples per class; one with zero, few, or again many examples.
For ZSL, the first (second) portion is called seen (unseen) classes, and extra class semantic rep-
resentationsa
c
are provided. In our SynC algorithm, we learn a mechanismh to synthesize the
classifierw
c
given the correspondinga
c
. We can actually learn and apply the same mechanism
to the other mechanisms if we havea
c
: for example, constructinga
c
by average visual features.
in these learning paradigms, we can indeed construct them using visual features—for example,
by taking the average visual features of images of each class. In the following we then experiment
with this idea to connect multiple learning paradigms and analyze the performance gap.
6.2 Empirical studies
6.2.1 Overview
Zero-shot learning, either in conventional setting or generalized setting, is a challenging problem
as there is no labeled data for the unseen classes. The performance of ZSL methods depends
on at least two factors: (1) how seen and unseen classes are related; (2) how effectively the
relation can be exploited by learning algorithms to generate models for the unseen classes. For
generalized zero-shot learning, the performance further depends on how classifiers for seen and
unseen classes are combined to classify new data into the joint label space.
Despite extensive study in ZSL, several questions remain understudied. For example, given a
dataset and a split of seen and unseen classes, what is the best possible performance of any ZSL
method? How far are we from there? What is the most crucial component we can improve in
order to reduce the gap between the state-of-the-art and the ideal performances? In this section,
we empirically analyze ZSL methods in detail and shed light on some of those questions.
51
0 0.1 0.2 0.3 0.4 0.5
0
0.2
0.4
0.6
0.8
A
U→T
A
S→T
ImageNet−2K (Flat hit@1)
Multi−class: 0.352
GZSL (word2vec): 0.038
GZSL (G−attr): 0.251
0 0.2 0.4 0.6 0.8
0
0.2
0.4
0.6
0.8
1
A
U→T
A
S→T
ImageNet−2K (Flat hit@5)
Multi−class: 0.657
GZSL (word2vec): 0.170
GZSL (G−attr): 0.578
Figure 6.2: We contrast the performances of GZSL to multi-class classifiers trained with labeled
data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses WORD2VECTOR
(in red) and the idealized visual features (G-attr) as semantic representations (in black).
6.2.2 Setup
As ZSL methods do not use labeled data from unseen classes for training classifiers, one reason-
able estimate of their best possible performance is to measure the performance on a multi-class
classification task where annotated data on the unseen classes are provided.
Concretely, to construct the multi-class classification task, on AwA and CUB, we randomly
select 80% of the data along with their labels from all classes (seen and unseen) to train classifiers.
The remaining 20% will be used to assess both the multi-class classifiers and the classifiers from
ZSL. Note that, for ZSL, only the seen classes from the 80% are used for training—the portion
belonging to the unseen classes are not used.
On ImageNet, to reduce the computational cost (of constructing multi-class classifiers which
would involve 20,345-way classification), we subsample another 1,000 unseen classes from its
original 20,345 unseen classes. We call this new dataset ImageNet-2K (including the 1K seen
classes from ImageNet). Out of those 1,000 unseen classes, we randomly select 50 samples per
class and reserve them for testing and use the remaining examples (along with their labels) to
train 2000-way classifiers.
For ZSL methods, we use either attribute vectors or word vectors (WORD2VEC) as semantic
representations. Since SynC
o-vs-o
performs well on a range of datasets and settings, we focus on
this method. For multi-class classification, we train one-versus-others SVMs. Once we obtain
the classifiers for both seen and unseen classes, we use the calibrated stacking decision rule to
combine (as in generalized ZSL) and vary the calibration factor
to obtain the Seen-Unseen
accuracy Curve, exemplified in Fig. 5.3.
6.2.3 Results
How far are we from the ideal performance? Fig. 6.2 displays the accuracy Curves for
ImageNet-2K
1
. Clearly, there is a large gap between the performances of GZSL using the default
1
Similar trends are observed for AwA and CUB
52
Table 6.1: Comparison of performances measured in AUSUC between GZSL (using WORD2VEC
and G-attr) and multi-class classification on ImageNet-2K. Few-shot results are averaged over
100 rounds. GZSL with G-attr improves upon GZSL with WORD2VEC significantly and quickly
approaches multi-class classification performance.
Method Flat hit@K
1 5 10 20
WORD2VEC 0.04 0.17 0.27 0.38
G-attr from 1 image 0.080.003 0.250.005 0.330.005 0.420.005
GZSL G-attr from 10 images 0.200.002 0.500.002 0.620.002 0.720.002
G-attr from 100 images 0.250.001 0.570.001 0.690.001 0.780.001
G-attr from all images 0.25 0.58 0.69 0.79
Multi-class classification 0.35 0.66 0.75 0.82
WORD2VEC semantic representations and the ideal performance indicated by the multi-class clas-
sifiers. Note that the cross marks indicate the results of direct stacking. The multi-class classifiers
not only dominate GZSL in the whole ranges (thus, with high AUSUCs) but also are capable of
learning classifiers that are well-balanced (such that direct stacking works well).
How much can idealized semantic representations help? We hypothesize that a large portion
of the gap between GZSL and multi-class classification can be attributed to the weak semantic
representations used by the GZSL approach.
We investigate this by using a form of idealized semantic representations. As the success
of zero-shot learning relies heavily on how accurate semantic representations represent visual
similarity among classes, we examine the idea of visual features as semantic representations.
Concretely, for each class, semantic representations can be obtained by averaging visual features
of images belonging to that class. We call them G-attr as we derive the visual features from
GoogLeNet. Note that, for unseen classes, we only use the reserved training examples to derive
the semantic representations; we do not use their labels to train classifiers.
Fig. 6.2 shows the performance of GZSL using G-attr—the gaps to the multi-class classifi-
cation performances are significantly reduced from those made by GZSL using WORD2VEC. In
some cases, GZSL can almost match the performance of multi-class classifiers without using any
labels from the unseen classes!
How much labeled data to improve GZSL’s performance? Imagine we are given a budget to
label data from unseen classes, how much those labels can improve GZSL’s performance?
Table 6.1 contrasts the AUSUCs obtained by GZSL to those from mutli-class classification
on ImageNet-2K, where GZSL is allowed to use visual features as embeddings—those features
can be computed from a few labeled images from the unseen classes, a scenario we can refer to
as “few-shot” learning. Using about (randomly sampled) 100 labeled images per class, GZSL
can quickly approach the performance of multi-class classifiers, which use about 1,000 labeled
images per class. Moreover, those G-attr visual features as semantic representations improve
upon WORD2VEC more significantly under Flat hit@K = 1 than when K> 1.
53
Table 6.2: Comparison of performances measured in AUSUC between GZSL with WORD2VEC
and GZSL with G-attr on the full ImageNet with over 20,000 unseen classes. Few-shot results
are averaged over 20 rounds.
Method Flat hit@K
1 5 10 20
WORD2VEC 0.006 0.034 0.059 0.096
G-attr from 1 image 0.0180.0002 0.0710.0007 0.1060.0009 0.1500.0011
G-attr from 10 images 0.0500.0002 0.1840.0003 0.2630.0004 0.3520.0005
G-attr from 100 images 0.0650.0001 0.2300.0002 0.3220.0002 0.4210.0002
G-attr from all images 0.067 0.236 0.329 0.429
We further examine on the whole ImageNet with 20,345 unseen classes in Table 6.2, where
we keep 80% of the unseen classes’ examples to derive G-attr and test on the rest, and observe
similar trends. Specifically on Flat hit@1, the performance of G-attr from merely 1 image is
boosted threefold of that by WORD2VEC, while G-attr from 100 images achieves over tenfold.
6.3 Summary
The studies show a large gap between the GZSL approaches and multi-class classification (by
conventional supervised learning)—the latter achieves three times better AUSUC than the for-
mer. Under the hit@5 metric, the multi-class classifier achieves 0.66 AUSUC while the GZSL
approach by SynC achieves only 0.17 (the maximum is 1.0).
We hypothesize that the gap is largely attributed to the weak class semantic representations.
As the success of ZSL relies heavily on how accurate semantic representations represent visual
similarity among classes, we examine the idea of visual features as semantic representations.
The performance is encouraging—by treating the average visual features (of both seen and
unseen classes) over only a few examples as the semantic representations, the gap can already be
significantly reduced even by using the same zero-shot learning algorithm.
Such a way to build up semantic representations, however, is not realistic for zero-shot
learning—we should not observe any labeled example of unseen classes. In the next chapter,
we develop an algorithm to improve semantic representations without seeing examples of unseen
classes, according to the insights of this chapter.
54
Chapter 7
Improving Semantic Representations by Predicting Visual
Exemplars (EXEM)
The insights from the previous chapter suggest that designing high quality semantic representa-
tions or improving the existing ones should be the focus of zero-shot learning researches. To this
end, we propose to learn a mapping from the original semantic representations to the average
visual features (called visual exemplars), using the seen classes’ data—visual exemplars are ex-
actly the ones used in the previous chapter to replace the original representations. The resulting
mapping is then used to obtain improved representations, which can be either plugged into any
ZSL approaches or treated as the (single) training instances for unseen classes so that supervised
algorithms like nearest neighbors can be applied. Fig. 4.1 shows the conceptual diagram.
: class exemplar
Semantic representations
a
House Wren
a
Cardinal
a
Cedar Waxwing
v
Cardinal
Visual features
(a
c
) ≈v
c
PCA
Semantic embedding space
(a
u
) for NN classification or to improve existing ZSL approaches
a
Gadwall
a
Mallard
Figure 7.1: Given the semantic information and visual features of the seen classes, our method
learns a kernel-based regressor () such that the semantic representationa
c
of class c can
predict well its class exemplar (center)v
c
that characterizes the clustering structure. The learned
() can be used to predict the visual feature vectors of the unseen classes for nearest-neighbor
(NN) classification, or to improve the semantic representations for existing ZSL approaches.
7.1 Approach
Our approach is based on the structural constraint that takes advantage of the clustering structure
assumption in the semantic embedding space. The constraint forces the semantic representations
to be predictive of their visual exemplars (i.e., cluster centers). In this section, we describe how
55
we achieve this goal. First, we describe how we learn a function to predict the visual exemplars
from the semantic representations. Second, given a novel semantic representation, we describe
how we apply this function to perform zero-shot learning.
7.1.1 Learning to predict the visual exemplars from the semantic representations
For each class c, we would like to find a transformation function () such that (a
c
) v
c
,
wherev
c
2R
d
is the visual exemplar for the class. In this chapter, we create the visual exemplar
of a class by averaging the PCA projections of data belonging to that class. That is, we consider
v
c
=
1
jIcj
P
n2Ic
Mx
n
, whereI
c
=fi : y
i
= cg andM2R
dD
is the PCA projection matrix
computed over training data of the seen classes. We note thatM is fixed for all data points (i.e.,
not class-specific) and is used in Eq. (7.1).
Given training visual exemplars and semantic representations, we learn d support vector re-
gressors (SVR) with the RBF kernel—each of them predicts each dimension of visual exemplars
from their corresponding semantic representations. Specifically, for each dimensiond = 1;:::;d,
we use the-SVR formulation [164]. Details are in Section 7.2.
Note that the PCA step is introduced for both the computational and statistical benefits. In
addition to reducing dimensionality for faster computation, PCA decorrelates the dimensions of
visual features such that we can predict these dimensions independently rather than jointly.
See Section 7.3 for analysis on applying SVR and PCA.
7.1.2 Zero-shot learning based on the predicted visual exemplars
Now that we learn the transformation function (), how do we use it to perform zero-shot
classification? We first apply () to all semantic representationsa
u
of the unseen classes. We
consider two main approaches that depend on how we interpret these predicted exemplars (a
u
).
7.1.2.1 Predicted exemplars as training data
An obvious approach is to use (a
u
) as data directly. Since there is only one data point per class,
a natural choice is to use a nearest neighbor classifier. Then, the classifier outputs the label of the
closest exemplar for each novel data pointx that we would like to classify:
^ y = arg min
u
dis
NN
(Mx; (a
u
)); (7.1)
where we adopt the (standardized) Euclidean distance asdis
NN
in the experiments.
7.1.2.2 Predicted exemplars as the ideal semantic representations
The other approach is to use (a
u
) as the ideal semantic representations (“ideal” in the sense that
they have knowledge about visual features) and plug them into any existing zero-shot learning
framework. We provide two examples.
In the method of convex combination of semantic embeddings (ConSE) [135], their orig-
inal semantic embeddings are replaced with the corresponding predicted exemplars, while the
combining coefficients remain the same. In the method of synthesized classifiers (SynC), the
predicted exemplars are used to define the similarity values between the unseen classes and the
56
bases, which in turn are used to compute the combination weights for constructing classifiers.
In particular, their similarity measure is of the form
expfdis(a
c
;b
r
)g
P
R
r=1
expfdis(a
c
;b
r
)g
, wheredis is the
(scaled) Euclidean distance andb
r
’s are the semantic representations of the base classes. In this
case, we simply need to change this similarity measure to
expfdis( (a
c
); (b
r
))g
P
R
r=1
expfdis( (a
c
); (b
r
))g
.
7.1.3 Comparison to related approaches
One appealing property of our approach is its scalability: we learn and predict at the exemplar
(class) level so the runtime and memory footprint of our approach depend only on the number
of seen classes rather the number of training data points. This is much more efficient than other
ZSL algorithms that learn at the level of each individual training instance [44, 105, 138, 5, 212,
50, 172, 135, 82, 127, 6, 156, 220, 221, 123].
Several methods propose to learn visual exemplars
1
by preserving structures obtained in the
semantic space [189, 119]. However, our approach predicts them with a regressor such that they
may or may not strictly follow the structure in the semantic space, and thus they are more flexible
and could even better reflect similarities between classes in the visual feature space.
Similar in spirit to our work, [129] proposes using nearest class mean classifiers for ZSL. The
Mahalanobis metric learning in this work could be thought of as learning a linear transformation
of semantic representations (their “zero-shot prior” means, which are in the visual feature space).
Our approach learns a highly non-linear transformation. Moreover, our EXEM (1NNs) (cf. Sec-
tion 7.3) learns a (simpler, i.e., diagonal) metric over the learned exemplars. Finally, the main
focus of [129] is on incremental, not zero-shot, learning settings (see also [152, 147]).
[218] proposes to use a deep feature space as the semantic embedding space for ZSL. Though
similar to ours, they do not compute average of visual features (exemplars) but train neural net-
works to predict all visual features from their semantic representations. Their model learning
takes significantly longer time than ours. Neural networks are more prone to overfitting and give
inferior results (cf. Section 7.3). Additionally, we provide empirical studies on much larger-scale
datasets for zero-shot learning, and analyze the effect of PCA.
7.2 Other details
SVR formulation for predicting visual exemplars Given semantic representation-visual ex-
emplar pairs of the seen classes, we learn d support vector regressors (SVR) with RBF kernel.
1
Exemplars are used loosely here and do not necessarily mean class-specific feature averages.
57
Specifically, for each dimensiond = 1;:::;d ofv
c
, SVR is learned based on the-SVR formu-
lation [164]:
min
w;;
0
;
1
2
w
>
w +( +
1
S
S
X
c=1
(
c
+
0
c
))
s:t:w
>
rbf
(a
c
)v
c
+
c
(7.2)
v
c
w
>
rbf
(a
c
) +
0
c
c
0;
0
c
0;
where
rbf
is an implicit nonlinear mapping based on our kernel. We have dropped the subscript
d for aesthetic reasons but readers are reminded that each regressor is trained independently
with its own target values (i.e., v
cd
) and parameters (i.e., w
d
). We found that the regression
error is not sensitive to and set it to 1 in all experiments. We jointly tune 2 (0; 1] and
the kernel bandwidth and finally apply the same set of hyper-parameters for all the d regres-
sors. Details on hyper-parameter tuning can be found in Section 7.3. The resulting () =
[w
T
1
rbf
(); ;w
T
d
rbf
()]
>
, wherew
d
is from thed-th regressor.
7.3 Empirical studies
We conduct extensive empirical studies of our approach EXEM for both the conventional and
generalized settings on four benchmark datasets—Animal with Attributes (AwA) [105], CUB-
200-2011 Birds (CUB) [187], SUN Attribute (SUN) [142], and the full ImageNet Fall 2011
dataset [38] with more than 20,000 unseen classes. Despite its simplicity, our approach outper-
forms other existing ZSL approaches in most cases, demonstrating the potential of improving
semantic representations towards visual exemplars.
7.3.1 Setup
Datasets, features, and semantic representations Please be refer to Section 4.5 for details.
We use the GoogLeNet deep features. For ImageNet, we further derive 21,632 dimensional
semantic vectors of the class names using multidimensional scaling (MDS) on the WordNet hier-
archy, as in [123]. We normalize the class semantic representations to have unit`
2
norms.
7.3.2 Implementation details
Variants of our ZSL models given predicted exemplars The main step of our method is
to predict visual exemplars that are well-informed about visual features. How we proceed to
perform zero-shot classification (i.e., classifying test data into the label space of unseen classes)
based on such exemplars is entirely up to us. In this chapter, we consider the following zero-shot
classification procedures that take advantage of the predicted exemplars:
• EXEM (ZSLmethod): ZSL method with predicted exemplars as semantic representations,
where ZSL method = ConSE [135], LatEm [199], and SynC.
• EXEM (1NN): 1-nearest neighbor classifier with the Euclidean distance to the exemplars.
58
• EXEM (1NNs): 1-nearest neighbor classifier with the standardized Euclidean distance
to the exemplars, where the standard deviation is obtained by averaging the intra-class
standard deviations of all seen classes.
EXEM (ZSL method) regards the predicted exemplars as the ideal semantic representations
(Section. 7.1.2.2). On the other hand, EXEM (1NN) treats predicted exemplars as data prototypes
(Section. 7.1.2.1). The standardized Euclidean distance in EXEM (1NNs) is introduced as a way
to scale the variance of different dimensions of visual features. In other words, it helps reduce the
effect of collapsing data that is caused by our usage of the average of each class’ data as cluster
centers.
Hyper-parameter tuning There are several hyper-parameters to be tuned in our experiments:
(a) projected dimensionality d for PCA and (b) , , and the RBF-kernel bandwidth in SVR.
For (a), we found that the ZSL performance is not sensitive to d and thus set d = 500 for all
experiments. For (b), we perform class-wise cross-validation (CV), with one exception—We
found = 1 works robustly on all datasets for zero-shot learning.
The class-wise CV can be done as follows. We hold out data from a subset of seen classes as
pseudo-unseen classes, train our models on the remaining folds (which belong to the remaining
classes), and tune hyper-parameters based on a certain performance metric on the held-out fold.
This scenario simulates the ZSL setting and has been shown to outperform the conventional CV
in which each fold contains a portion of training examples from all classes.
We consider the following two performance metrics. The first one minimizes the distance
between the predicted exemplars and the ground-truth (average of PCA-projected validation data
of each class) in R
d
. We use the Euclidean distance in this case. We term this measure CV-
distance. This approach does not assume the downstream task at training and aims to measure
the quality of predicted exemplars by its faithfulness.
The other approach maximizes the zero-shot classification accuracy on the validation set. This
measure can easily be obtained for EXEM (1NN) and EXEM (1NNS), which use simple decision
rules that have no further hyper-parameters to tune. Empirically, we found that CV-accuracy
generally leads to slightly better performance. The results reported in the following for these two
approaches are thus based on this measure.
On the other hand, EXEM (SYNC
O-VS-O
), EXEM (SYNC
STRUCT
), EXEM (CONSE), and EXEM
(LATEM) require further hyper-parameter tuning. For computational purposes, we use CV-distance
for tuning hyper-parameters of the regressors, followed by the hyper-parameter tuning for SYNC
and CONSE using the predicted exemplars. Since SYNC and CONSE construct their classifiers
based on the distance values between class semantic representations, we do not expect a signifi-
cant performance drop in this case. (We remind the reader that, in EXEM (SYNC
O-VS-O
), EXEM
(SYNC
STRUCT
), EXEM (CONSE), and EXEM (LATEM), the predicted exemplars are used as seman-
tic representations.)
7.3.3 Predicted visual exemplars
We first show that predicted visual exemplars better reflect visual similarities between classes
than semantic representations. LetD
au
be the pairwise Euclidean distance matrix between un-
seen classes computed from semantic representations (i.e., U by U),D
(au)
the distance matrix
59
Table 7.1: We compute the Euclidean distance matrix between the unseen classes based on se-
mantic representations (D
au
), predicted exemplars (D
(au)
), and real exemplars (D
vu
). Our
method leads toD
(au)
that is better correlated withD
vu
thanD
au
is. See text for more details.
Dataset Correlation toD
vu
name Semantic distances Predicted exemplar distances
D
au
D
(au)
AwA 0.862 0.897
CUB 0.777 0.021 0.904 0.026
SUN 0.784 0.022 0.893 0.019
Figure 7.2: t-SNE [184] visualization of randomly selected real images (crosses) and predicted
visual exemplars (circles) for the unseen classes on (from left to right) AwA, CUB, SUN, and
ImageNet. Different colors of symbols denote different unseen classes. Perfect predictions of
visual features would result in well-aligned crosses and circles of the same color. Plots for CUB
and SUN are based on their first splits. Plots for ImageNet are based on randomly selected 48
unseen classes from 2-hop and word vectors as semantic representations. Best viewed in color.
60
computed from predicted exemplars, andD
vu
the distance matrix computed from real exemplars
(which we do not have access to). Table 7.1 shows that the correlation betweenD
(au)
and
D
vu
is much higher than that betweenD
au
andD
vu
. Importantly, we improve this correlation
without access to any data of the unseen classes.
We then show some t-SNE [184] visualization of predicted visual exemplars of the unseen
classes. Ideally, we would like them to be as close to their corresponding real images as possible.
In Fig. 7.2, we demonstrate that this is indeed the case for many of the unseen classes; for those
unseen classes (each of which denoted by a color), their real images (crosses) and our predicted
visual exemplars (circles) are well-aligned.
The quality of predicted exemplars (in this case based on the distance to the real images)
depends on two main factors: the predictive capability of semantic representations and the number
of semantic representation-visual exemplar pairs available for training, which in this case is equal
to the number of seen classes S. On AwA where we have only 40 training pairs, the predicted
exemplars are surprisingly accurate, mostly either placed in their corresponding clusters or at least
closer to their clusters than predicted exemplars of the other unseen classes. Thus, we expect them
to be useful for discriminating among the unseen classes. On ImageNet, the predicted exemplars
are not as accurate as we would have hoped, but this is expected since the word vectors are purely
learned from text.
We also observe relatively well-separated clusters in the semantic representation space (in our
case, also the visual feature space since we only apply PCA projections to the visual features),
confirming our assumption about the existence of clustering structures. On CUB, we observe
that these clusters are more mixed than on other datasets. This is not surprising given that it is a
fine-grained classification dataset of bird species.
7.3.4 Results on the conventional setting
7.3.4.1 Main results
Table 7.2 summarizes our results in the form of multi-way classification accuracies on all datasets.
We significantly outperform recent state-of-the-art baselines when using GoogLeNet features.
We note that, on AwA, several recent methods obtain higher accuracies due to using a more
optimistic evaluation metric (per-sample accuracy) and new types of deep features [221, 220].
This has been shown to be unsuccessfully replicated (cf. Table 2 in [198]).
Our alternative approach of treating predicted visual exemplars as the ideal semantic repre-
sentations significantly outperforms taking semantic representations as given. EXEM (SYNC),
EXEM (CONSE), EXEM (LATEM) outperform their corresponding base ZSL methods relatively
by 5.9-6.8%, 11.4-27.6%, and 1.1-17.1%, respectively. This again suggests improved quality of
semantic representations (on the predicted exemplar space).
Furthermore, we find that there is no clear winner between using predicted exemplars as ideal
semantic representations or as data prototypes. The former seems to perform better on datasets
with fewer seen classes. Nonetheless, we remind that using 1-nearest-neighbor classifiers clearly
scales much better than zero-shot learning methods; EXEM (1NN) and EXEM (1NNS) are more
efficient than EXEM (SYNC), EXEM (CONSE), and EXEM (LATEM).
Finally, we find that in general using the standardized Euclidean distance instead of the Eu-
clidean distance for nearest neighbor classifiers helps improve the accuracy, especially on CUB,
suggesting there is a certain effect of collapsing actual data during training. The only exception
61
Table 7.2: Comparison between existing ZSL approaches in multi-way classification accuracies
(in %) on four benchmark datasets. For each dataset, we mark the best in red and the second
best in blue. Italic numbers denote per-sample accuracy instead of per-class accuracy. On Ima-
geNet, we report results for both types of semantic representations: Word vectors (wv) and MDS
embeddings derived from WordNet (hie). All the results are based on GoogLeNet features [175].
.
Approach AwA CUB SUN ImageNet
wv hie
CONSE
y
[135] 63.3 36.2 51.9 1.3 -
BIDILEL [189] 72.4 49.7
x
- - -
LATEM
z
[199] 72.1 48.0 64.5 - -
CCA [123] - - - - 1.8
SYNC
o-vs-o
69.7 53.4 62.8 1.4 2.0
SYNC
struct
72.9 54.5 62.7 1.5 -
EXEM (CONSE) 70.5 46.2 60.0 - -
EXEM (LATEM)
z
72.9 56.2 67.4 - -
EXEM (SYNC
O-VS-O
) 73.8 56.2 66.5 1.6 2.0
EXEM (SYNC
STRUCT
) 77.2 59.8 66.1 - -
EXEM (1NN) 76.2 56.3 69.6 1.7 2.0
EXEM (1NNS) 76.5 58.5 67.3 1.8 2.0
x
: on a particular split of seen/unseen classes.
y
: our implementation.
z
: based on the code of [199], averaged over 5 different initializations.
is on SUN. We suspect that the standard deviation values computed on the seen classes on this
dataset may not be robust enough as each class has only 20 images.
7.3.4.2 Large-scale zero-shot classification results
We then provide expanded results for ImageNet, following evaluation protocols in the literature.
In Table 7.3 and 7.4, we provide results based on the exemplars predicted by word vectors and
MDS features derived from WordNet, respectively. We consider SYNC
o-v-o
, rather than SYNC
struct
,
as the former shows better performance on ImageNet. Regardless of the types of metrics used,
our approach outperforms the baselines significantly when using word vectors as semantic rep-
resentations. For example, on 2-hop, we are able to improve the F@1 accuracy by 2% over
the state-of-the-art. However, we note that this improvement is not as significant when using
MDS-WordNet features as semantic representations.
We observe that the 1-nearest-neighbor classifiers perform better than using predicted exem-
plars as more powerful semantic representations. We suspect that, when the number of classes
is very high, zero-shot learning methods (CONSE or SYNC) do not fully take advantage of the
meaning provided by each dimension of the exemplars.
62
Table 7.3: Comparison between existing ZSL approaches on ImageNet using word vectors of
the class names as semantic representations. For both metrics (in %), the higher the better. The
best is in red. The numbers of unseen classes are listed in parentheses.
y
: our implementation.
Test data Approach Flat Hit@K Hierarchical precision@K
K= 1 2 5 10 20 2 5 10 20
CONSE
y
[135] 8.3 12.9 21.8 30.9 41.7 21.5 23.8 27.5 31.3
SYNC
o-vs-o
10.5 16.7 28.6 40.1 52.0 25.1 27.7 30.3 32.1
2-hop (1,509) EXEM (SYNC
O-VS-O
) 11.8 18.9 31.8 43.2 54.8 25.6 28.1 30.2 31.6
EXEM (1NN) 11.7 18.3 30.9 42.7 54.8 25.9 28.5 31.2 33.3
EXEM (1NNS) 12.5 19.5 32.3 43.7 55.2 26.9 29.1 31.1 32.0
CONSE
y
[135] 2.6 4.1 7.3 11.1 16.4 6.7 21.4 23.8 26.3
SYNC
o-vs-o
2.9 4.9 9.2 14.2 20.9 7.4 23.7 26.4 28.6
3-hop (7,678) EXEM (SYNC
O-VS-O
) 3.4 5.6 10.3 15.7 22.8 7.5 24.7 27.3 29.5
EXEM (1NN) 3.4 5.7 10.3 15.6 22.7 8.1 25.3 27.8 30.1
EXEM (1NNS) 3.6 5.9 10.7 16.1 23.1 8.2 25.2 27.7 29.9
CONSE
y
[135] 1.3 2.1 3.8 5.8 8.7 3.2 9.2 10.7 12.0
SYNC
o-vs-o
1.4 2.4 4.5 7.1 10.9 3.1 9.0 10.9 12.5
All (20,345) EXEM (SYNC
O-VS-O
) 1.6 2.7 5.0 7.8 11.8 3.2 9.3 11.0 12.5
EXEM (1NN) 1.7 2.8 5.2 8.1 12.1 3.7 10.4 12.1 13.5
EXEM (1NNS) 1.8 2.9 5.3 8.2 12.2 3.6 10.2 11.8 13.2
Table 7.4: Comparison between existing ZSL approaches on ImageNet (with 20,842 unseen
classes) using MDS embeddings derived from WordNet [123] as semantic representations.
The higher, the better (in %). The best is in red.
Test data Approach Flat Hit@K
K= 1 2 5 10 20
CCA [123] 1.8 3.0 5.2 7.3 9.7
All SYNC
o-vs-o
2.0 3.4 6.0 8.8 12.5
(20,842) EXEM (SYNC
O-VS-O
) 2.0 3.3 6.1 9.0 12.9
EXEM (1NN) 2.0 3.4 6.3 9.2 13.1
EXEM (1NNS) 2.0 3.4 6.2 9.2 13.2
63
Table 7.5: Accuracy of EXEM (1NN) on AwA, CUB, and SUN when predicted exemplars are from
original visual features (No PCA) and PCA-projected features (PCA withd = 1024 andd = 500).
Dataset No PCA PCA PCA
name d = 1024 d = 1024 d = 500
AwA 77.8 76.2 76.2
CUB 55.1 56.3 56.3
SUN 69.2 69.6 69.6
Table 7.6: Comparison between EXEM (1NN) with support vector regressors (SVR) and with 2-
layer multi-layer perceptron (MLP) for predicting visual exemplars. Results on CUB are for the
first split. Each number for MLP is an average over 3 random initialization.
Dataset How to predict No PCA PCA PCA
name exemplars d = 1024 d = 1024 d = 500
AwA SVR 77.8 76.2 76.2
MLP 76.1 0.5 76.4 0.1 75.5 1.7
CUB SVR 57.1 59.4 59.4
MLP 53.8 0.3 54.2 0.3 53.8 0.5
7.3.4.3 Analysis
PCA or not? Table 7.5 investigates the effect of PCA. In general, EXEM (1NN) performs com-
parably with and without PCA. Moreover, decreasing PCA projected dimension d from 1024 to
500 does not hurt the performance. Clearly, a smaller PCA dimension leads to faster computation
due to fewer regressors to be trained.
Kernel regression vs. Multi-layer perceptron We compare two approaches for predicting
visual exemplars: kernel-based support vector regressors (SVR) and 2-layer multi-layer percep-
tron (MLP) with ReLU nonlinearity. MLP weights are`
2
regularized, and we cross-validate the
regularization constant.
Table 7.6 shows that SVR performs more robustly than MLP. One explanation is that MLP is
prone to overfitting due to the small training set size (the number of seen classes) as well as the
model selection challenge imposed by ZSL scenarios. SVR also comes with other benefits; it is
more efficient and less susceptible to initialization.
7.3.5 Results on the generalized setting
We evaluate our methods and baselines using the Area Under Seen-Unseen accuracy Curve
(AUSUC) and report the results in Table 7.7. Following the same evaluation procedure as be-
fore, our approach again outperforms the baselines on all datasets.
Recently, Xian et al. [198] proposes to unify the evaluation protocol in terms of image fea-
tures, class semantic representations, data splits, and evaluation criteria for conventional and
generalized zero-shot learning. In their protocol, GZSL is evaluated by the harmonic mean of
64
Table 7.7: Generalized ZSL results in Area Under Seen-Unseen accuracy Curve (AUSUC) on
AwA, CUB, and SUN. For each dataset, we mark the best in red and the second best in blue. All
approaches use GoogLeNet as the visual features and calibrated stacking to combine the scores
for seen and unseen classes.
Approach AwA CUB SUN
DAP
y
[106] 0.366 0.194 0.096
IAP
y
[106] 0.394 0.199 0.145
CONSE
y
[135] 0.428 0.212 0.200
ESZSL
y
[156] 0.449 0.243 0.026
SYNC
O-VS-Oy
0.568 0.336 0.242
SYNC
STRUCTy
0.583 0.356 0.260
EXEM (SYNC
O-VS-O
) 0.553 0.365 0.265
EXEM (SYNC
STRUCT
) 0.587 0.397 0.288
EXEM (1NN) 0.570 0.318 0.284
EXEM (1NNS) 0.584 0.373 0.287
y
: our implementation.
seen and unseen classes’ accuracies. Technically, AUSUC provides a more complete picture of
zero-shot learning method’s performance, but it is less simpler than the harmonic mean.
7.4 Summary
We developed a novel approach by learning a mapping from the original semantic representations
to the average visual features, using the seen classes’ data. The resulting mapping is then used to
obtain improved representations, which can be either plugged into any ZSL approaches or treated
as the (single) training instances for unseen classes so that supervised algorithms like nearest
neighbors can be applied. While extremely simple, the latter way leads to promising results, even
outperforming SynC on the large-scale zero-shot learning task.
65
Part III
Domain Generalization for Visual Question Answering
66
Chapter 8
Introduction to Visual Question Answering and Its Challenges
So far we have talked about how to recognize unseen objects and differentiate them from seen
ones with the help of external class semantic representations, under the learning paradigm called
zero-shot learning (cf. Part II). To focus on object recognition there, we made an assumption—
the question to an intelligent system is always “What is the animal (or object, scene, etc.)?”—so
that we can ignore the information from questions. In this part, we will reconsider questions and
focus on visual question answering (Visual QA), a much difficult task than object recognition.
Specifically, given an image, the system needs to understand the questions and then comes up
with the answers, rather than just outputting all the object names within that image. (See Fig. 8.1
for an illustration.) The questions can begin with words other than “what”. The corresponding
answers thus may go beyond object names to further include counts (e.g., three), time (e.g., at
night), or even relationships among objects (e.g., to the left of the truck). In other words, Visual
QA requires comprehending and reasoning with both visual and language information, which is
an essential functionality for general artificial intelligence (AI).
To master Visual QA, we need not only novel learning algorithms (including model architec-
tures) and faithful evaluation metrics, but also new data collections to provide learning signal and
test environment. Moreover, as humans are known to have remarkable ability in generalizing and
adapting their intelligence to new environments (i.e., in the wild), it is thus crucial to investigate
whether the learned models have acquired such an ability. See Fig. 8.2 for an illustration.
Figure 8.1: The visual question answering (Visual QA) task [14]: given an image an intelligent
system needs to answer questions related to the image.
67
Figure 8.2: To learn a Visual QA system, we need to collect multiple question-answer pairs
for an image (in black color). However, human language is extremely flexible—there can be
exponentially many distinct questions or answers with respect to the vocabulary size and the text
length. Moreover, people can have different language styles. As a consequence, there can be
many ways of phrasing questions (or answers) of the same semantic meaning (in gray color). It
is thus desirable to have a system capable of dealing with those unfamiliar language usage.
In this chapter, we first review existing work on Visual QA
1
, and then discuss the remaining
challenges. We conclude with the outline of our work to be presented in the following chapters.
8.1 Review of existing work on Visual QA
8.1.1 Datasets
In merely the last four years, more than a dozen datasets have been released for Visual QA [89,
195, 72, 67, 88, 2]. In all the datasets, there are a collection of images (I). Most of them use use
natural images from large-scale common image databases (e.g. MSCOCO [117]), while some are
based on synthetic ones. Usually for each image, multiple questions (Q) and their corresponding
“correct” answers (T), referred as targets, are generated. This can be achieved either by human
annotators, or with an automatic procedure that uses captions or question templates and detailed
image annotations. In our work, we will focus on VQA [14], Visual7W [230], Visual Genome
(VG) [100], COCOQA [150], and VQA2 [67], which are among the most widely-used datasets
in the literature.
1
Please also be referred to [195, 89] for overviews of the status quo of the Visual QA task.
68
Besides the pairs of questions and correct answers, VQA [14], Visual7W [230], and visual
Madlibs [213] provide “negative” candidate answers (D), referred as decoys, for each pair so
that the task can be evaluated in multiple-choice selection accuracy.
8.1.2 Evaluation
While ideally a Visual QA system can generate free-form answers [59], evaluating the answers
is challenging and not amenable to automatic evaluation. Thus, so far a convenient paradigm is
to evaluate machine systems using multiple-choice (MC) based Visual QA [14, 230, 80]. The
machine is presented the correct answer, along with several decoys and the aim is to select the
right one. The evaluation is then automatic: one just needs to record the accuracy of selecting the
right answer. Alternatively, the open-ended (OE) setting is to select one from the top frequent
answers and compare it to multiple human-annotated ones [10, 14, 18, 56, 67, 88, 122, 203, 209,
211, 227], avoiding constructing decoys that are too easy such that the performance is artificially
boosted [14, 67].
8.1.3 Algorithms
As summarized in [89, 195, 72], in open-end Visual QA one popular framework of algorithms is
to learn a joint image-question embedding and perform multi-way classification (for predicting
top-frequency answers) on top [227, 10, 18, 56, 209, 122]. Though lacking the ability to gener-
ate novel answers beyond the training set, this framework has been shown to outperform other
methods that dedicate for free-form answer generation [195, 89].
Different from this line of research, in the multiple-choice setting, algorithms are usually
designed to learn a scoring function with the image, question, and a candidate answer as the
input [80, 56, 168]. Even a simple multi-layer perceptron (MLP) model achieves the state of the
art [80, 56, 168]. Such methods can take the advantage of answer semantics but fail to scale up
inferencing along the increasing number of answer candidates.
8.1.4 Analysis
In [48], Ferraro et al. surveyed several exiting image captioning and Visual QA datasets in terms
of their linguistic patterns. They proposed several metrics including perplexity, part of speech
distribution, and syntactic complexity to characterize those datasets, demonstrating the existence
of the reporting bias—the frequency that annotators write about actions, events, or states does not
reflect the real-world frequencies. However, they do not explicitly show how such a bias affects
the downstream tasks (i.e., Visual QA and captioning).
Specifically for Visual QA, there have been several work discussing the bias within a single
dataset [67, 219, 80, 87]. For example, [67, 219] argue the existence of priors on answers given
the question types and the correlation between the questions and answers (without images) in
VQA [14]. They propose to augment the original datasets with additional IQT (i.e., image-
question-target) triplets to resolve such issues. [80, 2] studies biases across datasets, and show
the difficulties in transferring learned knowledge across datasets.
69
8.2 Challenges
While Visual QA has attracted significant attention, together with seemingly remarkable progress,
there are still many challenges to resolve to ensure that we are on the right track towards AI.
• What kind of knowledge a Visual QA system actually learns—does it truly understand the
multi-modal information? or it simply relies on and over-fits to the incidental statistics or
correlations.
• The current experimental setup mainly focuses on training and testing within the same
dataset. It is unclear how the learned system can be generalized to real environment where
both the visual and language data might have distribution mismatch.
• State-of-the-art systems for different evaluation metrics are designed differently. It would
be desirable to have a unified system or algorithm to simultaneously master both metrics.
8.3 Contributions and outline of Part III
In my thesis, I strive to conduct comprehensive studies to answer the above questions. Then
according to the issues disclosed, we develop corresponding solutions to advance Visual QA.
Chapter 9 We started with multiple-choice Visual QA, which can be evaluated by the selection
accuracy without considering the semantic ambiguity. Through careful analysis, we showed the
design of negative candidate answers (i.e., decoys) has a significant impact on how and what the
models learn from the existing datasets. In particular, the resulting model can ignore the visual
information, the question, or both while still doing well on the task. We developed automatic pro-
cedures to remedy such design deficiencies by re-constructing decoy answers. Empirical studies
show that the deficiencies have been alleviated in the remedied datasets and the performance on
them is likely a more faithful indicator of the difference among learning models.
Chapter 10 We then studied cross-dataset generalization as a proxy to evaluate how the learned
models can be applied to real-world environment, reminiscent of the seminal work by Torralba
and Efros [181] on object recognition. We showed that the language components contain strong
dataset characteristics (e.g., phrasing styles)—by looking at them alone, a machine can detect the
origin of a Visual QA instance (i.e., an image-question-answer triplet). We performed so far the
most comprehensive analysis to show that such characteristics significantly prevent cross-dataset
generalization, evaluated among five popular datasets (see Fig. 8.3). In other words, current
Visual QA models cannot effectively handle unfamiliar language in new environment.
To this end, we developed a novel domain adaptation algorithm for Visual QA so that we can
properly transfer the learned knowledge (see Fig. 8.4). We introduced a framework by adapting
the unfamiliar language usage (target domain) to what the learned Visual QA model has been
trained on (source domain) so that we can re-use the model without re-training. Our algorithm
minimizes the domain mismatch while ensuring the consistency among different modalities (i.e.,
images, questions, and answers), given only limited amount of data from the target domain.
70
Figure 8.3: We experiment knowledge transfer across five popular datasets: VQA [14], Vi-
sual7W [230], Visual Genome (VG) [100], COCOQA [150], and VQA2 [67]. We train a model
on one dataset and investigate how well it can perform on the others.
Figure 8.4: We introduced a framework by adapting the unfamiliar language usage (target do-
main) to what the learned Visual QA model has been trained on (source domain) so that we can
re-use the model without re-training.
71
Figure 8.5: Denotei as an image,q as a question, andc as a candidate answer, we aim to learn a
scoring functionf(i;q;c) so that it gives a high score ifc is the target answer of the (i;q) pair.
We factorize f(i;q;c) into h(i;q) and g(c), in which we can take advantage of existing joint
embedding of vision and language forh(i;q). Moreover,g(c) can effectively captures the answer
semantic ignored in many state-of-the-art models. The scoring function is learned to maximize
the likelihood of outputting the target answer from a set of stochastically sampled candidates.
Chapter 11 We further developed a probabilistic and factorization framework of Visual QA
algorithms that can be applied to both the multiple-choice and open-ended settings. Our frame-
work effectively leverages the answer semantics and can directly account for out-of-vocabulary
instances (see Fig. 8.5), drastically increasing the transferability. More importantly, both work in
Chapter 10 and 11 can be applied to existing models so that we can stand on the shoulder of their
insightful architecture design in learning joint vision and language embedding.
72
Chapter 9
Creating Better Visual Question Answering Datasets
9.1 Overview
In this chapter, we study how to design high-quality multiple choices for the Visual QA task. In
this task, the machine (or the human annotator) is presented with an image, a question and a list
of candidate answers. The goal is to select the correct answer through a consistent understanding
of the image, the question and each of the candidate answers. As in any multiple-choice based
tests (such as GRE), designing what should be presented as negative answers—we refer them
as decoys—is as important as deciding the questions to ask. We all have had the experience of
exploiting the elimination strategy: This question is easy—none of the three answers could be
right so the remaining one must be correct!
While a clever strategy for taking exams, such “shortcuts” prevent us from studying faithfully
how different learning algorithms comprehend the meanings in images and languages (e.g., the
quality of the embeddings of both images and languages in a semantic space). It has been noted
that machines can achieve very high accuracies of selecting the correct answer without the visual
input (i.e., the image), the question, or both [80, 14]. Clearly, the learning algorithms have over-fit
on incidental statistics in the datasets. For instance, if the decoy answers have rarely been used
as the correct answers (to any questions), then the machine can rule out a decoy answer with a
binary classifier that determines whether the answers are in the set of the correct answers—note
that this classifier does not need to examine the image and it just needs to memorize the list of
the correct answers in the training dataset. See Fig. 9.1 for an example, and Section 9.3 for more
and detailed analysis.
We focus on minimizing the impacts of exploiting such shortcuts. We suggest a set of prin-
ciples for creating decoy answers. In light of the amount of human efforts in curating existing
datasets for the Visual QA task, we propose two procedures that revise those datasets such that
the decoy answers are better designed. In contrast to some earlier works, the procedures are fully
automatic and do not incur additional human annotator efforts. We apply the procedures to revise
both Visual7W [230] and VQA [14]. Additionally, we create new multiple-choice based datasets
from COCOQA [150] and the recently released VQA2 [67] and Visual Genome datasets [100].
The one based on Visual Genome becomes the largest multiple-choice dataset for the Visual QA
task, with more than one million image-question-candidate answers triplets.
We conduct extensive empirical and human studies to demonstrate the effectiveness of our
procedures in creating high-quality datasets for the Visual QA task. In particular, we show that
machines need to use all three information (image, questions and answers) to perform well—any
73
Question:
What vehicle is pictured?
Image only Unresolvable (IoU)
a. Overcast. (0.5455)
b. Daytime. (0.4941)
c. A building. (0.4829)
d. A train. (0.5363)
Question only Unresolvable (QoU)
a. A bicycle. (0.2813)
b. A truck. (0.5364)
c. A boat. (0.4631)
d. A train. (0.5079)
Original
a. A car. (0.2083)
b. A bus. (0.6151)
c. A cab. (0.5000)
d. A train. (0.7328)
Candidate Answers:
Figure 9.1: An illustration of how the shortcuts in the Visual7W dataset [230] should be remedied.
In the original dataset, the correct answer “A train” is easily selected by a machine as it is far
often used as the correct answer than the other decoy (negative) answers. (The numbers in the
brackets are probability scores computed using eq. (9.2)). Our two procedures—QoU and IoU
(cf. Section 9.4)—create alternative decoys such that both the correct answer and the decoys are
highly likely by examining either the image or the question alone. In these cases, machines make
mistakes unless they consider all information together. Thus, the alternative decoys suggested
our procedures are better designed to gauge how well a learning algorithm can understand all
information equally well.
missing information induces a large drop in performance. Furthermore, we show that humans
dominate machines in the task. However, given the revised datasets are likely reflecting the true
gap between the human and the machine understanding of multimodal information, we expect
that advances in learning algorithms likely focus more on the task itself instead of overfitting to
the idiosyncrasies in the datasets.
The rest of the chapter is organized as follows. In Section 9.2, we describe related work. In
Section 9.3, we analyze and discuss the design deficiencies in existing datasets. In Section 9.4, we
describe our automatic procedures for remedying those deficiencies. In Section 9.5 we conduct
experiments and analysis. We conclude the chapter in Section 9.6.
9.2 Related work
In VQA [14], the decoys consist of human-generated plausible answers as well as high-frequency
and random answers from the datasets. In Visual7W [230], the decoys are all human-generated
plausible ones. Note that, humans generate those decoys by only looking at the questions and the
correct answers but not the images. Thus, the decoys might be unrelated to the corresponding
images. A learning algorithm can potentially examine the image alone and be able to identify the
correct answer. In visual Madlibs [213], the questions are generated with a limited set of question
74
templates and the detailed annotations (e.g., objects) of the images. Thus, similarly, a learning
model can examine the image alone and deduce the correct answer.
Our work is inspired by the experiments in [80] where they observe that machines without
looking at images or questions can still perform well on the Visual QA task. Others have also
reported similar issues [67, 219, 87, 1, 88, 2], though not in the multiple-choice setting. Our work
extends theirs by providing more detailed analysis as well as automatic procedures to remedy
those design deficiencies.
Besides Visual QA, VisDial [36] and Ding et al. [39] also propose automatic ways to generate
decoys for the tasks of multiple-choice visual captioning and dialog, respectively.
9.3 Analysis of decoy answers’ effects
In this section, we examine in detail the dataset Visual7W [230], a popular choice for the Vi-
sual QA task. We demonstrate how the deficiencies in designing decoy questions impact the
performance of learning algorithms.
In multiple-choice Visual QA datasets, a training or test example is a triplet that consists of
an image I, a question Q, and a candidate answer set A. The set A contains a target T (the correct
answer) and K decoys (incorrect answers) denoted by D. An IQA triplet is thusfI; Q; A =
fT; D
1
; ; D
K
gg. We use C to denote either the target or a decoy.
9.3.1 Visual QA models
We investigate how well a learning algorithm can perform when supplied with different modalities
of information. We concentrate on the one hidden-layer MLP model proposed in [80], which has
achieved state-of-the-art results on the dataset Visual7W. The model computes a scoring function
f(c;i)
f(c;i) =(U max(0;Wg(c;i)) +b) (9.1)
over a candidate answerc and the multimodal informationi, whereg is the joint feature of (c;i)
and(x) = 1=(1 + exp(x)). The informationi can be null, the image (I) alone, the question
(Q) alone, or the combination of both (I+Q).
Given an IQA triplet, we use the penultimate layer of ResNet-200 [73] as visual features to
represent I and the average WORD2VEC embeddings [131] as text features to represent Q and C.
To form the joint featureg(c;i), we just concatenate the features together. The candidatec2 A
that has the highestf(c;i) score in prediction is selected as the model output.
We use the standard training, validation, and test splits of Visual7W, where each contains
69,817, 28,020, and 42,031 examples respectively. Each question has 4 candidate answers. The
parameters off(c;i) are learned by minimizing the binary logistic loss of predicting whether or
not a candidatec is the target of an IQA triplet. Details are in Section 9.5.
75
Information used Machine Human
random 25.0 25.0
A 52.9 -
I + A 62.4 75.3
Q + A 58.2 36.4
I + Q + A 65.7 88.4
Table 9.1: Accuracy of selecting the right answers out of 4 choices (%) on the Visual QA task on
Visual7W.
9.3.2 Analysis results
Machines find shortcuts Table 9.1 summarizes the performance of the learning models, to-
gether with the human studies we performed on a subset of 1,000 triplets (c.f. Section 9.5 for
details). There are a few interesting observations.
First, in the row of “A” where only the candidate answers (and whether they are right or
wrong) are used to train a learning model, the model performs significantly better than random
guessing and humans (52.9% vs. 25%)—humans will deem each of the answers equally likely
without looking at both the image and the question! Note that in this case, the informationi in
eq. (9.1) contains nothing. The model learns the specific statistics of the candidate answers in
the dataset and exploits those. Adding the information about the image (i.e., the row of “I+A”),
the machine improves significantly and gets close to the performance when all information is
used (62.4% vs. 65.7%). There is a weaker correlation between the question and the answers as
“Q+A” improves over “A” only modestly. This is expected. In the Visual7W dataset, the decoys
are generated by human annotators as plausible answers to the questions without being shown the
images—thus, many decoy answers do not have visual groundings. For instance, a question of
“what animal is running?” elicits equally likely answers such as “dog”, “tiger”, “lion”, or “cat”,
while an image of a dog running in the park will immediately rule out all 3 but the “dog”, see
Fig. 9.1 for a similar example. Thus, the performance of “I+A” implies that many IQA triplets
can be solved by object, attribute or concept detection on the image, without understanding the
questions. This is indeed the case also for humans—humans can achieve 75.3% by considering
“I+A” and not “Q”. Note that the difference between machine and human on “I+A” are likely due
to their difference in understanding visual information.
Note that human improves significantly from “I+A” to “I+Q+A” with “Q” added, while the
machine does so only marginally. The difference can be attributed to the difference in understand-
ing the question and correlating with the answers between the two. Since each image corresponds
to multiple questions or have multiple objects, solely relying on the image itself will not work
well in principle. Such difference clearly indicates that in the Visual QA model, the language
component is weak as the model cannot fully exploit the information in “Q”, making a smaller
relative improvement 5.3% (from 62.4% to 65.7%) where humans improved relatively 17.4%.
Shortcuts are due to design deficiencies We probe deeper on how the decoy answers have
impacted the performance of learning models.
76
As explained above, the decoys are drawn from all plausible answers to a question, irrespec-
tive of whether they are visually grounded or not. We have also discovered that the targets (i.e.,
correct answers) are infrequently used as decoys.
Specifically, among the 69,817 training samples, there are 19,503 unique correct answers and
each one of them is used about 3.6 times as correct answers to a question. However, among all
the 69; 817 3 210K decoys, each correct answer appears 7.2 times on average, far below a
chance level of 10.7 times (210K 19; 503 10:7). This disparity exists in the test samples too.
Consequently, the following rule, computing each answer’s likelihood of being correct,
P (correctjC) =
(
0:5; if C is never seen in training,
# times C as target
# times C as target+(# times C as decoys)=K
; otherwise,
(9.2)
should perform well. Essentially, it measures how unbiased C is used as the target and the decoys.
Indeed, it attains an accuracy of 48.73% on the test data, far better than the random guess and is
close to the learning model using the answers’ information only (the “A” row in Table 9.1).
Good rules for designing decoys Based on our analysis, we summarize the following guidance
rules to design decoys: (1) Question only Unresolvable (QoU). The decoys need to be equally
plausible to the question. Otherwise, machines can rely on the correlation between the question
and candidate answers to tell the target from decoys, even without the images. Note that this is a
principle that is being followed by most datasets. (2) Neutrality. The decoys answers should be
equally likely used as the correct answers. (3) Image only Unresolvable (IoU). The decoys need
to be plausible to the image. That is, they should appear in the image, or there exist questions so
that the decoys can be treated as targets to the image. Otherwise, Visual QA can be resolved by
objects, attributes, or concepts detection in images, even without the questions.
Ideally, each decoy in an IQA triplet should meet the three principles. Neutrality is compa-
rably easier to achieve by reusing terms in the whole set of targets as decoys. On the contrary, a
decoy may hardly meet QoU and IoU simultaneously
1
. However, as long as all decoys of an IQA
triplet meet Neutrality and some meet QoU and others meet IoU, the triplet as a whole achieves
the three principles—a machine ignoring either images or questions will likely perform poorly.
9.4 Creating better Visual QA datasets
In this section, we describe our approaches of remedying deficiencies in the existing datasets for
the Visual QA task. We introduce two automatic and widely-applicable procedures to create new
decoys that can prevent learning models from exploiting incident statistics in the datasets.
9.4.1 Methods
Main ideas Our procedures operate on a dataset that already contains image-question-target
(IQT) triplets, i.e., we do not assume it has decoys already. For instance, we have used our
1
E.g., in Fig 9.1, for the question “What vehicle is pictured?”, the only answer that meets both principles is “train”,
which is the correct answer instead of being a decoy.
77
procedures to create a multiple-choice dataset from the Visual Genome dataset which has no
decoy. We assume that each image in the dataset is coupled with “multiple” QT pairs, which is
the case in nearly all the existing datasets. Given an IQT triplet (I, Q, T), we create two sets of
decoy answers.
• QoU-decoys. We search among all other triplets that have similar questions to Q. The tar-
gets of those triplets are then collected as the decoys for T. As the targets to similar ques-
tions are likely plausible for the question Q, QoU-decoys likely follow the rules of Neutral-
ity and Question only Unresolvable (QoU). We compute the average WORD2VEC [131]
to represent a question, and use the cosine similarity to measure the similarity between
questions.
• IoU-decoys. We collect the targets from other triplets of the same image to be the decoys
for T. The resulting decoys thus definitely follow the rules of Neutrality and Image only
Unresolvable (IoU).
We then combine the triplet (I, Q, T) with QoU-decoys and IoU-decoys to form an IQA triplet as
a training or test sample.
Resolving ambiguous decoys One potential drawback of automatically selected decoys is that
they may be semantically similar, ambiguous, or rephrased terms to the target [230]. We utilize
two filtering steps to alleviate it. First, we perform string matching between a decoy and the
target, deleting those decoys that contain or are covered by the target (e.g., “daytime” vs. “during
the daytime” and “ponytail” vs. “pony tail”).
Secondly, we utilize the WordNet hierarchy and the Wu-Palmer (WUP) score [196] to elimi-
nate semantically similar decoys. The WUP score measures how similar two word senses are (in
the range of [0; 1]), based on the depth of them in the taxonomy and that of their least common
subsumer. We compute the similarity of two strings according to the WUP scores in a similar
manner to [124], in which the WUP score is used to evaluate Visual QA performance. We elim-
inate decoys that have higher WUP-based similarity to the target. We use the NLTK toolkit [21]
to compute the similarity.
Other details For QoU-decoys, we sort and keep for each triplet the topN (e.g., 10,000) similar
triplets from the entire dataset according to the question similarity. Then for each triplet, we
compute the WUP-based similarity of each potential decoy to the target successively, and accept
those with similarity below 0.9 until we haveK decoys. We choose 0.9 according to [124]. We
also perform such a check among selected decoys to ensure they are not very similar to each
other. For IoU-decoys, the potential decoys are sorted randomly. The WUP-based similarity with
a threshold of 0.9 is then applied to remove ambiguous decoys.
9.4.2 Comparison to other datasets
Several authors have noticed the design deficiencies in the existing databases and have proposed
“fixes” [14, 213, 230, 36]. No dataset has used a procedure to generate IoU-decoys. We empiri-
cally show that how the IoU-decoys significantly remedy the design deficiencies in the datasets.
78
Several previous efforts have generated decoys that are similar in spirit to our QoU-decoys.
Yu et al. [213], Das et al. [36], and Ding et al. [39] automatically find decoys from similar ques-
tions or captions based on question templates and annotated objects, tri-grams and GLOVE em-
beddings [143], and paragraph vectors [108] and linguistic surface similarity, respectively. The
later two are for different tasks from Visual QA, and only Ding et al. [39] consider removing
semantically ambiguous decoys like ours. Antol et al. [14] and Zhu et al. [230] ask humans to
create decoys, given the questions and targets. As shown earlier, such decoys may disobey the
rule of Neutrality.
Goyal et al. [67] augment the VQA dataset [14] (by human efforts) with additional IQT
triplets to eliminate the shortcuts (language prior) in the open-ended setting. Their effort is com-
plementary to ours on the multiple-choice setting. Note that an extended task of Visual QA, visual
dialog [36], also adopts the latter setting.
9.5 Empirical studies
9.5.1 Dataset
We examine our automatic procedures for creating decoys on five datasets. Table 9.2 summarizes
the characteristics of the three datasets—VQA, Visual7W, and Visual Genome—we focus on.
VQA Real [14] The dataset uses images from MSCOCO [117] under the same splits for train-
ing/validation/testing to construct IQA triplets. Totally 614,163 IQA triplets are generated for
204,721 images. Each question has 18 candidate answers: in general 3 decoys are human-
generated, 4 are randomly sampled, and 10 are randomly sampled frequent-occurring targets.
As the test set does not indicate the targets, our studies focus on the training and validation sets.
Visual7W Telling (Visual7W) [230] The dataset uses 47,300 images from MSCOCO [117]
and contains 139,868 IQA triplets. Each has 3 decoys generated by humans.
Visual Genome (VG) [100] The dataset uses 101,174 images from MSCOCO [117] and con-
tains 1,445,322 IQT triplets. No decoys are provided. Human annotators are asked to write
diverse pairs of questions and answers freely about an image or with respect to some regions of
it. On average an image is coupled with 14 question-answer pairs. We divide the dataset into
non-overlapping 50%/20%/30% for training/validation/testing. Additionally, we partition such
that each portion is a “superset” of the corresponding one in Visual7W, respectively.
COCOQA [150] This dataset contains in total 117,684 auto-generated IQT triplets with no
decoy answers. Therefore, we create decoys using our proposed approach and follow the original
data split, leading to a training set and a testing set with 78,736 IQA triplets and 38,948 IQA
triplets, respectfully.
VQA2 [67] VQA2 is a successive dataset of VQA, which pairs each IQT triplet with a comple-
mentary one to reduce the correlation between questions and answers. There are 443,757 training
IQT triplets and 214,354 validation IQT triplets, with no decoys. We generate decoys using our
79
Dataset # of Images # of triplets # of decoys
Name train val test train val test per triplet
VQA 83k 41k 81k 248k 121k 244k 17
Visual7W 14k 5k 8k 69k 28k 42k 3
VG 49k 19k 29k 727k 283k 433k -
Table 9.2: Summary of Visual QA datasets.
approach and follow the original data split to organize the data. We do not consider the test split
as it does not indicate the targets (correct answers).
Creating decoys We create 3 QoU-decoys and 3 IoU-decoys for every IQT triplet in each
dataset, following the steps in Section 9.4.1. In the cases that we cannot find 3 decoys, we
include random ones from the original set of decoys for VQA and Visual7W; for other datasets,
we randomly include those from the top 10 frequently-occurring targets.
9.5.2 Setup
Visual QA models We utilize the MLP models mentioned in Section 9.3 for all the experiments.
We denote MLP-A, MLP-QA, MLP-IA, MLP-IQA as the models using A (Answers only),
Q+A (Question plus Answers), I+A (Image plus Answers), and I+Q+A (Image, Question and
Answers) for multimodal information, respectively. The hidden-layer has 8,192 neurons. We use
a 200-layer ResNet [73] to compute visual features which are 2,048-dimensional. The ResNet
is pre-trained on ImageNet [157]. The WORD2VEC feature [131] for questions and answers are
300-dimensional, pre-trained on Google News
2
. The parameters of the MLP models are learned
by minimizing the binary logistic loss of predicting whether or not a candidate answer is the
target of the corresponding IQA triplet.
We further experiment with a variant of the spatial memory network (denoted as Attention)
[203] and the HieCoAtt model [122] adjusted for the multiple-choice setting. Both models utilize
the attention mechanism.
Optimization We train all our models using stochastic gradient based optimization method
with mini-batch size of 100, momentum of 0.9, and the stepped learning rate policy: the learning
rate is divided by 10 after everyM mini-batches. We set the initial learning rate to be 0.01 (we
further consider 0.001 for the case of fine-tuning). For each model, we train with at most 600,000
iterations. We treatM and the number of iterations as hyper-parameters of training. We tune the
hyper-parameters on the validation set.
Within each mini-batch, we sample 100 IQA triplets. For each triplet, we randomly choose
to use QoU-decoys or IoU-decoys when training on IoU +QoU, or QoU-decoys or IoU-decoys
or Orig when training on All. We then take the target and 3 decoys for each triplet to train
the binary classifier (i.e., minimize the logistic loss). Specifically on VQA, which has 17 Orig
decoys for a triplet, we randomly choose 3 decoys out of them. That is, 100 triplets in the mini-
batch corresponds to 400 examples with binary labels. This procedure is to prevent unbalanced
2
We experiment on using different features in Section 9.5.4.
80
Method Orig IoU QoU IoU +QoU All
MLP-A 52.9 27.0 34.1 17.7 15.6
MLP-IA 62.4 27.3 55.0 23.6 22.2
MLP-QA 58.2 84.1 40.7 37.8 31.9
MLP-IQA 65.7 84.1 57.6 52.0 45.1
HieCoAtt 63.9 - - 51.5 -
Attntion 65.9 - - 52.8 -
Human 88.4 - - 84.1 -
Random 25.0 25.0 25.0 14.3 10.0
: based on our implementation or modification
Table 9.3: Test accuracy (%) on Visual7W.
training, where machines simply learn to predict the dominant label, as suggested by Jabri et
al. [80]. In all the experiments, we use the same type of decoy sets for training and testing.
Evaluation metric For VQA and VQA2, we follow their protocols by comparing the picked
answer to 10 human-generated targets. The accuracy is computed based on the number of exactly
matched targets (divided by 3 and clipped at 1). For others, we compute the accuracy of picking
the target from multiple choices.
Decoy sets to compare For each dataset, we derive several variants: (1) Orig: the original de-
coys from the datasets, (2) QoU: Orig replaced with ones selected by our QoU-decoys generating
procedure, (3) IoU: Orig replaced with ones selected by our IoU-decoys generating procedure,
(4) QoU +IoU: Orig replaced with ones combining QoU and IoU, (5) All: combining Orig,
QoU, and IoU.
User studies Automatic decoy generation may lead to ambiguous decoys as mentioned in Sec-
tion 9.4 and [230]. We conduct a user study via Amazon Mechanic Turk (AMT) to test humans’
performance on the datasets after they are remedied by our automatic procedures. We select 1,000
IQA triplets from each dataset. Each triplet is answered by three workers and in total 169 workers
get involved. The total cost is $215—the rate for every 20 triplets is $0.25. We report the average
human performance and compare it to the learning models’.
9.5.3 Main results
We present the main results on VQA, Visual7W, and Visual Genome. The performances of
learning models and humans on the 3 datasets are reported in Table 9.3, 9.4, and 9.5
3
.
3
We note that in Table 9.3, the 4.3% drop of the human performance on IoU +QoU, compared to Orig, is likely
due to that IoU +QoU has more candidates (7 per question). Besides, the human performance on qaVG cannot be
directly compared to that on the other datasets, since the questions on qaVG tend to focus on local image regions and
are considered harder.
81
Method Orig IoU QoU IoU +QoU All
MLP-A 31.2 39.9 45.7 31.2 27.4
MLP-IA 42.0 39.8 55.1 34.1 28.7
MLP-QA 58.0 84.7 55.1 54.4 50.0
MLP-IQA 64.6 85.2 65.4 63.7 58.9
HieCoAtt 63.0 - - 63.7 -
Attntion 66.0 - - 66.7 -
Human 88.5
y
- - 89.0 -
Random 5.6 25.0 25.0 14.3 4.2
: based on our implementation or modification
y
: taken from [14]
Table 9.4: Accuracy (%) on the validation set in VQA.
Effectiveness of new decoys A better set of decoys will force learning models to integrate all
3 pieces of information—images, questions and answers—to make the correct selection from
multiple-choices. In particular, they should prevent learning algorithms from exploiting shortcuts
such that partial information is sufficient for performing well on the Visual QA task.
Table 9.3 clearly indicates that those goals have been achieved. With the Orig decoys, the
relatively small gain from MLP-IA to MLP-IQA suggests that the question information can be
ignored to attain good performance. However, with the IoU-decoys which require questions to
help to resolve (as image itself is inadequate to resolve), the gain is substantial (from 27.3% to
84.1%). Likewise, with the QoU-decoys (question itself is not adequate to resolve), including
images information improves from 40.7% (MLP-QA) substantially to 57.6% (MLP-IQA). Note
that with the Orig decoys, this gain is smaller (58.2% vs. 65.7%).
It is expected that MLP-IA matches better QoU-decoys but not IoU-decoys, and MLP-QA
is the other way around. Thus it is natural to combine these two decoys. What is particularly
appealing is that MLP-IQA improves noticeably over models learned with partial information on
the combined IoU +QoU-decoys (and “All” decoys
4
). Furthermore, using answer information
only (MLP-A) attains about the chance-level accuracy.
On the VQA dataset (Table 9.4), the same observations hold, though to a lesser degree. On
any of the IoU or QoU columns, we observe substantial gains when the complementary infor-
mation is added to the model (such as MLP-IA to MLP-IQA). All these improvements are much
more visible than those observed on the original decoy sets.
Combining both Table 9.3 and 9.4, we notice that the improvements from MLP-QA to MLP-
IQA tend to be lower when facing IoU-decoys. This is also expected as it is difficult to have
decoys that are simultaneously both IoU and QoU –such answers tend to be the target answers.
Nonetheless, we deem this as a future direction to explore.
4
We note that the decoys in Orig are not trivial, which can be seen from the gap between All and IoU +QoU.
Our main concern on Orig is that for those questions that machines can accurately answer, they mostly rely on only
partial information. This will thus hinder designing machines to fully comprehend and reason from multimodal infor-
mation. We further experiment on random decoys, which can achieve Neutrality but not the other two principles, to
demonstrate the effectiveness of our methods in Section 9.5.4.
82
Method IoU QoU IoU +QoU
MLP-A 29.1 36.2 19.5
MLP-IA 29.5 60.2 25.2
MLP-QA 89.3 45.6 43.9
MLP-IQA 89.2 64.3 58.5
HieCoAtt - - 57.5
Attntion - - 60.1
Human - - 82.5
Random 25.0 25.0 14.3
: based on our implementation or modification
Table 9.5: Test accuracy (%) on qaVG.
Differences across datasets Contrasting Visual7W to VQA (on the column IoU +QoU), we
notice that Visual7W tends to have bigger improvements in general. This is due to the fact that
VQA has many questions with “Yes” or “No” as the targets—the only valid decoy to the target
“Yes” is “No”, and vice versa. As such decoys are already captured by Orig of VQA (‘Yes”
and “No” are both top frequently-occurring targets), adding other decoy answers will not make
any noticeable improvement. In Section 9.5.4, however, we show that once we remove such
questions/answers pairs, the degree of improvements increases substantially.
Comparison on Visual QA models As presented in Table 9.3 and 9.4, MLP-IQA is on par with
or even outperforms Attention and HieCoAtt on the Orig decoys, showing how the shortcuts
make it difficult to compare different models. By eliminating the shortcuts (i.e., on the combined
IoU +QoU-decoys), the advantage of using sophisticated models becomes obvious (Attention
outperforms MLP-IQA by 3% in Table 9.4), indicating the importance to design advanced models
for achieving human-level performance on Visual QA.
For completeness, we include the results on the Visual Genome dataset in Table 9.5. This
dataset has no “Orig” decoys, and we have created a multiple-choice based dataset qaVG from it
for the task—it has over 1 million triplets, the largest dataset on this task to our knowledge. On the
combined IoU +QoU-decoys, we again clearly see that machines need to use all the information
to succeed.
With qaVG, we also investigate whether it can help improve the multiple-choice perfor-
mances on the other two datasets. We use the MLP-IQA trained on qaVG with both IoU and
QoU decoys to initialize the models for the Visual7W and VQA datasets. We report the ac-
curacies before and after fine-tuning, together with the best results learned solely on those two
datasets. As shown in Table 9.6, fine-tuning largely improves the performance, justifying the
finding by Fukui et al. [56].
9.5.4 Additional results and analysis
Results on VQA w/o QA pairs that have Yes/No as the targets The validation set of VQA
contains 45,478 QA pairs (out of totally 12,1512 pairs) that have Yes or No as the correct answers.
The only reasonable decoy to Yes is No, and vice versa—any other decoy could be easily recog-
nized in principle. Since both of them are among top 10 frequently-occurring answers, they are
83
Datasets Decoys Best w/o qaVG model
using qaVG initial fine-tuned
Visual7W
Orig 65.7 60.5 69.1
IoU +QoU 52.0 58.1 58.7
All 45.1 48.9 51.0
VQA
Orig 64.6 42.2 65.6
IoU +QoU 63.7 47.9 64.1
All 58.9 37.5 59.4
Table 9.6: Using models trained on qaVG to improve Visual7W and VQA (Accuracy in %).
Method Orig IoU QoU IoU +QoU All
MLP-A 28.8 42.9 34.5 23.6 15.8
MLP-IA 43.0 44.8 53.2 35.5 28.5
MLP-QA 45.8 80.7 39.3 38.2 31.9
MLP-IQA 55.6 81.8 56.6 53.7 46.5
HieCoAtt 54.8 - - 55.6 -
Attention 58.5 - - 58.6 -
Human-IQA - - - 85.5 -
Random 5.6 25.0 25.0 14.3 4.2
: based on our implementation or modification
Table 9.7: Accuracy (%) on VQA
-2014val, which contains 76,034 triplets.
already included in the Orig decoys—our IoU-decoys and QoU-decoys can hardly make notice-
able improvement. We thus remove all those pairs (denoted as Yes/No QA pairs) to investigate
the improvement on the remaining pairs, for which having multiple choices makes sense. We
denote the subset of VQA as VQA
(we remove Yes/No pairs in training and validation sets).
We conduct experiments on VQA
, and Table 9.7 summarizes the machines’ as well as hu-
mans’ results. Compared to Table 9.4, most of the results drop, which is expected as those
removed Yes/No pairs are considered simpler and easier ones—their effective random chance is
50%. The exception is for the MLP-IA models, which performs roughly the same or even better
on VQA
, suggesting that Yes/No pairs are somehow difficult to MLP-IA. This, however, makes
sense since without the questions (e.g., those start with “Is there a ...” or “Does the person ...”), a
machine cannot directly tell if the correct answer falls into Yes or No, or others.
We see that on VQA
, the improvement by our IoU-decoys and QoU-decoys becomes sig-
nificant. The gain brought by images on QoU (from 39.3% to 56.6%) is much larger than that
on Orig (from 45.8% to 55.6%). Similarly, the gain brought by questions on IoU (from 44.8%
to 81.8%) is much larger than that on Orig (from 43.0% to 55.6%). After combining IoU-decoys
and QoU-decoys as in IoU +QoU and All, the improvement by either including images to MLP-
QA or including questions to MLP-IA is noticeable higher than that on Orig. Moreover, even
with only 6 decoys, the performance by MLP-A on IoU +QoU is already lower than that on Orig,
which has 17 decoys, demonstrating the effectiveness of our decoys in preventing machines from
overfitting to the incidental statistics. These observations together demonstrate how our proposed
ways for creating decoys improve the quality of multiple-choice Visual QA datasets.
84
Method IoU QoU IoU +QoU
MLP-A 70.3 31.7 26.6
MLP-IA 73.4 73.3 60.7
MLP-QA 91.5 52.5 51.4
MLP-IQA 93.1 78.3 75.9
Random 25.0 25.0 14.3
Table 9.8: Test accuracy (%) on COCOQA.
Method IoU QoU IoU +QoU
MLP-A 37.7 41.9 27.7
MLP-IA 37.9 54.4 30.5
MLP-QA 84.2 48.3 48.1
MLP-IQA 86.3 63.0 61.1
Random 25.0 25.0 14.3
Table 9.9: Test accuracy (%) on VQA2-2017val.
Results on COCOQA and VQA2 For both datasets, we conduct experiments using the MLP-
based models. As shown in Table 9.8, we clearly see that with only answers being visible to
the model (MLP-A), the performance is close to random (on the column of IoU +QoU-decoys),
and far from observing all three sources of information (MLP-IQA). Meanwhile, models that can
observe either images and answers (MLP-IA) or questions and answers (MLP-QA) fail to predict
as good as the model that observe all three sources of information. Results in Table 9.9 also
shows a similar trend. These empirical observations meet our expectation and again verify the
effectiveness of our proposed methods for creating decoys.
We also perform a more in-depth experiment on VQA2, removing triplets with Yes/No as the
target. We name this subset as VQA2
. Table 9.10 shows the experimental results on VQA2
.
Comparing to Table 9.9, we see that the overall performance for each model decreases as the
dataset becomes more challenging on average. Specifically, the model that observes question and
answer on VQA2
performs much worse than that on VQA2 (37.2% vs. 48.1%).
Analysis on different question and answer embeddings We consider GLOVE [143] and the
embedding learned from translation [125] on both question and answer embeddings. The results
on Visual7W (IoU + QoU, compared to Table 9.3 that uses WORD2VEC) are in Table 9.11. We do
not observe significant difference among different embeddings, which is likely due to that both
the questions and answers are short (averagely 7 words for questions and 2 for answers).
Analysis on random decoys We conduct the analysis on sampling random decoys, instead of
our IoU-decoys and QoU-decoys, on Visual7W. We collect 6 additional random decoys for each
Orig IQA triplet so the answer set will contain 10 candidates, the same as All in Table 9.3. We
consider two strategies: (A) uniformly random decoys from unique correct answers, and (B)
weighted random decoys w.r.t. their frequencies. The results are in Table 9.12. We see that
different random strategies lead to drastically different results. Moreover, compared to the All
column inTable 9.3, we see that our methods lead to a larger relative gap between MLP-IQA
85
Method IoU QoU IoU +QoU
MLP-A 39.8 33.7 21.3
MLP-IA 40.3 53.0 31.0
MLP-QA 84.8 37.6 37.2
MLP-IQA 85.9 56.1 53.8
Random 25.0 25.0 14.3
Table 9.10: Test accuracy (%) on VQA2
-2017val, which contains 134,813 triplets.
Method GLOVE Translation WORD2VEC
MLP-A 18.0 18.0 17.7
MLP-IA 23.6 23.2 23.6
MLP-QA 38.1 38.3 37.8
MLP-IQA 52.5 51.4 52.0
Random 14.3 14.3 14.3
Table 9.11: Test accuracy (%) on Visual7W, comparing different embeddings for questions and
answers. The results are reported for the IoU +QoU-decoys.
to MLP-IA and MLP-QA than both random strategies, demonstrating the effectiveness of our
methods in creating decoys.
9.5.5 Qualitative results
In Fig. 9.2, we present examples of image-question-target triplets from V7W, VQA, and VG,
together with our IoU-decoys (A, B, C) and QoU-decoys (D, E, F). G is the target. The predictions
by the corresponding MLP-IQA are also included. Ignoring information from images or questions
makes it extremely challenging to answer the triplet correctly, even for humans.
Our automatic procedures do fail at some triplets, resulting in ambiguous decoys to the tar-
gets. See Fig. 9.3 for examples. We categorized those failure cases into two situations.
• Our filtering steps in Section 9.4 fail, as observed in the top example. The WUP-based
similarity relies on the WordNet hierarchy. For some semantically similar words like “lady”
and “woman”, the similarity is only 0.632, much lower than that of 0.857 between “cat”
Method (A) (B) All
MLP-A 39.6 11.6 15.6
MLP-IA 53.4 40.3 22.2
MLP-QA 52.3 50.3 31.9
MLP-IQA 61.5 60.2 45.1
Random 10.0 10.0 10.0
Table 9.12: Test accuracy (%) on Visual7W, comparing different random decoy strategies to our
methods: (A) Orig + uniformly random decoys from unique correct answers, (B) Orig + weighted
random decoys w.r.t. their frequencies, and All (Orig+IoU +QoU).
86
What is the train traveling
over?
A. Yes.
B. Blue.
C. Tracks.
D. Train.
E. South.
F. Forward.
G. Bridge.
What is the color of his
wetsuit?
A. When waves are bigger.
B. It is not soft and fine.
C. It is a picture of nature.
D. Green.
E. Blue.
F. Red.
G. It is black.
Where do the stairs lead?
A. A parking lot.
B. The building.
C. The windows.
D. From the canal to the
bridge.
E. Up.
F. To the building.
G. To the plane.
What is the right man on
the right holding?
A. Brown.
B. The man on the right.
C. Four.
D. A bottle.
E. A surfboard.
F. Cellphone.
G. A bat.
What is the man wearing?
A. Black.
B. Mountains.
C. The beach.
D. Board shorts.
E. He wears white shoes.
F. A white button down
shirt and a black tie.
G. Wetsuit.
What are these people
about to do?
A. Yellow.
B. Yes.
C. Four.
D. Surf.
E. Fly kite.
F. Play frisbee.
G. Ski.
Figure 9.2: Example image-question-target triplets from Visual7W, VQA, and VG, together with
our IoU-decoys (A, B, C.) and QoU-decoys (D, E, F). G is the target. Machine’s selections are
denoted by green ticks (correct) or red crosses (wrong).
A. Trees.
B. Clear and sunny.
C. Basement windows.
D. On both sides of road.
E. To left of truck.
F. On edge of the sidewalk.
G. In front of the building.
A. Certificate.
B. Garland.
C. Three.
D. The man.
E. Person in chair.
F. The lady.
G. The woman.
Who is wearing glasses?
Where are several trees?
Figure 9.3: Ambiguous examples by our IoU-decoys (A, B, C) and QoU-decoys (D, E, F). G is
the target. Ambiguous decoys F are marked.
and “dog”. This issue can be alleviated by considering alternative semantic measures by
WORD2VEC or by those used in [36, 39] for searching similar questions.
• The question is ambiguous to answer. In the bottom example in Fig. 9.3, both candidates
D and F seem valid as a target. Another representative case is when asked about the back-
ground of a image. In images that contain sky and mountains in the distance, both terms
can be valid.
9.6 Summary
We perform detailed analysis on existing datasets for multiple-choice Visual QA. We found that
the design of decoys can inadvertently provide “shortcuts” for machines to exploit to perform well
on the task. We describe several principles of constructing good decoys and propose automatic
procedures to remedy existing datasets and create new ones. We conduct extensive empirical
studies to demonstrate the effectiveness of our methods in creating better Visual QA datasets.
The remedied datasets and the newly created ones are released and available athttp://www.
teds.usc.edu/website_vqa/.
87
Chapter 10
Cross-dataset Adaptation
10.1 Overview
In this chapter, we study the cross-dataset performance gap. Specifically, can the machine learn
knowledge well enough on one dataset so as to answer adeptly questions from another dataset?
Such study will highlight the similarity and difference among different datasets and guides the
development of future ones. It also sheds lights on how well learning machines can understand
visual and textual information in their generality, instead of learning and reasoning with dataset-
specific knowledge.
Studying the performance gap across datasets is reminiscent of the seminal work by Torralba
and Efros [181]. There, the authors study the bias in image datasets for object recognition. They
have showed that the idiosyncrasies in the data collection process cause domain mismatch such
that classifiers learnt on one dataset degrade significantly on another dataset [64, 62, 63, 95, 126,
180, 74, 179].
The language data in the Visual QA datasets introduces an addition layer of difficulty to bias
in the visual data (see Fig. 10.1). For instance, [48] analyzes several datasets and illustrates their
difference in syntactic complexity as well as within- and cross-dataset perplexity. As such, data
in Visual QA datasets are likely more taletelling the origins from which datasets they come.
To validate this hypothesis, we had designed a Name That Dataset! experiment, similar to
the one in [181] for comparing visual object images. We show that the two popular Visual
QA datasets VQA [14] and Visual7W [230] are almost complete distinguishable using either the
question or answer data. See Section 10.2 for the details of this experiment.
Thus, Visual QA systems that are optimized on one of those datasets can focus on dataset-
specific knowledge such as the type of questions and how the questions and answers are phrased.
This type of bias exploitation hinders cross-dataset generalization and does not result in AI sys-
tems that can reason well over vision and text information in different or new characteristics.
In this chapter, we investigate the issue of cross-dataset generalization in Visual QA. We
assume that there is a source domain with a sufficiently large amount of annotated data such that
a strong Visual QA model can be built, albeit adapted to the characteristics of the source domain
well. However, we are interested in using the learned system to answer questions from another
(target) domain. The target domain does not provide enough data to train a Visual QA system
from scratch. We show that in this domain-mismatch setting, applying directly the learned system
from the source to the target domain results in poor performance.
88
Candidates:
The mayor.
The governor.
The clowns.
Motorcycle cop.
Question:
Who leads
the parade?
Candidates:
No.
Bike for two.
Kingfish.
Motorcycle.
Question:
What type of
bike is this?
Figure 10.1: An illustration of the dataset bias in visual question answering. Given the same
image, Visual QA datasets like VQA [14] (right) and Visual7W [230] (left) provide different
styles of questions, correct answers (red), and candidate answer sets, each can contributes to the
bias to prevent cross-dataset generalization.
Table 10.1: Results of Name That Dataset!
Information I Q T D Q + T Q + D T + D Q + T + D Random
Accuracy 52.3% 76.3% 74.7% 95.8% 79.8% 97.5% 97.4% 97.5% 50.00%
We thus propose a novel adaptation algorithm for Visual QA. Our method has two compo-
nents. The first is to reduce the difference in statistical distributions by transforming the feature
representation of the data in the target dataset. We use an adversarial type of loss to measure the
degree of differences—the transformation is optimized such that it is difficult to detect the origins
of the transformed features. The second component is to maximize the likelihood of answering
questions (in the target dataset) correctly using the Visual QA model trained on the source dataset.
This ensures the learned transformation from optimizing domain matches retaining the semantic
understanding encoded in the Visual QA model learned on the source domain.
The rest of this chapter is organized as follows. In Section 10.2, we analyze the dataset bias
via the game Name That Dataset! In Section 10.3, we define tasks of domain adaptation for
Visual QA. In Section 10.3.2, we describe the proposed domain adaptation algorithm. We leave
the details of our algorthm in Section 10.5. In Section 10.4, we conduct extensive experimental
studies and further analysis. Section 10.6 gives the conclusion.
10.2 Visual QA and bias in the datasets
In what follows, we describe a simple experiment Name That Dataset! to illustrate the biases in
Visual QA datasets—questions and answers are idiosyncratically constructed such that a classifier
can easily tell one apart from the other by using them as inputs. We then discuss how those biases
give rise to poor cross-dataset generalization errors.
89
MLP
I
Q
C
k
Given an IQA triplet, where A={C
1
«&
K
}
M(I, Q, C
k
)
M(I, Q, C
1
)
« «
M(I, Q, C
K
)
argmax
6
á
Figure 10.2: An illustration of the MLP-based model for multiple-choice Visual QA. Given an
IQA triplet, we compute the M(I;Q;C
k
) score for each candidate answer C
k
. The candidate
answer that has the highest score is selected as the model’s answer.
10.2.1 Visual QA
In Visual QA datasets, a training or test example is a IQT triplet that consists of an image I, a
question Q, and a (ground-truth) correct answer T
1
. During evaluation or testing, given a pair of
I and Q, a machine needs to generate an answer that matches exactly or is semantically similar to
T.
In this chapter, we focus on multiple-choice based Visual QA, since the two most-widely
studied datasets—VQA [14] and Visual7W [230]—both consider such a setting. In this setting,
the correct answer T is accompanied by a set ofK “negative” candidate answers, resulting in a
candidate answer set A consist of a single T andK decoys denoted by D. An IQA triplet is thus
fI; Q; A =fT; D
1
; ; D
K
gg. We use C to denote an element in A. During testing, given I,
Q, and A, a machine needs to select T from A. Multiple-choice based Visual QA has the benefit
of simplified evaluation procedure and has been popularly studied [80, 211, 56, 168, 94]. Note
that in the recent datasets like VQA2 [67], the candidate set A is expanded to include the most
frequent answers from the whole training set, instead of a smaller subset typically used in earlier
datasets. Despite this subtle difference, we do not lose in generality by studying cross-dataset
generalization with multiple-choice based Visual QA datasets.
We follow Chapter 9 to train one-hidden-layer MLP models for multiple-choice based Visual
QA. The MLPM takes the concatenated features of an IQC triplet as input and outputs a compat-
ible scoreM(I, Q, C)2 [0; 1], measuring how likely C is the correct answer to the IQ pair. During
training, M is learned to maximize the binary cross-entropy, where each IQC triplet is labeled
with 1 if C is the correct answer; 0, otherwise. During testing, given an IQA triplet, the C2 A
that leads to the highest score is selected as the model’s answer. We use the penultimate layer of
ResNet-200 [73] as visual features to represent I and the average WORD2VEC embeddings [131]
as text features to represent Q and C, as in [80]. See Fig. 10.2 for an illustration.
10.2.2 Bias in the datasets
We refer the term “bias” to any idiosyncrasies in the datasets that learning algorithms can overfit
to and cause poor cross-dataset generalization.
1
Some datasets provide multiple correct answers to accommodate the ambiguity in the answers.
90
Name That Dataset! To investigate the degree and the cause of the bias, we construct a game
Name That Dataset!, similar to the one described in [181] for object recognition datasets. In this
game, the machine has access to the examples (ie, either IQT or IQA triplet) and needs to decide
which dataset those examples belong to. We experiment on two popular datasets Visual7W [230]
and VQA [14]. We use the same visual and text features described in Section 10.2.1 to represent
I, Q, T, and D
2
. We then concatenate these features to form the joint feature. We examine different
combination of I, Q, T, D as the input to a one-hidden-layer MLP for predicting the dataset from
which the sample comes. We sample 40,000, 5,000 and 20,000 triplets from each dataset and
merge them to be the training, validation and test sets.
As shown in Table 10.1, all components but images lead to strong detection of the data ori-
gin, with the decoys contributing the most (i.e., 95.8% alone). Combining multiple components
further improve the detection accuracy, suggesting that datasets contain different correlations or
relationships among components. Concatenating all the components together results in nearly
100% classification accuracy. In other words, the image, question, and answers in each dataset
are constructed characteristically. Their distributions (in the joint space) are sufficiently distant
from each other. Thus, one would not expect a Visual QA system trained on one dataset to work
well on the other datasets. See below for results validating this observation.
Question Type is just one biasing factor Question type is an obvious culprit of the bias. In
Visual7W, questions are mostly in the 6W categories (ie, what, where, how, when, why, who).
On the other hand, the VQA dataset contains additional questions whose correct answers are
either Yes or No. Those questions barely start with the 6W words. We create a new dataset called
VQA
by removing the Yes or No questions from the original VQA dataset.
We reran the Name That Dataset! (after retraining on the new dataset). The accuracies of us-
ing Q or Q+T have dropped from 76.3% and 79.8% to 69.7% and 73.8%, respectively, which are
still noticeably higher than 50% by chance. This indicates that the questions or correct answers
may phrased differently between the two datasets (e.g., the length or the use of vocabularies).
Combining them with the decoys (i.e., Q+T+D) raises the accuracy to 96.9%, again nearly dis-
tinguishing the two datasets completely. This reflects that the incorrect answers must be created
very differently across the two datasets (In most cases, decoys are freely selected by the data
collectors—being incorrect answers to the questions affords the data collectors to sample from
unconstrained spaces of possible words and phrases.)
Poor cross-dataset generalization Using the model described in Section 10.2.1, we obtain the
Visual QA accuracies of 65.7% and 55.6% on Visual7W and VQA
when training and testing
using the same dataset. However, when the learned models are applied to the other dataset, the
performance drops significantly to 53.4% (trained on VQA
but applied to Visual7W) and 28.1%
(trained on Visual7W but applied to VQA
). See Table 10.3 for the details.
We further evaluate a variant of the spatial memory network [203], a more sophisticated
Visual QA model. A similar performance drop is observed. See Table 10.7 for details.
2
Visual7W [230] has 3 decoys per triplet and VQA [14] has 17 decoys. For fair comparison, we subsample 3
decoys for VQA. We then average the WORD2VEC embedding of each decoy to be the feature of decoys.
91
Table 10.2: Various Settings for cross-dataset Adaptation. Source domain always provide I, Q
and A (T+D) while the target domain provides the same only during testing.
Shorthand Data from Target at Training
Setting[Q] Q
Setting[Q+T] (or [Q+T+D]) Q, T (or Q, T+D)
Setting[T] (or [T+D]) T (or T+D)
10.3 Cross-dataset adaptation
We propose to overcome the cross-dataset bias (and the poor cross-dataset generalization) with
the idea of domain adaptation. Similar ideas have been developed in the past to overcome the
dataset bias for object recognition [159, 64].
10.3.1 Main idea
We assume that we have a source domain (or dataset) with plenty of annotated data in the form of
Image-Question-Candidate Answers (IQA) triplets such that we can build a strong Visual QA
system. We are then interested in applying this system to the target domain. However, we
do not assume there is any annotated data (i.e., IQA/IQT triplets) from the target domain such
that re-training (either using the target domain alone or jointly with the source domain) or fine-
tuning [137, 194] the system is feasible
3
.
Instead, the target domain provides unsupervised data. The target domain could provide
images, images and questions (without either correct or incorrect answers), questions, questions
with either correct or incorrect answers or both, or simply a set of candidate answers (either
correct or incorrect or both). This last two scenarios are particularly interesting
4
. From the results
in Table 10.1, the discrepancy in textual information is a major contributor to domain mismatch,
cf. the columns starting Q.
Given the target domain data, it is not feasible to train an “in-domain” model with the data (as
it is incomplete and unsupervised). We thus need to model jointly the source domain supervised
data and the target domain data that reflect distribution mismatch. Table 10.2 lists the settings we
work on.
10.3.2 Approach
Our approach has two components. In the first part, we match features encoding questions and/or
answers across two domains. In the second part, we ensure the correct answers from the target
domain have higher likelihood in the Visual QA model trained on the source domain. Note that
we do not re-train the Visual QA model as we do not have access to complete data on the target
domain.
3
Annotated data from the target data, if any, can be easily incorporated into our method as a supervised learning
discriminative loss.
4
Most existing datasets are derived from MSCOCO. Thus there are limited discrepancies between images, as shown
in the column I in Table 10.1. Our method can also be extended to handle large discrepancy in images. Alternatively,
existing methods of domain adaptation for visual recognition could be applied to images first to reduce the discrepancy.
92
Matching domain The main idea is to transform features computed on the target domain (TD)
to match those features computed on the source domain (SD). To this end, let g
q
() and g
a
()
denote the transformation for the features on the questions and on the answers respectively. We
also usef
q
, f
t
, f
d
, andf
c
to denote feature representations of a question, a correct answer, an
incorrect decoy, or a candidate answer. In the Visual QA model, all these features are computed
by the average WORD2VEC embeddings of words.
The matching is computed as the Jensen-Shannon Divergence (JSD) between the two empir-
ical distributions across the datasets. For the Setting[Q], the matching is
m(TD!SD) =JSD(^ p
SD
(f
q
); ^ p
TD
(g
q
(f
q
))) (10.1)
where ^ p
SD
(f
q
) is the empirical distribution of the questions in the source domain and ^ p
TD
(g
q
(f
q
)))
is the empirical distribution of the questions in the target domain, after being transformed with
g
q
(),
The JSD divergence between two distributionsP andP
0
is computed as
JSD(P;P
0
)=
1
2
KL
P ;
P +P
0
2
+KL
P
0
;
P +P
0
2
; (10.2)
whileKL is the KL divergence between two distributions. The JSD divergence is closely related
to discriminating two distributions with a binary classifier [66] but difficult to compute. We thus
use an adversarial lose to approximate it. See Section 10.5 for details.
For both the Setting[Q+T] and the Setting[Q+T+D], the matching is
m(TD!SD) =JSD(^ p
SD
(f
q
;f
t
); ^ p
TD
(g
q
(f
q
);g
a
(f
t
))) (10.3)
with the empirical distributions computed over both the questions and the correct answers. Note
that even when the decoy information is available, we deliberately ignore them in computing
domain mismatch. This is because the decoys can be designed very differently even for the same
IQT triplet. Matching the distributions of D thus can cause undesired mismatch of T since they
share the same transform during testing
5
.
For the Setting[T] and Setting[T+D], the matching is
m(TD!SD) =JSD(^ p
SD
(f
t
); ^ p
TD
(g
a
(f
t
))) (10.4)
while the empirical distributions are computed over the correct answers only.
Leverage Source Domain for Discriminative Learning In the Setting[Q+T], Setting[Q+T+D],
Setting[T] and Setting[T+D], the learner has access to the correct answers T (and the incorrect
answers D) from the target domain. As we intend to use the transformed feature g
q
(f
q
) and
5
Consider the following highly contrived example. To answer the question “what is in the cup?”, the annotators
in the source domain could answer with “water” as the correct answer, and “coffee”, “juice” as decoys, while the
annotators in the target domain could answer with “sparkling water” (as that is the correct answer), then “cat” (as in
cupcats), and “cake” (as in cupcakes) as decoys. While it is intuitive to match the distribution of correct answers, it
makes less sense to match the distributions of the decoys as they are much more dispersed.
93
g
a
(f
c
) with the Visual QA model trained on the source domain, we would like those transformed
features to have high likelihood of being correct (or incorrect).
To this end, we can leverage the source domain’s data which always contain both T and D.
The main idea is to construct a Visual QA model on the source domain using the same partial
information as in the target domain, then to assess how likely the transformed features remain to
be correct (or incorrect).
In the following, we use the Setting[Q+T+D] as an example (other settings can be formulated
similarly). Leth
SD
(q;c) be a model trained on the source domain such that it tells us the likeli-
hood an answerc can be correct with respect to questionq. Without loss of generality, we assume
h
SD
(q;c) is the output of a binary logistic regression.
To use this model on the target data, we compute the following loss for every pair of question
and candidate answer:
`(q;c) =
logh
SD
(g
q
(f
q
);q
a
(f
c
)) ifc is correct,
log(1h
SD
(g
q
(f
q
);q
a
(f
c
))) otherwise.
The intuition is to raise the likelihood of any correct answers and lowering the likelihood of any
incorrect ones. Thus, even we do not have a complete data for training models on the target
domain discriminatively, we have found a surrogate to minimize,
^
`
TD
=
X
(q;c)2TD
`(q;c); (10.5)
measuring all the data provided in the target data and how they are likely to be correct or incorrect.
10.3.3 Joint optimization
We learn the feature transformation by jointly balancing the domain matching and the discrimi-
native loss surrogate
arg min
gq;ga
m(TD!SD) +
^
`
TD
: (10.6)
We select to be large while still allowingm(TD! SD) to decrease in optimization: is 0.5
for Setting[Q+T+D] and Setting[T+D], and 0.1 for the other experiments. The learning objective
can be similarly constructed when the target domain provides Q and T, T, or T+D, as explained
above. If the target domain only provides Q, we omit the term
^
`
TD
.
Once the feature transformations are learnt, we use the Visual QA model on the source domain
M
SD
, trained using image, question, and answers all together to make an inference on an IQA
triplet (i;q;A) from the target
^
t = arg max
c2A
M
SD
(f
i
;g
q
(f
q
);g
a
(f
c
));
where we identify the best candidate answer from the pool of the correct answers and their decoys
A using the source domain’s model. See Section 10.4.2 and 10.5 for the parameterization ofg
q
()
andg
a
(), and details of the algorithm.
94
10.3.4 Related work on domain adaptation
Extensive prior work has been done to adapt the domain mismatch between datasets [182, 58,
183, 34, 65, 63], mostly for visual recognition while we study a new task of Visual QA. One
popular method is to learn a transformation that aligns source and target domains according to a
certain criterion. Inspired by the recent flourish of Generative Adversarial Network [66], many
algorithms [58, 183, 34, 208] train a domain discriminator as a new criterion for learning such a
transformation. Our method applies a similar approach, but aims to perform adaptation simulta-
neously on data with multiple modalities (i.e., images, questions, and answers). To this end, we
leverage the Visual QA knowledge learned from the source domain to ensure that the transformed
features are semantically aligned. Moreover, in contrast to most existing methods, we learn the
transformation from the target domain to the source one, similar to [174, 183]
6
, enabling applying
the learned Visual QA model from the source domain without re-training.
10.4 Empirical studies
10.4.1 Dataset
We first evaluate our algorithms on the domain adaptation settings defined in Section 10.3 be-
tween Visual7W [230] and VQA [14]. Experiments are conducted on both the original datasets
and the revised version presented in Chapter 9. We then include Visual Genome [100], CO-
COQA [150], and VQA2 [67] with the decoys created in Chapter 9, leading to a comprehensive
study of cross-dataset generalization. See Chapter 9 for more details.
Evaluation metric For Visual7W, VG, and COCOQA, we compute the accuracy of picking the
correct answer from multiple choices. For VQA and VQA2, we follow its protocol to compute
accuracy, comparing the picked answer to the 10 human-annotated correct answers. The accuracy
is computed based on the number of exact matches among the 10 answers (divided by 3 and
clipped at 1).
10.4.2 Experimental setup
Visual QA model In all our experiments, we use a one hidden-layer MLP model (with 8,192
hidden nodes and ReLU) to perform binary classification on each input IQC (image, question,
candidate answer) triplet, following the setup as in [80] and Chapter 9. Please see Fig. 10.2 and
Section 10.2.1 for explanation. The candidate C2 A that has the largest score is then selected
as the answer of the model. Such a simple model has achieved the state-of-the-art results on
Visual7W and comparable results on VQA.
For images, we extract convolutional activation from the last layer of a 200-layer Residual
Network [73]; for questions and answers, we extract the 300-dimensional WORD2VEC [131]
embedding for each words in a question/answer and compute their average as the feature. We
then concatenate these features to be the input to the MLP model. Besides the Visual QA model
6
Most DA algorithms, when given a target domain, adjust the features for both domains and retrain the source
model on the adjusted features—they need to retrain the model when facing a new target domain. Note that [174, 183]
do not incorporate the learned source-domain knowledge as ours.
95
that takes I, Q, and C as input, we also train two models that use only Q + C and C alone as the
input. These two models can serve ash
SD
described in Sect 10.3.2.
Using simple models like MLP and average WORD2VEC embeddings adds credibility to our
studies—if models with limited capacity can latch on to the bias, models with higher capacity
can only do better in memorizing the bias.
Domain adaptation model We parameterize the transformationg
q
(), g
a
() as a one hidden-
layer MLP model (with 128 hidden nodes and ReLU) with residual connections directly from
input to output. Such a design choice is due to the fact that the target embedding can already
serve as a good starting point of the transforms. We approximate them(TD! SD) measure by
adversarially learning a one hidden-layer MLP model (with 8,192 hidden nodes and ReLU) for
binary classification between the source and the transformed target domain data, following the
same architecture as the classifier in Name That Dataset! game.
For all our experiments on trainingg
q
(), g
a
() and approximatingm(TD! SD), we use
Adam [96] for stochastic gradient-based optimization.
Domain adaptation settings As mentioned in Section 10.2, VQA (as well as VQA2) has
around 30% of the IQA triplets with the correct answers to be either “Yes” or “NO”. On the
other hand, Visual7W, COCOQA, and VG barely have triplets with such correct answers. There-
fore, we remove those triplets from VQA and VQA2, leading to a reduced dataset VQA
and
VQA2
that has 153,047/76,034 and 276,875/133,813 training/validation triplets, respectively.
We learn the Visual QA model using the training split of the source dataset and learn the
domain adaptation transform using the training split of both datasets.
Other implementation details Questions in Visual7W, COCOQA, VG, VQA
, and VQA2
are mostly started with the 6W words. The frequencies, however, vary among datasets. To en-
courageg
q
to focus on matching the phrasing style rather than transforming one question type
to the others, when training the binary classifier for m(TD! SD) with Adams, we perform
weighted sampling instead of uniform sampling from the source domain—the weights are deter-
mined by the ratio of frequency of each of the 6W question types between the target and source
domain. This trick makes our algorithm more stable.
10.4.3 Experimental results on Visual7W and VQA
We experiment on the five domain adaptation (DA) settings introduced in Section 10.3 using the
proposed algorithm. We also compare with ADDA [183] and CORAL [174], two DA algorithms
that can also learn transformations from the target to the source domain and achieves comparable
results on many benchmark datasets. Specifically, we learn two transformations to match the
(joint) distribution of the questions and target answers. We only report the best performance
among the five settings for ADDA and CORAL. Table 10.3 and Table 10.4 summarize the results
on the original and revised datasets, together with Direct transfer without any domain adaptation
and Within domain performance where the Visual QA model is learned using the supervised data
(i.e., IQA triplets) of the target domain. Such supervised data is inaccessible in the adaptation
settings we considered.
96
Table 10.3: Domain adaptation (DA) results (in %) on original VQA [3] and Visual7W [230].
Direct: direct transfer without DA. [174]: CORAL. [183]: ADDA. Within: apply models trained
on the target domain if supervised data is provided. (best DA result in bold)
VQA
! Visual7W
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within
53.4 53.4 54.1 53.6 54.5 55.7 55.2 58.5 65.7
Visual7W! VQA
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within
28.1 26.9 29.2 28.1 29.7 33.6 29.4 35.2 55.6
Table 10.4: Domain adaptation (DA) results (in %) on revised VQA and Visual7W from Chap-
ter 9. (best DA result in bold)
VQA
! Visual7W
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within
46.1 47.2 47.8 46.2 47.6 47.6 48.4 49.3 52.0
Visual7W! VQA
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within
45.6 45.3 45.9 45.9 45.9 47.8 45.8 48.1 53.7
Domain mismatch hurts cross-dataset generalization The significant performance drop in
comparing Within domain and Direct transfer performance suggests that the learned Visual QA
models indeed exploit certain domain-specific bias that may not exist in the other datasets. Such
a drop is much severe between the original datasets than the revised datasets. Note that the two
versions of datasets are different only in the decoys, and the revised datasets create decoys for
both datasets by the same automatic procedure. Such an observation, together with the finding
from Name That Dataset! game, indicate that decoys contribute the most to the domain mismatch
in Visual QA.
Comparison on domain adaptation algorithms Our domain adaptation algorithm outper-
forms Direct transfer in all the cases. On contrary, CORAL [174], which aims to match the first
and second order statistics between domains, fails in several cases, indicating that for domain
adaptation in Visual QA, it is crucial to consider higher order statistics.
We also examine setting in Eq. (10.6) to 0 for the [T] and [Q+T] settings
7
(essentially
ADDA [183] extended to multiple modalities), which leads to a drop of 1%, demonstrating the
effectiveness of leveraging the source domain for discriminative learning. See the Section 10.4.5
for more details.
Different domain adaptation settings Among the five settings, we see that [T] generally gives
larger improvement over Direct than [Q], suggesting that the domain mismatch in answers hinder
more in cross-dataset generalization.
7
When=0, D has no effect (i.e., [Q+T+D] is equivalent to [Q+T]).
97
Table 10.5: DA results (in %) on original datasets, with target data sub-sampling by 1/16. FT:
fine-tuning. (best DA result in bold)
VQA
! Visual7W
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within FT
53.4 52.6 54.0 53.6 54.4 56.3 55.1 58.2 53.9 60.1
Visual7W! VQA
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within FT
28.1 26.5 28.8 28.1 29.3 33.4 29.2 35.2 44.1 47.9
Table 10.6: DA results (in %) on revised datasets, with target data sub-sampling by 1/16. FT:
fine-tuning. (best DA result in bold)
VQA
! Visual7W
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within FT
46.1 45.6 47.8 46.1 47.5 47.6 48.3 49.1 39.7 48.3
Visual7W! VQA
Direct [174] [183] [Q] [T] [T+D] [Q+T] [Q+T+D] Within FT
45.6 44.8 45.6 46.0 45.9 47.8 45.8 48.0 43.1 48.2
Extra information on top of [T] or [Q] generally benefits the domain adaptation performance,
with [Q+T+D] giving the best performance. Note that different setting corresponds to different
objectives in Eq. (10.6) for learning the transformations g
q
and g
a
. Comparing [T] to [T+D],
we see that adding D helps take more advantage of exploiting the source domain’s Visual QA
knowledge, leading to a g
a
that better differentiates the correct answers from the decoys. On
the other hand, adding T to [Q], or vice versa, helps constructing a better measure to match the
feature distribution between domains.
Domain adaptation using a subset of data The domain adaptation results presented in Ta-
ble 10.3 and 10.4 are based on learning the transformations using all the training examples of
the source and target domain. We further investigate the robustness of the proposed algorithm
under a limited number of target examples. We present the results using only 1/16 of the them
in Table 10.5 and 10.6. The proposed algorithm can still learn the transformations well under
such a scenario, with a slight drop in performance (i.e.,< 0:5%). In contrast, learning Visual QA
models with the same amount of limited target data (assuming the IQA triplets are accessible)
from scratch leads to significant performance drop. We also include the results by fine-tuning,
which is infeasible in any setting of Table 10.2 but can serve as an upper bound.
We further consider domain adaptation (under Setting[Q+T+D] with = 0:1) between Vi-
sual7W [230] and VQA
[14] for both the original and revised decoys using
1
2
a
of training data
of the target domain, wherea2 [0; 1; ; 6]. The results are shown in Fig. 10.3. Note that the
Within results are from models trained on the same sub-sampled size using the supervised IQA
triplets from the target domain.
98
0 1 2 3 4 5 6
sub-sampling rate (in log
2
scale)
45
50
55
60
65
70
Visual QA accuracy (in %)
VQA- to Visual7W (original)
Within
Direct
DA
0 1 2 3 4 5 6
sub-sampling rate (in log
2
scale)
25
30
35
40
45
50
55
60
Visual QA accuracy (in %)
Visual7W to VQA- (original)
Within
Direct
DA
0 1 2 3 4 5 6
sub-sampling rate (in log
2
scale)
30
35
40
45
50
55
Visual QA accuracy (in %)
VQA- to Visual7W (revised)
Within
Direct
DA
0 1 2 3 4 5 6
sub-sampling rate (in log
2
scale)
35
40
45
50
55
Visual QA accuracy (in %)
Visual7W to VQA- (revised)
Within
Direct
DA
Figure 10.3: Domain adaptation (DA) results (in %) with limited target data, under Set-
ting[Q+T+D] with = 0:1. A sub-sampling ratea means using
1
2
a
of the target data.
As shown, our domain adaptation (DA) algorithm is highly robust to the accessible data
size from the target domain. On the other hand, the Within results from models training from
scratch significantly degrade when the data size decreases. Except the case Visual7W! VQA
(original), domain adaptation (DA) using our algorithm outperforms the Within results after a
certain sub-sampling rate. For example, on the case VQA
! Visual7W (revised), DA already
outperforms Within under
1
4
of the target data.
Results on sophisticated Visual QA model We further investigate a variant of the spatial mem-
ory network (SMem) [203] and the HieCoAtt model [122] for Visual QA, which utilizes the ques-
tion to guide the visual attention on certain parts of the image for extracting better visual features.
The results are shown in Table 10.7 and 10.8, where a similar trend of improvement is observed.
Qualitative analysis We shown in Fig 10.4 the results on each question type (out of the 6W
words) when transferring from VQA to Visual7W in Table 10.3 (on the original datasets). DA
([Q+T+D]) outperforms Direct at all the question types. The question type that improves the
most from Direct to DA is “When” (from 41.8 to 63.4, while Within is 80.3). Other types improve
1:0 5:0. This is because that the “When”-type question is scarcely seen in VQA, and our
99
Table 10.7: DA results (in %) on VQA and Visual7W (both original and revised) using a variant
of the SMem model [203].
original
VQA
! Visual7W Visual7W! VQA
Direct [Q+T+D] Within Direct [Q+T+D] Within
56.3 61.0 65.9 27.5 34.1 58.5
revised
VQA
! Visual7W Visual7W! VQA
Direct [Q+T+D] Within Direct [Q+T+D] Within
48.6 51.2 52.8 46.6 48.4 58.6
Table 10.8: DA results (in %) on VQA and Visual7W (both original and revised) using a variant
of the HieCoAtt model [122].
original
VQA
! Visual7W Visual7W! VQA
Direct [Q+T+D] Within Direct [Q+T+D] Within
51.5 56.2 63.9 27.2 33.1 54.8
revised
VQA
! Visual7W Visual7W! VQA
Direct [Q+T+D] Within Direct [Q+T+D] Within
46.4 48.2 51.5 44.5 46.3 55.6
DA algorithm, together with the weighted sampling trick, significantly reduces the mismatch of
question/answer phrasing of such a type.
10.4.4 Experimental results across five datasets
We perform a more comprehensive study on transferring the learned Visual QA models across five
different datasets. We use the revised candidate answers for all of them to reduce the mismatch
on how the decoys are constructed. We consider the [Q+T+D] setting, and limit the disclosed
target data to 1/16 of its training split size. The models for Within are also trained on such a
size, using the supervised IQA triplets. Table 10.9 summarizes the results, where rows/columns
correspond to the source/target domains.
On almost all (source, target) pairs, domain adaptation (DA) outperforms Direct, demonstrat-
ing the wide applicability and robustness of our algorithm. The exception is on (VQA
, VQA2
),
where DA degrades by 0.1%. This is likely due to the fact that these two datasets are constructed
similarly and thus no performance gain can be achieved. Such a case can also be seen between
Visual7W and VG. Specifically, domain adaptation is only capable in transferring the knowledge
learned in the source domain, but cannot acquire new knowledge in the target domain.
The reduced training size significantly limits the performance of training from scratch (Within).
In many cases Within is downplayed by DA, or even by Direct, showing the essential demand to
leverage source domain knowledge. Among the five datasets, Visual QA models trained on VG
100
1 2 3 4 5 6
how what why who where when
0
0.5
1
Accuracy
Direct [Q+T+D] Within
Figure 10.4: Qualitative comparison on different type of questions when transferring from VQA
to Visual7W (on the original datasets).
Table 10.9: Transfer results (in %) across different datasets (the decoys are generated according
to Chapter 9). The setting for domain adaptation (DA) is on [Q+T+D] using 1/16 of the training
examples of the target domain.
Visual7W VQA
VG COCOQA VQA2
Training/TestingDirect DA WithinDirect DA WithinDirect DA WithinDirect DA WithinDirect DA Within
Visual7W 52.0 - - 45.6 48.0 43.1 49.1 49.4 48.0 58.0 63.1 65.2 43.9 45.5 43.6
VQA
46.1 49.1 39.7 53.7 - - 44.8 47.4 48.0 59.0 63.4 65.2 50.7 50.6 43.6
VG 58.1 58.3 39.7 52.6 54.6 43.1 58.5 - - 65.5 68.8 65.2 50.1 51.3 43.6
COCOQA 30.1 35.5 39.7 35.1 40.4 43.1 29.1 33.1 48.0 75.8 - - 33.3 37.5 43.6
VQA2
48.8 50.8 39.7 55.2 55.3 43.1 47.3 49.1 48.0 60.3 64.9 65.2 53.8 - -
seems to generalize the best—the DA results to any target domain outperforms the corresponding
Within—indicating the good quality of VG.
In contrast, Visual QA models trained on COCOQA can hardly transfer to other datasets—
none of its DA results to other datasets is higher than Within. It is also interesting to see that
none of the DA results from other source domain (except VG) to COCOQA outperforms CO-
COQA’s Within. This is, however, not surprising given how differently in the way COCOQA
is constructed; i.e., the questions and answers are automatically generated from the captions in
MSCOCO. Such a significant domain mismatch can also be witnessed from the gap between Di-
rect and DA on any pair that involves COCOQA. The performance gain by DA over Direct is on
average over 4.5%, larger than the gain of any other pair, further demonstrating the effectiveness
of our algorithms in reducing the mismatch between domains.
10.4.5 Additional experimental results
The effect of the discriminative loss surrogate We provide in Table 10.10 the domain adap-
tation results on the [T] and [Q+T] settings when is set to 0 (cf. Eq. (10.6)), which corresponds
to omitting the discriminative loss surrogate
^
`
TD
. In most of the cases, the results with = 0:1
outperforms = 0, showing the effectiveness of leveraging the source domain for discriminative
learning. Also note that when D is provided for the target domain (i.e., [T+D] or [Q+T+D]),
it is the
^
`
TD
term that utilizes the information of D, leading to better results than [T] or [Q+T],
respectively.
101
Table 10.10: Domain adaptation (DA) results (in %) with or without the discriminative loss
surrogate term
original
VQA
! Visual7W Visual7W! VQA
Setting [T] [Q+T] [T] [Q+T]
= 0 54.1 54.1 29.2 28.8
= 0:1 54.5 55.2 29.7 29.4
revised
VQA
! Visual7W Visual7W! VQA
Setting [T] [Q+T] [T] [Q+T]
= 0 47.8 47.8 45.9 45.7
= 0:1 47.6 48.4 45.9 45.8
0 0.1 0.2 0.3 0.4
25
30
35
40
accuracy
Visual7W to VQA-
[Q+T+D]
[Q+T]
0 0.1 0.2 0.3 0.4
54
56
58
60
accuracy
VQA- to Visual7W
[Q+T+D]
[Q+T]
Figure 10.5: Results by varying on the original VQA and Visual7W datasets, for both the [Q+T]
and [Q+T+D] settings.
We further experiment on different values of, as shown in Fig. 10.5. For [Q+T], we achieve
consistent improvement for 0:1. For [Q+T+D], we can get even better results by choosing a
larger (e.g. = 0:5).
Open-ended (OE) results We apply Visual QA models learned with the multiple-choice setting
to evaluate on the open-ended one (i.e., select an answer from the top frequent ones, or from
the set of all possible answers in the training data). The result on transferring from VQA
to
COCOQA is in Table 10.11. Our adaptation algorithm still helps transferring.
Experimental results across five datasets: using the whole target domain data Table 10.12
summarizes the results of the same study, except that now all the training examples of the target
domain are used. The models for Within are also trained on such a size, using the supervised
IQA triplets. Compared to Table 10.9, we see that the performance drop of DA from using all the
training examples of the target domain to 1=16 of them is very small (mostly smaller than 0:3%),
demonstrating the robustness of our algorithm under limited training data. On the other hand, the
drop of Within is much more significant—for most of the (source, target) pairs, the drop is at least
10%. For most of the (source, target) pairs shown in Table 10.12, Within outperforms Direct and
DA. The notable exceptions are (VG, Visual7W) and (VQA2
, VQA
). This is likely due to the
fact that VG and Visual7W are constructed similarly while VG has more training examples than
102
Table 10.11: OE results (VQA
! COCOQA, sub-sampled by 1/16).
Direct [Q+T+D] Within
16.7 24.0 26.9
Table 10.12: Transfer results (in %) across datasets (the decoys are generated according to Chap-
ter 9). The setting for domain adaptation (DA) is on [Q+T+D] using all the training examples of
the target domain.
Visaul7W [230] VQA
[14] VG [100] COCOQA [150] VQA2
[67]
Training/TestingDirect DA WithinDirect DA WithinDirect DA WithinDirect DA WithinDirect DA Within
Visual7W [230] 52.0 - - 45.6 48.1 53.7 49.1 49.6 58.5 58.0 63.0 75.8 43.9 45.6 53.8
VQA
[14] 46.1 49.3 52.0 53.7 - - 44.8 47.9 58.5 59.0 64.7 75.8 50.7 50.6 53.8
VG [100] 58.1 58.4 52.0 52.6 54.4 53.7 58.5 - - 65.5 68.8 75.8 50.1 51.5 53.8
COCOQA [150] 30.1 34.4 52.0 35.1 40.2 53.7 29.1 33.4 58.5 75.8 - - 33.3 37.9 53.8
VQA2
[67] 48.8 51.0 52.0 55.2 55.3 53.7 47.3 49.6 58.5 60.3 65.2 75.8 53.8 - -
Visual7W. The same fact applies to VQA2
and VQA
. Therefore, the Visual QA model learned
on the source domain can be directly applied to the target domain and leads to better results than
Within.
10.5 Details on the proposed domain adaptation algorithm
10.5.1 Approximating the JSD divergence
As mentioned in Section 10.3.2, we use the Jensen-Shannon Divergence (JSD) to measure the
domain mismatch between two domains according to their empirical distributions. Dependent on
the domain adaptation (DA) setting, the empirical distribution is computed on the (transformed)
questions, (transformed) correct answers, or both.
Since JSD is hard to compute, we approximate it by training a binary classifierWhichDomain()
to detect the domain of a question Q, a correct answer T, or a QT pair, following the idea of Gener-
ative Adversarial Network [66]. The architecture ofWhichDomain() is exactly the same as that
used for Name that dataset!, except that the input features of examples from the target domain
are after the transformationsg
q
() andg
a
().
10.5.2 Details on the proposed algorithm
We summarize the proposed domain adaptation algorithm for Visual QA under Setting[Q+T+D]
in Algorithm 1. Algorithms of the other settings can be derived by removing the parts corre-
sponding to the missing information.
10.6 Summary
We study cross-dataset adaptation for visual question answering. We first analyze the causes
of bias in existing datasets. We then propose to reduce the bias via domain adaptation so as
103
to improve cross-dataset knowledge transfer. To this end we propose a novel domain adapta-
tion algorithm that minimizes the domain mismatch while leveraging the source domain’s Visual
QA knowledge. Through experiments on knowledge transfer among five popular datasets, we
demonstrate the effectiveness of our algorithm, even under limited and fragment target domain
information.
Notations Denote the features of Q, T, D byf
q
,f
t
, andf
d
. The D here stands for one
decoy.
Goal Learn transformationsg
q
(),g
a
() and a binary domain classifierWhichDomain(),
where
q
,
a
, and are the parameters to learn, respectively. WhichDomain() gives the
conditional probability of being from the source domain;
for number of training iterations do
Initialize the parameters ofWhichDomain();
for k steps do
Sample a mini-batch ofm pairsfQ
(j)
SD
;T
(j)
SD
g
m
j=1
SD;
Sample a mini-batch ofm pairsfQ
(j)
TD
;T
(j)
TD
g
m
j=1
TD;
UpdateWhichDomain() by ascending its stochastic gradient;
r
n
1
m
P
m
j=1
h
logWhichDomain(ff
q
(j)
SD
;f
t
(j)
SD
g) +
log(1WhichDomain(fg
q
(f
q
(j)
TD
);g
a
(f
t
(j)
TD
)g))
io
end
for l steps do
Sample a mini-batch ofm tripletfQ
(j)
TD
;T
(j)
TD
;D
(j)
TD
g
m
j=1
TD;
Update the transformations by descending their stochastic gradients;
r
q;a
n
1
m
P
m
i=1
log(1WhichDomain(fg
q
(f
q
(j)
TD
);g
a
(f
t
(j)
TD
)g)) +
`
fg
q
(f
q
(j)
TD
);g
a
(f
t
(j)
TD
)g) +`
fg
q
(f
q
(j)
TD
);g
a
(f
d
(j)
TD
)g)
o
end
end
Algorithm 1: The proposed domain adaptation algorithm for Setting[Q+T+D].D
(j)
TD
denotes a
single decoy. When the decoys of the target domain are not provided (i.e., Setting[Q+T]), the`
term related toD
(j)
TD
is ignored.
104
Chapter 11
Learning Answer Embedding for Visual Question Answering
In this chapter we propose a novel probabilistic model for Visual QA. The key idea is to infer
two sets of embeddings: one for the image and the question jointly and the other for the answers.
The learning objective is to learn the best parameterization of those embeddings such that the
correct answer has higher likelihood among all possible answers. In contrast to several existing
approaches of treating Visual QA as multi-way classification, the proposed approach takes the
semantic relationships (as characterized by the embeddings) among answers into consideration,
instead of viewing them as independent ordinal numbers. Thus, the learned embedded function
can be used to embed unseen answers (in the training dataset). These properties make the ap-
proach particularly appealing for transfer learning for open-ended Visual QA, where the source
dataset on which the model is learned has limited overlapping with the target dataset in the space
of answers. We have also developed large-scale optimization techniques for applying the model
to datasets with a large number of answers, where the challenge is to properly normalize the
proposed probabilistic models. We validate our approach on several Visual QA datasets and in-
vestigate its utility for transferring models across datasets. The empirical results have shown that
the approach performs well not only on in-domain learning but also on transfer learning.
11.1 Introduction
In Visual QA, the machine is presented with an image and a related question and needs to output a
correct answer. There are several ways of “outputting”, though. One way is to ask the machine to
generate a piece of free-form texts [59]. However, this often requires humans to decide whether
the answer is correct or not. Thus, scaling this type of evaluation to assess a large amount of data
(on a large number of models) is challenging.
Automatic evaluation procedures have the advantage of scaling up. There are two major
paradigms. One is to use multiple-choice based Visual QA [230, 3, 150]. In this setup, for each
pair of image and question, a correct answer is mixed with a set of incorrect answers and the
learner optimizes to select the correct one. While popular, it is difficult to design good incorrect
answers without shortcuts such that learners are not able to exploit (cf. Chapter 9).
The other paradigm that is amenable to automatic evaluation revises the pool of possible
answers to be the same for any pair of image and question [67, 14], i.e., open-ended Visual QA.
In particular, the pool is composed of most frequentK answers in the training dataset. This has
the advantage of framing the task as a multi-way classifier that outputs one of theK categories,
with the image and the question as the input to the classifier.
105
Q1: Where is the ball?
Q2: Who is holding the bat?
The little boy.
In the basket.
In the air.
The man in the
white uniform.
The player.
The woman.
Image + Question
Embedding
Joint Embedding Space
Answer
Embedding
Figure 11.1: Conceptual diagram of our approach. We learn two embedding functions to trans-
form image question pair (i;q) and (possible) answera into a joint embedding space. The dis-
tance (by inner products) between the embedded (i;q) anda is then measured and the closesta
(in red) would be selected as the output answer.
However, while alleviating the bias of introducing incorrect answers that are image and ques-
tion specific, the open-end Visual QA approaches also suffer from several problems. First, treat-
ing the answers as independent categories (as entailed by the multi-way classification) removes
the semantic relationship between answers. For example, the answers of “running” and “jogging”
(to the question “what is the woman in the picture doing?”) are semantically close, so one would
naturally infer the corresponding images are visually similar. However, treating “running” and
“jogging” as independent categories “choice i” and “choice j” would not automatically regularize
the learner to ensure the classifier’s outputs of visually similar images and semantically similar
questions to be semantically close. In other words, we would desire the outputs of the Visual
QA model express semantic proximities aligned with visual and semantic proximities at the in-
puts. Such alignment will put a strong prior on what the models can learn and prevent them from
exploiting biases in the datasets, thus become more robust.
Secondly, Visual QA models learned on one dataset do not transfer to another dataset unless
the two datasets share the same space of topK answers—if there is a difference between the two
spaces (for example, as “trivial” as changing the frequency order of the answers), the classifier
will make a substantial number of errors. This is particularly alarming unless we construct a
system a prior to map one set of answers to another set, we are likely to have very poor transfer
across datasets and would have to train a new Visual QA model whenever we encounter a new
dataset. In fact, for two popular Visual QA datasets, about 10% answers are shared and of top-K
answers (whereK < 10; 000), only 50% answers are shared. We refer readers to Section 11.3.5
and Table 11.6 for more results.
In this chapter, we propose a new learning model to address these challenges. Our main idea
is to learn also an embedding of the answers. Together with the (joint embedding) features of
image and question in some spaces, the answer embeddings parameterize a probabilistic model
describing how the answers are similar to the image and question pair. We learn the embeddings
for the answers as well as the images and the questions to maximize the correct answers’ likeli-
hood. The learned model thus aligns the semantic similarity of answers with the visual/semantic
similarity of the image and question pair. Furthermore, the learned model can also embed any
106
unseen answers, thus can generalize from one dataset to another one. Fig. 11.1 illustrates the
main idea of our approach.
Our method needs to learn embeddings of hundreds and thousands of answers. Thus to op-
timize our probabilistic model, we overcome the challenge by introducing a computationally
efficient way of adaptively sampling negative examples in a minibatch.
Our model also has the computational advantage that for each pair of image and question, we
only need to compute the joint embedding of image and question for once, irrespective of how
many candidate answers one has to examine. On the other end, models such as [80, 56] learn a
joint embedding of the triplet (image, question and answer) needs to compute embeddings at the
linear order of the number of candidate answers. When the number of candidate answers need to
be large (to obtain better coverage), such models do not scale up easily.
While our approach is motivated by addressing challenges in open-end Visual QA, the pro-
posed approach trivially includes multiple-choice based Visual QA as a special case and is thus
equally applicable. We extensively evaluated our approach on several existing datasets, including
Visual7W [230], VQA2 [67], and Visual Genome [100]. We show the gain in performance by
our approach over the existing approaches that are based on multi-way classification. We also
show the effectiveness of our approach in transferring models trained on one dataset to another.
To our best knowledge, we are likely the first to examine the challenging issue of transferability
in the open-end Visual QA task
1
.
The rest of the chapter is organized as follows. Section 11.2.1 introduces the notation and
problem setup. Section 11.2.2 presents our proposed methods. Section 11.3 shows our empirical
results on multiple Visual QA datasets.
11.2 Methods
In what follows, we describe our approach in detail. We start by describing a general setup for
Visual QA and introducing necessary notations. We then introduce the main idea, followed by
detailed descriptions of the method and important steps to scale the method to handling hundreds
of thousands negative samples.
11.2.1 Setup and notations
In the Visual QA task, the machine is given an image i and a question q, and is asked to gen-
erate an answera. In this work, we focus on the open-ended setting wherea is a member of a
setA. This set of candidate answers is intuitively “the universe of all possible answers”. How-
ever, in practice, it is approximated by the top K most frequent correct answers in a training
set [122, 56, 209], plus all the incorrect answers in the dataset (if any). Another popular setting is
multiple-choice based. For each pair of (i;q), the setA is different (this set is either automatically
generated (cf. Chapter 9) or manually generated [230, 3]). Without loss of generality, however,
we useA to represent both. Whenever necessary, we clarify the special handling we would need
for (i;q) specific candidate set.
1
Our work focuses on the transferability across datasets with different question and answer spaces. We leave visual
transferability (e.g., by domain adaptation) as future work.
107
We distinguish two subsets inA with respect to a pair (i;q):T andD =AT . The setT
contains all the correct answers for (i;q)—it could be a singleton or in some cases, contains mul-
tiple semantically similar answers to the correct answer (e.g., “policeman” to “police officer”),
depending on the datasets. The setD contains all the incorrect (or undesired) answers.
A training dataset is thus denoted by a set ofN distinctive tripletsD =f(i
n
;q
n
;T
n
)g when
only the correct answers are given, orD =f(i
n
;q
n
;A
n
=T
n
[D
n
)g when both the correct and
incorrect answers are given.
Note that by i, q or a, we refers to their “raw” formats (an image in pixel values, and a
question or an answer in its textual forms).
11.2.2 Main idea
Our main idea is motivated by two deficiencies in the current approaches for open-ended Visual
QA [3]. In those methods, it is common to construct aK-way classifier so that for each (i;q),
the classifier outputsk that corresponds to the correct answer (i.e., thek-th element inA is the
correct answer).
However, this classification paradigm cannot capture all the information encoded in the dataset
for us to derive better models. First, by equating two different answersa
k
anda
l
with the ordi-
nal numbers k and l, we lose the semantic kinship between the two. If there are two triplets
(i
m
;q
m
;a
k
2T
m
) and (i
n
;q
n
;a
l
2T
n
) having similar visual appearance betweeni
m
andi
n
and
similar semantic meaning betweenq
m
andq
n
, we would expecta
k
anda
l
to have some degrees
of semantic similarity. In a classification framework, such expectation cannot be fulfilled as the
assignment of ordinal numbersk orl to eithera
k
ora
l
can be arbitrary such that the difference
betweenk andl does not preserve the similarity betweena
k
anda
l
. However, observing such
similarity at both the inputs to the classifier and the outputs of the classifier is beneficial and adds
robustness to learning.
The second flaw with the multi-way classification framework is that it does not lend itself to
generalize across two datasets with little or no overlapping in the candidate answer setsA. Unless
there is a prior defined mapping between the two sets, the classifier trained on one dataset is not
applicable to the other dataset.
We propose a new approach to overcome those deficiencies. The key idea is to learn embed-
dings of all the data. The embedding functions, when properly parameterized and learned, will
preserve similarity and will generalize to unseen answers (in the training data).
Embeddings We first define a joint embedding functionf
(i;q) to generate the joint embed-
ding of the pairi andq. We also define an embedding functiong
(a) to generate the embedding
of an answera. We will postpone to later to explain why we do not learn a function that generates
the joint embedding of the triplet.
The embedding functions are parameterized by and , respectively. In this work, we
use deep learning models such as multi-layer perceptron (MLP) and Stacked Attention Network
(SAN) [209, 93] (after removing the classifier at the last layer). In principle, any representation
network can be used—our focus is on how to use the embeddings.
108
Probabilistic Model of Compatibility (PMC) Given a triplet (i
n
;q
n
;a2T
n
) where a is a
correct answer, we define the following probabilistic model
p(aji
n
;q
n
) =
exp(f
(i
n
;q
n
)
>
g
(a))
P
a
0
2A
exp(f
(i
n
;q
n
)
>
g
(a
0
))
(11.1)
Discriminative Learning with Weighted Likelihood Given the probabilistic model, it is nat-
ural to learn the parameters to maximize its likelihood. In our work, we have found the following
weighted likelihood is more effective
` =
N
X
n
X
a2Tn
X
d2A
(a;d) logP (dji
n
;q
n
); (11.2)
where the weighting function(a;d) measures how much the answerd could contribute to the
objective function. A nature design is
(a;d) =I[a =d]; (11.3)
whereI[] is the binary indicator function, taking value of 1 if the condition is true and 0 if false.
In this case, the objective function reduces to the standard cross-entropy loss ifT
n
is a singleton.
However, in Section 11.2.4, we discuss several different designs.
11.2.3 Large-scale stochastic optimization
The optimization of eq. (11.2) is very challenging on real Visual QA datasets. There, the size
ofA can be as large as hundreds of thousands
2
. Thus computing the normalization term of the
probability model is a daunting task.
We use a minibatch based stochastic gradient descent procedure to optimize the weighted
likelihood. Specifically, we chooseB triplets randomly fromD (the training dataset defined in
Section 11.2.1) and compute the gradient of the weighted likelihood.
Within a minibatch (i
b
;q
b
;T
b
) or (i
b
;q
b
;T
b
[D
b
) forb = 1; 2;B, we construct a minibatched-
universe
A
B
=
N
[
b=1
(T
b
[
D
b
) (11.4)
Namely, all the possible answers in the minibatch are used.
However, this “mini-universe” might not be a representative sampling of the true “universe”
A. Thus, we augment it with negative sampling. First we compute the set
A
B
=AA
B
(11.5)
and sample M samples from this set. These samples (denoted asA
o
) are mixed withA
B
to
increase the exposure to incorrect answers (i.e. negative samples) encountered by the triplets in a
2
In the Visual Genome dataset [100], for example, we have more than 201,000 possible answers.
109
minibatch. In short, we useA
0
S
A
B
in lieu ofA in computing the posterior probabilityp(aji;q)
and the likelihood.
11.2.4 Defining the weighting function
We can take advantage of the weighting function(a;d) to incorporate external or prior semantic
knowledge. For example, (a;d) can depend on semantic similiarity scores between a and d.
Using the WUPS score [196, 124], we define the following rule
(a;d) =
1 if WUPS(a;d)>;
0 otherwise,
(11.6)
where is a threshold (e.g., 0.9 as in [124]).(a;d) can also be used to scale triplets with a lot of
semantic similar answers inT (for instance, “apple”, ”green apple”, ”small apple” or “big apple”
are good answers to “what is on the table?”):
(a;d) =
I[a =d]
jTj
(11.7)
such that each of these similar answers only contributes to a fraction of the likelihood to the
objective function. The idea of eq. (11.7) has been exploited in several recent work [226, 79, 93]
to boost the performance on VQA [14] and VQA2 [67].
11.2.5 Prediction
During testing, given the learnedf
andg
, for the open-ended setting we can apply the follow-
ing decision rule
a
= arg max
a2A
f
(i;q)
>
g
(a); (11.8)
to identify the answer to the pair (i;q).
Note that we have the freedom to chooseA again: it can be the same as the “universe of
answers” constructed for the training (i.e., the collection of most frequent answers), or a union
with all the answers in the validation or testing set. The flexibility is afforded here by using the
embedding function g
to embed any texts. Note that in existing open-ended Visual QA, the
setA is constrained to the most frequent answers, reflecting the limitation of using multi-way
classification as a framework for Visual QA tasks.
This decision rule readily extends to the multiple-choice setting, where we just need to set
A to include the correct answer and the incorrect answers in each testing triplet.
11.2.6 Comparison to existing algorithms
Most existing Visual QA algorithms (most working on the open-ended setting on VQA [14] and
VQA2 [67]) train a multi-way classifier on top of thef
embedding. The number of classes are
set to 1,000 for VQA [56] and around 3,000 for VQA2 [56, 226, 93] of the top-frequency correct
answers. These top-frequent answers cover over 90% of the training and 88% of the training
110
Table 11.1: Summary statistics of Visual QA datasets.
Dataset # of Images # of (i;q;T ) triplets (jTj;jDj)
Name train val test train val test per tuple
VQA2 [67] 83K 41K 81K 443K 214K 447K (10; 0)
Visual7W [230] 14K 5K 8K 69K 28K 42K (1; 3)
V7W Chapter 9 14K 5K 8K 69K 28K 42K (1; 6)
qaVG Chapter 9 49K 19K 29K 727K 283K 433K (1; 6)
and validation examples. Those training examples whose correct answers are not in the top-K
frequent ones are simply disregarded during training.
There are some algorithms also learning a tri-variable compatibility functionh(i;q;a) [80,
56, 168]. And the correct answer is inferred by identifya
such thath(i;q;a
) is the highest. This
type of learning is particularly suitable for multiple-choice based Visual QA. Since the number
of candidate answers is small, enumerating all possiblea is feasible. However, for open-ended
Visual QA tasks, the number of possible answers is very large—computing the functionh() for
every one of them is costly.
Note that our decision rule relies on computingf
(i;q)
>
g
(a), a factorized form of the more
generic functionh(i;q;a). However, precisely due to this factorization, we only need to compute
f
(i;q) just once for every pair (i;q). For g
(a), as long as the model is sufficiently simple,
enumerating over many possible a is less demanding than what a generic (and more complex)
function h(i;q;a) requires. Indeed, in practice we only need to compute g
(a) once for any
possiblea
3
. See Section 11.3.9 for details.
11.3 Empirical studies
We validate our approach on several Visual QA datasets. We start by describing these datasets
and the empirical setups. We then report our results. The proposed approach performs very well.
It outperforms the corresponding multi-way classification-based approaches where the answers
are modeled as independent ordinal numbers. Moreover, it outperforms those approaches in
transferring models learned on one dataset to another one.
11.3.1 Datasets
We apply the proposed approach to four datasets. Table 11.1 summarizes their characteristics.
We call the revised Visual7W in Chapter 9 ascV7W, and call the multiple-choice version of
Visual Genome (VG) as qaVG. Note that each (i
n
;q
n
) pair in VQA2 is answered by 10 human
annotators (i.e,jT
n
j = 10). The most frequent one is selected as the single correct answert
n
.
Please see Chapter 9 for more details.
3
The answer embeddingg(a) for all possible answers (say 100,000) can be pre-computed. At inference we only
need to compute the embedding f(i;q) once for an (i;q) pair and perform 100,000 inner products. In contrast,
methods like [80, 56, 168] need to computeh(i;q;a) for 100,000 times. Even if such a function is parameterized with
a simple MLP, the computation is much more intensive than an inner product when one has to perform 100,000 times.
111
Table 11.2: The answer coverage of each dataset.
# of unique answers triplets covered by topK =
Dataset train/val/test/All 1,000 3,000 5,000
VQA2 22K/13K/ - /29K 88% 93% 96%
Visual7W 63K/31K/43K/108K 57% 68% 71%
VG 119K/57K/79K/201K 61% 72% 76%
Answer Coverage within Each Dataset. In Table 11.2, We show the number of unique an-
swers in each dataset on each split, together with the portions of question and answer pairs cov-
ered by the top-K frequent correct answers from the training set. We observe that the qaVG
contains the largest number of answers, followed by Visual7W and VQA2. In terms of coverage,
we see that the distribution of answers on VQA2 is the most skewed: over 88% of training and
validation triplets are covered by the top-1000 frequent answers. On the other hand, Visual7W
and qaVG needs more than top-5000 frequent answers to achieve a similar coverage.
Thus, a prior, Visual7W and qaVG are “harder” datasets, where a multi-way classification-
based open-ended Visual QA model will not perform well unless the number of categories is
significantly higher (say 5000) in order to be able to encounter less frequent answers in the
test portion of the dataset—the answers just have a long-tail distribution.
11.3.2 Experimental setup
Our Model. We use two different models to parameterize the embedding functionf
(i;q) in
our experiments—Multi-layer Perceptron [80] (MLP) and Stacked Attention Network [209, 93]
(SAN). For both models, we first represent each token in the question by the 300-dimensional
GloVe vector [143], and use the ResNet-152 [73] to extract the visual features following the
exact setting of [93]. Detailed specifications of each model are as follows.
• Multi-layer Perceptron (MLP): We represent an image by the 2,048-dimensional vector
form the top layer of the ResNet-152 pre-trained on ImageNet [157], and a question by the
average of the GloVe vectors after a linear transformation followed by tanh non-linearity
and dropout. We then concatenate the two features (in total 2,348 dimension), and feed
them into a one-layer MLP (4,096 hidden nodes and intermediate dropout), with the output
dimensionality of 1,024.
• Stacked Attention Network (SAN): We represent an image by the 14142048-dimensional
tensor, extracted from the second last layer of the ResNet-152 pre-trained on ImageNet [157].
See [209] for details. On the other hand, we represent a question by a one layer bidirec-
tional LSTM over GloVe word embeddings. Image and question features are then inputed
into the SAN structure for fusion. Specifically, we follow a very similar network architec-
ture presented in [93], with the output dimensionality of 1,024.
For parameterizing the answering embedding functiong
(a), we adopt two architectures: 1)
Utilizing a one-layer MLP on average GloVe embeddings of answer sequences, with the out-
put dimensionality of 1,024. 2) Utilizing a two-layer bidirectional LSTM (bi-LSTM) on top of
112
GloVE embeddings of answer sequences. We use MLP for computing answer embedding by
default. We denote method with bi-LSTM answer embedding with a postfix? (e.g. SAN?).
In the following, we denote our factorized model applying PMC for optimization as fPMC
(cf. eq (11.1)). We consider variants of fPMC with different architectures (e.g. MLP, SAN) for
computingf
(i;q) andg
(a), named as fPMC(MLP), fPMC(SAN) and fPMC(SAN?).
Competing Methods. We compare our model to multiway classification-based (CLS) models
which take either MLP or SAN asf
. We denote them as CLS(MLP) or CLS(SAN). We set the
number of output classes for CLS model to be top-3,000 frequent training answers for VQA2,
and top-5,000 for Visual7W and VG. This is a common setup for open-ended Visual QA [3].
Meanwhile, we also re-implement approaches that learn a scoring functionh(i;q;a) with its
input as (i
n
;q
n
;T
n
) triplets [80]. As such methods are initially designed for multiple-choice
datasets, the calibration between positive and negative samples needs to be carefully tuned.
It is challenging to adapt to ‘open-end‘ settings where the number of negative answers scaled
up. Therefore, we adapt them to also utilize our PMC framework for training, which optimize
stochastic multi-class cross-entropy with negative answers sampling. We name such methods
as uPMC (un-factorized PMC) and call its variants as uPMC(MLP) and uPMC(SAN). We also
compare to reported results from other state-of-the-art methods.
Evaluation Metrics The evaluation metric for each dataset is different. For VQA2, the standard
metric is to compare the selected answer a
of a (i;q) pair to the ten corresponding human
annotated answersT =fs
1
; ;s
10
g. The performance on such an (i;q) pair is set as follows
acc(a
;T ) = max
(
1;
P
l
I[a
=s
l
]
3
)
: (11.9)
We report the average performance over examples in the validation split and test split.
For Visaul7W (or V7W), the performance is measured by the portion of correct answers
selected by the Visual QA model from the candidate answer set. The chance for random guess
is 25% (or 14.3%). For VG, we focus on the multiple choice evaluation (on qaVG). We follow
the settings in Chapter 9 and measure multiple choice accuracy. The chance for random guess is
14.3%.
11.3.3 Results on individual Visual QA datasets
Table 11.3 gives a comprehensive evaluation for most state-of-the-art approaches on four differ-
ent settings over VQA2 (test-dev), Visual7W, V7W and qaVG
4
. Among all those settings, our
proposed fPMC model outperform the corresponding classification model by a noticeable mar-
gin. Meanwhile, fPMC outperforms uPMC over all settings. Comparing to other state-of-the-art
methods, we show competitive performance against most of them.
In Table 11.3, note that there are differences in the experimental setups in many of the com-
parison to state-of-the-art methods. For instance, MLP [80] used either better text embedding
4
The omitted ones are due to their missing in the corresponding work. In fact, most existing work only focuses on
one or two datasets.
113
Table 11.3: Results (%) on Visual QA with different settings: open-ended (Top-K) and multiple-
choice (MC) based for different datasets. The omitted ones are due to their missing in the corre-
sponding work.
Visual7W V7W VQA2 qaVG
Method MC [230] MC Chapter 9 Top-3k [67] MC Chapter 9
LSTM [230] 55.6 - - -
MLP Chapter 9 65.7 52.0 - 58.5
MLP [80] 67.1 - - -
C+LSTM [67] - - 54.1 -
MCB [67] 62.2 - 62.3 -
MFB [227] - - 65.0 -
BUTD [10] - - 65.6 -
MFH [226] - - 66.8 -
Multi-way Classification Based Model (CLS)
CLS(MLP) 51.6 40.9 53.5 46.9
CLS(SAN) 53.7 43.6 62.4 53.0
Our Probabilistic Model of Compatibility (PMC)
uPMC(MLP) 62.4 51.6 51.4 54.5
uPMC(SAN) 65.3 55.2 56.0 61.3
fPMC(MLP) 63.1 52.4 59.3 57.7
fPMC(SAN) 65.6 55.4 63.2 62.6
fPMC(SAN?) 66.0 55.5 63.9 63.4
or more advanced visual feature, which benefits their result on Visual7W significantly. Under
the same configuration, our model has obtained improvement. Besides, most of the state-of-the-
art methods on VQA2 fall into the category of classification model that accommodates specific
Visual QA settings. They usually explore better architectures for extracting rich visual informa-
tion [230, 10], or better fusion mechanisms across multiple modalities [67, 227, 226]. We notice
that our proposed PMC model is orthogonal to all those recent advances in multi-modal fusion
and neural architectures. More advanced deep learning models can be adapted into our frame-
work as f
(i;q) (e.g. fPMC(MFH)) to achieve superior performance across different settings.
This is particularly exemplified by the dominance of SAN over the vanilla MLP model. We leave
this for future work.
11.3.4 Ablation studies
Importance of Negative Sampling Our approach is probabilistic, demanding to compute a
proper probability over the space of all possible answers. (In contrast, classification-based models
limit their output spaces to a pre-determined number, at the risk of not being able to handle unseen
answers).
In Section 11.2.3, we describe a large-scale optimization technique that allows us to approxi-
mate the likelihood by performing negative sampling. Within each mini-batch, we create a mini-
universe of all possible answers as the union of all the correct answers (i.e.,A
B
). Additionally,
we randomly sampleM answers from the union of all answers outside of the mini-batch, creating
114
Table 11.4: The effect of negative sampling (M = 3; 000) on fPMC. The number is the accuracy
in each question type on VQA2 (val).
Method Mini-Universe Y/N Number Other All
MLP
A
B
70.1 33.0 38.7 49.8
SAN 78.2 37.1 45.7 56.7
MLP
A
o
S
A
B
76.6 36.1 43.9 55.2
SAN 79.0 38.0 51.3 60.0
0 500 1000 1500 2000 2500 3000
A
o
size
48
50
52
54
56
58
60
Acc (%)
Performance vs. Number of Negative Samples
MLP
SAN
Figure 11.2: Detailed analysis on the size of negative sampling to fPMC(MLP) and fPMC(SAN)
at each mini-batch. The reported number is the accuracy on VQA2 (val).
“an other world” of all possible answersA
o
. TheA
o
provides richer negative samples toA
B
and
is important to the performance of our model, as shown in Table 11.4.
We further conducted detailed analysis on the effects of negative sample sizes as shown in
Fig. 11.2. With the number of negative samples increasing from 0 to 3,000 for each mini-batch,
we observe a increasing trend from the validation accuracy. A significant performance boost is
obtained comparing methods with a small number of negative samples to no additional negative
samples. The gain then becomes marginal afterA
o
is greater than 2,000.
The Effect of Incorporating Semantic Knowledge in Weighted Likelihood In Section 11.2.2,
we have introduced the weighting function(a;d) to measure how much an incorrect answerd
should contribute to the overall objective function. In particular, this weighting function can be
used to incorporate prior semantic knowledge about the relationship between a correct answera
and an incorrect answerd.
We report in Table 11.5 the ablation study on using different weight function(a;d) in the
weighted likelihood formulation (cf. Eq. 11.2). We compare three different types of(a;d) on
VQA2:
• one-hot: Denotet
n
as the dominant answer inT
n
. We setT
n
ft
n
g (i.e., nowT
n
becomes
a singleton) and apply
(a;d) =I[a =d] (cf. Eq. 11.3):
115
Table 11.5: Detailed analysis of different(a;d) for weighted likelihood. The reported number
is the accuracy on VQA2 (validation).
Method Weighting Criterion Acc.
fPMC(SAN)
one-hot 58.0
multi-hot 60.0
WUPS 57.8
Table 11.6: The # of common answers across datasets (training set)
Top-K most frequent answers Total # of
Dataset 1K 3K 5K 10K all unique answers
VQA2, Visual7W 451 1,262 2,015 3,585 10K 137K
VQA2, qaVG 495 1,328 2,057 3,643 11K 149K
Visual7W, qaVG 657 1,890 3,070 5,683 27K 201K
In this case, only one answer is considered positive to a (i;q) pair. No extra semantic
relationship is encoded.
• multi-hot: We keep the givenT
n
(the ten user annotations collected by VQA2; i.e.jT
n
j =
10) and apply
(a;d) =I[a =d] (cf. Eq. 11.3)
to obtain a multi-hot vector
P
a2Tn
(a;d) for soft weighting, leading to a loss similar to
[93, 79].
• WUPS: We again considerT
n
ft
n
g, but utilize the WUPS score [196, 124] (the range
is [0, 1]) together with Eq. 11.6 to define(a;d). We set = 0:9 and gived which has
WUPS(a;d) = 1 a larger weight (i.e., 8).
The results suggest that the multi-hot vector computed from multiple user annotations pro-
vides the best semantic knowledge among answers for learning the model.
11.3.5 Transfer learning across datasets
One important advantage of our method is to be able to cope with unseen answers in the training
dataset. This is in stark contrast to multi-way classification based models which will have to skip
on those answers as the output categories are selected as top-K most frequent answers from the
training dataset.
Thus, classification based models for Visual QA are not amenable to transfer across datasets
where there is a large gap between different spaces of answers. Table 11.6 illustrates the severity
by computing the number of common answers across datasets. On average, about 7% to 10% of
the unique answers are shared across datasets. If we restrict the number of answers to consider to
top 1,000, about 50% to 65% answers are shared. However, top 1000 most frequent answers are
in general not enough to cover all the questions in any dataset. Hence, we arrive at the unexciting
observation—we can transfer but we can only answer a few questions!
116
Table 11.7: Results of cross-dataset transfer using either classification-based models or our mod-
els (PMC) for Visual QA. (f
= SAN)
Visual7W VQA2 qaVG
CLS uPMC fPMC fPMC? CLS uPMC fPMC fPMC? CLS fPMC fPMC fPMC?
Visual7W 53.7 65.3 65.6 66.0" 19.1 18.5 19.8" 19.1 42.8 52.2 54.8" 54.3
VQA2 45.8 56.8 60.2 61.7" 59.4 56.0 60.0 60.9" 37.6 51.5 54.8 56.8"
qaVG 58.9 66.0 68.4 69.5" 25.6 23.6 25.8 26.4" 53.0 61.2 62.6 63.4"
Table 11.8: Transferring is improved on the VQA2 dataset without Yes/No answers (and the
corresponding questions) (f
= SAN).
Dataset CLS uPMC fPMC fPMC?
Visual7W 31.7 29.5 33.1" 32.0
qaVG 42.6 39.3 43.0 43.4"
In Table 11.7, we report our results of transferring learned Visual QA model from one dataset
(row) to another one (column). For VQA2, we evaluate the open-end accuracy using top-3000
frequent answer candidates on validation set. We evaluate multiple-choice accuracy on the test
set of Visual7W and qaVG.
The classification models (CLS) clearly fall behind the performance of our method (uPMC
and fPMC)—the red upper arrows signify improvement. In some pairs the improvement is sig-
nificant (e.g., from 42.8% to 54.8% when transferring from Visual7W to qaVG). Furthermore,
we noticed that fPMC outperforms uPMC in all transfer settings.
However, VQA2 seems a particular difficult dataset to be transferred to, from either V7W or
qaVG. The improvement from CLS to fPMC is generally small. This is because VQA2 contains
a large number of Yes/No answers. For such answers, learning embeddings is not advantageous
as there are little semantic meanings to extract from them.
We perform another study by removing those answers (and associated questions) from VQA2
and report the transfer learning results in Table 11.8. In general, both CLS and fPMC transfer
better. Moreover, fPMC improves over CLS by a larger margin than that in Table 11.7.
11.3.6 Analysis with seen/unseen answers
To gain a deeper understanding towards which component brings the advantage in transfer learn-
ing, we performed additional experiments to analyze the difference on seen/unseen answers.
Specifically, we study the transfer learning result from VQA2 and qaVG to Visual7W. Here,
seen (S) refers to those multiple choices where at least one candidate answer is seen in the train-
ing vocabulary, and unseen (U) refers to those multiple choices where all the candidate answers
are not observed in the training vocabulary. As shown in Table 11.9, we see that our fPMC model
performs better than the CLS model on both seen and unseen answer set. While CLS model
obtains random performance (the random chance is 25 %) on the unseen answer set, our fPMC
model achieved at least 20% (in absolute value) better performance. In general. uPMC is also
working well comparing to CLS. This performance improvement is gain mostly by taking answer
semantics from the word vectors into account.
117
Table 11.9: Analysis of cross dataset performance over Seen/Unseen answers using either CLS
or PMC for Visual QA
Visual7W
CLS(SAN) uPMC(SAN) fPMC(SAN) fPMC(SAN?)
S U All S U All S U All S U All
VQA2 59.8 25.0 45.8 57.4 54.6 56.8 60.7 58.5 60.2 61.7 59.4 62.5
qaVG 63.4 25.0 58.9 66.7 45.3 66.0 69.1 47.7 68.4 70.2 46.9 69.5
Table 11.10: Results for the baseline method that fix answer embedding as GloVe. (We show
results with SAN asf
(i;q)).
Target VQA2 Visual7W qaVG
Source Fixed Learning Fixed Learning Fixed Learning
VQA2 57.5 60.0 47.5 60.2 37.6 54.8
11.3.7 Visualization on answer embeddings
We provide the t-SNE visualization of the answer embedding. To better demonstrate the effec-
tiveness of learning answer embedding, we re-train the answer embedding model with randomly
initialized answer vectors. We provide visualization on both the initial answer embedding and
learned answer embedding, to reflect the preservation of semantics and syntactics in the learned
embedding.
According to Fig. 11.3, we can observe that a clear structure in the answer embedding are ob-
tain in our learned embedding. While the random initialization of the embedding remains chaos,
our learned embedding successfully provide both semantic and syntactic similarities between an-
swers. For example, semantically similar answers such as “airplane” and “motorcycle” are
close to each other, and syntactically similar answers like “in an office” and “on the porch”
are close. Besides, we also observe that answers are clustered according to its majority question
type, which meets our expectation for the answer embedding’s structure. Here we take majority
because one answer can be used for multiple questions of different types.
11.3.8 Analysis on answer embeddings
We provide results for an additional baseline algorithm wheref
(i;q) directly maps to the fixed
space of average GloVe answer representations. Here we need to keep the GloVe embedding
fixed to enable transferability. Table 11.10 shows the results on the VQA2 dataset. We com-
pare its performance to our approach of learning answer embedding with MLP asg
(a) in terms
of both in-domain and transfer learning performance—learning answer embeddings outperforms
this simple baseline in all cases. Associated with the previous visualization results, we can con-
clude that learning answer embedding can effectively capture the semantic relationship between
answers and image-question pairs while obtaining superior performance on both within-domain
performance and transfer learning performance.
118
(a) Random initialized answer embedding
(b) Learned answer embedding
Figure 11.3: t-SNE visualization. We randomly select 1000 answers from Visual7W and vi-
sualize them in the initial answer embedding and learned answer embeddings. Each answer is
marked with different colors according to their question types. (e.g. when, how, who, where,
why, what). To make the figure clear for reading, we randomly sub-sampled the text among those
1000 answers to visualize.
119
Table 11.11: Efficiency study among CLS(MLP), uPMC(MLP) and fPMC(MLP). The reported
numbers are the average inference time of a mini-batch of 128 (jTj = 1000).
Method CLS(MLP) uPMC(MLP) fPMC(MLP)
Time (ms) 22.01 367.62 22.14
0 500 1000 1500 2000
Mini-batch Index
0
100
200
300
400
500
600
700
800
Inference Time (ms)
Inference Time for each Mini-batch
uPMC(MLP)
CLS(MLP)
fPMC(MLP)
Figure 11.4: Inference time Vs. Mini-batch index. fPMC(MLP) and CLS(MLP) model are 10x
faster than uPMC(MLP) (use PyTorch v0.2.0 + Titan XP + Cuda 8 + Cudnnv5).
11.3.9 Inference efficiency
Next we study the inference efficiency of the proposed fPMC, uPMC (i.e., triplet based ap-
proaches [80, 56, 168] with PMC) models with the CLS model. For fair comparison, we use
the one-hidden-layer MLP model for all approaches, keepjTj = 1000 and mini-batch size to be
128 (uPMC based approach is memory consuming. More candidates require reducing the mini-
batch size). We evaluate models on the VQA2 validation set (2200 mini-batches) and report the
(average) mini-batch inference time. Fig. 11.4 and Table 11.11 show that fPMC(MLP) obtains
similar performance to CLS(MLP), with at least 10 times faster than uPMC(MLP).
11.4 Summary
We propose a novel approach of learning answer embeddings for the visual question answering
(Visual QA) task. The main idea is to learn embedding functions to capture the semantic re-
lationship among answers, instead of treating them as independent categories as in multi-way
classification-based models. Besides improving Visual QA results on single datasets, another
significant advantage of our approach is to enable better model transfer. The empirical studies on
several datasets have validated our approach.
Our approach is also “modular” in the sense that it can exploit any joint modeling of images
and texts (in this case, the questions). An important future direction is to discover stronger multi-
modal modeling for this purpose.
120
Part IV
Conclusion
121
Chapter 12
Conclusion
My thesis is towards developing intelligent systems for vision and language understanding in
the wild. Many recent successes on vision and language understanding are based on statistical
machine learning, which is founded on the assumption—the data of the training and test envi-
ronments must be from the same feature and label spaces, and have the same distribution. This
assumption, however, will not lead to systems that can performs well in the wild—to recognize
object categories not seen during training (e.g., rare or newly-defined objects), and to answer un-
familiar visual questions (e.g., different language styles by users). In my thesis, I thus strive to
develop transfer learning algorithms to leverage external information so that the learned models
can be properly transferred and generalized across categories and users (domains).
To recognize unseen objects, we work on a transfer learning paradigm called zero-shot learn-
ing (ZSL), which aims to transfer discriminative knowledge learned from seen objects, of which
we have labeled training data, to unseen ones with the help of external class semantic represen-
tations. My thesis provides a comprehensive set of insights and techniques to improve zero-shot
learning (ZSL)—from effectively leveraging the semantic representations in relating classes (cf.
Chapter 4), to revisiting and revising the ZSL settings towards real-world applications (cf. Chap-
ter 5), to unifying ZSL with one-shot and few-shot learning (cf. Chapter 6), and to improving the
class semantic representations by incorporating domain knowledge (cf. Chapter 7).
To answer unfamiliar questions, we work on domain adaptation, another transfer learning
paradigm to match the data distributions of training (source) and testing (target) domains. We
present a framework by adapting users’ language styles to what the learned Visual QA model
has been trained on so that we can re-use the model without re-training (cf. Chapter 10). My
thesis further revisits and revises existing datasets (cf. Chapter 9) and introduces a probabilistic
and factorization framework to leverage answer semantics (cf. Chapter 11), providing a series of
analysis and techniques to improve knowledge transfer across domains for Visual QA.
12.1 Remarks on future work
In this section, I will discuss my plans for future research towards advanced transfer learning.
For the short term, I hope to advance zero-shot learning by exploring more informative class
semantic representations and establishing the theoretical foundation. On the other hand, I want
to improve the performance of visual question answering, and generalize the solutions to and
benefit general AI tasks. For the long run, I will strive to unify different concepts and algorithms
122
of transfer learning and advance the area with brand new thinking by introducing more principled
frameworks and approaches instead of relying on ad hoc decisions or domain-specific insights.
12.1.1 Advanced zero-shot learning
The performance of ZSL algorithms largely depends on the quality of class semantic represen-
tations. Human-annotated attributes (e.g., colors, shapes) convey more domain (e.g., visual) in-
formation than word vectors and lead to better performance. However, they are costly to define
and acquire, especially on a massive number of categories. I will develop algorithms to auto-
matically mine attributes from documents that are able to faithfully describe and discriminate a
large amount of categories. Moreover, I will develop algorithms that can incorporate multiple
sources of semantic representations even beyond the vectorized forms, e.g., taxonomies, knowl-
edge graphs and bases, so as to take the best advantage of available information.
Besides algorithm design, I hope to establish the theoretical foundation that so far has been
missing in zero-shot learning. Theories underlying other transfer learning paradigms such as
domain adaptation have been gradually developed and provide novel insights to inspire algorithm
design. For example, the idea of adversarial domain adaptation [58] has been indicated in the
theories by Ben-David et al. [17]. I believe that similar benefits can be brought to ZSL. While
theories are usually founded on simple models (e.g., linear models or nearest neighbors), the
fact that simple models do work for ZSL [26, 28, 27] suggests their immediate applicability and
impact once being developed. As a starting point, I am dedicated to establishing the theoretical
guarantee—under what relationships between the data and class semantic representations the
ZSL algorithms will succeed or fail—to guide exploring better semantic representations.
12.1.2 Advanced transfer learning for AI
My future plans on transfer learning for AI tasks are three-folded. First, I notice that Visual QA
datasets that use real images and collect questions and answers from humans usually end up with
little need of reasoning (e.g., “who is in the image?”). On the other hand, datasets that synthesize
images and questions often require higher-level of reasoning (e.g., “what is the color of the big
ball to the left of the small red triangle?”). I will thus develop transfer learning algorithms that
takes advantage of the synthetic datasets to learn high-level reasoning for the real environments.
Secondly, I plan to extend the insights in advancing datasets and cross-dataset transfer for
Visual QA to other AI tasks that involve information of multiple modalities. For example, the
above line of research has potential to benefit autonomous driving and reinforcement learning,
in which collecting real data is extremely costly. Moreover, I hope to awake the community’s
attention in systematic and rigorous design of datasets and evaluation metrics so that the research
progress will not deviate from practice.
Finally, I hope to build up powerful AI systems to help advance other vision tasks that indeed
require human studies for evaluation. One exemplar is video summarization—a good video sum-
mary should coveys sufficient information of the original video. We can apply learned Visual QA
models to the summary, together with questions relevant to the original videos, to judge the qual-
ity of the summary, replacing the currently used metrics that simply measure visual or temporal
overlaps to the human-generated summary.
123
12.1.3 Principled frameworks for transferable machine learning
While transfer learning has been studied over decades, it still remains as a fundamental challenge
of machine learning and without proper organization. Different transfer learning paradigms, such
as zero-shot learning, few-shot learning, domain adaptation, and multi-task learning, are usu-
ally developed independently, lacking strong connections to inspire and benefit each other. On
the other hand, learning-based models, instead of rule-based ones, are believed to be the most
promising ways to develop intelligence systems for real-world applications, built upon the fact
that they have achieved human-level performance in many constrained tasks of computer vision
and natural language processing. To effectively bring the in-laboratory success to reality, we need
to advance transfer learning with brand new and more principled thinking to account for various
types of mismatch in the environment.
From my point of view, the core of transfer learning is to identify “why”, “what”, and “how”
to transfer. “Why” corresponds to different situations of environmental mismatch. “What” cor-
responds to different sources that can be transferred, ranging from data to feature representa-
tion [139] to models (architectures and parameters) [158, 11] and to meta information like ini-
tialization, hyper-parameters [49], or even the optimizers [12]. Note that these sources are where
algorithms of different paradigms share common concepts. Finally, “how” corresponds to the
ways to execute the transfer, in which the standard one is to determine a static objective and op-
timize it on currently available data. While increasing effort has been put into the community, so
far it is we humans that identify the combination of “why”, “what”, and “how” for the problem
at hand, mostly according to domain-specific insights or even ad hoc decisions. Can we have a
principled transfer learning framework in which a machine can automatically and dynamically
adjust the combination and its model complexity by interacting with the real-world environment?
I plan to approach this goal via Bayesian approach. As mentioned earlier, Bayesian approach
can intrinsically incorporate prior knowledge (i.e., belief) and uncertainty—corresponding to the
past experience and the current interaction with the environment, respectively. More concretely,
I will incorporate and take advantage of the streaming [77, 23], nonparametric [22], and hierar-
chical [177] methods of Bayesian approach for transfer learning.
• Streaming: Streaming (or more broadly stochastic and online) methods allow Bayesian
approach to dynamically update the posterior with respect to the current environment and
provide the up-to-date belief for the future one.
• Nonparametric: Nonparametric methods can automatically adjust the model complexity
with respective to the data and environment the machine encounters.
• Hierarchical: Different sources in “what” to transfer can be arranged into a hierarchy—
data and feature representation at the bottom, models at the middle, and meta-information
at the top.
By treating the sources in “what” to transfer as nodes in a probabilistic graphical model,
together with the streaming and nonparametric methods, Bayesian approach can systematically
achieve automatic and dynamic transfer learning—(1) The observability, space, and distribution
of data (and the corresponding labels) can identify “why” to transfer. (2) The likelihood on the
data and labels can tell the uncertainty of them, providing cues to adjust the objective (i.e., “how”
to transfer) so as to focus on hard instances. (3) Given the data, we can infer and update the
124
relationship of nodes. The posterior probability on variables of each node given the data in a
new environment can indicate the transferability of the corresponding source, making identifying
“what” to transfer an automatic data-driven process without ad hoc decisions.
125
Part V
Bibliography
126
Bibliography
[1] Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. Analyzing the behavior of visual
question answering models. In EMNLP, 2016.
[2] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just
assume; look and answer: Overcoming priors for visual question answering. In CVPR,
2018.
[3] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick,
Devi Parikh, and Dhruv Batra. Vqa: Visual question answering. IJCV, 2016.
[4] Zeynep Akata, Mateusz Malinowski, Mario Fritz, and Bernt Schiele. Multi-cue zero-shot
learning with strong supervision. In CVPR, 2016.
[5] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-
embedding for attribute-based classification. In CVPR, 2013.
[6] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of
output embeddings for fine-grained image classification. In CVPR, 2015.
[7] Ziad Al-Halah and Rainer Stiefelhagen. How to transfer? zero-shot object recognition via
hierarchical transfer of semantic attributes. In WACV, 2015.
[8] Ziad Al-Halah and Rainer Stiefelhagen. Automatic discovery, association estimation and
learning of semantic attributes for a thousand categories. In CVPR, 2017.
[9] Ziad Al-Halah, Makarand Tapaswi, and Rainer Stiefelhagen. Recovering the missing link:
Predicting class-attribute associations for unsupervised zero-shot learning. In CVPR, pages
5975–5984, 2016.
[10] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and vqa.
In CVPR, 2018.
[11] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module net-
works. In CVPR, 2016.
[12] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau,
Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient
descent. In Advances in Neural Information Processing Systems (NIPS), 2016.
127
[13] Yashas Annadani and Soma Biswas. Preserving semantic relations for zero-shot learn-
ing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2018.
[14] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.
[15] Gundeep Arora, Vinay Kumar Verma, Ashish Mishra, and Piyush Rai. Generalized zero-
shot learning via synthesized examples. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 2018.
[16] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and
data representation. Neural computation, 15(6):1373–1396, 2003.
[17] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-
nifer Wortman Vaughan. A theory of learning from different domains. Machine learning,
79(1):151–175, 2010.
[18] Hedi Ben-younes, R´ emi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal
tucker fusion for visual question answering. In ICCV, 2017.
[19] Tamara L Berg, Alexander C Berg, and Jonathan Shih. Automatic attribute discovery and
characterization from noisy web data. In ECCV, 2010.
[20] Luca Bertinetto, Jo˜ ao F Henriques, Jack Valmadre, Philip H.S. Torr, and Andrea Vedaldi.
Learning feed-forward one-shot learners. In NIPS, 2016.
[21] Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python:
analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
[22] David M Blei, Michael I Jordan, et al. Variational inference for dirichlet process mixtures.
Bayesian analysis, 1(1):121–143, 2006.
[23] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jor-
dan. Streaming variational bayes. In Advances in Neural Information Processing Systems
(NIPS), pages 1727–1735, 2013.
[24] Maxime Bucher, St´ ephane Herbin, and Fr´ ed´ eric Jurie. Improving semantic embedding
consistency by metric learning for zero-shot classiffication. In ECCV, pages 730–746.
Springer, 2016.
[25] Maxime Bucher, St´ ephane Herbin, and Fr´ ed´ eric Jurie. Generating visual representations
for zero-shot classification. In ICCV Workshop, 2017.
[26] Soravit Changpinyo*, Wei-Lun Chao*, Boqing Gong, and Fei Sha. Synthesized classifiers
for zero-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2016.
[27] Soravit Changpinyo, Wei-Lun Chao, and Fei Sha. Predicting visual exemplars of unseen
classes for zero-shot learning. In IEEE Internatioanl Conference on Computer Vision
(ICCV), 2017.
128
[28] Wei-Lun Chao*, Soravit Changpinyo*, Boqing Gong, and Fei Sha. An empirical study and
analysis of generalized zero-shot learning for object recognition in the wild. In European
Conference on Computer Vision (ECCV), 2016.
[29] Wei-Lun Chao*, Boqing Gong*, Kristen Grauman, and Fei Sha. Large-margin determi-
nantal point processes. In Conference on Uncertainty in Artificial Intelligence (UAI), 2015.
[30] Wei-Lun Chao*, Hexiang Hu*, and Fei Sha. Being negative but constructively: Lessons
learnt from creating better visual question answering datasets. In NAACL, 2018.
[31] Wei-Lun Chao*, Hexiang Hu*, and Fei Sha. Cross-dataset adaptation for visual question
answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2018.
[32] Wei-Lun Chao, Justin Solomon, Dominik L Michels, and Fei Sha. Exponential integration
for hamiltonian monte carlo. In International Conference on Machine Learning (ICML),
2015.
[33] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shot visual
recognition using semantics-preserving adversarial embedding network. In CVPR, 2018.
[34] Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang, Wan-Ting Hsu, Jianlong Fu, and
Min Sun. Show, adapt and tell: Adversarial training of cross-domain image captioner. In
ICCV, 2017.
[35] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass
kernel-based vector machines. JMLR, 2:265–292, 2002.
[36] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´ e MF Moura,
Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017.
[37] Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler Cinbis. At-
tributes2classname: A discriminative model for attribute-based unsupervised zero-shot
learning. In ICCV, 2017.
[38] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In CVPR, 2009.
[39] Nan Ding, Sebastian Goodman, Fei Sha, and Radu Soricut. Understanding image and
text simultaneously: a dual vision-language machine comprehension task. arXiv preprint
arXiv:1612.07833, 2016.
[40] Zhengming Ding, Ming Shao, and Yun Fu. Low-rank embedded ensemble semantic dic-
tionary for zero-shot learning. In CVPR, 2017.
[41] Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. Improving zero-shot learning by
mitigating the hubness problem. arXiv preprint arXiv:1412.6568, 2014.
[42] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. Write a classifier: Zero-shot
learning using purely textual descriptions. In ICCV, 2013.
129
[43] Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, and Ahmed Elgammal. Link the head to the
beak: Zero shot learning from noisy text description at part precision. In CVPR, 2017.
[44] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their
attributes. In CVPR, 2009.
[45] Li Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. TPAMI,
28:594–611, 2006.
[46] Christiane Fellbaum. WordNet. Wiley Online Library, 1998.
[47] Vittorio Ferrari and Andrew Zisserman. Learning visual attributes. In NIPS, 2008.
[48] Francis Ferraro, Nasrin Mostafazadeh, Lucy Vanderwende, Jacob Devlin, Michel Galley,
Margaret Mitchell, et al. A survey of current datasets for vision and language research. In
EMNLP, 2015.
[49] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast
adaptation of deep networks. In International Conference on Machine Learning (ICML),
2017.
[50] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, MarcAurelio
Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In
NIPS, 2013.
[51] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Zhenyong Fu, and Shaogang Gong.
Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV,
2014.
[52] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal
latent attributes. TPAMI, 36(2):303–316, 2014.
[53] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. Transductive multi-
view zero-shot learning. TPAMI, 2015.
[54] Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In CVPR,
2016.
[55] Zhenyong Fu, Tao Xiang, Elyor Kodirov, and Shaogang Gong. Zero-shot object recogni-
tion by semantic manifold distance. In CVPR, 2015.
[56] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus
Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual
grounding. In EMNLP, 2016.
[57] Chuang Gan, Yi Yang, Linchao Zhu, Deli Zhao, and Yueting Zhuang. Recognizing an
action using its name: A knowledge-based approach. IJCV, pages 1–17, 2016.
[58] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle,
Franc ¸ois Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial train-
ing of neural networks. JMLR, 17(59):1–35, 2016.
130
[59] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you
talking to a machine? dataset and methods for multilingual image question. In NIPS, pages
2296–2304, 2015.
[60] Efstratios Gavves, Thomas Mensink, T. Tommasi, Cees Snoek, and Tinne Tuytelaars. Ac-
tive transfer learning with zero-shot priors: Reusing past datasets for future tasks. In ICCV,
2015.
[61] Boqing Gong*, Wei-Lun Chao*, Kristen Grauman, and Fei Sha. Diverse sequential subset
selection for supervised video summarization. In Advances in Neural Information Pro-
cessing Systems (NIPS), 2014.
[62] Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Dis-
criminatively learning domain-invariant features for unsupervised domain adaptation. In
ICML, 2013.
[63] Boqing Gong, Kristen Grauman, and Fei Sha. Reshaping visual datasets for domain adap-
tation. In NIPS, 2013.
[64] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsu-
pervised domain adaptation. In CVPR, 2012.
[65] Mingming Gong, Kun Zhang, Tongliang Liu, Dacheng Tao, Clark Glymour, and Bernhard
Sch¨ olkopf. Domain adaptation with conditional transferable components. In ICML, pages
2839–2848, 2016.
[66] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
[67] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making
the v in vqa matter: Elevating the role of image understanding in visual question answer-
ing. In CVPR, 2017.
[68] Kristen Grauman, Fei Sha, and Sung Ju Hwang. Learning a tree of metrics with disjoint
visual features. In NIPS, pages 621–629, 2011.
[69] Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. Synthesizing samples fro zero-
shot learning. In IJCAI, 2017.
[70] Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. Zero-shot learning with trans-
ferred samples. IEEE Transactions on Image Processing, 2017.
[71] Yuchen Guo, Guiguang Ding, Xiaoming Jin, and Jianmin Wang. Transductive zero-shot
recognition via shared model space learning. In AAAI, volume 3, page 8, 2016.
[72] Akshay Kumar Gupta. Survey of visual question answering: Datasets and techniques.
arXiv preprint arXiv:1705.03865, 2017.
[73] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In CVPR, 2016.
131
[74] Luis Herranz, Shuqiang Jiang, and Xiangyang Li. Scene recognition with cnns: Objects,
scales and dataset bias. In CVPR, 2016.
[75] Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander
Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving
zero-shot transfer in reinforcement learning. In ICML, 2017.
[76] Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, 2002.
[77] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational
inference. Journal of Machine Learning Research (JMLR), 14(1):1303–1347, 2013.
[78] Hexiang Hu*, Wei-Lun Chao*, and Fei Sha. Learning answer embeddings for visual
question answering. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2018.
[79] I. Ilievski and J. Feng. A simple loss function for improving the convergence and accuracy
of visual question answering models. In CVPR Workshop, 2017.
[80] Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question an-
swering baselines. In ECCV, 2016.
[81] Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multi-class open set recognition
using probability of inclusion. In ECCV, pages 393–409, 2014.
[82] Dinesh Jayaraman and Kristen Grauman. Zero-shot recognition with unreliable attributes.
In NIPS, 2014.
[83] Dinesh Jayaraman, Fei Sha, and Kristen Grauman. Decorrelating semantic visual attributes
by resisting the urge to share. In CVPR, 2014.
[84] Zhong Ji, Yuzhong Xie, Yanwei Pang, Lei Chen, and Zhongfei Zhang. Zero-shot learning
with multi-battery factor analysis. Signal Processing, 138:265–272, 2017.
[85] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-
shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast
feature embedding. In ACM Multimedia, 2014.
[86] Huajie Jiang, Ruiping Wang, Shiguang Shan, Yi Yang, and Xilin Chen. Learning discrimi-
native latent attributes for zero-shot classification. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 4223–4232, 2017.
[87] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zit-
nick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and
elementary visual reasoning. In CVPR, 2017.
[88] Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms.
In ICCV, 2017.
[89] Kushal Kafle and Christopher Kanan. Visual question answering: Datasets, algorithms,
and future challenges. Computer Vision and Image Understanding, 163:3–20, 2017.
132
[90] Pichai Kankuekul, Aram Kawewong, Sirinart Tangruamsub, and Osamu Hasegawa. Online
incremental attribute-based zero-shot learning. In CVPR, 2012.
[91] Ken Kansky, Tom Silver, David A M´ ely, Mohamed Eldawy, Miguel L´ azaro-Gredilla,
Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George.
Schema networks: Zero-shot transfer with a generative causal model of intuitive physics.
In ICML, 2017.
[92] Nour Karessli, Zeynep Akata, Andreas Bulling, and Bernt Schiele. Gaze embeddings for
zero-shot image classification. In CVPR, 2017.
[93] Vahid Kazemi and Ali Elqursh. Show, ask, attend, and answer: A strong baseline for visual
question answering. arXiv preprint arXiv:1704.03162, 2017.
[94] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and
Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016.
[95] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A Efros, and Antonio Torralba.
Undoing the damage of dataset bias. In ECCV, 2012.
[96] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR,
2015.
[97] Elyor Kodirov, Tao Xiang, Zhenyong Fu, and Shaogang Gong. Unsupervised domain
adaptation for zero-shot learning. In ICCV, 2015.
[98] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learn-
ing. In CVPR, 2017.
[99] Hans-Peter Kriegel, Peer Kr¨ oger, Erich Schubert, and Arthur Zimek. LoOP: local outlier
probabilities. In CIKM, 2009.
[100] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and
Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense
image annotations. IJCV, 2017.
[101] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with
deep convolutional neural networks. In NIPS, 2012.
[102] Neeraj Kumar, Alexander Berg, Peter N Belhumeur, and Shree Nayar. Describable visual
attributes for face verification and image search. TPAMI, 2011.
[103] Neeraj Kumar, Alexander C Berg, Peter N Belhumeur, and Shree K Nayar. Attribute and
simile classifiers for face verification. In CVPR. IEEE, 2009.
[104] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. One-shot learning by
inverting a compositional causal process. In NIPS, 2013.
[105] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen
object classes by between-class attribute transfer. In CVPR, 2009.
133
[106] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classifi-
cation for zero-shot visual object categorization. TPAMI, 36(3):453–465, 2014.
[107] Angeliki Lazaridou, Georgiana Dinu, and Marco Baroni. Hubness and pollution: Delv-
ing into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages
270–280, 2015.
[108] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents.
In ICML, 2014.
[109] Chung-Wei Lee, Wei Fang, Chih-Kuan Yeh, and Yu-Chiang Frank Wang. Multi-label
zero-shot learning with structured knowledge graphs. In CVPR, 2018.
[110] Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, and Honglak Lee. Hi-
erarchical novelty detection for visual object recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018.
[111] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, and Ruslan Salakhutdinov. Predicting deep
zero-shot convolutional neural networks using textual descriptions. In ICCV, 2015.
[112] Xin Li and Yuhong Guo. Max-margin zero-shot learning for multi-class classification. In
AISTATS, 2015.
[113] Xin Li, Yuhong Guo, and Dale Schuurmans. Semi-supervised zero-shot classification with
label representation learning. In ICCV, 2015.
[114] Yan Li, Junge Zhang, Jianguo Zhang, and Kaiqi Huang. Discriminative learning of latent
features for zero-shot recognition. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018.
[115] Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. Zero-shot
recognition using dual visual-semantic mapping paths. In CVPR, 2017.
[116] Zhenyang Li, Efstratios Gavves, Thomas Mensink, and Cees GM Snoek. Attributes make
sense on segmented objects. In ECCV. Springer, 2014.
[117] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Doll´ ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
ECCV, 2014.
[118] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in
the wild. In ICCV, 2015.
[119] Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. From
zero-shot learning to conventional supervised classification: Unseen visual data synthesis.
In CVPR, 2017.
134
[120] Yang Long and Ling Shao. Describing unseen classes by exemplars: Zero-shot learning
using grouped simile ensemble. In Applications of Computer Vision (WACV), 2017 IEEE
Winter Conference on, pages 907–915. IEEE, 2017.
[121] Jiang Lu, Jin Li, Ziang Yan, and Changshui Zhang. Zero-shot learning by generating
pseudo feature representations. arXiv preprint arXiv:1703.06389, 2017.
[122] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-
attention for visual question answering. In NIPS, pages 289–297, 2016.
[123] Yao Lu. Unsupervised learning of neural network outputs. In IJCAI, 2016.
[124] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering
about real-world scenes based on uncertain input. In NIPS, 2014.
[125] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in trans-
lation: Contextualized word vectors. In NIPS, 2017.
[126] Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. Data-augmentation for
reducing dataset bias in person re-identification. In Advanced Video and Signal Based
Surveillance (AVSS), 2015 12th IEEE International Conference on, 2015.
[127] Thomas Mensink, Efstratios Gavves, and Cees G.M. Snoek. Costa: Co-occurrence statis-
tics for zero-shot classification. In CVPR, 2014.
[128] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning
for large scale image classification: Generalizing to new classes at near-zero cost. In
ECCV, 2012.
[129] Thomas Mensink, Jakob Verbeek, Florent Perronnin, and Gabriela Csurka. Distance-based
image classification: Generalizing to new classes at near-zero cost. TPAMI, 35(11):2624–
2637, 2013.
[130] Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. In ICLR Workshops, 2013.
[131] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeffrey Dean. Distributed
representations of words and phrases and their compositionality. In NIPS, 2013.
[132] George A Miller. Wordnet: a lexical database for english. Communications of the ACM,
38(11):39–41, 1995.
[133] Pedro Morgado and Nuno Vasconcelos. Semantically consistent regularization for zero-
shot recognition. In CVPR, 2017.
[134] Li Niu, Ashok Veeraraghavan, and Ashu Sabharwal. Webly supervised learning meets
zero-shot learning: A hybrid approach for fine-grained classification. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
135
[135] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, An-
drea Frome, Greg S. Corrado, and Jeffrey Dean. Zero-shot learning by convex combination
of semantic embeddings. In ICLR, 2014.
[136] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task general-
ization with multi-task deep reinforcement learning. In ICML, 2017.
[137] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring
mid-level image representations using convolutional neural networks. In CVPR, 2014.
[138] Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. Zero-shot
learning with semantic output codes. In NIPS, 2009.
[139] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on
knowledge and data engineering, 22(10):1345–1359, 2010.
[140] Devi Parikh and Kristen Grauman. Relative attributes. In ICCV, 2011.
[141] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide
Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot
visual imitation. In ICLR, 2018.
[142] Genevieve Patterson, Chen Xu, Hang Su, and James Hays. The sun attribute database:
Beyond categories for deeper scene understanding. IJCV, 108(1-2):59–81, 2014.
[143] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors
for word representation. In EMNLP, 2014.
[144] Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Less is more:
zero-shot learning from online textual documents with noise suppression. In CVPR, pages
2249–2257, 2016.
[145] Jie Qin, Yunhong Wang, Li Liu, Jiaxin Chen, and Ling Shao. Beyond semantic attributes:
Discrete latent attributes learning for zero-shot recognition. IEEE signal processing letters,
23(11):1667–1671, 2016.
[146] Santhosh K Ramakrishnan, Ambar Pal, Gaurav Sharma, and Anurag Mittal. An empirical
evaluation of visual question answering for novel objects. In CVPR, 2017.
[147] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H. Lampert. iCaRL: In-
cremental classifier and representation learning. In CVPR, 2017.
[148] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. Learning deep representations
of fine-grained visual descriptions. In CVPR, 2016.
[149] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Honglak Lee, and Bernt
Schiele. Generative adversarial text to image synthesis. In ICML, 2016.
[150] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image
question answering. In NIPS, 2015.
136
[151] Danilo Jimenez Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wier-
stra. One-shot generalization in deep generative models. arXiv preprint arXiv:1603.05106,
2016.
[152] Marko Ristin, Matthieu Guillaumin, Juergen Gall, and Luc Van Gool. Incremental learning
of random forests for large-scale image classification. TPAMI, 38(3):490–503, 2016.
[153] Marcus Rohrbach, Sandra Ebert, and Bernt Schiele. Transfer learning in a transductive
setting. In NIPS, 2013.
[154] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and
zero-shot learning in a large-scale setting. In CVPR, 2011.
[155] Marcus Rohrbach, Michael Stark, Gy¨ orgy Szarvas, Iryna Gurevych, and Bernt Schiele.
What helps where–and why? semantic relatedness for knowledge transfer. In CVPR,
2010.
[156] Bernardino Romera-Paredes and Philip H. S. Torr. An embarrassingly simple approach to
zero-shot learning. In ICML, 2015.
[157] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252,
2015.
[158] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirk-
patrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural net-
works. arXiv preprint arXiv:1606.04671, 2016.
[159] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category
models to new domains. ECCV, 2010.
[160] Ruslan Salakhutdinov, Antonio Torralba, and Josh Tenenbaum. Learning to share visual
appearance for multiclass object detection. In CVPR, 2011.
[161] Jorge S´ anchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classifi-
cation with the fisher vector: Theory and practice. IJCV, 105(3):222–245, 2013.
[162] Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult.
Toward open set recognition. TPAMI, 35(7):1757–1772, 2013.
[163] Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set
recognition. TPAMI, 36(11):2317–2324, 2014.
[164] Bernhard Sch¨ olkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New
support vector algorithms. Neural computation, 12(5):1207–1245, 2000.
[165] Bernhard Sch¨ olkopf and Alexander J Smola. Learning with kernels: support vector ma-
chines, regularization, optimization, and beyond. MIT press, 2002.
137
[166] Yaxin Shi, Donna Xu, Yuangang Pan, and Ivor W Tsang. Multi-context label embedding.
arXiv preprint arXiv:1805.01199, 2018.
[167] Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. Ridge
regression, hubness, and zero-shot learning. In Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, pages 135–151. Springer, 2015.
[168] Kevin J Shih, Saurabh Singh, and Derek Hoiem. Where to look: Focus regions for visual
question answering. In CVPR, 2016.
[169] Seyed Mohsen Shojaee and Mahdieh Soleymani Baghshah. Semi-supervised zero-shot
learning by a clustering-based approach. arXiv preprint arXiv:1605.09016, 2016.
[170] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2015.
[171] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot
learning. In NIPS, 2017.
[172] Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. Zero-shot
learning through cross-modal transfer. In NIPS, 2013.
[173] Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. Transductive unbi-
ased embedding for zero-shot learning. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, 2018.
[174] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adapta-
tion. In AAAI, 2016.
[175] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In CVPR, 2015.
[176] Kevin D Tang, Marshall F Tappen, Rahul Sukthankar, and Christoph H Lampert. Optimiz-
ing one-shot recognition with micro-set learning. In CVPR, 2010.
[177] Yee Whye Teh and Michael I Jordan. Hierarchical bayesian nonparametric models with
applications. Bayesian nonparametrics, 1, 2010.
[178] Damien Teney and Anton van den Hengel. Zero-shot visual question answering. arXiv
preprint arXiv:1611.05546, 2016.
[179] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper look at
dataset bias. In German Conference on Pattern Recognition, 2015.
[180] Tatiana Tommasi and Tinne Tuytelaars. A testbed for cross-dataset analysis. In ECCV
Workshop, 2014.
[181] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR, 2011.
138
[182] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer
across domains and tasks. In ICCV, 2015.
[183] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative
domain adaptation. In CVPR, 2017.
[184] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 9(2579-
2605):85, 2008.
[185] Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney,
Trevor Darrell, and Kate Saenko. Captioning images with diverse objects. In CVPR, 2017.
[186] Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Koray Kavukcuoglu, and Daan
Wierstra. Matching networks for one shot learning. In NIPS, 2016.
[187] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-
200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technol-
ogy, 2011.
[188] Donghui Wang, Yanan Li, Yuetan Lin, and Yueting Zhuang. Relational knowledge transfer
for zero-shot learning. In AAAI, 2016.
[189] Qian Wang and Ke Chen. Zero-shot visual recognition via bidirectional latent embedding.
arXiv preprint arXiv:1607.02104, 2016.
[190] Wenlin Wang, Y Pu, VK Verma, K Fan, Y Zhang, C Chen, P Rai, and L Carin. Zero-shot
learning via class-conditioned deep generative models. In AAAI Conference on Artificial
Intelligence (AAAI-18), Louisiana, USA, volume 5, 2018.
[191] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic em-
beddings and knowledge graphs. In CVPR, 2018.
[192] Xiaoyang Wang and Qiang Ji. A unified probabilistic approach modeling relationships
between attributes and objects. In ICCV, 2013.
[193] Jason Weston, Samy Bengio, and Nicolas Usunier. Large scale image annotation: learning
to rank with joint word-image embeddings. Machine learning, 81(1):21–35, 2010.
[194] Georg Wiese, Dirk Weissenborn, and Mariana L. Neves. Neural domain adaptation for
biomedical question answering. In CoNLL, 2017.
[195] Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den
Hengel. Visual question answering: A survey of methods and datasets. Computer Vision
and Image Understanding, 163:21–40, 2017.
[196] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In ACL, 1994.
[197] Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. Harnessing object and scene
semantics for large-scale video understanding. In CVPR, 2016.
139
[198] Yongqin Xian, Zeynep Akata, and Bernt Schiele. Zero-shot learning – the Good, the Bad
and the Ugly. arXiv preprint arXiv:1703.04394, 2017.
[199] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt
Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016.
[200] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating net-
works for zero-shot learning. arXiv preprint arXiv:1712.00981, 2017.
[201] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot learning-the good, the bad and
the ugly. In CVPR, 2017.
[202] Jianxiong Xiao, James Hays, Krista Ehinger, Aude Oliva, and Antonio Torralba. Sun
database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
[203] Huijuan Xu and Kate Saenko. Ask, attend and answer: Exploring question-guided spatial
attention for visual question answering. In ECCV, 2016.
[204] Xing Xu, Fumin Shen, Yang Yang, Dongxiang Zhang, Heng Tao Shen, and Jingkuan Song.
Matrix tri-factorization with manifold regularizations for zero-shot learning. In CVPR,
2017.
[205] Xun Xu, Timothy Hospedales, and Shaogang Gong. Transductive zero-shot action recog-
nition by word-vector embedding. International Journal of Computer Vision, 123(3):309–
333, 2017.
[206] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. Attribute2image: Conditional
image generation from visual attributes. In ECCV, 2016.
[207] Yongxin Yang and Timothy M Hospedales. A unified perspective on multi-domain and
multi-task learning. In ICLR, 2015.
[208] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William W Cohen. Semi-supervised
qa with generative domain-adaptive nets. In ACL, 2017.
[209] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention
networks for image question answering. In CVPR, 2016.
[210] Meng Ye and Yuhong Guo. Zero-shot classification with discriminative semantic represen-
tation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7140–7148, 2017.
[211] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-level attention networks for visual
question answering. In CVPR, 2017.
[212] Felix X. Yu, Liangliang Cao, Rog´ erio Schmidt Feris, John R. Smith, and Shih-Fu Chang.
Designing category-level attributes for discriminative visual recognition. In CVPR, 2013.
[213] Licheng Yu, Eunbyung Park, Alexander C Berg, and Tamara L Berg. Visual madlibs: Fill
in the blank description generation and question answering. In ICCV, 2015.
140
[214] Xiaodong Yu and Yiannis Aloimonos. Attribute-based transfer learning for object catego-
rization with zero/one training example. In ECCV, 2010.
[215] Hongguang Zhang and Piotr Koniusz. Zero-shot kernel learning. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[216] Ke Zhang*, Wei-Lun Chao*, Fei Sha, and Kristen Grauman. Summary transfer: Exemplar-
based subset selection for video summarization. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2016.
[217] Ke Zhang*, Wei-Lun Chao*, Fei Sha, and Kristen Grauman. Video summarization with
long short-term memory. In European Conference on Computer Vision (ECCV), 2016.
[218] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-
shot learning. In CVPR, 2017.
[219] Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Yin and
yang: Balancing and answering binary visual questions. In CVPR, 2016.
[220] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity em-
bedding. In ICCV, 2015.
[221] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity
embedding. In CVPR, 2016.
[222] Ziming Zhang and Venkatesh Saligrama. Zero-shot recognition via structured prediction.
In ECCV, 2016.
[223] Bo Zhao, Yanwei Fu, Rui Liang, Jiahong Wu, Yonggang Wang, and Yizhou Wang. A large-
scale attribute dataset for zero-shot learning. arXiv preprint arXiv:1804.04314, 2018.
[224] Bo Zhao, Xinwei Sun, Yuan Yao, and Yizhou Wang. Zero-shot learning via shared-
reconstruction-graph pursuit. arXiv preprint arXiv:1711.07302, 2017.
[225] Bo Zhao, Botong Wu, Tianfu Wu, and Yizhou Wang. Zero-shot learning posed as a missing
data problem. In Proceedings of ICCV Workshop, pages 2616–2622, 2017.
[226] Yu Zhou, Yu Jun, Xiang Chenchao, Fan Jianping, and Tao Dacheng. Beyond bilinear: Gen-
eralized multi-modal factorized high-order pooling for visual question answering. arXiv
preprint arXiv:1708.03619, 2017.
[227] Yu Zhou, Yu Jun, Fan Jianping, and Tao Dacheng. Multi-modal factorized bilinear pooling
with co-attention learning for visual question answering. In ICCV, 2017.
[228] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distributions
of object subcategories. In CVPR, 2014.
[229] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, and Ahmed Elgammal. A generative
adversarial approach for zero-shot learning from noisy texts. In CVPR, 2018.
[230] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question
answering in images. In CVPR, 2016.
141
Abstract (if available)
Abstract
Developing intelligent systems for vision and language understanding has long been a crucial part that people dream about the future. In the past few years, with the accessibility to large-scale data and the advance of machine learning algorithms, vision and language understanding has had significant progress for constrained environments. However, it remains challenging for unconstrained environments in the wild where the intelligent system needs to tackle unseen objects and unfamiliar language usage that it has not been trained on. Transfer learning, which aims to transfer and adapt the learned knowledge from the training environment to a different but related test environment has thus emerged as a promising framework to remedy the difficulty. ❧ In my thesis, I focus on two challenging paradigms of transfer learning: zero-shot learning and domain adaptation. I will begin with zero-shot learning (ZSL), which aims to expand the learned knowledge from seen objects, of which we have training data, to unseen objects, of which we have no training data. I will present an algorithm SynC that can construct the classifier of any object class given its semantic representation, even without training data, followed by a comprehensive study on how to apply it to different environments. The study further suggests directions to improve the semantic representation, leading to an algorithm EXEM that can widely benefit existing ZSL algorithms. ❧ I will then describe an adaptive visual question answering (Visual QA) framework that builds upon the insight of zero-shot learning and can further adapt its knowledge to new environments given limited information. Along our work we also revisit and revise existing Visual QA datasets so as to ensure that a learned model can faithfully comprehend and reason both the visual and language information, rather than relying on incidental statistics to perform the task. ❧ For both zero-shot learning for object recognition and domain adaptation for visual question answering, we conduct extensive empirical studies on multiple (large-scale) datasets and experimental settings to demonstrate the superior performance and applicability of our proposed algorithms toward developing intelligent systems in the wild.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Visual representation learning with structural prior
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Modeling, learning, and leveraging similarity
PDF
Visual knowledge transfer with deep learning techniques
PDF
Kernel methods for unsupervised domain adaptation
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Towards understanding language in perception and embodiment
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Towards more human-like cross-lingual transfer learning
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Learning controllable data generation for scalable model training
PDF
Parametric and semi-parametric methods for knowledge acquisition from text
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Unsupervised domain adaptation with private data
PDF
Learning to detect and adapt to unpredicted changes
PDF
Toward situation awareness: activity and object recognition
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Bayesian methods for autonomous learning systems
PDF
Robust and adaptive online reinforcement learning
Asset Metadata
Creator
Chao, Wei-Lun
(author)
Core Title
Transfer learning for intelligent systems in the wild
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/26/2018
Defense Date
06/26/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,intelligent systems,machine learning,OAI-PMH Harvest,object classification,object recognition,transfer learning,visual question answering, domain adaptation,zero-shot learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sha, Fei (
committee chair
), Georgiou, Panayiotis (
committee member
), Itti, Laurent (
committee member
), Lee, Jason (
committee member
), Lim, Joseph (
committee member
)
Creator Email
weilunchao760414@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-107678
Unique identifier
UC11676679
Identifier
etd-ChaoWeiLun-6980.pdf (filename),usctheses-c89-107678 (legacy record id)
Legacy Identifier
etd-ChaoWeiLun-6980.pdf
Dmrecord
107678
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chao, Wei-Lun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
intelligent systems
machine learning
object classification
object recognition
transfer learning
visual question answering, domain adaptation
zero-shot learning