Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling, learning, and leveraging similarity
(USC Thesis Other)
Modeling, learning, and leveraging similarity
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Modeling, Learning, and Leveraging Similarity by Soravit Changpinyo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2018 Copyright 2018 Soravit Changpinyo Acknowledgments I have learned and grown tremendously over the past six years. I will be forever grateful for the people, the places, the things, the opportunities and the challenges that have made me the very person I am today. First of all, I would like to thank my advisor, Fei Sha. Among endless pieces of his wisdom and advice, what has stuck with me the most are the values Fei placed on the consistent strive for improvement and growth, conviction, planning, hard work, clear communication, and the courage to be honest to ourselves and our work. I still remember our first meeting on the USC Visit Day, during which his enthusiasm (and, later in the day, his derivation on dual decomposition in a lab meeting) made me excited. Since then, countless hours of interactions have shaped me into a much more competent thinker and doer. I also appreciate the immense efforts Fei put into creating a fantastic environment for doing research along with special treats here and there that made all the lab members feel so privileged. Thank you so much, Fei, for always having my back. I would also like to thank Professors Kevin Knight, C.-C. Jay Kuo, Hao Li, and Nora Ayanian for serving on my Language-Vision-Graphics-Robotics super committee. Special thanks to Kevin for his kindness and for teaching my favorite class at USC. Kevin also fueled my interest in NLP and gave me invaluable advice on how to do resarch. It had all started at the first lecture of Professor Erik Sudderth’s “Introduction to Machine Learning” at Brown University. Later I found myself TAing for that class, giving presentations on variational inference for Dirichlet and Indian Buffet processes in another class of his, and doing an Honors thesis with him. With Erik’s guidance and patience, I stepped into the machine learning research world for the first time. I am incredibly fortunate to have had opportunities to learn from great mentors during the summers of 2016 and 2017: Mark Sandler and Andrey Zhmoginov at Google Research, Mountain View; Richard Zens, Yin-Wen Chang, and Kishore Papineni at Google Research, NYC. My short visit to Inria Lille could not have gone so smoothly and fun without the afternoon’s coffee and foosball with the MAGNET/LEGO team. Those experiences have provided me with different sets of skills and perspectives, both related and unrelated to research, that I would not have gotten otherwise. My PhD life and the research described in this thesis have benefited from past and present collaborators, ShaLab members, as well as fellow PhD Trojans. Boqing Gong set an example of how to efficiently and effectively manage research projects from start to finish. He also in- troduced me to the problem of zero-shot learning. Harry Chao’s attention to details and ability to get any experiments done in time are beyond superior, but more than that he is one of the most caring persons I know. Zhiyun Lu is a symbol of candidness and over the years has made me better at speaking my mind. Despite our struggles as PhD beginners, Kuan Liu was an ex- cellent collaborator during our time on Similarity Component Analysis. Frank Hu’s enthusiasm ii brought a lot of positive energy into the lab. His coding, system, and figure creation skills proved extremely useful during our collaboration period. I would also like to thank Tomer Levinboim, Erica Greene, Yuan Shi, Dingchao Lu, Alireza Bagheri Garakani, Ke Zhang, Franziska Meier, Aur´ elien Bellet, Chao-Kai Chiang, Aaron Chan, Yiming Yan, Shariq Iqbal, Jeremy Hsu, Yury Zemlyanskiy, Ivy Xiao, Liyu Chen, Bowen Zhang, S´ eb Arnold, Melissa Ailem, Rishi Rawat, Wenzhe Li, Joseph Veloce, Caitlyn Clabaugh, Alana Shine, Hang Ma, Dong Guo, Anqi Wu, Anand Kumar Narayanan, James Preiss, Shao-Hua Sun, and Liam Li. Space does not permit me to narrate all the stories but I feel extremely grateful to be surrounded by such smart and kind individuals on a daily basis. I could not have asked for more supportive people from the USC Department of Computer Science and the USC Viterbi School of Engineering. Lizsl De Leon’s helpfulness, efficiency, laughter, and “Email of the Day” will always be appreciated and remembered. I would also like to thank Jennifer Gerson, Tracy Charles, Lifeng (Mai) Lee, and Nina Shilling for assisting me with administrative matters. I would have not survived the PhD journey without being physically well. I thank Dr. Shinada for his reliable care over the years and for enabling me to have more good days than bad ones. Active members of the badminton club made sure that at least once a month I fully took my mind out of research. I learned from Coach Luis Paulo Oliveira and Coach John Jessee how I could be a more skilled soccer and tennis player. Special thanks to Coach Luis for his little encouraging things he said every time we met that to me are far from little. I thank the Thai community for their friendship (and for tolerating at times my random sched- ule and withdrawals from life). I thank my roommates Dave, Toey, Toy, Joke, and Nam for their help, random chats, eating-outs, and otherwise. Recently, Jelly has become my role model on how to approach life with the right attitude. YernClub members have provided an online space for stream of consciousness. Special thanks to Namwan or being such a considerate, supportive, and funny-without-even-trying person; I am amazed by your ability to perceive and see through human mind. I also would like to thank Google Intern and SGI groups for teaching me how to better support others. To my fellow Brunonians, TS and other friends, thank you guys for your support and childhood influences on maturity and proactivity as well as your eagerness to catch up whenever you guys are in town. It takes a village and I have one. Thank you! Additionally, I thank all strangers, TV shows, books, tweets, alternative music, TED talks, cafes (especially ones with free internet and great cold brew), and Roger Federer for bringing me joy and sometimes even courage. The PhD milestone and my other successes as a student are largely due to the unconditional love and support from my parents and sisters. I cannot thank them enough for always being there and for everything they did and continue to do for me. I hope this thesis has made them proud. iii Table of Contents Acknowledgments ii List of Tables viii List of Figures xii Abstract xvii 1 Introduction 1 1.1 Modeling, Learning, and Leveraging Similarity: Motivation and Methods . . . . 2 1.2 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 I Background 7 2 Similarity Learning 8 2.1 Metric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Mahalanobis Metric Learning . . . . . . . . . . . . . . . . . . . . . . . 9 3 Learning with limited data 13 3.1 Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Semantic Representations for Relating Seen and Unseen Classes . . . . . 14 3.1.1.1 Visual Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.1.2 Word Vector Representations of Class Names . . . . . . . . . 16 3.1.1.3 Other Types of Semantic Representations . . . . . . . . . . . . 20 3.1.2 Zero-shot Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2.1 Two-Stage Approaches . . . . . . . . . . . . . . . . . . . . . 20 3.1.2.2 Unified Approaches . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Transfer and Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 23 iv II Modeling and Learning Similarity 26 4 Similarity Component Analysis 27 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 Probabilistic Model of Similarity . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3.2 Multiway Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.3 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 III Learning and Leveraging Semantic Similarity for Zero-Shot Visual Recog- nition 41 5 Algorithms 42 5.1 Synthesized Classifiers for Zero-Shot Learning . . . . . . . . . . . . . . . . . . 42 5.1.1 Main Idea: Manifold Learning . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.2 Learning Phantom Classes . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.3 Extension: Learning Metrics for Computing Similarities Between Se- mantic Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1.4 Classification With Synthesized Classifiers . . . . . . . . . . . . . . . . 46 5.2 Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning . . . . . 47 5.2.1 Main Idea: The Auxiliary Task of Predicting Visual Exemplars . . . . . . 48 5.2.2 Learning A Function To Predict Visual Exemplars From Semantic Rep- resentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.3 Zero-Shot Learning Based On Predicted Visual Exemplars . . . . . . . . 49 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6 Generalized Zero-Shot Learning 51 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 6.2 Generalized Zero-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3.1 Calibrated Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.3.2 Area Under Seen-Unseen Accuracy Curve (AUSUC) . . . . . . . . . . . 54 6.3.3 Alternative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3.3.1 Novelty Detection Algorithms . . . . . . . . . . . . . . . . . . 55 6.3.3.2 Relation to Calibrated Stacking . . . . . . . . . . . . . . . . . 56 6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 v 7 Zero-Shot Learning Experiments 57 7.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7.1.2 Semantic Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.1.3 Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.1.4 Evaluation Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7.1.5 Summary of Variants of Our Methods . . . . . . . . . . . . . . . . . . . 61 7.2 Performances of Our Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . 62 7.2.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.2.2 Large-Scale Zero-Shot Classification Results . . . . . . . . . . . . . . . 63 7.2.3 Generalized Zero-Shot Learning Results . . . . . . . . . . . . . . . . . . 64 7.3 Detailed Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.3.1 Detailed Results and Analysis on SYNC . . . . . . . . . . . . . . . . . . 66 7.3.2 Detailed Results and Analysis on EXEM . . . . . . . . . . . . . . . . . 68 7.3.3 Detailed Results and Analysis on GZSL . . . . . . . . . . . . . . . . . . 72 IV Learning and Leveraging Task Similarity for Sequence Tagging 77 8 Multi-Task Learning for Sequence Tagging 78 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 8.2 Multi-Task Learning for Sequence Tagging . . . . . . . . . . . . . . . . . . . . 79 8.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.3.1 Datasets and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.3.2 Metrics and Score Comparison . . . . . . . . . . . . . . . . . . . . . . . 84 8.3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.3.4 Various Settings for Learning from Multiple Tasks . . . . . . . . . . . . 84 8.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.4.1 Pairwise MTL Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.4.2 All MTL Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.6 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 V Concluding Remarks and Future Work 97 9 Conclusion and Future Directions 98 9.1 Rethinking Similarity Modeling and Learning . . . . . . . . . . . . . . . . . . . 98 9.1.1 Case Study: Modeling non-metric similarity in unsupervised learning with applications to word sense discovery . . . . . . . . . . . . . . . . . 98 9.2 Correcting and Reliably Leveraging Similarity Graphs for Effective Learning . . 102 9.2.1 Case Study: Improving the quality of semantic representations . . . . . . 103 9.2.2 Case Study: Generalized few-shot learning . . . . . . . . . . . . . . . . 104 9.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.3.1 Case Study: Beyond image classification in computer vision . . . . . . . 105 vi 9.3.2 Case Study: Natural language processing and understanding . . . . . . . 106 Bibliography 106 Appendices 120 A Zero-shot learning experiments 121 A.1 Expanded Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.2 Expanded Large-Scale Zero-Shot Classification Results . . . . . . . . . . . . . . 124 A.3 Supplementary Material on SYNC . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 Supplementary Material on EXEM . . . . . . . . . . . . . . . . . . . . . . . . 124 A.5 From Zero-Shot to Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . 129 A.5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.5.2 Detailed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.6 Supplementary Material on GZSL . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B Multi-Task Learning Experiments 144 B.1 Comparison Between Different MTL Approaches . . . . . . . . . . . . . . . . . 144 B.2 Additional Results on All-but-one Settings . . . . . . . . . . . . . . . . . . . . . 144 B.3 Detailed Results Separated By the Tasks Being Tested On . . . . . . . . . . . . . 144 vii List of Tables 4.1 Similarity prediction accuracies and standard errors (%) on the synthetic dataset . . . . 34 4.2 Misclassification rates (%) on the MNIST recognition task . . . . . . . . . . . . . . . 34 4.3 Section names for NIPS 1987 – 1999 . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4 Link prediction accuracies and their standard errors (%) on a network of scientific papers NIPS 0-12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Link prediction accuracies and their standard errors (%) on a network of scientific papers NIPS-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Top five ToW features for each metric . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.7 Top three ToP features for each metric . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.1 Classification accuracies (%) on conventional ZSL (A U!U ), multi-class classification for seen classes (A S!S ), and GZSL (A S!T andA U!T ), on AwA and CUB. Significant drops are observed fromA U!U toA U!T . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1 Key characteristics of studied datasets . . . . . . . . . . . . . . . . . . . . . . . . . 58 7.2 Comparison between existing ZSL approaches in multi-way classification accuracies (in %) on four benchmark datasets. For each dataset, we mark the best in red and the second best in blue. Italic numbers denote per-sample accuracy instead of per-class accuracy. On ImageNet, we report results for both types of semantic representations: Word vectors (wv) and MDS embeddings derived from WordNet (hie). All the results are based on GoogLeNet features [234]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.3 Comparison between existing ZSL approaches on ImageNet using word vectors of the class names as semantic representations. For both metrics (in %), the higher the better. The best is in red. The numbers of unseen classes are listed in parentheses. . . . . . . . 63 7.4 Comparison between existing ZSL approaches on ImageNet (with 20,842 unseen classes) using MDS embeddings derived from WordNet [154] as semantic representations. The higher, the better (in %). The best is in red. . . . . . . . . . . . . . . . . . . . . . . 63 7.5 Generalized ZSL results in Area Under Seen-Unseen accuracy Curve (AUSUC) on AwA, CUB, and SUN. For each dataset, we mark the best in red and the second best in blue. All approaches use GoogLeNet as the visual features and calibrated stacking to combine the scores for seen and unseen classes. . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.6 Detailed analysis of various methods: the effect of feature and attribute types on multi-way classification accuracies (in %). Within each column, the best is in red and the 2nd best is in blue. We cite both previously published results (numbers in bold italics) and results from our implementations of those competing methods (numbers in normal font) to enhance comparability and to ease analysis (see texts for details). We use the shallow features provided by [137], [109], [192] for AwA, CUB, SUN, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 viii 7.7 Effect of types of semantic representations on AwA. . . . . . . . . . . . . . . . . . . 66 7.8 Effect of learning semantic representations . . . . . . . . . . . . . . . . . . . . . . . 67 7.9 Performance of our method under different numberS of seen classes on CUB. The num- ber of unseen classesU is fixed to be 50. . . . . . . . . . . . . . . . . . . . . . . . . 68 7.10 Effect of learning metrics for computing semantic similarity on AwA. . . . . . . . . . 68 7.11 We compute the Euclidean distance matrix between the unseen classes based on semantic representations (D au ), predicted exemplars (D (au) ), and real exem- plars (D vu ). Our method leads toD (au) that is better correlated withD vu than D au is. See text for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.12 ZSL results in the per-class multi-way classification accuracies (in %) on AwA using word vectors as semantic representations. We use the 1,000-dimensional word vectors in [72]. All approaches use GoogLeNet as the visual features. . . . 71 7.13 Accuracy of EXEM (1NN) on AwA, CUB, and SUN when predicted exemplars are from original visual features (No PCA) and PCA-projected features (PCA withd = 1024 andd = 500). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.14 Comparison between EXEM (1NN) with support vector regressors (SVR) and with 2-layer multi-layer perceptron (MLP) for predicting visual exemplars. Results on CUB are for the first split. Each number for MLP is an average over 3 random initialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.15 Performances measured in AUSUC of several methods for Generalized Zero-Shot Learning on AwA and CUB. The higher the better (the upper bound is 1). . . . . 72 7.16 Performance measured in AUSUC for novelty detection (Gaussian and LoOP) and calibrated stacking on AwA and CUB. Hyper-parameters are cross-validated to maximize accuracies. Calibrated stacking outperforms Gaussian and LoOP in all cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.17 Performances measured in AUSUC by different zero-shot learning approaches on GZSL on ImageNet, using our method of calibrated stacking. . . . . . . . . . . 75 8.1 Datasets used in our experiments, as well as their key characteristics and their corre- sponding tasks. / is used to separate statistics for training data only and those for all subsets of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.2 Examples of input-output pairs of the tasks in consideration . . . . . . . . . . . . . . 83 8.3 F1 scores for MULTI-DEC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 8.4 F1 scores for TEDEC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.5 F1 scores for TEENC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 ix 8.6 F1 scores for MULTI-DEC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 9.1 Nearest words to the target words under different senses . . . . . . . . . . . . . . . . 100 9.2 Evaluation of multiple embeddings on the task of word similarity. is the Spearman rank correlation with human judgements. The higher, the better. . . . . . . . . . . . . . . . 101 9.3 Evaluation of multiple embeddings on the contextual word similarity task on the SCWS dataset. is the Spearman rank correlation with human judgements. The higher, the better. 102 A.1 Expanded comparison (cf. Table 7.2) to existing ZSL approaches in the multi-way clas- sification accuracies (in %) on AwA, CUB, and SUN. For each dataset, we mark the best in red and the second best in blue. We include results of recent ZSL methods with other types of deep features (VGG by [225] and GoogLeNet V2 by [104]) and/or different evaluation metrics. See text for details on how to interpret these results. . . . . . . . . . 122 A.2 Expanded comparison (cf. Table 7.3) to existing ZSL approaches on ImageNet using word vectors of the class names as semantic representations. For both types of metrics (in %), the higher the better. The best is in red. AlexNet is by [128]. The number of actual unseen classes are given in parentheses. . . . . . . . . . . . . . . . . . . . . . 123 A.3 Overlap of k-nearest classes (in %) on AwA, CUB, SUN. We measure the overlap be- tween those searched by real exemplars and those searched by semantic representations (i.e., attributes) or predicted exemplars. We set k to be 40 % of the number of unseen classes. See text for more details. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.4 Comparison of performances measured in AUSUC between GZSL (using WORD2VEC and G-attr) and multi-class classification on ImageNet-2K. Few-shot results are av- eraged over 100 rounds. GZSL with G-attr improves upon GZSL with WORD2VEC significantly and quickly approaches multi-class classification performance. . . . . . . . 138 A.5 Comparison of performances measured in AUSUC between GZSL with WORD2VEC and GZSL with G-attr on the full ImageNet with 21,000 unseen classes. Few-shot results are averaged over 20 rounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.6 Comparison of performance measured in AUSUC between GZSL (using (human-defined) visual attributes and G-attr) and multi-class classification on AwA and CUB. Few-shot results are averaged over 1,000 rounds. GZSL with G-attr improves upon GZSL with visual attributes significantly. On CUB, the performance of GZSL with visual attributes almost matches that of multi-class classification. . . . . . . . . . . . . . . . . . . . . 140 B.1 Comparison between MTL approaches . . . . . . . . . . . . . . . . . . . . . . . . 145 B.2 F1 scores for TEDEC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.3 F1 scores for TEENC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 B.4 F1 score tested on the task UPOS in different training scenarios . . . . . . . . . . . . . 147 B.5 F1 score tested on the task XPOS in different training scenarios . . . . . . . . . . . . . 147 B.6 F1 score tested on the task CHUNK in different training scenarios . . . . . . . . . . . . . 148 B.7 F1 score tested on the task NER in different training scenarios . . . . . . . . . . . . . . 148 B.8 F1 score tested on the task MWE in different training scenarios . . . . . . . . . . . . . . 149 x B.9 F1 score tested on the task SEM in different training scenarios . . . . . . . . . . . . . . 149 B.10 F1 score tested on the task SEMTR in different training scenarios . . . . . . . . . . . . . 150 B.11 F1 score tested on the task SUPSENSE in different training scenarios . . . . . . . . . . . 150 B.12 F1 score tested on the task COM in different training scenarios . . . . . . . . . . . . . . 151 B.13 F1 score tested on the task FRAME in different training scenarios . . . . . . . . . . . . . 151 B.14 F1 score tested on the task HYP in different training scenarios . . . . . . . . . . . . . . 152 xi List of Figures 1.1 Visual similarity. Figure taken from [252]. . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 t-SNE visualization of Wikipedia articles. Each article is converted into a vector by Doc2Vec [140]. Figure taken from Christopher Olah’s blog on Visualizing Representa- tions: Deep Learning and Human Beings. . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 A learned similarity graph of computer vision tasks. Figure taken from [274]. . . . . . . 4 2.1 Illustration of large margin nearest neighbors (LMNN) metric learning. In this example, x j andx k , respectively, are the target and imposter examples ofx i . LMNN pullsx j to x i and pushesx k fromx i during the learning process. Figure taken from [15] which in turn is an adaptation of the similar figure in [253]. . . . . . . . . . . . . . . . . . . . 9 2.2 Illustration of multiple-metric large margin nearest neighbors (mm-LMNN) metric learn- ing. 4 metrics are learned for clusters corresponding to handwritten digits four, two, one, and zero. Figure taken from [253]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Visual attributes of animals from the aPascal & aYahoo [59], Animal with Attributes (AwA) [137], and Caltech UCSD 2011 Birds (CUB) [247] datasets. Binary attributes provide the information on presence or absence of specific properties of object categories. 15 3.2 Scene attributes from the SUN attribute dataset [192]. Continuous-valued attributes are provided for each image to describe diverse and rich descriptions of scenes. . . . . . . . 16 3.3 Face attributes from [132] and [229]. (Top left) Attribute values for two images of the same person. (Top right) Attribute values for images of two different people. (Bottom) Sample images from different datasets, ordered according to the predicted value of their respective attribute. Relative attributes are a useful alternative to absolute continuous- valued attributes, as the latter may not be consistently agreed upon among human raters. . 17 3.4 The architecture of the skip-gram model by Mikolov et al. [168, 169], where each word is trained to predict each of its context words. The word vectors are essentially the model’s parameters that transform the one-hot vector of word w to the hidden representation (i.e., blue parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Traditional learning with a separate system for each task, transfer learning in which a system is aid by the source task and built for the target task, and multi-task learning with a single system for multiple tasks. Figure adapted from [186]. . . . . . . . . . . . . . 22 3.6 Taxonomy of multi-task learning based on the types of input-output pairs. Figures taken from [156]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.7 Hard and soft parameter sharing for multi-task learning. . . . . . . . . . . . . . . . . 23 3.8 Hard sharing with tasks supervise at the same or different layers. . . . . . . . . . . . . 24 3.9 A Joint Many-Task Model [94] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xii 4.1 Similarity Component Analysis and its application to the example of CENTAUR, MAN and HORSE. SCA has K latent components which give rise to local similarity valuess k conditioned on a pair of datax m andx n . The model’s outputs is a combination of all local values through an OR model (straightforward to extend to a noisy-OR model). k is the parameter vector forp(s k jx m ;x n ). See texts for details. . . . . . . . . . . . . . 28 4.2 On synthetic datasets, SCA successfully identifies the sparse structures and (non)overlapping patterns of ground-truth metrics. See texts for details. Best viewed in color. . . . . . . . 33 4.3 Edge component analysis. Representing network links with local similarity values re- veals interesting structures, such as nearly one-to-one correspondence between latent components and sections, as well as clusters. However, representing articles in LDA’s topics does not reveal useful clustering structures such that links can be inferred. See texts for details. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Average component-wise dissimilarity values of edges between different sections. Darker indicates higher dissimilarity values. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 The normalized diagonal values of metrics forK = 9 . . . . . . . . . . . . . . . . . . 38 5.1 Illustration of SYNC for zero-shot learning. Object classes live in two spaces. They are characterized in the semantic space with semantic representations (a s ) such as attributes or word vectors of their names. They are also represented as models for visual recognition (w s ) in the model space. In both spaces, those classes form weighted graphs. The main idea behind our approach is that these two spaces should be aligned. In particular, the coordinates in the model space should be the projection of the graph vertices from the semantic space to the model space — preserving class relatedness encoded in the graph. We introduce adaptable phantom classes (b andv) to connect seen and unseen classes — classifiers for the phantom classes are bases for synthesizing classifiers for real classes. In particular, the synthesis takes the form of convex combination. . . . . . . . . . . . . 43 5.2 Illustration of our method EXEM for improving semantic representations as well as for zero-shot learning. Given semantic information and visual features of the seen classes, we learn a kernel-based regressor () such that the semantic representationa c of class c can predict well its visual exemplar (center)z c that characterizes the clustering struc- ture. The learned () can be used to predict the visual feature vectors of unseen classes for nearest-neighbor (NN) classification, or to improve the semantic representations for existing ZSL approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1 The Seen-Unseen accuracy Curve (SUC) obtained by varying in the calibrated stacking classification rule eq. (6.2). The AUSUC summarizes the curve by computing the area under it. We use the method SYNC O-VS-O on the AwA dataset, and tune hyper-parameters as in Table 6.1. The red cross denotes the accuracies by direct stacking. . . . . . . . . . 54 7.1 We vary the number of phantom classesR as a percentage of the number of seen classes S and investigate how much that will affect classification accuracy (the vertical axis cor- responds to the ratio with respect to the accuracy when R = S). The base classifiers are learned with SYNC O-VS-O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.2 Percentages of basis components required to capture 95% of variance in classifier matri- ces for AwA and CUB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.3 Performance of our method under different numbers of unseen classes on CUB. The number of seen classes is fixed to be 50. . . . . . . . . . . . . . . . . . . . . . . . . 69 xiii 7.4 t-SNE [242] visualization of randomly selected real images (crosses) and predicted vi- sual exemplars (circles) for the unseen classes on (from left to right) AwA, CUB, SUN, and ImageNet. Different colors of symbols denote different unseen classes. Perfect pre- dictions of visual features would result in well-aligned crosses and circles of the same color. Plots for CUB and SUN are based on their first splits. Plots for ImageNet are based on randomly selected 48 unseen classes from 2-hop and word vectors as semantic representations. Best viewed in color. . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.5 Seen-Unseen accuracy Curves (SUC) for Gaussian (Gauss), LoOP, and calibrated stack- ing (Cal Stack) for all zero-shot learning approaches on AwA. Hyper-parameters are cross-validated based on accuracies (top) and AUSUC (bottom). Calibrated stacking outperforms both Gaussian and LoOP in all cases. . . . . . . . . . . . . . . . . . . . 73 7.6 Comparison between several ZSL approaches on the task of GZSL for AwA and CUB. . 73 7.7 Comparison between CONSE and SYNC of their performances on the task of GZSL for ImageNet where the unseen classes are within 2 tree-hops from seen classes. . . . . . . 74 7.8 Comparison of performance measured in AUSUC between different zero-shot learning approaches on the four splits of CUB. . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.9 Comparison of performance measured in AUSUC between different zero-shot learning approaches on ImageNet-3hop (left) and ImageNet-All (right). . . . . . . . . . . . . 76 8.1 Different settings for learning from multiple tasks considered in our experiments . . . . 80 8.2 Summary of our results for MTL methods MULTI-DEC (top), TEDEC (middle), and TEENC (bottom) on various settings with different types of sharing. The vertical axis is the relative improvement over STL. See texts for details. Best viewed in color. . . . . 86 8.3 Pairwise MTL relationships (benefit vs. harm) using MULTI-DEC (left), TEDEC (middle), and TEENC (right). Solid green (red) directed edge from s to t denotes s benefiting (harming)t. Dashed Green (Red) edges betweens andt denote they ben- efiting (harming) each other. Dotted edges denote asymmetric relationship: benefit in one direction but harm in the reverse direction. Absence of edges denotes neutral rela- tionships. Best viewed in color and with a zoom-in. . . . . . . . . . . . . . . . . . . 87 8.4 t-SNE visualization of the embeddings of the 11 tasks that are learned from TEDEC . 94 A.1 Qualitative results of our method (SYNC STRUCT ) on AwA. (Top) We list the 10 unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2 Qualitative results of our method (SYNC STRUCT ) on CUB. (Top) We list a subset of unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.3 Qualitative results of our method (SYNC O-VS-O ) on SUN. (Top) We list a subset of unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xiv A.4 Success/failure case analysis of our method (SYNC STRUCT ) on AwA: (Left) an unseen class label, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen class, and (Right) the top-3 predicted unseen classes. The green text corresponds to the correct label. . . . . . . . . . . . . . 127 A.5 Success/failure case analysis of our method (SYNC STRUCT ) on CUB. (Left) an unseen class label, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen classes, and (Right) the top-3 predicted unseen class. The green text corresponds to the correct label. . . . . . . . . . . . . . . 127 A.6 Success/failure case analysis of our method (SYNC O-VS-O ) on SUN. (Left) an unseen class label, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen classes, and (Right) the top-3 predicted unseen class. The green text corresponds to the correct label. . . . . . . . . . . . . . . 127 A.7 Qualitative zero-shot learning results on AwA (top) and SUN (bottom). For each row, we provide a class name, three attributes with highest strength, and the nearest image to the predicted exemplar (projected back to the original visual feature space). . . . . . . . . . 128 A.8 Data split for zero-to-few-shot learning on ImageNet . . . . . . . . . . . . . . . . . 129 A.9 Accuracy vs. the number of peeked unseen classes for EXEM (top) and SYNC (bottom) across different subset selection methods. Evaluation metrics are F@1 (left) and F@20 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.10 Accuracy vs. the number of peeked unseen classes for EXEM and SYNC for heavy- toward-seen class selection strategy. Evaluation metrics are F@1 (left) and F@20 (right). 132 A.11 Combined accuracyA K U!U vs. the number of peeked unseen classes for EXEM (1NN). The “squares” correspond to the upperbound (UB) obtained by EXEM (1NN) on real exemplars. F@K=1, 5, and 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A.12 (Top) Weighted peeked unseen accuracyw P A K P!U vs. the number of peeked unseen classes. (Bottom) Weighted remaining unseenw R A K R!U accuracy vs. the number of peeked unseen classes. The weightw P (w R ) is the number of test instances belonging toP (R) divided by the total number of test instances. The evaluation metrics are F@1 (left) and F@20 (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.13 Accuracy on test instances from the remaining unseen classes when classifying into the label space of remaining unseen classes onlyA K R!R vs. the number of peeked unseen classes. ZSL trains on labeled data from the seen classes only while PZSL (ZSL with peeked unseen classes) trains on the the labeled data from both seen and peeked unseen classes. The evaluation metrics are F@1 (left) and F@20 (right). . . . . . . . . 135 A.14 Accuracy on test instances from the remaining unseen classes when classifying into the label space of unseen classesA K R!U vs. the number of peeked unseen classes. ZSL trains on labeled data from the seen classes only while PZSL (ZSL with peeked unseen classes) trains on the the labeled data from both seen and peeked unseen classes. Note that these plots are the unweighted version of those at the bottom row of Fig. A.12. The evaluation metrics are F@1 (left) and F@20 (right). . . . . . . . . . . . . . . . . . . 135 A.15 Combined per-image accuracy vs. the number of peeked unseen classes for EXEM (1NN), EXEM (1NNS), and SYNC. The evaluation metrics are, from left to right, Flat Hit@1 ,2 ,5, 10, and 20. Five subset selection approaches are considered. . . . . . . . . 136 xv A.16 Combined per-class accuracy vs. the number of peeked unseen classes for EXEM (1NN), EXEM (1NNS), and SYNC. The evaluation metrics are, from left to right, Flat Hit@1, 2, 5, 10, and 20. Five subset selection approaches are considered. . . . . . . . . . . . . . 136 A.17 We contrast the performances of GZSL to multi-class classifiers trained with labeled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses WORD2VECTOR (in red color) and the idealized visual features (G-attr) as semantic representations (in black color). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.18 Comparison between GZSL and multi-class classifiers trained with labeled data from both seen and unseen classes on the datasets AwA and CUB. GZSL uses visual attributes (in red color) or G-attr (in black color) as semantic representations. . . . . . . . . . . 140 A.19 Comparison between GZSL and multi-class classifiers trained with labeled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses WORD2VEC (in red color) or G-attr (in black color) as semantic representations. . . . . . . . . . . . . 141 A.20 Data splitting for different cross-validation (CV) strategies: (a) the seen-unseen class splitting for zero-shot learning, (b) the sample-wise CV , (c) the class-wise CV . . . . . 141 xvi Abstract Measuring similarity between any two entities is an essential component in most machine learning tasks. This thesis describes a set of techniques revolving around the notion of similarity. The first part involves modeling and learning similarity. We introduce Similarity Component Analysis (SCA), a Bayesian network for modeling instance-level similarity that does not observe the triangle inequality. Such a modeling choice avoids the transitivity bias in most existing simi- larity models, making SCA intuitively more aligned with the human perception of similarity. The second part involves learning and leveraging similarity for effective learning with limited data, with applications in computer vision and natural language processing. We first leverage incom- plete and noisy similarity graphs in different modalities to aid the learning of object recognition models. In particular, we propose two novel zero-shot learning algorithms that utilize class- level semantic similarities as a building block, establishing state-of-the-art performance on the large-scale benchmark with more than 20,000 categories. As for natural language processing, we employ multi-task learning (MTL) to leverage unknown similarities between sequence tagging tasks. This study leads to insights regarding the benefit of joint MTL of more than two tasks, task selection strategies, as well as the nature of task relationships. xvii Chapter 1 Introduction During the past few years, gaps between machine-level and human-level performances on diverse sets of tasks have been reduced significantly. We start to see systems that accurately perform pat- tern recognition, an important building block that will potentially allow them to predict and make decisions as our representatives. Besides such stand-alone human-imitative systems, intelligence can also augment humans’ capabilities and environments. For example, systems equipped with the ability to quickly find relevant information from a huge database of knowledge (e.g., naviga- tion apps, factoid question answering systems, and augmented reality) would eliminate the need for us to memorize or search through the database ourselves. Regardless of the types of intelligence they possess, the systems described above generally employ technologies that take in data and learn a program from them. In some cases, data are abundant; many types of data are digitized and stored at an unprecedented rate, much more rapidly than the rate at which we ever hope to consume. Note that while data are considered as resources, they are different from traditional computational resources like time or space, for which getting more of them is always merrier. This paradox leads to the invention of “algorithmic weakening,” a scenario in which a learning algorithm adjusts its efficiency according to the amount of data it obtains, while still maintaining/achieving a specific accuracy [29]. In other cases, however, the amount of data is unavoidably limited. For example, while malaria, meningitis, and hepatitis C affect more than one billion people around the world, amy- otrophic lateral sclerosis (ALS) or Lou Gehrigs disease affects approximately only 420,000. 1 How do we help those who suffer from such a rare (but devastating) disease? More generally, how could we possibly learn from small data? Perhaps one possible solution is to consider the opposite notion of “algorithmic weakening,” say “algorithmic strengthening.” In other words, to deal with small data, we should look at them more deeply, extracting richer knowledge from the same bits of information. Another possible solution is “data enhancing,” in which extra supervi- sion is incorporated. In particular, can we leverage data that are not entirely related but relevant together with algorithmic strengthening? This thesis takes concrete steps towards both algorithmic strengthening and data enhancing. We tackle our challenges and narrate our stories around different notions of similarity. The core component of learning algorithms is the ability to measure similarity between a set of instances at hand. In particular, similarity allows us to characterize statistical patterns among data. For 1 https://qz.com/743231/the-ice-bucket-challenge-worked-theres-been-a- breakthrough-in-als-research/ 1 Figure 1.1: Visual similarity. Figure taken from [252]. example, a subset of similar instances can be grouped together in a clustering task, or assigned a label in a classification task. In the following section, we provide a global view of our research work. It generally consists of two parts. In the first part, we model and learn the notion of similarity between data instances that is richer than previous approaches. In the second part, we describe how to learn and leverage similarity for effective learning from a small amount of data, with real-world application scenarios in computer vision and natural language processing. 1.1 Modeling, Learning, and Leveraging Similarity: Motivation and Methods Perhaps the most basic question we could ask about data is how do we represent data and their relationships?. In our case, we consider similarity as a simplified version of relationship. One way to define similarity in a more precise language is this: We say two objects are similar if we can establish a high degree of asymmetric correspondences between them. As we can imag- ine, even with this definition, the notion of similarity can be ambiguous and largely depends on context. For example, consider similarities between three images in Fig. 1.1. Depending on the level/aspect of image semantics one would like to consider, it is not clear which pair of images are the most similar. In the context of machine learning algorithms, the example above raises an important ques- tion. Do we want our computer vision systems to detect low-level features (e.g. color histograms) or high-level ones (e.g. categories)? Suppose that the goal is to build attribute detectors rather than object detectors, how should we measure similarity among a collection of images? What if our goal is to do both tasks equally well? Metric learning [15, 131] is widely used in the literature as a proxy for similarity learning (see, e.g., [150, 221, 262] in which two terms are synonymous). Unfortunately, metric learning has one major limitation: it forces the transitivity of similarity. This undesirably violates what we know about the world. In Fig. 1.1, metric learning would force the white dog to be similar to the black horse simply because the white dog is similar to the black dog and the black dog is similar to the black horse. We directly tackle the issue of transitivity of similarity in this thesis. Our probabilistic graphical model, called Similarity Component Analysis, is an example of algorithmic strengthening — a method that can look more deeply into the data by providing greater flexibility. 2 Figure 1.2: t-SNE visualization of Wikipedia articles. Each article is converted into a vector by Doc2Vec [140]. Figure taken from Christopher Olah’s blog on Visualizing Representations: Deep Learning and Human Beings. So far we have provided an overview of our research on modeling and learning similarity. To motivate the other component, let us consider a scenario where similarity information is given. Fig. 1.2 shows t-SNE visualization of Wikipedia articles, where each article is converted into a vector using an unsupervised log-linear model called Paragraph Vectors or Doc2Vec [140]. 2 Given the vectors of Wikipedia articles, one could construct a similarity graph of concepts in which the negative distances between them correspond to similarity degrees. On the left, we see that articles on “albums” are generally similar to articles on “films” than to articles on “species.” Furthermore, we in general see on the right small clusters or subclusters within a cluster of the same type (“tennis” articles are similar and constitute a subcluster whereas “bollywood” articles have their own cluster far away from the cluster of general films.) Note that the similarity graph we consider here is easy to obtain as no label information is needed. How could such similarity graphs benefit us? Let us consider an intelligent agent equipped with a visual recognition system. Having never been to Peru, the agent has not seen alpacas or only seen a few of them. Can we use the similarity graph to help the agent learn about alpacas quickly and effectively? Is it possible to ask the agent to read the Wikipedia page of Alpaca so that it can recognize alpacas in the wild right away? For example, the fact that alpacas are more similar to sheep than to birds intuitively should be useful. More generally, how would leveraging similarity benefit learning with limited data? Our methods for zero-shot visual object recognition aim to do just that. We propose a visual recognition system that can recognize a new object category even when its exemplars are not provided. The key to our idea is to leverage similarity graphs in other modalities and transfer the similarity knowledge into the visual space. We also propose a method that can automatically 2 http://colah.github.io/posts/2015-01-Visualizing-Representations 3 Figure 1.3: A learned similarity graph of computer vision tasks. Figure taken from [274]. improve external similarity graphs. Finally, we also advocate the “generalized” zero-shot learning setting, in which we can more realistically evaluate zero-shot visual recognition systems. We show that our approaches are effective in both conventional and newly proposed generalized settings. At a high level, we demonstrate the power of data enhancing using similarity graphs. One natural question that arises from the similarity graph example is What if we do not have a good way to come up with a similarity graph for our objects of interest but we suspect that we can leverage their similarity?. This is especially challenging when our objects of interest are abstract concepts. For example, in many areas related to artificial intelligence, the end goal of having general-purpose intelligent systems is so far-reaching that we take a divide-and-conquer approach. We break a big task into smaller ones that we can hope to solve. However, it is not entirely clear what we mean by similarity between tasks and how they interact with each other. Nevertheless, as in zero-shot learning, we do not have to have the perfect representations or similarities of tasks to be able to leverage them; learning paradigms such as transfer or multi-task learning allow us to leverage task similarities or relatedness to achieve better generalization. In- tuitively, learning from multiple tasks helps because each task provides a different data enhancing strategy or inductive bias [27], which is particularly useful when we have limited data. Clearly, learning methods must be able to perform given tasks and relate those tasks, either explicitly or implicitly. The work of [274] is one example of explicit task similarity learning in computer vision; they learn taxonomy of multiple tasks such as Autoencoding, Colorization, Context Encoding, Curvature Estimation, Depth Estimation, Edge Detection, Keypoint Detection, Relative Camera Pose Estimation, Segmentation, and Room Layout Estimation. One instance of the learned taxonomy is shown in Fig. 1.3. In this thesis, we show how we can learn and leverage similarity between tasks in the context of natural language processing, in particular sequence tagging. We show that multi-task learning can be a general useful learning method for leveraging task similarities. Furthermore, one of our approaches discovers task similarities that correspond well with the linguistics theory. In particular, the visualization of task embeddings reveals clustering of syntactic and semantic tasks. 4 By looking at this rich concept of similarity from multiple perspectives, we can learn better. In other words, as similarity underlies learning algorithms, better models for similarity as well as better methods for leveraging similarty will translate to effective learning. This thesis take steps toward such a direction. 1.2 Overview of Contributions Methodologically, we provide a set of techniques for modeling, learning, and leveraging similar- ity. Our probabilistic model of similarity Similarity Component Analysis (SCA) seamlessly com- bines two rich areas. One is probabilistic graphical models (specifically Bayesian networks) as well as approximate inference of those models [122, 248]. The other is metric learning [15, 131]. The method of Synthesized Classifiers (SynC) is a robust and efficient approach for aligning two similarity graphs, one of which is incomplete. It is inspired largely by the convex multi-task fea- ture learning framework for supervised learning [9] and Laplacian Eigenmaps for dimensionality reduction and data representation [14]. The method of Predicted Exemplars (EXEM) provides a simple method for improving a similarity graph given another. It mainly uses tools in kernel re- gression [30, 218, 226] and is motivated by our previous work on generalized zero-shot learning (GZSL) [35], which showed the effectiveness of deep representations as idealized semantic rep- resentations. The Area Under Seen-Unseen accuracy Curve (AUSUC) developed in our GZSL work takes inspiration from popular machine learning and information retrieval metrics: Area Under (ROC/PR) Curves (AUC) [61]. Finally, our approaches to multi-task learning provide a method for automatically learning task characteristics vectors (hence, their similarities), and are greatly influenced by the Google’s machine translation system [113]. We consider a wide range of applications and tasks. Similarity Component Analysis is eval- uated on digit recognition and link prediction in a network of documents. Our zero-shot learning algorithms are applied to object recognition. We provide comprehensive evaluation of our ap- proaches. Our systems achieve the state-of-the-art performance at the time of publication, and re- main one of the strongest systems, especially on the large-scale benchmark ImageNet [258, 260]. Our GZSL work also ignites the field to consider a more realistic evaluation scenario for zero- shot learning. Our multi-task learning frameworks for NLP are applied to 11 sequence tagging tasks simultaneously, while almost all multi-task learning work for NLP considers just two. We acknowledge recent advances in tools and benchmarks for deep representation learning [16] as one of the main factors contributing to the effectiveness of our zero-shot and multi-task learning models in the applications of our interest. 1.3 Thesis Organization The remaining of this thesis is outlined as follows. Chapter 1 describes the scope of the problems considered in this thesis, as well as provides high-level descriptions of methods, contributions, and organization. Chapter 2 is a survey on similarity learning. We focus on metric learning and methods that are most relevant to our work. To provide a broader picture of the field, we also briefly describe less relevant but related work toward the end of the chapter. Chapter 3 is a survey on learning with limited data. We put an emphasis on zero-shot learning, which constitutes a large part of this thesis. We then touch upon few-shot, transfer, and 5 multi-task learning. In each of these settings, we again focus on background most relevant to this thesis. Chapter 4 describes Similarity Component Analysis (SCA) [34], a probabilistic graphical model for richer modeling of similarity. Chapter 5 describes two zero-shot learning algorithms: Synthesized Classifiers (SynC) [31], a model that leverages similarity graphs from two modalities to perform zero-shot learning (ZSL) for image classification, and Predicted Exemplars (EXEM) [32], a method that im- proves similarity graphs by further utilizing clustering structures. EXEM can also be used to to improve other existing zero-shot learning methods. Chapter 6 describes Generalized Zero-Shot Learning (GZSL) [35], a more realistic setting for ZSL methods that we advocate. Chapter 7 demonstrates the effectiveness of the two proposed ZSL methods in various settings, including in GZSL. Chapter 8 describes our empirical study on multi-task learning for sequence tagging [33], lever- aging similarity of loosely related tasks to improve the tagging performance, using three multi-task learning frameworks (MULTI-DEC, TEDEC, and TEENC). Chapter 9 provides conclusions and future directions. 6 Part I Background 7 Chapter 2 Similarity Learning Measuring similarity is crucial to many machine learning tasks. In this section, we briefly review the most dominant paradigm for learning to measure similarity: metric learning [15, 131]. Our focus will be basic background and algorithms most relevant to our proposed model in this thesis. 2.1 Metric Learning 2.1.1 Metrics A metricd on setX is a distance function such that, for allx;y;z2X , the following properties are satisfied: d(x;y) 0 [Non-negativity] (2.1) d(x;y) = 0 if and only if x =y [Identity of indiscernibles] (2.2) d(x;y) =d(y;x) [Symmetry] (2.3) d(x;z)d(x;y) +d(y;z) [Triangle inequality] (2.4) When the identity of indiscernibles property fails, d becomes a pseudometric. As in the metric learning literature, we will refer to pseudometrics as metrics and only make the distinction whenever necessary. 2.1.2 Problem Description Suppose that we have a set of data instancesfx n g N n=1 inR D . The goal of metric learning is to learn a metricd on such data points such that the metric is useful to the task(s) at hand. Normally, learning signals or constraints that guide metric learning algorithms come in one or a combination of the following three forms. • Similarity constraints:S =f(x m ;x n ) :x m andx n are similar.g • Dissimilarity constraints:D =f(x m ;x n ) :x m andx n are dissimilar.g • Relative similarity constraints:R =f(x m ;x n ;x o ) : x m is more similar tox n than to x o .g 8 Figure 2.1: Illustration of large margin nearest neighbors (LMNN) metric learning. In this example,x j andx k , respectively, are the target and imposter examples ofx i . LMNN pullsx j tox i and pushesx k fromx i during the learning process. Figure taken from [15] which in turn is an adaptation of the similar figure in [253]. 2.1.3 Mahalanobis Metric Learning One popular approach is to parameterize our distance function or metric with a Mahalanobis distance: d M (x;y) = q (xy) T M(xy); (2.5) wherex;y2R D , andM2R DD is a positive semi-definite matrix. WhenM is the identity matrix, the distance function becomes the Euclidean distance. Mahalanobis metric learning aims to learn the metricM from such labeled data. It was popularized due to the pioneering work of Xing et al. [262]. In what follows, we give specific examples of approaches to supervised Mahalanobis metric learning. We focus on methods that are popular and/or most relevant to the content of this thesis. Large Margin Nearest Neighbors (LMNN) [254] The main idea behind LMNN is to impose the constraint that each instance should be closer to target instances of the same class than im- poster instances of different class. LMNN first defines N(x) as the K-nearest neighbors ofx based on the Euclidean distance. The definition of neighborhoodN combined with label infor- mation of instances then defines similarity and relative similarity constraints: S LMNN =f(x m ;x n )g :y m =y n andx n 2N(x m )g; (2.6) R LMNN =f(x m ;x n ;x o )g : (x m ;x n )2S LMNN andy m 6=y o g: (2.7) 9 Figure 2.2: Illustration of multiple-metric large margin nearest neighbors (mm-LMNN) metric learning. 4 metrics are learned for clusters corresponding to handwritten digits four, two, one, and zero. Figure taken from [253]. Using the above constraints, LMNN specifically formulates the problem as the following semi-definite programming: min M; (1) X (xm;xn)2S LMNN d 2 M (x m ;x n ) + X m;n;o mno (2.8) s:t: d 2 M (x m ;x o )d 2 M (x m ;x n ) 1 mno 8(x m ;x n ;x o )2R LMNN ; (2.9) 0; (2.10) M 0; (2.11) where2 [0; 1] is a weighting parameter balancing the forces of pulling the targets and pushing the imposters. Extension: Multi-metric LMNN (mm-LMNN) In some cases, learning global linear transfor- mations does not lead to flexible models for rich relationships induced by data. This motivates the learning of multiple locally linear transformations. In this setting, the data is first partitioned into disjoint clusters using k-means, spectral clustering, or label information. Then, we can learn a metric for each cluster, while at the same time couple metrics in different cluster in the objective. In particular, denote byM 1 ;:::;M C metrics for the C clusters of data, we solve the following semi-definite programming: min M 1 ;:::;M C ; (1) X (xm;xn)2S LMNN d 2 M yn (x m ;x n ) + X m;n;o mno (2.12) s:t: d 2 M yo (x m ;x o )d 2 M yn (x m ;x n ) 1 mno 8(x m ;x n ;x o )2R LMNN ; (2.13) 0; (2.14) M c 0; 8c = 1;:::;C; (2.15) where2 [0; 1] is a weighting parameter balancing the forces of pulling the targets and pushing the imposters. 10 Information-Theoretic Metric Learning (ITML) [50] The main idea behind ITML is to cast metric learning as an optimization problem over a probability distribution of positive semidefi- nite matrices with linear constraints. The approach leverages the bijection between Mahalanobis distances and multivariate Gaussian distribution, which allows for comparison of two distances. This leads to the use of LogDet divergence regularization to constraint the problem. In particular, we consider the following optimization problem: min M KL(p(x;;M 0 )jjp(x;;M)) (2.16) s:t: d 2 M (x m ;x n )u 8(x m ;x n )2S; (2.17) d 2 M (x m ;x n )v 8(x m ;x n )2D; (2.18) M 0; (2.19) whereu;v are parameters controlling the feasible set;M 0 is a fixed DD semi-positive defi- nite matrix;p(x;;A) = 1 Z exp ( 1 2 d A (x;)) is a multivariate Gaussian distribution with the mean, the covarianceA 1 , and the normalization constantZ. In other words, the Gaussian distributions in the optimization problem above are parameterized by theDD matricesM and M 0 . We will make two adjustments to the above optimization problem. First, [49] shows the equiv- alence of the KL divergence and the LogDet divergence in the optimization problem— the dif- ferential relative entropy between two multivariate Gaussian distributions can be expressed as the convex combination of a Mahalanobis distance between mean vectors and the Log-Determinant (LogDet) divergence between the covariance matrices. This means 2KL(p(x;;M 0 )jjp(x;;M)) (2.20) =D `d (M;M 0 ) (2.21) := trace(MM 1 0 ) log det(M;M 1 0 )D (2.22) = X m;n m n (v T m u m ) X m log( m m )D; (2.23) whereM =V V T andM 0 =UU T . Second, to make the above optimization feasible, slack variables are introduced. In the end, the optimization problem becomes min M; D `d (M;M 0 ) + X m;n mn (2.24) s:t: d 2 M (x m ;x n )u + mn 8(x m ;x n )2S; (2.25) d 2 M (x m ;x n )v mn 8(x m ;x n )2D; (2.26) M 0; (2.27) where is a weighting parameter controlling the trade-off between satisfying the constraints and minimizingD `d (M;M 0 ). In practice and in our experiments,M 0 is set to be the identity matrix. Local approaches The approaches we have described so far learn the global metric. Another popular approach learns local metrics [69, 70, 95, 161, 254]. Though also learning multiple 11 distance measures, these methods differ from our model in that they learn local distance measures instead of global ones. In other words, each of their distance metrics or functions is associated with a different class label, a region of the input space, or even a single data point. In SCA, the learned metrics are global and can be applied to any pair of data points in the input space. 12 Chapter 3 Learning with limited data 3.1 Zero-Shot Learning Visual recognition and classification has made a significant progress due to the widespread use of deep learning architectures [128, 235] that are optimized on large-scale datasets of human- labeled images [212]. Despite the exciting advances, to recognize objects “in the wild” remains a daunting challenge. In particular, the amount of annotations is vital to deep learning architectures in order to discover and exploit powerful discriminating visual features. It is well-known, however, that object frequencies in natural images follow long-tailed distri- butions [213, 232, 282]. Some categories such as animal or plant species are simply rare by nature (about a hundred of northern hairy-nosed wombats are alive in the wild). Moreover, we live in a rapidly evolving world where newly defined visual concepts or products are introduced everyday (images of futuristic products such as Teslas Model S). As a result, brand new categories could just emerge with zero or few labeled images. In contrast to common objects such as household items, these ”tail” objects do not occur frequently enough, making the process of collecting and labeling their representative exemplar images laboriously difficult and costly. These challenges are especially crippling for fine-grained object recognition such as classify- ing species of birds, designer products, etc. Suppose we want to carry a visual search of “Chanel Tweed Fantasy Flap Handbag”. While handbag, flap, tweed, and Chanel are popular accessory, style, fabric, and brand, respectively, the combination of them is rare — the query generates about 55,000 results on Google search with a small number of images. The amount of labeled images is thus far from enough for directly building a high-quality classifier, unless we treat this category as a composition of attributes, for each of which more training data can be easily acquired [137]. In all cases above, not only the amount of labeled training images but also the statistical variation among them is limited. These restrictions do not lead to robust systems for recognizing such objects. In this real-world setting, it would be desirable for computer vision systems to be able to recognize instances of those rare classes, while demanding minimum human efforts and labeled examples. In the extreme case, we want the systems to recognize even objects that they have never seen before. Zero-shot learning (ZSL) has long been believed to hold the key to the above problem of recognition in the wild. Unlike the conventional supervised learning, ZSL distinguishes between two types of classes: seen and unseen, where labeled examples are available for the seen classes only. The main goal of zero-shot learning is to expand classifiers and the space of possible labels beyond seen objects to unseen ones. To this end, we need to address two key interwoven 13 challenges [185]: (1) how to relate unseen classes to seen ones and (2) how to attain optimal discriminative performance on the unseen classes even though we do not have their labeled data. 3.1.1 Semantic Representations for Relating Seen and Unseen Classes In zero-shot learning, labeled training examples of the unseen classes are not provided. Instead, zero-shot learners have access to some form of high-level semantics that span across object cat- egories — we call them semantic representations of objects. In other word, the learners have access to a shared semantic space that embeds all categories. The semantic space enables trans- ferring and adapting classifiers trained on seen classes to unseen ones. In this section, we mainly review related work on the two most popular semantic representations: visual attributes and word vector representations of the class names. Other forms of semantic representations are also briefly discussed. 3.1.1.1 Visual Attributes Traditional work on object recognition focuses on naming objects. However, the ability to name objects is hardly generalizable across categories. This motivates a stream of research on visual attributes, where the goal is to describe objects instead [59]. More concretely, visual attributes are human understandable properties to describe images such as black, striped, four-legged. These properties can range from texture, character, color, parts, activity, nutrition, habitat, behavior, shape, and so on. In Fig. 3.1, 3.2, and 3.3, we show exemplar images their annotated visual attribute descriptions in the domain of animals and birds, scenes, and faces from publicly available datasets. While initial work on attributes focus on binary attributes — the presence or absence of attributes in the images (Fig. 3.1 and Fig. 3.2), attributes can be considered in a richer and broader context (Fig. 3.3) such as the absolute strength of attributes, which correspond to extending the attribute space to the real-valued space; or the strength of attributes of an image relative to that of other images [189]. Attribute information is usually annotated and collected by human experts [59, 136, 192, 247] so that their semantics are aligned with what humans understand and perceive. Enhancing human-machine communiation, attributes lead to diverse applications in computer vision such as face recognition and verification [39, 132, 216], action recognition [149, 268], fine-grained recognition [54], image retrieval [53, 123, 224], image description and caption generation [38, 59, 64, 189], novelty detection [246], as well as transfer and zero-shot learning [136]. We will provide more details on related work for zero-shot learning in a later Section. Despite their usefulness, attributes have thier problems. First, attribute annotations of a cat- egory could be costly to obtain. Though not as costly as collecting the category’s representative labeled images, this step could prevent scaling up the systems relying on attributes to an extremely large number of classes. Second, attributes are difficult to get right. It is not clear how to obtain a list of human understandable attributes that are also discriminative [188, 272]. Even with such a list, it is not always the case that they are machine detectable. This is because they are often correlated among another (“brown” and “wooden”’) [110] and possibly not category-independent (“fluffy” animals and “fluffy” towels) [36]. Despite these difficulties, learned attributes (instead of manually defined by human experts) have shown to be useful at relating seen and unseen classes [6, 272]. 14 Figure 3.1: Visual attributes of animals from the aPascal & aYahoo [59], Animal with Attributes (AwA) [137], and Caltech UCSD 2011 Birds (CUB) [247] datasets. Binary attributes provide the information on presence or absence of specific properties of object categories. 15 Figure 3.2: Scene attributes from the SUN attribute dataset [192]. Continuous-valued attributes are pro- vided for each image to describe diverse and rich descriptions of scenes. 3.1.1.2 Word Vector Representations of Class Names One way to scale the process of obtaining semantic representations to a large number of categories is to learn them from unlabeled text data. For this particular application, the goal is to represent category names with dense real-valued vectors. In natural language processing community, the common paradigm for deriving such representations is based on the distributional hypothesis of Harris [93], which states that words in similar contexts have similar meanings. Under this assumption, words are not merely discrete units, allowing their similarity and dissimilarity to be measured. This unsurprisingly opens doors to many applications in natural language processing and understanding. More specifically, there are two main directions for learning dense real-valued word embed- ding vectors: count-based and prediction-based methods. Though they are strikingly different, neither consistently dominates the other in terms of performances on natural language tasks [144]. We review related work on each of them below. Count-based methods Count-based approach can be described in terms of a word-context ma- trixM in which each rowi corresponds to a word, each columnj corresponds to a context in which the word appears, and each matrix entryM ij corresponds to some association measure between the word and the context. Words are then represented as rows inM. Since the number of columns is in general very high (at least as high as the vocabulary size), it is often preferred to represent words as rows in a dimensionality-reduced matrix based onM. As a result, count- based methods are referred to as matrix factorization based methods in the case that we utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus. 16 Figure 3.3: Face attributes from [132] and [229]. (Top left) Attribute values for two images of the same person. (Top right) Attribute values for images of two different people. (Bottom) Sample images from different datasets, ordered according to the predicted value of their respective attribute. Relative attributes are a useful alternative to absolute continuous-valued attributes, as the latter may not be consistently agreed upon among human raters. 17 This approach dates back to Latent Semantic Indexing (LSI) [51]. LSI uses a singular value decomposition of the word-document tf-idf matrix, which corresponds to identifying a linear subspace in the space of tf-idf features that captures most of the variance in the collection. Hy- perspace Analogue to Language (HAL) [155] applies a similar idea to the word co-occurence matrix (i.e., word-word matrix instead of word-document). One problem with HAL and related methods is that raw word co-occurrences allow frequent words to contribute disproportionately to the semantics of word embeddings. To alleviate this problem, there have been a variety of techniques proposed to transform the co-occurence matrix, including entropy and correlation based transformations [206], positive pointwise mutual infor- mation (PPMI) [26], Shifted PPMI [143], shifted logarithmic transformation [195], and Hellinger PCA [141]. A comprehensive comparison with additional transformations can be found in [231]. Prediction-based methods Prediction-based approach learns word representations that facili- tate making predictions within local context windows. Most methods resort to neural networks for learning high-quality nonlinear predictive functions. Since the seminal work of [17], which introduces a simple neural network architecture for language models, there have been many attempts to improve the architectures and the objectives, as well as optimization techniques for dealing with large vocabulary. For example, [175] propose three new architectures based on Restricted Boltzmann Machines (RBMs) and log bilinear mod- els. [44, 45] learn their word embeddings using convolutional neural networks, which are trained jointly on labeled data from multiple downstream tasks as well as unlabeled data for optimizing the language model. Recently, several methods demonstrate that learning word embeddings can be done very ef- ficiently as a result of novel training techniques and simple objectives. [176, 177] use noise- contrastive estimation (NEC) [91] for training their log bilinear models, vLBL and ivLBL. [168, 169] propose the skip-gram and continuous bag-of-words (CBOW) models. They also propose a technique similar to NEC called negative sampling for speeding up the skip-gram. As for the objectives, vLBL and CBOW are trained to predict word from its context, while ivLBL and the skip-gram are trained to predict a word’s context given the word itself. Interestingly, the resulting word vectors have been shown to capture word relatedness as well as other syntactic and semantic regularities [169, 170, 176]. Focus: The skip-gram model We will focus on the skip-gram as the main method for learning vector representations of the class names. We therefore describe it in detail below. In what follows, we useW andC to denote the vocabularies of words and contexts, respectively. In this paper,C is derived from the neighboring words around the word of interest; thus,C =W. We differentiate two types of embeddings:v w for the embedding for the wordw inW andu c for the wordc inC. We refer to them as word vectors and context vectors, respectively. We useuv to represent an inner product between the two vectors. The skip-gram model is based on the assumption that words that have similar surrounding words in its context would have similar vector representations [168]. Hence, a word ought to be able to predict words in its context. Thus, for each word-context pair (w;c), the objective is to maximize the probability Pr(cjw), which is parametrized by the softmax function: Pr(cjw)/ exp(v w u c ) (3.1) 18 W C 1 C 2 C 3 Figure 3.4: The architecture of the skip-gram model by Mikolov et al. [168, 169], where each word is trained to predict each of its context words. The word vectors are essentially the model’s parameters that transform the one-hot vector of word w to the hidden representation (i.e., blue parameters). Fig. 3.4 summarizes the shallow neural network architecture for the skip-gram model. First, the one-hot vector of the word (one at the index of the word in the vocabulary and zeros elsewhere) is encoded to an intermediate representation. Then, this intermediate representation is decoded to predict each of the one-hot vectors of the word’s contexts. We call the encoding parameters the word vectors and the decoding parameters context vectors. To reduce the computational burden, Mikolov et al. [169] introduce the technique of negative sampling where logistic regression is used to approximate the softmax objective: log Pr(cjw) log Pr(D = 1jw;c) (3.2) +kE c 0 Pr D (c) log(1 Pr(D = 1jw;c 0 )) where Pr(D = 1jw;c) =(v w u c ) is the sigmoid function predicting the co-occurrence of the word and context pair. k is the number of “negative” samples of such pair, where c 0 is drawn from a “background” distribution of context words Pr D (c) — in practice, a smoothed empiri- cal unigram distribution is found to be effective. Intuitively, the first term aims to maximize the likelihood of wordw and contextc co-occurring in the corpus, while the second term is a mech- anism to prevent all vectors from having the same value by disallowing some random negative word-context pairs [86]. The skip-gram model is clearly a prediction-based method by design. However, Levy and Goldberg show that the skip-gram with negative sampling performs weighted factorization of the shifted PPMI matrix of word co-occurences [143], therefore proving that it is as well a count- based method. This result provides a theoretical understanding of the statistics it captures. Since obtaining unlabeled text corpora is cheap and convenient, there is a potential to scale this approach to large vocabulary. Weston et al. [255] propose to learn a joint embedding for images and their annotation representations. Frome et al. [68] and Socher et al. [227] then simul- tanoeusly apply this approach to object classification in the context of zero-shot learning. Since then, many researchers have explored this connection between visual and language embeddings as a scalable way to connect seen and unseen concepts. However, expectedly, as these semantic 19 representations are derived from text, they have very little knowledge about visual information and thus can be more noisy than visual attributes. 3.1.1.3 Other Types of Semantic Representations We conclude the section on semantic representations by briefly mentioning other types of seman- tic representations that have been proposed in the literature. These embeddings include textual descriptions of objects [55, 142] and knowledge mined from the Web [55, 165, 207, 208]. Com- binations of several embeddings are also found to be beneficial [5, 71, 74]. Finally, in contrast to most work that takes semantic representations as are [4, 68, 109, 137, 142, 146, 183, 209, 227], recent work proposes to transform existing semantic representations so that they become more robust at relating seen and unseen classes. Examples of transformation functions are Canonical Correlation Analysis (CCA) [71, 72, 154], linear (or bilinear) transformation [4, 142, 209] and sparse coding [121, 279, 280]. 3.1.2 Zero-shot Learning Algorithms Given semantic representations, how do we design zero-shot learning algorithms with highly discriminative performance even under the absence of unseen labeled data? In this section, we focus on discussing relevant work on zero-shot learning algorithms. That is, we stress that, in most cases, the design of these algorithms is often orthogonal to the semantic representations that they use. As a result, the plethora of work on learning and improving semantic representations themselves (discussed in the previous section) is complementary to the content provided in this section. Morgado and Vasconcelos [178] distinguish “recognition using independent semantics (RIS)” and “recognition using semantic embeddings (RULE).” Wang and Chen [250] group ZSL algo- rithms into “Direct mapping”, “Model parameter transfer”, and “Common space learning” ap- proaches. Fu et al. [73] argue that solving zero-shot recognition involves “embedding models” and “recognition models in the embedding space,” where the embedding models can be fur- ther categorized into semantic embedding, Bayesian models, embedding into common spaces, or deep embedding approaches. Xian et al. [258] categorize 13 ZSL methods into “Learning Linear Compatibility”, “Learning Nonlinear Compatibility”, “Learning Intermediate Attribute Classifiers,” and “Hybrid Models.” To facilitate the discussions in our work, we divide zero-shot learning algorithms into the following two themes: two-stage approaches and unified approaches. This criterion is most similar to the one used by Wang and Chen. 3.1.2.1 Two-Stage Approaches The theme of two-stage approaches is to identify and learn an intermediate subtask that is then used to infer the final prediction. Two popular subtasks are predicting the embeddings of images in the semantic space, and generating instances of each class given its corresponding semantic representation. It is possible that the selected intermediate subtask is trained jointly with the zero-shot recognition in a unified manner (Sect. 3.1.2.2), but this is not fully investigated in the literature and in some cases may lead to other technical difficulties. Learning to predict semantic embeddings Given an image, one can project it to the semantic embedding space, and then infer its class label by comparing the predicted semantic embeddings to those of unseen classes using a similarity measure. The projection mapping can be trained using standard classification or regression models on image-semantic embedding pairs from the seen classes. The semantic embedding space is usually chosen to be the one where given semantic 20 representations live in. As for label inference, there are two popular approaches. One is based on probabilistic models of class labels based on semantic representations [136, 137, 183]. The other is based on nearest neighbor classifiers on the semantic space [59, 185, 227, 263]. If we assume that semantic representations capture all information one needs to predict the class labels (i.e., they are highly discriminative), then focusing on accurately predicting the se- mantic embeddings would solve zero-shot learning. In practice, however, this paradigm suffers from the unreliability of semantic representation predictions. Several techniques are as a result proposed to alleviate this problem. Jayaraman and Grauman [109] propose a random forest based approach to take this unreliability into account. Al-Halah and Stiefelhagen [6] constructs the hierarchy of concepts underlying the attributes to improve reliability. Gan et al. [76] transform visual features to reduce the mismatches between attributes in different categories, thus enhancing reliability. Learning to generate instances of each class Recent advances in conditional generative mod- els (e.g., [201, 264]) lead to interest in exploiting them for generating labeled data from cor- responding semantic representations. Once those examples are generated, one can employ any supervised learning technique to learn classifiers [25, 133, 259, 283]. Note that all of these ap- proaches focus on directly generating features rather than image pixels. 3.1.2.2 Unified Approaches The other type of ZSL approaches focuses on the task of zero-shot classification directly. There are two main sub-themes, where the difference lies in whether the emphasis is on learning com- mon space or compatibility or on learning model parameters, but the distinction between the two is thin. Common space or compatibility learning This approach learns a common representation to which visual features and semantic embeddings are projected with the objective of maximizing the compatibility score of projected instances in this space. The difference among methods that fall into this category lies in their choices of common space or compatibility function. The linear or bilinear scoring functions are extensively used [4, 5, 68, 146, 209]. Some propose to use canonical correlation analysis (CCA) [71, 72, 154]. Nonlinear methods are scarce but have also been explored such as dictionary learning and sparse coding [121, 280]. Model parameter learning One can also build the classifiers for unseen classes by relating them to seen ones via similarities computed from semantic representations [55, 79, 142, 165, 207, 208, 266, 279]. For example, Mensink et al. and Gan et al. [75, 165] propose to construct classifiers for unseen objects by combining classifiers of seen objects, where the combining co- efficients are determined based on semantic relatedness. As we will see, both of our methods SYNC, EXEM (1NN) and EXEM (1NNS) fall into model parameter learning approaches but the details in how they construct classifiers / exemplars are different. EXEM (1NN) and EXEM (1NNS) can also be viewed as learning to generate one instance for each class — without modeling variations explicitly. EXEM (ZSL method) falls into the approach of learning to predict semantic embeddings but we show that the (projected) space of visual features is an extremely effective semantic embedding space. 21 Figure 3.5: Traditional learning with a separate system for each task, transfer learning in which a system is aid by the source task and built for the target task, and multi-task learning with a single system for multiple tasks. Figure adapted from [186]. 22 Figure 3.6: Taxonomy of multi-task learning based on the types of input-output pairs. Figures taken from [156]. Figure 3.7: Hard and soft parameter sharing for multi-task learning. 3.2 Transfer and Multi-Task Learning Transfer and multi-task learning are two related frameworks often related to learning with limited data. Fig. 3.5 depicts different learning settings. In the traditional setting, one collect data for each task and train a system based on such data. Transfer learning and multi-task learning leverage data from multiple tasks. However, generally speaking, the importance difference is that transfer learning differentiate between source and target tasks and aims to perform well on the target task, while multi-task learning simultaneously considers multiple tasks and put the emphasis on all of them. In this section, we provide an overview of multi-task learning (MTL) focusing on general themes and some specific examples in natural language processing and understanding applications. Interlude: Zero-shot learning as transfer learning Note that zero-shot learning described in the previous section could be thought of as transfer learning. In this case, each task is to recognize single object category and the goal is to be able to do well on the target unseen tasks. What is shared between tasks? A common theme in multi-task learning is sharing between tasks. Suppose we consider the definition of a task as a set of input-output pairs of the same type. Then, one way to characterize multi-task learning frameworks is to differentiate between one-to-many, many-to-one, and many-to-many cases. In the one-to-many (many-to-one) setting, all input (output) instances are of the same type, and in the many-to-many case, both inputs and outputs can be of different types. Fig. 3.6 shows an example of these settings in the context of sequence to sequence models [156]. One-to-many multi-task learning We now focus on the one-to-many MTL setting. For exam- ple, our input instances can be English sentences, and the goal for the system is to give different output instances based on the task assigned to it. We will now briefly review most popular tech- niques to leverage task similarity and relatedness for MTL. The most popular approach to MTL is parameter sharing. In hard parameter sharing, a subset of parameters of different tasks are constrained to be the same. In soft parameter sharing, a subset of parameters of different tasks are constrained to be close based on a similarity metric. The 23 Figure 3.8: Hard sharing with tasks supervise at the same or different layers. Figure 3.9: A Joint Many-Task Model [94] difference is shown in Fig. 3.7. We denote the constraint that parameters should be similar by dashed lines. Moreover, one could consider leveraging signals in a multi-scale manner. Fig. 3.8 illustrates the hard-sharing MTL settings in which tasks supervise at the same or different layers. For exam- ple, inspired by [228], the Joint Many-Task Model [94] uses this setting to combine supervision at the word level and at the sentence level. Focus: Multi-task learning in natural language processing Natural language processing and understanding has seen progress due to advances in automatic and end-to-end feature learning in recent years [85, 271]. However, the extent to which recent models understand natural lan- guages, arguably, remains limited. For example, it has been shown that existing well-performing question answering and natural language inference systems largely rely on superficial cues that take advantage of benchmark annotation artifacts or evaluation metrics [90, 111]. In fact, the task of complete language understanding is so herculean that the research com- munity usually tackles it by first breaking it down into subtasks, such as sequence tagging and parsing. These subtasks are usually identified heuristically or motivated by linguistic theories such as syntax, semantics, pragmatics, discourse analysis, and stylistics. Such a divide-and- conquer approach has so far been embraced and resulted in tasks in NLP often being done in isolation [44]. 24 While we master many of these subtasks (at least when we test our models on data similar to their training data), there remain several open problems. First, what is the nature of the rela- tionships between these tasks? Understanding this will aid knowledge transfer among these tasks themselves as well as from these tasks to others. For instance, part-of-speech and named entity tags are widely used as additional token features in top systems for high-level tasks like ques- tion answering [37, 100, 102], information extraction [174], and machine translation [181, 220]. At the same time, it is not clear how much is shared between the two tasks and whether other potentially useful token information exists. Second, can we do better than solving multiple tasks independently? This question motivates a strand of research on multi-task learning (MTL). The intuition is that tasks should benefit one another if they are indeed related. Additionally, MTL may provide inductive bias that leads to more robust and generalizable systems [28]. Finally, even if tasks are not completely related, it is still unsatisfactory that data and computational resources are unnecessarily wasted from having a separate model for each task. 25 Part II Modeling and Learning Similarity 26 Chapter 4 Similarity Component Analysis 4.1 Introduction Learning how to measure similarity (or dissimilarity) is a fundamental problem in machine learn- ing. Arguably, if we have the right measure, we would be able to achieve a perfect classification or clustering of data. If we parameterize the desired dissimilarity measure in the form of a metric function, the resulting learning problem is often referred to as metric learning. In the last few years, researchers have invented a plethora of such algorithms [50, 96, 108, 187, 249, 254]. Those algorithms have been successfully applied to a wide range of application domains. However, the notion of (dis)similarity is much richer than what metric is able to capture. Consider the classical example of CENTAUR, MAN and HORSE. MAN is similar to CENTAUR and CENTAUR is similar to HORSE. Metric learning algorithms that model the two similarities well would need to assign small distances among those two pairs. On the other hand, the algorithms will also need to strenuously battle against assigning a small distance between MAN and HORSE due to the triangle inequality, so as to avoid the fallacy that MAN is similar to HORSE too! This example (and others [148]) thus illustrates the important properties, such as non-transitiveness and non-triangular inequality, of (dis)similarity that metric learning has not adequately addressed. Representing objects as points in high-dimensional feature spaces, most metric learning learn- ing algorithms assume that the same set of features contribute indistinguishably to assessing sim- ilarity. In particular, the popular Mahalanobis metric weights each feature (and their interactions) additively when calculating distances. In contrast, similarity can arise from a complex aggre- gation of comparing data instances on multiple subsets of features, to which we refer as latent components. For instance, there are multiple reasons for us to rate two songs being similar: being written by the same composers, being performed by the same band, or of the same genre. For an arbitrary pair of songs, we can rate the similarity between them based on one of the many com- ponents or an arbitrary subset of components, while ignoring the rest. Note that, in the learning setting, we observe only the aggregated results of those comparisons — which components are used is latent. Multi-component based similarity exists also in other types of data. Consider a social network where the network structure (i.e., links) is a supposition of multiple networks where people are connected for various organizational reasons: school, profession, or hobby. It is thus unrealistic to assume that the links exist due to a single cause. More appropriately, social networks are “multiplex” [65, 236]. We propose Similarity Component Analysis (SCA) to model the richer similarity relation- ships beyond what current metric learning algorithms can offer. SCA is a Bayesian network, 27 k Θ m x n x k S S N N × K K 12 S 32 S 31 S 91 . 0 ) 1 ( = = s p 9 . 0 1 = p 1 S 2 S 1 x 2 x 2 x 3 x 1 x 3 x 1 S 2 S 1 S 2 S 91 . 0 ) 1 ( = = s p 19 . 0 ) 1 ( = = s p 1 . 0 2 = p 1 . 0 1 = p 9 . 0 2 = p 1 . 0 1 = p 1 . 0 2 = p Figure 4.1: Similarity Component Analysis and its application to the example of CENTAUR, MAN and HORSE. SCA has K latent components which give rise to local similarity valuess k conditioned on a pair of datax m andx n . The model’s output s is a combination of all local values through an OR model (straightforward to extend to a noisy-OR model). k is the parameter vector forp(s k jx m ;x n ). See texts for details. illustrated in Fig. 4.1. The similarity (node s) is modeled as a probabilistic combination of multi- ple latent components. Each latent component (s k ) assigns a local similarity value to whether or not two objects are similar, inferring from only a subset (but unknown) of features. The (local) similarity values of those latent components are aggregated with a (noisy-) OR model. Intuitively, two objects are likely to be similar if they are considered to be similar by at least one component. Two objects are likely to be dissimilar if none of the components voices up. We derive an EM-based algorithm for fitting the model with data annotated with similarity relationships. The algorithm infers the intermediate similarity values of latent components and identifies the parameters for the (noisy-)OR model, as well as each latent component’s conditional distribution, by maximizing the likelihood of the training data. We validate SCA on several learning tasks. On synthetic data where ground-truth is available, we confirm SCA’s ability in discovering latent components and their corresponding subsets of features. On a multiway classification task, we contrast SCA to state-of-the-art metric learning algorithms and demonstrate SCA’s superior performance in classifying data samples. Finally, we use SCA to model the network link structures among research articles published at NIPS proceedings. We show that SCA achieves the best link prediction accuracy among competitive algorithms. We also conduct extensive analysis on how learned latent components effectively represent link structures. In Sect. 4.2, we describe the SCA model and inference and learning algorithms. We report our empirical findings in Sect. 4.3. We discuss related work in Sect. 4.4 and conclude in Sect. 4.5. 4.2 Approach We start by describing in detail Similarity Component Analysis (SCA), a Bayesian network for modeling similarity between two objects. We then describe the inference procedure and learning algorithm for fitting the model parameters with similarity-annotated data. 28 4.2.1 Probabilistic Model of Similarity In what follows, let (u;v;s) denote a pair of D-dimensional data pointsu;v2 R D and their associated value of similaritys2fDISSIMILAR; SIMILARg orf0; 1g accordingly. We are inter- ested in modeling the process of assignings to these two data points. To this end, we propose Similarity Component Analysis (SCA) to model the conditional distributionp(sju;v), illustrated in Fig. 4.1. In SCA, we assume thatp(sju;v) is a mixture of multiple latent components’s local similarity values. Each latent component evaluates its similarity value independently, using only a subset of the D features. Intuitively, there are multiple reasons of annotating whether or not two data instances are similar and each reason focuses locally on one aspect of the data, by restricting itself to examining only a different subset of features. Latent components Formally, letu [k] denote the subset of features fromu corresponding to the k-th latent component where [k]f1; 2;:::;Dg. The similarity assessments k of this component alone is determined by the distance betweenu [k] andv [k] d k = (uv) T M k (uv) (4.1) whereM k 0 is a D D positive semidefinite matrix, used to measure the distance more flexibly than the standard Euclidean metric. We restrictM k to be sparse, in particular, only the corresponding [k]-th rows and columns are non-zeroes. Note that in principle [k] needs to be inferred from data, which is generally hard. Nonetheless, we have found that empirically even without explicitly constrainingM k , we often obtain a sparse solution. The distanced k is transformed to the probability for the Bernoulli variables k according to P (s k = 1ju;v) = (1 +e b k )[1(d k b k )] (4.2) where() is the sigmoid function(t) = (1 +e t ) 1 andb k is a bias term. Intuitively, when the (biased) distance (d k b k ) is large, s k is less probable to be 1 and the two data points are regarded less similar. Note that the constraintM k being positive semidefinite is important as this will constrain the probability to be bounded above by 1. Combining local similarities Assume that there are K latent components. How can we com- bine all the local similarity assessments? In this work, we use an OR-gate. Namely, P (s = 1js 1 ;s 2 ; ;s K ) = 1 K Y k=1 I[s k = 0] (4.3) Thus, the two data points are similar (s = 1) if at least one of the aspects deems so, corresponding tos k = 1 for a particulark. The OR-model can be extended to the noisy-OR model [193]. To this end, we model the non-deterministic effect of each component on the final similarity value, P (s = 1js k = 1) = k = 1 k ; P (s = 1js k = 0) = 0 (4.4) In essence, the uncertainty comes from our probability of failure k (false negative) to identify the similarity if we are only allowed to consider one component at a time. If we can consider 29 all components at the same time, this failure probability would be reduced. The noisy-OR model captures precisely this notion: P (s = 1js 1 ;s 2 ; ;s K ) = 1 K Y k=1 I[s k =1] k (4.5) where the mores k = 1, the less the false-negative rate is after combination. Note that the noisy- OR model reduces to the OR-model eq. (4.3) when k = 0 for allk. Similarity model Our desired model for the conditional probability p(sju;v) is obtained by marginalizing all possible configurations of the latent componentss =fs 1 ;s 2 ; ;s K g P (s = 0ju;v) = X s P (s = 0js) Y k P (s k ju;v) = X s Y k I[s k =1] k P (s k ju;v) = Y k [ k p k + 1p k ] = Y k [1 k p k ] (4.6) wherep k =p(s k = 1ju;v) is a shorthand for eq. (4.2). Note that despite the exponential number of configurations fors, the marginalized probability is tractable. For the OR-model where k = 0, the conditional probability simplifies toP (s = 0ju;v) = Q k [1p k ]. 4.2.2 Inference and Learning Given an annotated training datasetD =f(x m ;x n ;s mn )g, we learn the parameters, which in- clude all the positive semidefinite matricesM k , the biases b k and the false negative rates k (if noisy-OR is used), by maximizing the likelihood ofD. Note that we will assume that K is known throughout this work. We develop an EM-style algorithm to find the local optimum of the likelihood. Posterior The posteriors over the hidden variables are computationally tractable: q k =P (s k = 1ju;v;s = 0) = p k k Q l6=k [1 l p l ] P (s = 0ju;v) r k =P (s k = 1ju;v;s = 1) = p k 1 k Q l6=k [1 l p l ] P (s = 1ju;v) (4.7) For OR-model eq. (4.3), these posteriors can be further simplified as all k = 0. Note that, these posteriors are sufficient to learn the parametersM k and b k . To learn the parameters k , however, we need to compute the expected likelihood with respect to the posterior P (sju;v;s). While this posterior is tractable, the expectation of the likelihood is not and varia- tional inference is needed [106]. We omit the derivation for brevity. In what follows, we focus on learningM k andb k . For thek-th component, the relevant terms in the expected log-likelihood, given the posteri- ors, from a single similarity assessments on (u;v), is J k =q 1s k r s k logP (s k = 1ju;v) + (1q 1s k r s k ) log(1P (s k = 1ju;v)) (4.8) 30 Learning the parameters Note thatJ k is not jointly convex inb k andM k . Thus, we optimize them alternatively. Concretely, fixing M k , we grid search and optimize over b k . Fixing b k , maximizing J k with respect toM k is convex optimization as J k is a concave function inM k given the linear dependency of the distance eq. (4.1) on this parameter. We use the method of projected gradient ascent. Essentially, we take a gradient ascent step to updateM k iteratively. If the update violates the positive semidefinite constraint, we project back to the feasible region by setting all negative eigenvalues ofM k to zeroes. Alternatively, we have found that reparameterizingJ k in the following formM k =L T k L k is more computationally advantageous, asL k is unconstrained. We use L-BFGS to optimize with respect toL k and obtain faster convergence and better objective function values. (While this procedure only guarantees local optima, we observe no significant detrimental effect of arriving at those solutions.) We give the exact form of gradients with respect toM k andL k below. E-step: Objective function We have given the posterior distribution of the (hidden) local similarity value variables. Below, we derive the form of expected complete data log-likelihood conditioned on a pairu;v. E p(s) [log(P (s;s 1 ;:::;s K ju;v))] =E p(s) [ K X k=1 (log(P (s k ju;v))) + log(P (sjs 1 ;s 2 ;:::;s K ))] =E p(s) [ K X k=1 (s k logp k + (1s k ) log(1p k )) +s log(1 K Y k=1 1(s k =1) k ) + (1s) K X k=1 s k log k ] = K X k=1 (q 1s k r s k logp k + (1q 1s k r s k ) log(1p k )) +E p(s) [s log(1 K Y k=1 1(s k =1) k )] + (1s) K X k=1 q 1s k r s k log k (4.9) wherep(s) = Pr(s 1 ;:::;s K ju;v;s) = Q K k=1 Pr(s k ju;v;s). p k =P (s k = 1ju;v) is given by eq. (4.2). q k = P (s k = 1ju;v;s = 0) andr k = P (s k = 1ju;v;s = 1) are given by eq. (4.7). Note that the last equation uses the fact thatE p(s) (s k ) =q 1s k r s k . The third termE p(s) [s log(1 Q K k=1 1(s k =1) k )] is not tractable; variational methods can be used, as described in [106]. M-step: Optimization We give the form of gradients with respect toM andL in the (noisy- )OR model. As given in Sect. 4.2 and above, for thek-th aspect, the relevant terms in the expected log-likelihood given the posteriors, from a single similarity assessment, are J k =w k logp k + (1w k ) log(1p k ) (4.10) wherep k =P (s k = 1ju;v) = (1 +e b k )[1(d k b k )] andw k denotesq 1s k r s k . Taking the derivative with respect tod k gives us @J k @d k =w k + (1w k ) (1) (c k 1)=c k (4.11) 31 where is a short form of(d k b k ) andc k = 1 +e b k . For two different parameterizations d k = (uv) T M k (uv) andd k = (uv) T L T k L k (uv), we have @d k @M k = (uv)(uv) T @d k @L k = 2L k (uv)(uv) T (4.12) It follows that the gradients with respect toM k andL k are @J k @M k = [w k + (1w k ) (1) (c k 1)=c k ](uv)(uv) T @J k @L k = 2[w k + (1w k ) (1) (c k 1)=c k ]L k (uv)(uv) T (4.13) 4.2.3 Extensions Variants to local similarity models The choice of using logistic-like functions eq. (4.2) for modeling local similarity of the latent components is orthogonal to how those similarities are combined in eq. (4.3) or eq. (4.5). Thus, it is relatively straightforward to replace eq. (4.2) with a more suitable one. For instance, in some of our empirical studies, we have constrainedM k to be a diagonal matrix with nonnegative diagonal elements. This is especially useful when the feature dimensionality is extremely high. We view this flexibility as a modeling advantage. Disjoint components We could also explicitly express our desiderata that latent components focus on non-overlapping features. To this end, we penalize the likelihood of the data with the following regularizer to promote disjoint components R(fM k g) = X k;k 0 diag(M k ) T diag(M k 0) (4.14) where diag() extracts the diagonal elements of the matrix. As the metrics are constrained to be positive semidefinite, the inner product attains its minimum of zero when the diagonal ele- ments, which are nonnegative, are orthogonal to each other. This will introduce zero elements on the diagonals of the metrics, which will in turn deselect the corresponding feature dimensions, because the corresponding rows and columns of those elements are necessarily zero due to the positive semidefinite constraints. Thus, metrics that have orthogonal diagonal vectors will use non-overlapping subsets of features. 4.3 Experiments We validate the effectiveness of SCA in modeling similarity relationships on three tasks. In Sect. 4.3.1, we apply SCA to synthetic datasets where the ground-truth is available to confirm SCA’s ability in identifying correctly underlying parameters. In Sect. 4.3.2, we apply SCA to a multiway classification task to recognize images of handwritten digits where similarity is equated to having the same class label. SCA attains superior classification accuracy to state-of-the-art metric learning algorithms. In Sect. 4.3.3, we apply SCA to a link prediction problem for a network of scientific articles. On this task, SCA outperforms competing methods significantly, too. 32 Our baseline algorithms for modeling similarity are information-theoretic metric learning (ITML) [50] and large margin nearest neighbor (LMNN) [254]. Both methods are discriminative approaches where a metric is optimized to reduce the distances between data points from the same label class (or similar data instances) and increase the distances between data points from different classes (or dissimilar data instances). When possible, we also contrast to multiple metric LMNN (MM-LMNN) [254], a variant to LMNN where multiple metrics are learned from data. 4.3.1 Synthetic Data 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 true metrics 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 recovered metrics 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 (a) Disjoint ground-truth metrics 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 true metrics 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 recovered metrics 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 10 20 30 5 10 15 20 25 30 (b) Overlapping ground-truth metrics Figure 4.2: On synthetic datasets, SCA successfully identifies the sparse structures and (non)overlapping patterns of ground-truth metrics. See texts for details. Best viewed in color. Data We generate a synthetic dataset according to the graphical model in Fig. 4.1. Specifically, our feature dimensionality is D = 30 and the number of latent components is K = 5. For each componentk, the corresponding metricM k is aDD sparse positive semidefinite matrix where only elements in a 6 6 matrix block on the diagonal are nonzero. Moreover, for differentk, these block matrices do not overlap in rows and columns indices. In short, these metrics mimic the setup where each component focuses on its own 1=K-th of total features that are disjoint from each other. The first row of Fig. 4.2(a) illustrates these 5 matrices while the black background color indicates zero elements. The values of nonzero elements are randomly generated as long as they maintain the positive semidefiniteness of the metrics. We set the bias termb k to zeroes for all components. We sample N = 500 data points randomly fromR D . We select a random pair and compute their similarity according to eq. (4.6) and threshold at 0.5 to yield a binary labels2f0; 1g. We select randomly 74850 pairs for training, 24950 for development, 24950 for testing. Method We use the OR-model eq. (4.3) to combine latent components. We evaluate the results of SCA on two aspects: how well we can recover the ground-truth metrics (and biases) and how well we can use the parameters to predict similarities on the test set. Results The second row of Fig. 4.2(a) contrasts the learned metrics to the ground-truth (the first row). Clearly, these two sets of metrics have almost identical shapes and sparse structures. Note that for this experiment, we did not use the disjoint regularizer (described in Sect. 4.2.3) to promote sparsity and disjointness in the learned metrics. Yet, the SCA model is still able to 33 Table 4.1: Similarity prediction accuracies and standard errors (%) on the synthetic dataset BASELINES SCA ITML LMNN K = 1 K = 3 K = 5 K = 7 K = 10 K = 20 72.70.0 71.30.2 72.80.0 82.10.1 91.50.1 91.70.1 91.80.1 90.20.4 Table 4.2: Misclassification rates (%) on the MNIST recognition task BASELINES SCA D EUC. ITML LMNN MM-LMNN K = 1 K = 5 K = 10 25 21.6 15.1 20.6 20.2 17.7 0.9 16.0 1.5 14.5 0.6 50 18.7 13.35 16.5 13.6 13.8 0.3 12.0 1.1 11.4 0.6 100 18.1 11.85 13.4 9.9 12.1 0.1 10.8 0.6 11.1 0.3 identify those structures. For the biases, SCA identifies them as being close to zero (details are omitted for brevity). Table 4.1 contrasts the prediction accuracies by SCA to competing methods. Note that ITML, LMNN and SCA withK = 1 perform similarly. However, when the number of latent components increases, SCA outperforms other approaches by a large margin. Also note that when the number of latent components exceeds the ground-truthK = 5, SCA reaches a plateau until overfitting. In real-world data, “true metrics” may overlap, that is, it is possible that different components of similarity rely on overlapping set of features. To examine SCA’s effectiveness in this scenario, we create another synthetic data where true metrics heavily overlap, illustrated in the first row of Fig. 4.2(b). Nonetheless, SCA is able to identify the metrics correctly, as seen in the second row. 4.3.2 Multiway Classification For this task, we use the MNIST dataset, which consists of 10 classes of hand-written digit im- ages. We use PCA to reduce the original dimension from 784 toD = 25; 50 and 100, respectively. We use 4200 examples for training, 1800 for development and 2000 for testing. The data is in the format of (x n ;y n ) wherey n is the class label. We convert them into the format (x m ;x n ;s mn ) that SCA expects. Specifically, for every training data point, we select its 15 nearest neighbors among samples in the same class and formulate 15 similar relationships. For dissimilar relationships, we select its 80 nearest neighbors among samples from the rest classes. For testing, the labely ofx is determined by y = arg max c s c = arg max c X x 0 2Bc(x) P (s = 1jx;x 0 ) (4.15) where s c is the similarity score to the c-th class, computed as the sum of 5 largest similarity valuesB c to samples in that class. In Table 4.2, we show classification error rates for different values of D. For K > 1, SCA clearly outperforms single-metric based baselines. In addition, SCA performs well compared to MM-LMNN, achieving far better accuracy for smallD. 4.3.3 Link Prediction We evaluate SCA on the task of link prediction in a “social” network of scientific articles. We aim to demonstrate SCA’s power to model similarity/dissimilarity in “multiplex” real-world network 34 Table 4.3: Section names for NIPS 1987 – 1999 ID Names ID Names ID Names 1 Cognitive Science 4 Algo. and Arch. 7 Visual Processing 2 Neuroscience 5 Implementation 8 Applications 3 Learning Theory 6 Speech & Sign. Proc. 9 Ctrl., Navi., & Plan. data. In particular, we are interested in not only link prediction accuracies, but also the insights about data that we gain from analyzing the identified latent components. Setup We use the NIPS 0-12 dataset 1 to construct the aforementioned network. The dataset contains papers from the NIPS conferences between 1987 and 1999. The papers are organized into 9 sections (topics), shown in Table 4.3. We sample randomly 80 papers per section and use them to construct the network. Each paper is a vertex and two papers are connected with an edge and deemed as similar if both of them belong to the same section. We experiment three representations for the papers: (1) Bag-of-words (BoW) uses normalized occurrences (frequencies) of words in the documents. As a preprocessing step, we remove “rare” words that appear less than 75 times and appear more than 240. Those words are either too specialized (thus generalize poorly) or just functional words. After the removal, we obtain 1067 words. (2) Topic (ToP) uses the documents’ topic vectors (mixture weights of topics) after fitting the corpus to a 50-topic LDA [23]. (3) Topic-words (ToW) is essentially BoW except that we retain only 1036 frequent words used by the topics of the LDA model (top 40 words per topic). Methods We compare the proposed SCA extensively to several competing methods for link prediction. For BoW and ToW represented data, we compare SCA with diagonal metrics (SCA- DIAG, cf. Sect. 4.2.3) to Support Vector Machines (SVM) and logistic regression (LOGIT) to avoid high computational costs associated with learning high-dimensional matrices (the feature dimen- sionalityD 1000). To apply SVM/LOGIT, we treat the link prediction as a binary classification problem where the input is the absolute difference in feature values between the two data points. For 50-dimensional ToP represented data, we compare SCA (SCA) and SCA-DIAG to SVM/LOGIT, information-theoretical metric learning (ITML), and large margin nearest neighbor (LMNN). Note that while LMNN was originally designed for nearest-neighbor based classification, it can be adapted to use similarity information to learn a global metric to compute the distance be- tween any pair of data points. We learn such a metric and threshold on the distance to render a decision on whether two data points are similar or not (i.e., whether there is a link between them). On the other end, multiple-metric LMNN, while often having better classification perfor- mance, cannot be used for similarity and link prediction as it does not provide a principled way of computing distances between two arbitrary data points when there are multiple (local) metrics. Link or not? In Table 4.4, we report link prediction accuracies, which are averaged over several runs of randomly generated 70/30 splits of the data. SVM and LOGIT perform nearly identically so we report only SVM. For both SCA and SCA-DIAG, we report results when a single component is used as well as when the optimal number of components are used (under columnsK ). Both SCA-DIAG and SCA outperform the rest methods by a significant margin, especially when the number of latent components is greater than 1 (K ranges from 3 to 13, depending 1 http://www.stats.ox.ac.uk/ ˜ teh/data.html 35 Table 4.4: Link prediction accuracies and their standard errors (%) on a network of scientific papers NIPS 0-12 Feature BASELINES SCA-DIAG SCA type SVM ITML LMNN K = 1 K K = 1 K BoW 73.30.0 - - 64.8 0.1 87.0 1.2 - - ToW 75.30.0 - - 67.0 0.0 88.1 1.4 - - ToP 71.20.0 81.10.1 80.70.1 62.6 0.0 81.0 0.8 81.0 0.0 87.6 1.0 Table 4.5: Link prediction accuracies and their standard errors (%) on a network of scientific papers NIPS-3 Feature BASELINES SCA-DIAG SCA type SVM ITML LMNN K = 1 K K = 1 K BoW 73.30.0 - - 70.5 0.1 89.5 0.8 - - ToW 80.40.0 - - 76.5 0.1 87.8 1.1 - - ToP 75.10.0 83.80.2 75.40.3 70.5 0.1 84.3 0.9 83.6 0.1 89.6 1.1 on the methods and the feature types). The only exception is SCA-DIAG with one component (K = 1), which is an overly restrictive model as the diagonal metrics constrain features to be combined additively. This restriction is overcome by using a larger number of components. Besides the dataset described in the experiments above, we report link prediction accuracies on another network (we call NIPS-3), also sampled from NIPS 0-12, to further validate the ef- fectiveness of SCA. This network uses only 450 papers from NIPS 1997 to NIPS 1999 and has 16059 edges. In this dataset, the numbers of papers are not balanced across sections — there are significantly more papers in Learning Theory and Algorithms & Architectures than Implementa- tions and Speech & Signal Processing. Nonetheless, in Table 4.5, we see again that SCA achieves a significant improvement on the link prediction accuracies, similar to that in Table 4.4. Edge component analysis Why does learning latent components in SCA achieve superior link prediction accuracies? The (noisy-) OR model used by SCA is naturally inclined to favoring “positive” opinions — a pair of samples are regarded as being similar as long as there is one latent component strongly believing so. This implies that a latent component can be tuned to a specific group of samples if those samples rely on common feature characteristics to be similar. Fig. 4.3(a) confirms our intuition. The plot displays in relative strength — darker being stronger — how much each latent component believes a pair of articles from the same section should be similar. Concretely, after fitting a 9-component SCA (from documents in ToP fea- tures), we consider edges connecting articles in the same section and compute the average local similarity values assigned by each component. We observe two interesting sparse patterns: for each section, there is a dominant latent component that strongly supports the fact that the articles from that section should be similar (e.g., for section 1, the dominant one is the 9-th component). Moreover, for each latent component, it often strongly “voices up” for one section – the exception is the second component which seems to support both section 3 and 4. Nonetheless, the general picture is that, each section has a signature in terms of how similarity values are distributed across latent components. This notion is further illustrated, with greater details, in Fig. 4.3(b). While Fig. 4.3(a) depicts averaged signature for each section, the scatterplot displays 2D embeddings computed with the 36 Metric ID Section ID 2 4 6 8 1 2 3 4 5 6 7 8 9 (a) Averaged component- wise similarity values of edges within each section 1 2 3 4 5 6 7 8 9 (b) Embedding of links, represented with component- wise similarity values 1 2 3 4 5 6 7 8 9 (c) Embedding of net- work nodes (documents), represented in LDA’s topics Figure 4.3: Edge component analysis. Representing network links with local similarity values reveals in- teresting structures, such as nearly one-to-one correspondence between latent components and sections, as well as clusters. However, representing articles in LDA’s topics does not reveal useful clustering structures such that links can be inferred. See texts for details. Best viewed in color. t-SNE algorithm, on each individual edge’s signature — 9-dimensional similarity values inferred with the 9 latent components. The embeddings are very well organized in 9 clusters, colored with section IDs. In contrast, embedding documents using their topic representations does not reveal clear clus- tering structures such that network links can be inferred. This is shown in Fig. 4.3(c) where each dot corresponds to a document and the low-dimensional coordinates are computed using t-SNE (symmetrized KL divergence between topics is used as a distance measure). We observe that while topics themselves do not reveal intrinsic (network) structures, latent components are able to achieve so by applying highly-specialized metrics to measure local similarities and yield char- acteristic signatures. We also study whether or not the lack of an edge between a pair of dissimilar documents from different sections, can give rise to characteristic signatures from the latent components. In summary, we do not observe those telltale signatures for those pairs. Detailed results are below. We have observed characteristic signatures from the latent components that result from edges between similar documents. One natural question to ask is whether or not the lack of edges between dissimilar documents (in this case, those from different sections) can give rise to such signatures, too. In Fig. 4.4, we show the average component-wise dissimilarity values of edges between different sections (how much each latent component believes a pair of articles from different sections should be dissimilar) for ToW feature type with K = 9. Not surprisingly, we do not observe those telltale signatures - for those pairs of data, almost all latent components vote them as being dissimilar strongly. This suggests that those latent components have both high sensitivity and specificity. Subset of features picked by sparse metrics In Fig. 4.5, we show the diagonal entries of the metrics in the case of ToP and ToW features, both for K = 9. We can think of the diagonal entries as the weights put on such features. Features picked by these metrics seem to be sparse and disjoint — signified by the non-overlapping spiky structure. This validates our assumption that each latent component evaluates its similarity value using a different subset of features. We show corresponding features according to these diagonal weights in Table 4.6 and Table 4.7. We observe that, for each metric, its top features (words or topics) tend to appear rarely in the 37 Section Section Metric 1 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 2 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 3 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 4 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 5 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 6 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 7 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 8 2 4 6 8 1 2 3 4 5 6 7 8 9 Section Section Metric 9 2 4 6 8 1 2 3 4 5 6 7 8 9 Figure 4.4: Average component-wise dissimilarity values of edges between different sections. Darker indicates higher dissimilarity values. 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized Weights Topic ID 1 2 3 4 5 6 7 8 9 (a) ToP 0 200 400 600 800 1000 0 0.02 0.04 0.06 0.08 0.1 0.12 Normalized Weights Word ID 1 2 3 4 5 6 7 8 9 (b) ToW Figure 4.5: The normalized diagonal values of metrics forK = 9 38 Table 4.6: Top five ToW features for each metric Metric Top 5 Features 1 radio financial trains costs achieve 2 curve representations image attractor kalman 3 trained speech signals statistics class 4 retina robot gate regression size 5 learning model data state activity 6 stress margin evidence actor barto 7 dot views conclusion moody perturbation 8 implementation hmm kalman rbf vlsi 9 speech chip period vision regression sections that the metric can predict similarity well. In other words, two documents are similar in one aspect often because they both lack a certain kind of features. One explanation is that each component tries to reduce sensitivity when predicting similarity by picking features that are less varied in values. 4.4 Related Work Our model learns multiple metrics, one for each latent component. However, the similarity (or associated dissimilarity) from our model is definitely non-metric due to the complex combination. This stands in stark contrast to most metric learning algorithms [50, 82, 87, 96, 108, 187, 249, 254, 262]. [148] gives an information-theoretic definition of (non-metric) similarity as long as there is a probabilistic model for the data. Our approach of SCA focuses on the relationship between data but not data themselves. [243] proposes visualization techniques for non-metric similarity data. Our work is reminiscent of probabilistic modeling of overlapping communities in social net- works, such as the mixed membership stochastic blockmodels [3]. The key difference is that those works model vertices with a mixture of latent components (communities) where we model the interactions between vertices with a mixture of latent components. [1, 2] studies a social network whose edge set is the union of multiple edge sets in hidden similarity spaces. Our work explicitly models the probabilistic process of combining latent components with a (noisy-)OR gate. 4.5 Conclusion We propose Similarity Component Analysis (SCA) for probabilistic modeling of similarity re- lationship for pairwise data instances. The key ingredient of SCA is to model similarity as a complex combination of multiple latent components, each giving rise to a local similarity value. SCA attains significantly better accuracies than existing methods on both classification and link prediction tasks. 39 Table 4.7: Top three ToP features for each metric Metric Top 3 Features 1 1: human similarity subjects generalization performance similar 2: image images object recognition face feature features objects 3: model motor position control eye movement forward trajectory 2 29: motion direction figure velocity optical flow retina time 18: action state reinforcement policy actions learning reward 35: visual target system attention location information search 3 36: time eeg activity attractor data response brain signal single 1: human similarity subjects generalization performance similar 50: representation sequence representations information level 4 27: words user context word text information documents query 29: motion direction figure velocity optical flow retina time 2: image images object recognition face feature features objects 5 16: spike firing information rate time spikes neuron model input 2: image images object recognition face feature features objects 23: phase figure adaptation contour segment oscillators segments 6 31: kernel vector set support function data regression training 49: time series prediction signal filter neural gamma kalman 13: block blocks data time algorithm search program parallel 7 8: function functions bound theorem bounds loss error proof 24: learning error noise training weight generalization teacher 7: energy correlation binary function correlations population 8 13: block blocks data time algorithm search program parallel 41: recognition character characters digit neural segmentation 25: language connectionist symbol symbols set rules languages 9 22: time call path rl channel problem traffic routing rate paths 29: motion direction figure velocity optical flow retina time 30: distribution probability variables approximation distributions 40 Part III Learning and Leveraging Semantic Similarity for Zero-Shot Visual Recognition 41 Chapter 5 Algorithms We describe our methods for addressing (conventional) zero-shot learning, where the task is to classify images from unseen classes into the label space of unseen classes. We first describe, SYNC, a manifold-learning-based method for synthesizing classifiers of unseen classes. We then describe, EXEM, an approach that automatically improves semantic representations through vi- sual exemplar synthesis. EXEM can generally be combined with any zero-shot learning algo- rithms, and can by itself operate as a zero-shot learning algorithm. Notations We denote byD =f(x n 2 R D ;y n )g N n=1 the training data with the labels coming from the label space of seen classesS =f1; 2; ;Sg. Denote byU =fS + 1; ;S +Ug the label space of unseen classes. LetT =S[U. For each classc2T , we assume that we have access to its semantic representationa c . 5.1 Synthesized Classifiers for Zero-Shot Learning In Sect. 3.1, we describe the two main challenges for zero-shot learning: (1) how to relate unseen classes to seen ones and (2) how to attain optimal discriminative performance on the unseen classes even though we do not have their labeled data. In this section, we propose to tackle the second challenge with ideas from manifold learning [14, 98], converging to a two-pronged approach. We view object classes in a semantic space as a weighted graph where the nodes correspond to object class names and the weights of the edges represent how they are related. Various information sources can be used to infer the weights — human-defined attributes or word vectors learnt from language corpora. On the other end, we view models for recognizing visual images of those classes as if they live in a space of models. In particular, the parameters for each object model are nothing but coordinates in this model space whose geometric configuration also reflects the relatedness among objects. Fig. 5.1 illustrates this idea conceptually. But how do we align the semantic space and the model space? The semantic space coordi- nates of objects are designated or derived based on external information (such as textual data) that do not directly examine visual appearances at the lowest level, while the model space concerns itself largely for recognizing low-level visual features. To align them, we view the coordinates in the model space as the projection of the vertices on the graph from the semantic space — there is a wealth of literature on manifold learning for computing (low-dimensional) Euclidean space embeddings from the weighted graph, for example, the well-known algorithm of Laplacian eigenmaps [14]. To adapt the embeddings (or the coordinates in the model space) to data, we introduce a set of phantom object classes — the coordinates of these classes in both the semantic space and the 42 Semantic space Model space Scarlet Tanager Cardinal a 1 a 2 a 3 b 1 b 2 w 1 w 2 w 3 v 1 v 2 v 3 b 3 Bobolink Figure 5.1: Illustration of SYNC for zero-shot learning. Object classes live in two spaces. They are characterized in the semantic space with semantic representations (a s ) such as attributes or word vectors of their names. They are also represented as models for visual recognition (w s ) in the model space. In both spaces, those classes form weighted graphs. The main idea behind our approach is that these two spaces should be aligned. In particular, the coordinates in the model space should be the projection of the graph vertices from the semantic space to the model space — preserving class relatedness encoded in the graph. We introduce adaptable phantom classes (b andv) to connect seen and unseen classes — classifiers for the phantom classes are bases for synthesizing classifiers for real classes. In particular, the synthesis takes the form of convex combination. model space are adjustable and optimized such that the resulting model for the real object classes achieve the best performance in discriminative tasks. However, as their names imply, those phan- tom classes do not correspond to and are not optimized to recognize any real classes directly. For mathematical convenience, we parameterize the weighted graph in the semantic space with the phantom classes in such a way that the model for any real class is a convex combinations of the coordinates of those phantom classes. In other words, the “models” for the phantom classes can also be interpreted as bases (classifiers) in a dictionary from which a large number of classifiers for real classes can be synthesized via convex combinations. In particular, when we need to con- struct a classifier for an unseen class, we will compute the convex combination coefficients from this class’s semantic space coordinates and use them to form the corresponding classifier. To summarize, our main contribution is a novel idea to cast the challenging problem of recog- nizing unseen classes as learning manifold embeddings from graphs composed of object classes. As a concrete realization of this idea, we show how to parameterize the graph with the loca- tions of the phantom classes, and how to derive embeddings (i.e., recognition models) as convex combinations of base classifiers. Our empirical studies extensively test our synthesized classi- fiers on four benchmark datasets for zero-shot learning, including the full ImageNet Fall 2011 release [52] with 20,345 unseen classes. The experimental results are very encouraging; the synthesized classifiers outperform several state-of-the-art methods, including attaining better or matching performance of Google’s ConSE algorithm [183] in the large-scale setting. 5.1.1 Main Idea: Manifold Learning We propose a zero-shot learning method of synthesized classifiers, called SYNC. We focus on linear classifiers in the visual feature spaceR D that assign a label ^ y to a data pointx by ^ y = arg max c w T c x; (5.1) 43 wherew c 2 R D , although our approach can be readily extended to nonlinear settings by the kernel trick [218]. The main idea behind our approach is shown by the conceptual diagram in Fig. 5.1. Each classc has a coordinatea c and they live on a manifold in the semantic representation space. We use attributes in this text to illustrate the idea but in the experiments we test our approach on multiple types of semantic representations. Additionally, we introduce a set of phantom classes associated with semantic representations b r ;r = 1; 2;:::;R. We stress that they are phantom as they themselves do not correspond to any real objects — they are introduced to increase the modeling flexibility, as shown below. The real and phantom classes form a weighted bipartite graph, with the weights defined as s cr = expfd(a c ;b r )g P R r=1 expfd(a c ;b r )g (5.2) to correlate a real classc and a phantom classr, where d(a c ;b r ) = (a c b r ) T 1 (a c b r ); (5.3) and 1 is a parameter that can be learned from data, modeling the correlation among attributes. For simplicity, we set = 2 I and tune the scalar free hyper-parameter by cross-validation (Sect. A.7). The more general Mahalanobis metric can be used and we propose one way of learning such metric in Sect. 5.1.3. The specific form of defining the weights is motivated by several manifold learning methods such as SNE [98]. In particular,s cr can be interpreted as the conditional probability of observing class r in the neighborhood of class c. However, other forms can be explored and are left for future work. In the model space, each real class is associated with a classifierw c and the phantom classr is associated with a virtual classifierv r . We align the semantic and the model spaces by viewing w c (orv r ) as the embedding of the weighted graph. In particular, we appeal to the idea behind Laplacian eigenmaps [14], which seeks the embedding that maintains the graph structure as much as possible; equally, the distortion error min wc;vr kw c R X r=1 s cr v r k 2 2 is minimized. This objective has an analytical solution w c = R X r=1 s cr v r ; 8c2T =f1; 2; ;S +Ug (5.4) In other words, the solution gives rise to the idea of synthesizing classifiers from those virtual classifiersv r . For conceptual clarity, from now on we refer tov r as base classifiers in a dictionary from which new classifiers can be synthesized. We identify several advantages. First, we could construct an infinite number of classifiers as long as we know how to computes cr . Second, by making R S, the formulation can significantly reduce the learning cost as we only need to learnR base classifiers. 44 5.1.2 Learning Phantom Classes Learning base classifiers We learn the base classifiersfv r g R r=1 from the training data (of the seen classes only). We experiment with two settings. To learn one-versus-other classifiers, we optimize, min v 1 ;;v R S X c=1 N X n=1 `(x n ;I yn;c ;w c ) + 2 S X c=1 kw c k 2 2 ; (5.5) s:t: w c = R X r=1 s cr v r ; 8c2T =f1; ;Sg where`(x;y;w) = max(0; 1yw T x) 2 is the squared hinge loss. The indicatorI yn;c 2f1; 1g denotes whether or noty n = c. Alternatively, we apply the Crammer-Singer multi-class SVM loss [47], given by ` cs (x n ; y n ;fw c g S c=1 ) = max(0; max c2Sfyng (c;y n ) +w c T x n w yn T x n ); We have the standard Crammer-Singer loss when the structured loss (c;y n ) = 1 if c6= y n , which, however, ignores the semantic relatedness between classes. We additionally use the ` 2 distance for the structured loss (c;y n ) =ka c a yn k 2 2 to exploit the class relatedness in our experiments. These two learning settings have separate strengths and weaknesses in our empirical studies. Learning semantic representations The weighted graph Eq. (5.2) is also parameterized by adaptable embeddings of the phantom classesb r . For this work, however, for simplicity, we assume that each of them is a sparse linear combination of the seen classes’ attribute vectors: b r = S X c=1 rc a c ;8r2f1; ;Rg; Thus, to optimize those embeddings, we solve the following optimization problem min fvrg R r=1 ;frcg R;S r;c=1 S X c=1 N X n=1 `(x n ;I yn;c ;w c ) (5.6) + 2 S X c=1 kw c k 2 2 + R;S X r;c=1 j rc j + 2 R X r=1 (kb r k 2 2 h 2 ) 2 ; s:t: w c = R X r=1 s cr v r ; 8c2T =f1; ;Sg; whereh is a predefined scalar equal to the norm of real attribute vectors (i.e., 1 in our experiments since we perform` 2 normalization). Note that in addition to learningfv r g R r=1 , we learn com- bination weightsf rc g R;S r;c=1 : Clearly, the constraint together with the third term in the objective 45 encourages the sparse linear combination of the seen classes’ attribute vectors. The last term in the objective demands that the norm ofb r is not too far from the norm ofa c . We perform alternating optimization for minimizing the objective function with respect to fv r g R r=1 andf rc g R;S r;c=1 . While this process is non-convex, there are useful heuristics to initialize the optimization routine. For example, if R = S, then the simplest setting is to letb r =a r for r = 1;:::;R. If R S, we can let them be (randomly) selected from the seen classes’ attribute vectorsfb 1 ;b 2 ; ;b R gfa 1 ;a 2 ; ;a S g, or first perform clustering onfa 1 ;a 2 ; ;a S g and then let eachb r be a combination of the seen classes’ attribute vectors in clusterr. IfR> S, we could use a combination of the above two strategies. There are four hyper-parameters;;; and to be tuned. To reduce the search space during cross-validation, we first fixb r =a r for r = 1;:::;R and tune;. Then we fix and and tune and . 5.1.3 Extension: Learning Metrics for Computing Similarities Between Semantic Representations Recall that the weights in the bipartite graph are defined based on the distance d(a c ;b r ) = (a c b r ) T 1 (a c b r ). In this section, we describe an objective for learning a more gen- eral Mahalanobis metric than 1 = 2 I. We focus on the case when R = S and on learning a diagonal metric 1 =M T M, whereM is also diagonal. We solve the following optimization problem. min M;v 1 ;;v R S X c=1 N X n=1 `(x n ;I yn;c ;w c ) (5.7) + 2 R X r=1 kv r k 2 2 + 2 kMIk 2 F ; (5.8) s:t: w c = R X r=1 s cr v r ; 8c2T =f1; ;Sga where`(x;y;w) = max(0; 1yw T x) 2 is the squared hinge loss. The indicatorI yn;c 2f1; 1g denotes whether or noty n =c. Again, we perform alternating optimization for minimizing the above objective function. At first, we fixM = I and optimizefv 1 ; ;v R g to obtain a reasonable initialization. Then we perform alternating optimization. To further prevent over-fitting, we alternately optimizeM and fv 1 ; ;v R g on different but overlapping subsets of training data. In particular, we split data into 5 folds and optimizefv 1 ; ;v R g on the first 4 folds andM on the last 4 folds. We report results in Sect. 7.3.1. 5.1.4 Classification With Synthesized Classifiers Given a data samplex fromU unseen classes and their corresponding attribute vectors (or coor- dinates in other semantic spaces), we classify it in the label spaceU by ^ y = arg max c2U w c T x (5.9) with the classifiers being synthesized according to Eq. (5.4). 46 Semantic space Visual feature space Cardinal a 1 a 2 a 3 Bobolink Scarlet Tanager PCA a 4 a 5 Semantic embedding space z 1 z 2 z 3 z 5 z 4 Figure 5.2: Illustration of our method EXEM for improving semantic representations as well as for zero- shot learning. Given semantic information and visual features of the seen classes, we learn a kernel-based regressor () such that the semantic representationa c of class c can predict well its visual exemplar (center)z c that characterizes the clustering structure. The learned () can be used to predict the visual feature vectors of unseen classes for nearest-neighbor (NN) classification, or to improve the semantic representations for existing ZSL approaches. 5.2 Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning The previous section describes, SYNC, an approach for synthesizing classifiers for the unseen classes in zero-shot learning. SYNC preserves graph structures in the semantic representation space. This suggests that SYNC assumes that such graph structures are reliable sources of class relationships. However, as we mentioned earlier in Sect. 1, semantic representations given as input to zero-shot learning algorithms are hard to get right. While they may capture high-level semantic relationships between classes, they are not well-informed about visual relationships. In this section, we propose a simple yet very effective ZSL algorithm that addresses the above- mentioned problem. The main idea is to exploit the intuition that the semantic representation can predict well the location of the cluster characterizing all visual feature vectors from the corresponding class. This idea is illustrated in Fig. 5.2. Recall the two challenges in Sect. 3.1. Our proposed method tackles the two challenges for ZSL simultaneously. First, unlike most of the existing ZSL methods, we acknowledge that se- mantic representations may not necessarily contain visually discriminating properties of objects classes. As a result, we demand that the predictive constraint be imposed explicitly. In our case, we assume that the cluster centers of visual feature vectors are our target semantic representa- tions. Second, we leverage structural relations on the clusters to further regularize the model, strengthening the usefulness of the clustering structure assumption for model selection. 47 5.2.1 Main Idea: The Auxiliary Task of Predicting Visual Exemplars The main idea is to transform the (original) semantic representations into semantic embeddings in another space to which visual information is injected. We call these target semantic embeddings visual exemplars. In this work, they are cluster centers characterized by the average of visual fea- ture vectors. More specifically, the main computation step of our approach is reduced to learning (from the seen classes) a predictive function from semantic representations to their corresponding centers (i.e., exemplars) of visual feature vectors. This function is used to predict the locations of visual exemplars of the unseen classes that are then used to construct nearest-neighbor style clas- sifiers, or to improve the semantic information demanded by existing ZSL approaches. Fig. 5.2 shows the conceptual diagram of our approach. Our two-stage approach for zero-shot learning consists of learning a function to to predict visual exemplars from semantic representations (Sect. 5.2.2) and then apply this function to per- form zero-shot learning given novel semantic representations (Sect. 5.2.3). 5.2.2 Learning A Function To Predict Visual Exemplars From Semantic Representations For each class c, we would like to find a transformation function () such that (a c ) z c , wherez c 2R d is the visual exemplar for the class. In this paper, we create the visual exemplar of a class by averaging the PCA projections of data belonging to that class. That is, we consider z c = 1 jIcj P n2Ic Mx n , whereI c =fi : y i = cg andM2R dD is the PCA projection matrix computed over training data of the seen classes. We note thatM is fixed for all data points (i.e., not class-specific) and is used in Eq. (5.11). Given training visual exemplars and semantic representations, we learn d support vector re- gressors (SVR) with the RBF kernel — each of them predicts each dimension of visual exemplars from their corresponding semantic representations. Specifically, for each dimensiond = 1;:::;d, we use the-SVR formulation [219]. min q;; 0 ; 1 2 q T q +( + 1 S S X c=1 ( c + 0 c )) s:t:q T rbf (a c )z c + c (5.10) z c q T rbf (a c ) + 0 c c 0; 0 c 0; where rbf is an implicit nonlinear mapping based on our kernel. We have dropped the subscript d for aesthetic reasons but readers are reminded that each regressor is trained independently with its own parameters. and2 (0; 1] (along with hyper-parameters of the kernel) are the hyper- parameters to be tuned. The resulting () = [q T 1 rbf (); ;q T d rbf ()] T , whereq d is from the d-th regressor. Note that the PCA step is introduced for both computational and statistical benefits. In ad- dition to reducing dimensionality for faster computation, PCA decorrelates the dimensions of visual features such that we can predict these dimensions independently rather than jointly. 48 5.2.3 Zero-Shot Learning Based On Predicted Visual Exemplars Now that we learn the transformation function (), how do we use it to perform zero-shot classification? We first apply () to all semantic representationsa u of the unseen classes. We consider two main approaches that depend on how we interpret these predicted exemplars (a u ). Predicted exemplars as training data An obvious approach is to use (a u ) as data directly. Since there is only one data point per class, a natural choice is to use a nearest neighbor classifier. Then, the classifier outputs the label of the closest exemplar for each novel data pointx that we would like to classify: ^ y = arg min u dis NN (Mx; (a u )); (5.11) where we adopt the Euclidean distance or the standardized Euclidean distance as dis NN in the experiments. Predicted exemplars as improved semantic representations The other approach is to use (a u ) as the improved semantic representations (“improved” in the sense that they have knowl- edge about visual features) and plug them into any existing zero-shot learning framework. We provide two examples. In the method of convex combination of semantic embeddings (CONSE) [183], their original class semantic embeddings are replaced with the corresponding predicted exemplars, while the combining coefficients remain the same. In SYNC described in the previous section, the predicted exemplars are used to define the similarity values between the unseen classes and the bases, which in turn are used to compute the combination weights for constructing classifiers. In particular, their similarity measure is of the form expfdis(ac;br )g P R r=1 expfdis(ac;br )g , where dis is the (scaled) Euclidean distance andb r ’s are the semantic representations of the base classes. In this case, we simply need to change this similarity measure to expfdis( (ac); (br ))g P R r=1 expfdis( (ac); (br ))g . In the experiments, we empirically show that existing semantic representations for ZSL are far from the optimal. Our approach can thus be considered as a way to improve semantic repre- sentations for zero-shot learning. 5.3 Related Work We conclude the approach section by discussing the most relevant approaches to both SYNC and EXEM, putting both methods in the context of vast literature on zero-shot learning and related areas. SYNC COSTA [165] combines pre-trained classifiers of seen classes to construct new classi- fiers. To estimate the semantic embedding (e.g., word vector) of a test image, CONSE [183] uses the decision values of pre-trained classifiers of seen objects to weighted average the correspond- ing semantic embeddings. Neither of them has the notion of base classifiers as in SYNC, which we introduce for constructing the classifiers and nothing else. We thus expect using bases to be more effective in transferring knowledge between seen and unseen classes than overloading the pre-trained and fixed classifiers of the seen classes for dual duties. We note that ALE [4] and SJE [5] can be considered as special cases of SYNC. In ALE and SJE, each attribute corresponds to a base and each “real” classifier corresponding to an actual object is represented as a linear combination of those bases, where the weights are the real objects’ “descriptions” in the form of 49 attributes. This modeling choice is inflexible as the number of bases is fundamentally constrained by the number of attributes. Moreover, the model is strictly a subset of SYNC. 1 SSE and JLSE [279, 280] propose similar ideas of aligning the visual and semantic spaces but take different approaches from ours. Our convex combination of base classifiers for synthesizing real classifiers can also be moti- vated from multi-task learning with shared representations [9]. While labeled examples of each task are required in [9], our method has no access to data of the unseen classes. EXEM DEVISE [68] and CONSE [183] predict an image’s semantic embedding from its visual features and compare to unseen classes’ semantic embeddings. We perform an “inverse predic- tion”: given an unseen class’s semantic representation, we predict the visual feature exemplar of that class. One appealing property of EXEM is its scalability: we learn and predict at the exemplar (class) level so the runtime and memory footprint of our approach depend only on the number of seen classes rather the number of training data points. This is much more efficient than other ZSL algorithms that learn at the level of each individual training instance [4, 5, 59, 68, 109, 136, 154, 165, 183, 185, 209, 227, 272, 279, 280]. Several methods propose to learn visual exemplars by preserving structures obtained in the semantic space, where exemplars are used loosely here and do not necessarily mean class-specific feature averages. Examples of such methods include SYNC, BIDILEL [250] and UVDS [152]. However, EXEM predicts them with a regressor such that they may or may not strictly follow the structure in the semantic space, and thus they are more flexible and could even better reflect similarities between classes in the visual feature space. Similar in spirit to our work, [167] proposes using nearest class mean classifiers for ZSL. The Mahalanobis metric learning in this work could be thought of as learning a linear transfor- mation of semantic representations (their “zero-shot prior” means, which are in the visual feature space). Our approach learns a highly non-linear transformation. Moreover, our EXEM (1NNS) (cf. Sect. 7.1) learns a (simpler, i.e., diagonal) metric over the learned exemplars. Finally, the main focus of [167] is on incremental, not zero-shot, learning settings (see also [200, 205]). DEM [277] uses a deep feature space as the semantic embedding space for ZSL. Though similar to EXEM, this approach does not compute the average of visual features (exemplars) but train neural networks to predict all visual features from their semantic representations. Their model learning takes significantly longer time than ours. There has been a recent surge of interests in applying deep learning models to generate im- ages (see, e.g., [162, 201, 264]). Most of these methods are based on probabilistic models (in order to incorporate the statistics of natural images). Unlike them, our method is to purely de- terministically predict visual features. Note that, generating features directly is likely easier and more effective than generating realistic images first and then extracting visual features. Recently, researchers became interested in generating visual features of unseen classes using conditional generative models such as variants of generative adversarial networks (GANs) [259, 283] and variational autoencoders (V AEs) [133] for ZSL. 1 For interested readers, if we set the number of attributes as the number of phantom classes (eachbr is the one-hot representation of an attribute), and use the Gaussian kernel with an isotropically diagonal covariance matrix in Eq. (5.3) with properly set bandwidths (either very small or very large) for each attribute, we will recover the formulation in [4, 5] when the bandwidths tend to zero or infinity. 50 Chapter 6 Generalized Zero-Shot Learning 6.1 Introduction The setup for ZSL is that once models for unseen classes are learned, they are judged based on their ability to discriminate among unseen classes, assuming the absence of seen objects during the test phase. Originally proposed in the seminal work of Lampert et al. [136], this setting has almost always been adopted for evaluating ZSL methods [4, 5, 6, 68, 72, 74, 109, 115, 121, 146, 165, 183, 185, 207, 209, 272, 273, 279, 280]. But, does this problem setting truly reflect what recognition in the wild entails? While the ability to learn novel concepts is by all means a trait that any zero-shot learning systems should possess, it is merely one side of the coin. The other important — yet so far under-studied — trait is the ability to remember past experiences, i.e., the seen classes. Why is this trait desirable? Consider how data are distributed in the real world. The seen classes are often more common than the unseen ones; it is therefore unrealistic to assume that we will never encounter them during the test stage. For models generated by ZSL to be truly useful, they should not only accurately discriminate among either seen or unseen classes themselves but also accurately discriminate between the seen and unseen ones. Thus, to understand better how existing ZSL approaches will perform in the real world, we advocate evaluating them in the setting of generalized zero-shot learning (GZSL), where test data are from both seen and unseen classes and we need to classify them into the joint labeling space of both types of classes. Previous work in this direction is scarce. See related work for more details. Our contributions include an extensive empirical study of several existing ZSL approaches in this new setting. We show that a straightforward application of classifiers constructed by those approaches performs poorly. In particular, test data from unseen classes are almost always clas- sified as a class from the seen ones. We propose a surprisingly simple yet very effective method called calibrated stacking to address this problem. This method is mindful of the two conflicting forces: recognizing data from seen classes and recognizing data from unseen ones. We intro- duce a new performance metric called Area Under Seen-Unseen accuracy Curve (AUSUC) that can evaluate ZSL approaches on how well they can trade off between the two. We demonstrate the utility of this metric by evaluating several representative ZSL approaches under this metric on three benchmark datasets, including the full ImageNet Fall 2011 release dataset [52] that contains approximately 21,000 unseen categories. We complement our comparative studies in learning methods by further establishing an upper bound on the performance limit of ZSL. In particular, our idea is to use class-representative visual 51 features as the idealized semantic representations to construct ZSL classifiers. We show that there is a large gap between existing approaches and this ideal performance limit, suggesting that improving class semantic representations is vital to achieve GZSL. 6.2 Generalized Zero-Shot Learning In this section, we describe formally the setting of generalized zero-shot learning. We then present empirical evidence to illustrate the difficulty of this problem. Suppose we are given the training dataD =f(x n 2 R D ;y n )g N n=1 with the labelsy n from the label space of seen classesS =f1; 2; ;Sg. Denote byU =fS + 1; ;S +Ug the label space of unseen classes. We useT =S[U to represent the union of the two sets of classes. In the (conventional) zero-shot learning (ZSL) setting, the main goal is to classify test data into the unseen classes, assuming the absence of the seen classes in the test phase. In other words, each test data point is assumed to come from and will be assigned to one of the labels inU. Existing research on ZSL has been almost entirely focusing on this setting [4, 5, 6, 68, 72, 74, 109, 115, 121, 136, 146, 165, 183, 185, 207, 209, 272, 273, 279, 280]. However, in real applications, the assumption of encountering data only from the unseen classes is hardly realistic. The seen classes are often the most common objects we see in the real world. Thus, the objective in the conventional ZSL does not truly reflect how the classifiers will perform recognition in the wild. Motivated by this shortcoming of the conventional ZSL, we advocate studying the more gen- eral setting of generalized zero-shot learning (GZSL), where we no longer limit the possible class memberships of test data — each of them belongs to one of the classes inT . ZSL classifiers in the GZSL setting Without the loss of generality, we assume that for each class c2T , we have a discriminant scoring function f c (x), from which we would be able to derive the label forx. For instance, for an unseen class u, our method of synthesized classi- fiers (SYNC) defines f u (x) = w T u x, wherew u is the model parameter vector for the class u, constructed from its semantic representationa u (such as its attribute vector or the word vector associated with the name of the class). In CONSE [183], f u (x) = cos(s(x);a u ), where s(x) is the predicted embedding of the data samplex. In DAP/IAP [137], f u (x) is a probabilistic model of attribute vectors. We assume that similar discriminant functions for seen classes can be constructed in the same manner given their corresponding semantic representations. How to assess an algorithm for GZSL? We define and differentiate the following performance metrics:A U!U the accuracy of classifying test data fromU intoU,A S!S the accuracy of classi- fying test data fromS intoS, and finallyA S!T andA U!T the accuracies of classifying test data from either seen or unseen classes into the joint labeling space. Note thatA U!U is the standard performance metric used for conventional ZSL andA S!S is the standard metric for multi-class classification. Furthermore, note that we do not reportA T!T as simply averagingA S!T and A U!S to computeA T!T might be misleading when the two metrics are not balanced, as shown below. Generalized ZSL is hard To demonstrate the difficulty of GZSL, we report the empirical re- sults of using a simple but intuitive algorithm for GZSL. Given the discriminant functions, we adopt the following classification rule ^ y = arg max c2T f c (x) (6.1) 52 Table 6.1: Classification accuracies (%) on conventional ZSL (A U!U ), multi-class classification for seen classes (A S!S ), and GZSL (A S!T andA U!T ), on AwA and CUB. Significant drops are observed from A U!U toA U!T . AwA CUB Method A U!U A S!S A U!T A S!T A U!U A S!S A U!T A S!T DAP [137] 51.1 78.5 2.4 77.9 38.8 56.0 4.0 55.1 IAP [137] 56.3 77.3 1.7 76.8 36.5 69.6 1.0 69.4 CONSE [183] 63.7 76.9 9.5 75.9 35.8 70.5 1.8 69.9 SYNC O-VS-O 70.1 67.3 0.3 67.3 53.0 67.2 8.4 66.5 SYNC STRUCT 73.4 81.0 0.4 81.0 54.4 73.0 13.2 72.0 which we refer to as direct stacking. We use the rule on “stacking” classifiers from the following zero-shot learning approaches: DAP and IAP [137], CONSE [183], and SYNC. We test GZSL on two datasets AwA [137] and CUB [247] — details about those datasets can be found in Sect. 7. Table 6.1 reports experimental results based on the 4 performance metrics we have described previously. Our goal here is not to compare between methods. Instead, we examine the impact of relaxing the assumption of the prior knowledge of whether data are from seen or unseen classes. We observe that, in this setting of GZSL, the classification performance for unseen classes (A U!T ) drops significantly from the performance in conventional ZSL (A U!U ), while that of seen ones (A S!T ) remains roughly the same as in the multi-class task (A S!S ). That is, nearly all test data from unseen classes are misclassified into the seen classes. This unusual degradation in performance highlights the challenges of GZSL; as we only see labeled data from seen classes during training, the scoring functions of seen classes tend to dominate those of unseen classes, leading to biased predictions in GZSL and aggressively classifying a new data point into the label space ofS because classifiers for the seen classes do not get trained on “negative” examples from the unseen classes. The previous example shows that the classifiers for unseen classes constructed by conven- tional ZSL methods should not be naively combined with models for seen classes to expand the labeling space required by GZSL. In what follows, we propose a simple variant to the naive approach of direct stacking to curb such a problem. We also develop a metric that measures the performance of GZSL, by acknowl- edging that there is an inherent trade-off between recognizing seen classes and recognizing unseen classes. This metric, referred to as the Area Under Seen-Unseen accuracy Curve (AUSUC), bal- ances the two conflicting forces. We conclude this section by describing two related approaches. Despite their sophistication, they do not perform well empirically (see Sect. 7.3.3). 6.3 Approach 6.3.1 Calibrated Stacking Our approach stems from the observation that the scores of the discriminant functions for the seen classes are often greater than the scores for the unseen classes. Thus, intuitively, we would like to reduce the scores for the seen classes. This leads to the following classification rule: ^ y = arg max c2T f c (x) I[c2S]; (6.2) 53 0 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Seen−Unseen accuracy Curve (SUC) A U→T A S→T SynC o−v−o : AUSUC = 0.398 Figure 6.1: The Seen-Unseen accuracy Curve (SUC) obtained by varying in the calibrated stacking classification rule eq. (6.2). The AUSUC summarizes the curve by computing the area under it. We use the method SYNC O-VS-O on the AwA dataset, and tune hyper-parameters as in Table 6.1. The red cross denotes the accuracies by direct stacking. where the indicatorI[]2f0; 1g indicates whether or notc is a seen class and is a calibration factor. We term this adjustable rule as calibrated stacking. Another way to interpret is to regard it as the prior likelihood of a data point coming from unseen classes. When = 0, the calibrated stacking rule reverts back to the direct stacking rule, described previously. It is also instructive to consider the two extreme cases of . When ! +1, the classification rule will ignore all seen classes and classify all data points into one of the unseen classes. When there is no new data point coming from seen classes, this classification rule essentially implements what one would do in the setting of conventional ZSL. On the other hand, when ! 1, the classification rule only considers the label space of seen classes as in standard multi-way classification. The calibrated stacking rule thus represents a middle ground between aggressively classifying every data point into seen classes and conservatively classifying every data point into unseen classes. Adjusting this hyperparameter thus gives a trade-off, which we exploit to define a new performance metric. 6.3.2 Area Under Seen-Unseen Accuracy Curve (AUSUC) Varying the calibration factor , we can compute a series of classification accuracies (A U!T , A S!T ). Fig. 6.1 plots those points for the dataset AwA using the classifiers generated by SYNC based on class-wise cross validation. We call such a plot the Seen-Unseen accuracy Curve (SUC). On the curve, = 0 corresponds to direct stacking, denoted by a cross. The curve is similar to many familiar curves for representing conflicting goals, such as the Precision-Recall (PR) curve and the Receiving Operator Characteristic (ROC) curve, with two ends for the extreme cases ( !1 and ! +1). A convenient way to summarize the plot with one number is to use the Area Under SUC (AUSUC) 1 . The higher the area is, the better an algorithm is able to balanceA U!T andA S!T . An immediate and important use of the metric AUSUC is for model selection. Many ZSL learning methods require tuning hyperparameters — previous work tune them based on the ac- curacy A U!U . The selected model, however, does not necessarily balance optimally between 1 If a single is desired, the “F-score” that balancesAU!T andAS!T can be used. 54 A U!T andA S!T . Instead, we advocate using AUSUC for model selection and hyperparamter tuning. Models with higher values of AUSUC are likely to perform in balance for the task of GZSL. 6.3.3 Alternative Approaches Socher et al. [227] propose a two-stage zero-shot learning approach that first predicts whether an image is of seen or unseen classes and then accordingly applies the corresponding classifiers. The first stage is based on the idea of novelty detection and assigns a high novelty score if it is unlikely for the data point to come from seen classes. They experiment with two novelty detection strategies: Gaussian and LoOP models [125]. We briefly describe and contrast them to our approach below. 6.3.3.1 Novelty Detection Algorithms The main idea is to assign a novelty scoreN(x) to each samplex. With this novelty score, the final prediction rule becomes ^ y = arg max c2S f c (x); ifN(x) : arg max c2U f c (x); ifN(x)> : (6.3) where is the novelty threshold. The scores above this threshold indicate belonging to unseen classes. Details on how to estimateN(x) are as follows: Gaussian Training examples of seen classes are first mapped into the semantic space and mod- eled by a Gaussian mixture model — each class is parameterized by a mean vector and an isomet- ric covariance matrix. The mean is set to be the class’ semantic representation and the covariance matrix is set to be the covariance of all mapped training examples of that class. The novelty score of a test data point is then its negative log probability value under this mixture model. LoOP The idea of the Local Outlier Probabilities (LoOP) model [125] is to compute the dis- tances ofx to its nearest seen classes. Such distances are then converted to an outlier probability, interpreted as the likelihood ofx being from unseen classes. In particular, LetX S be the set of all the mapped training examples from seen classes. For a test samplex (also mapped into the se- mantic space), a context setC(x)X S ofk nearest neighbors is first defined. The probabilistic set distancepdist fromx to all the points inC(x) is then computed as follows pdist (x;C(x)) = s P x 0 2C(x) d(x;x 0 ) 2 jC(x)j ; (6.4) whered(x;x 0 ) is chosen to be the Euclidean distance function. Such a distance is then used to define the local outlier factor lof (x) = pdist (x;C(x)) E x 0 2C(x) [pdist (x 0 ;C(x 0 ))] 1: (6.5) 55 Finally, the Local Outlier Probability (LoOP), which can be viewed as the novelty score, is com- puted as LoOP (x) = max 0; erf lof (x) Z (X S ) ; (6.6) where erf is the Gauss error function andZ (X S ) = p E x 0 2X S [(lof (x 0 )) 2 ] is the normaliza- tion constant. 6.3.3.2 Relation to Calibrated Stacking If we define a new form of novelty scoreN(x) = max u2U f u (x) max s2S f s (x) in eq. (6.3), we recover the prediction rule in eq. (6.2). However, this relation holds only if we are interested in predicting one label ^ y. When we are interested in predicting a set of labels (for example, hoping that the correct labels are in the topK predicted labels, (i.e., the Flat hit@K metric, cf. Sect. 7), the two prediction rules will give different results. 6.4 Related Work There has been very little work on generalized zero-shot learning. [68, 166, 183, 237] allow the label space of their classifiers to include seen classes but they only test on the data from the unseen classes. [227] proposes a two-stage approach that first determines whether a test data point is from a seen or unseen class, and then apply the corresponding classifiers. However, their experiments are limited to only 2 or 6 unseen classes. We describe and compare to their methods in Sect. 7.3.3. In the domain of action recognition, [77] investigates the generalized setting with only up to 3 seen classes. [55] and [142] focus on training a zero-shot binary classifier for each unseen class (against seen ones) — it is not clear how to distinguish multiple unseen classes from the seen ones. Finally, open set recognition [107, 214, 215] considers testing on both types of classes, but treating the unseen ones as a single outlier class. 56 Chapter 7 Zero-Shot Learning Experiments We evaluate our methods and compare to existing state-of-the-art models on several benchmark datasets. While there is a large degree of variations in the current empirical studies in terms of datasets, evaluation protocols, experimental settings, and implementation details, we strive to provide a comprehensive comparison to as many methods as possible, not only citing the pub- lished results but also reimplementing some of those methods to exploit several crucial insights we have discovered in studying our methods. 7.1 General Setup 7.1.1 Datasets We use four benchmark datasets in our experiments. Table 7.1 summarizes their key characteris- tics. • The Animals with Attributes (AwA) dataset [137] consists of 30,475 images of 50 animal classes. Along with the dataset, a standard data split is released for zero-shot learning: 40 seen classes (for training) and 10 unseen classes. • The CUB-200-2011 Birds (CUB) [247] has 200 bird classes and 11,788 images. We ran- domly split the 200 classes into 4 disjoint sets (each with 50 classes) and treat each of them as the unseen classes in turn. • The SUN Attribute (SUN) dataset [192] contains 14,340 images of 717 scene categories (20 images from each category). The dataset is drawn from the the SUN database [261]. Following [137], we randomly split the 717 classes into 10 disjoint sets (each with 71 or 72 classes) in a similar manner to the class splitting on CUB. We note that some previous published results [109, 209, 279, 280] are based on a simpler setting with 707 seen and 10 unseen classes. • The ImageNet dataset [52] consists of two disjoint subsets. (i) the ILSVRC 2012 1K dataset [212] contains 1,281,167 training and 50,000 validation images from 1,000 cate- gories and is treated as the seen-class data. (ii) Images of unseen classes come from the rest of the ImageNet Fall 2011 release dataset [52] that do not overlap with any of the 1,000 categories. We will call this release the ImageNet 2011 21K dataset (as in [68, 183]). 57 Table 7.1: Key characteristics of studied datasets Dataset # of seen # of unseen Total # name classes classes of images AwA y 40 10 30,475 CUB z 150 50 11,788 SUN z 645/646 72/71 14,340 ImageNet x 1,000 20,842 14,197,122 y : Following the prescribed split in [137]. z : 4 (or 10, respectively) random splits, reporting average. x : Seen and unseen classes from ImageNet ILSVRC 2012 1K [212] and Fall 2011 release [52, 68, 183]. Overall, this dataset contains 14,197,122 images from 21,841 classes, and we conduct our experiment on 20,842 unseen classes 1 . 7.1.2 Semantic Spaces For the classes in AwA, we use 85-dimensional binary or continuous attributes [137], as well as the 100 and 1,000 dimensional word vectors [168], derived from their class names and extracted by Fu et al. [71, 72]. For CUB and SUN, we use 312 and 102 dimensional continuous-valued attributes, respectively. We also threshold them at the global means to obtain binary-valued at- tributes, as suggested in [137]. Neither datasets have word vectors for their class names. For the SUN dataset, each image is annotated with 102 continuous-valued attributes. For each class, we average attribute vectors over all images belonging to that class to obtain a class-level attribute vector. For ImageNet, we train a skip-gram language model [168, 169] on the latest Wikipedia dump corpus 2 (with more than 3 billion words) to extract a 500-dimensional word vector for each class. We ignore classes without word vectors in the experiments, resulting in 20,345 (out of 20,842) unseen classes. More details can be found in Sect. A.7. For both the continuous attribute vectors and the word vector embeddings of the class names, we normalize them to have unit` 2 norms unless stated otherwise. 7.1.3 Visual Features Due to variations in features being used in literature, it is impractical to try all possible combina- tions of features and methods. Thus, we make a major distinction in using shallow features (such as color histograms, SIFT, PHOG, Fisher vectors) [4, 5, 109, 137, 208, 251] and deep learning features in several recent studies of zero-shot learning. Whenever possible, we use (shallow) features provided by those datasets or prior studies. For comparative studies, we also extract the following deep features: AlexNet [128] for AwA and CUB and GoogLeNet [235] for all datasets (all extracted with the Caffe package [112]). For AlexNet, we use the 4,096-dimensional activa- tions of the penultimate layer (fc7) as features. For GoogLeNet, we take the 1,024-dimensional 1 There is one class in the ILSVRC 2012 1K dataset that does not appear in the ImageNet 2011 21K dataset. Thus, we have a total of 20,842 unseen classes to evaluate. 2 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml. bz2 on September 1, 2015 58 activations of the pooling units, as in [5]. We denote features that are not extracted by deep learning as shallow features. Shallow features On AwA, many existing approaches take traditional features such as color histograms, SIFT, and PHOG that come with the dataset [137, 208, 251], while others use the Fisher vectors [4, 5]. The SUN dataset also comes with several traditional shallow features, which are used in [109, 137, 209]. In our experiments, we use the shallow features provided by [137] , [109], and [192] for AwA, CUB, and SUN, respectively, unless stated otherwise. Deep features Given the recent impressive success of deep Convolutional Neural Networks (CNNs) [128] on image classification, we conduct experiments with deep features on all datasets. We use the Caffe package [112] to extract AlexNet [128] and GoogLeNet [235] features for images from AwA and CUB. Observing that GoogLeNet give superior results over AlexNet on AwA and CUB, we focus on GoogLeNet features on large datasets: SUN and ImageNet. These networks are pre-trained on the ILSVRC 2012 1K dataset [52, 212] for all the four datasets. For AlexNet, we use the 4,096-dimensional activations of the penultimate layer (fc7) as features, and for GoogLeNet we extract features by the 1,024-dimensional activations of the pooling units following the suggestion by [5]. For CUB, we crop all images with the provided bounding boxes. For ImageNet, we center- crop all images and do not perform any data augmentation or other preprocessing. 7.1.4 Evaluation Protocols Conventional zero-shot learning Denote byA O!Y the accuracy of classifying test data whose labels come fromO into the label spaceY. Note that the accuracy denotes the “per-class” multi- way classification accuracy (defined below). Following previous work, we use the per-class multi-way classification accuracy (averaged over all classes, and averaged over all test images in each class) for AwA, CUB, and SUN. A U!U :=A pc U!U = 1 jUj X c2U # correct predictions in c # test images in c (7.1) Evaluating zero-shot learning on the large-scale ImageNet requires substantially different components from evaluating on the other three datasets. First, two evaluation metrics are used, as in [68]: Flat hit@K (F@K) and Hierarchical precision@K (HP@K). F@K is defined as the percentage of test images for which the model returns the true label in its top K predictions. Note that, F@1 is the per-sample multi-way classification accuracy (averaged over all test images): A ps U!U = P c2U # correct predictions in c P c2U # test images in c (7.2) HP@K takes into account the hierarchical organization of object categories. For each true label, we generate a ground-truth list of K closest categories in the hierarchy and compute the de- gree of overlapping (i.e., precision) between the ground-truth and the model’s top K predictions. Secondly, following the procedure in [68, 183], we evaluate on three scenarios of increasing difficulty: 59 • 2-hop contains 1,509 unseen classes that are within two tree hops of the seen 1K classes according to the ImageNet label hierarchy 3 . • 3-hop contains 7,678 unseen classes that are within three tree hops of seen classes. • All contains all 20,345 unseen classes in the ImageNet 2011 21K dataset that are not in the ILSVRC 2012 1K dataset. The numbers of unseen classes are slightly different from what are used in [68, 183] due to the missing semantic representations (i.e., word vectors) for certain class names. When computing Hierarchical precision@K (HP@K), we use the algorithm in the Appendix of [68] to compute a set of at leastK classes that are considered to be correct. This set is called hCorrectSet and it is computed for eachK and classc. See Algorithm 1 for more details. The main idea is to expand the radius around the true classc until the set has at leastK classes. Algorithm 1 Algorithm for computinghCorrectSet forH@K [68] 1: Input:K, classc, ImageNet hierarchy 2: hCorrectSet ; 3: R 0 4: while NumberElements(hCorrectSet)<K do 5: radiusSet all nodes in the hierarchy which areR hops fromc 6: validRadiusSet ValidLabelNodes(radiusSet) 7: hCorrectSet hCorrectSet[validRadiusSet 8: R R + 1 9: end while 10: returnhCorrectSet Note thatvalidRadiusSet depends on which classes are in the label space to be predicted (i.e., depending on whether we consider 2-hop, 3-hop, or All. We obtain the label sets for 2-hop and 3-hop from the authors of [68, 183]. We implement Algorithm 1 to derive hCorrectSet ourselves. Generalized zero-shot learning There are no previously established benchmark tasks for GZSL. We thus define a set of tasks that reflects more closely how data are distributed in real-world ap- plications. We construct the GZSL tasks by composing test data as a combination of images from both seen and unseen classes. We follow existing splits of the datasets for the conventional ZSL to separate seen and unseen classes. Moreover, for the datasets AwA and CUB, we hold out 20% of the data points from the seen classes (previously, all of them are used for training in the conventional zero-shot setting) and merge them with the data from the unseen classes to form the test set; for ImageNet, we combine its validation set (having the same classes as its training set) and the 21K classes that are not in the ILSVRC 2012 1K dataset. 3 http://www.image-net.org/api/xml/structure_released.xml 60 We will primarily report the performance of ZSL approaches under the metric Area Under Seen-Unseen accuracy Curve (AUSUC) developed in Sect. 6.3.1. We explain how its two accu- racy componentsA S!T andA U!T are computed below. For AwA and CUB, seen and unseen accuracies correspond to (normalized-by-class-size) multi-way classification accuracy, where the seen accuracy is computed on the 20% images from the seen classes and the unseen accuracy is computed on images from unseen classes. For Ima- geNet, seen and unseen accuracies correspond to Flat hit@K (F@K), defined as the percentage of test images for which the model returns the true label in its top K predictions. Please refer to Sect. 6.2 and Sect. 6.3 on the description of procedures to measure ZSL algo- rithms’ performance under the GZSL setting. 7.1.5 Summary of Variants of Our Methods We consider the following variants of SYNC that are different in the type of loss used in the objective function (cf. Sect. 5.1.2). • SYNC O-VS-O : one-versus-other with the squared hinge loss • SYNC CS : Crammer-Singer multi-class SVM loss [47] with (c;y n ) = 1 ifc6= y n and 0 otherwise. • SYNC STRUCT : Crammer-Singer multi-class SVM loss [47] with (c;y n ) =ka c a yn k 2 2 Unless stated otherwise, we adopt the version of SYNC that sets the number of base classifiers to be the number of seen classes S, and setsb r =a c forr = c (i.e., without learning semantic representations). The results with learned representations are in Sect. 7.3.1. Furthermore, we consider the following variants of EXEM (cf. Sect 5.2.3). • EXEM (ZSL method): A ZSL method with predicted exemplars as semantic representations, where ZSL method = CONSE [183], ESZSL [209], or the variants of SYNC. • EXEM (1NN): 1-nearest neighbor classifier with the Euclidean distance to the exemplars. • EXEM (1NNS): 1-nearest neighbor classifier with the standardized Euclidean distance to the exemplars, where the standard deviation is obtained by averaging the intra-class stan- dard deviations of all seen classes. EXEM (ZSL method) regards the predicted exemplars as the ideal semantic representations. On the other hand, EXEM (1NN) treats predicted exemplars as data prototypes. The standardized Euclidean distance in EXEM (1NNS) is introduced as a way to scale the variance of different dimensions of visual features. In other words, it helps reduce the effect of collapsing data that is caused by our usage of the average of each class’ data as cluster centers. Baselines We consider strong zero-shot learning baselines. These baselines are diverse in their approaches to zero-shot learning. In most experiments, we use code provided by the correspond- ing authors (if available). We also re-implement both traditional methods such as DAP and IAP as well as more recent ones such as ESZSL. On the large-scale ImageNet, we focus on strongest baseline on the benchmark: CONSE (whose implementation details can be found in in Sect. A.7). Please refer to specific experiments for further details. For further discussion of these methods, see Sect. 3.1.2 as well as [73, 258]. 61 Table 7.2: Comparison between existing ZSL approaches in multi-way classification accuracies (in %) on four benchmark datasets. For each dataset, we mark the best in red and the second best in blue. Italic numbers denote per-sample accuracy instead of per-class accuracy. On ImageNet, we report results for both types of semantic representations: Word vectors (wv) and MDS embeddings derived from WordNet (hie). All the results are based on GoogLeNet features [234]. . Approach AwA CUB SUN ImageNet wv hie CONSE [183] 63.3 36.2 51.9 1.3 - BIDILEL [250] 72.4 49.7 x - - - LATEM z [257] 72.1 48.0 64.5 - - CCA [154] - - - - 1.8 SYNC o-vs-o 69.7 53.4 62.8 1.4 2.0 SYNC struct 72.9 54.5 62.7 1.5 - EXEM (CONSE) 70.5 46.2 60.0 - - EXEM (LATEM) z 72.9 56.2 67.4 - - EXEM (SYNC O-VS-O ) 73.8 56.2 66.5 1.6 2.0 EXEM (SYNC STRUCT ) 77.2 59.8 66.1 - - EXEM (1NN) 76.2 56.3 69.6 1.7 2.0 EXEM (1NNS) 76.5 58.5 67.3 1.8 2.0 x : on a particular split of seen/unseen classes. z : based on the code of [257], averaged over 5 different initializations. 7.2 Performances of Our Proposed Algorithms 7.2.1 Main Results Table 7.2 summarizes our results in the form of multi-way classification accuracies on all datasets. Generally, EXEM outperforms SYNC. However, we note that both methods are complementary. We also find that we significantly outperform recent state-of-the-art baselines in general. In Sect. A, we provide additional quantitative and qualitative results. In particular, additional exper- imental results and analysis are provided for individual ZSL methods in Sect. A.3 and Sect. A.4. We also extend our analysis to other settings such as few-shot-related settings in Sect. A.5 and previously described GZSL in Sect. A.6. We note that, on AwA, several recent methods obtain higher accuracies due to using a more optimistic evaluation metric (per-sample accuracy) and new types of deep features [277, 279]. This has been shown to be unsuccessfully replicated (cf. Table 2 in [256]). See Sect. A.1 for results of these and other less competitive baselines. Our alternative approach of treating predicted visual exemplars as the ideal semantic repre- sentations significantly outperforms taking semantic representations as given. EXEM (SYNC), EXEM (CONSE), EXEM (LATEM) outperform their corresponding base ZSL methods relatively by 5.9-6.8%, 11.4-27.6%, and 1.1-17.1%, respectively. This again suggests improved quality of semantic representations (on the predicted exemplar space). Furthermore, we find that there is no clear winner between using predicted exemplars as ideal semantic representations or as data prototypes. The former seems to perform better on datasets with fewer seen classes. Nonetheless, we remind that using 1-nearest-neighbor classifiers clearly 62 Table 7.3: Comparison between existing ZSL approaches on ImageNet using word vectors of the class names as semantic representations. For both metrics (in %), the higher the better. The best is in red. The numbers of unseen classes are listed in parentheses. Test data Approach Flat Hit@K Hierarchical precision@K K= 1 2 5 10 20 2 5 10 20 CONSE [183] 8.3 12.9 21.8 30.9 41.7 21.5 23.8 27.5 31.3 SYNC o-vs-o 10.5 16.7 28.6 40.1 52.0 25.1 27.7 30.3 32.1 2-hop (1,509) EXEM (SYNC O-VS-O ) 11.8 18.9 31.8 43.2 54.8 25.6 28.1 30.2 31.6 EXEM (1NN) 11.7 18.3 30.9 42.7 54.8 25.9 28.5 31.2 33.3 EXEM (1NNS) 12.5 19.5 32.3 43.7 55.2 26.9 29.1 31.1 32.0 CONSE [183] 2.6 4.1 7.3 11.1 16.4 6.7 21.4 23.8 26.3 SYNC o-vs-o 2.9 4.9 9.2 14.2 20.9 7.4 23.7 26.4 28.6 3-hop (7,678) EXEM (SYNC O-VS-O ) 3.4 5.6 10.3 15.7 22.8 7.5 24.7 27.3 29.5 EXEM (1NN) 3.4 5.7 10.3 15.6 22.7 8.1 25.3 27.8 30.1 EXEM (1NNS) 3.6 5.9 10.7 16.1 23.1 8.2 25.2 27.7 29.9 CONSE [183] 1.3 2.1 3.8 5.8 8.7 3.2 9.2 10.7 12.0 SYNC o-vs-o 1.4 2.4 4.5 7.1 10.9 3.1 9.0 10.9 12.5 All (20,345) EXEM (SYNC O-VS-O ) 1.6 2.7 5.0 7.8 11.8 3.2 9.3 11.0 12.5 EXEM (1NN) 1.7 2.8 5.2 8.1 12.1 3.7 10.4 12.1 13.5 EXEM (1NNS) 1.8 2.9 5.3 8.2 12.2 3.6 10.2 11.8 13.2 Table 7.4: Comparison between existing ZSL approaches on ImageNet (with 20,842 unseen classes) using MDS embeddings derived from WordNet [154] as semantic representations. The higher, the better (in %). The best is in red. Test data Approach Flat Hit@K K= 1 2 5 10 20 CCA [154] 1.8 3.0 5.2 7.3 9.7 All SYNC o-vs-o 2.0 3.4 6.0 8.8 12.5 (20,842) EXEM (SYNC O-VS-O ) 2.0 3.3 6.1 9.0 12.9 EXEM (1NN) 2.0 3.4 6.3 9.2 13.1 EXEM (1NNS) 2.0 3.4 6.2 9.2 13.2 scales much better than zero-shot learning methods; EXEM (1NN) and EXEM (1NNS) are more efficient than EXEM (SYNC), EXEM (CONSE), and EXEM (LATEM). Finally, we find that in general using the standardized Euclidean distance instead of the Eu- clidean distance for nearest neighbor classifiers helps improve the accuracy, especially on CUB, suggesting there is a certain effect of collapsing actual data during training. The only exception is on SUN. We suspect that the standard deviation values computed on the seen classes on this dataset may not be robust enough as each class has only 20 images. 7.2.2 Large-Scale Zero-Shot Classification Results We then provide expanded results for ImageNet, following evaluation protocols in the literature. More results can be found in Sect. A.2. Again, EXEM generally outperforms SYNC, but both our methods outperform CONSE. Further analysis also shows that SYNC is capable of perform much better when we have more labeled data, for instance, in zero-to-few-shot settings in Sect. A.5 In Table 7.3 and 7.4, we provide results based on the exemplars predicted by word vec- tors and MDS features derived from WordNet, respectively. We consider SYNC o-v-o , rather than SYNC struct , as the former shows better performance on ImageNet. Regardless of the types of 63 Table 7.5: Generalized ZSL results in Area Under Seen-Unseen accuracy Curve (AUSUC) on AwA, CUB, and SUN. For each dataset, we mark the best in red and the second best in blue. All approaches use GoogLeNet as the visual features and calibrated stacking to combine the scores for seen and unseen classes. Approach AwA CUB SUN DAP [137] 0.366 0.194 0.096 IAP [137] 0.394 0.199 0.145 CONSE [183] 0.428 0.212 0.200 ESZSL [209] 0.449 0.243 0.026 SYNC O-VS-O 0.568 0.336 0.242 SYNC STRUCT 0.583 0.356 0.260 EXEM (SYNC O-VS-O ) 0.553 0.365 0.265 EXEM (SYNC STRUCT ) 0.587 0.397 0.288 EXEM (1NN) 0.570 0.318 0.284 EXEM (1NNS) 0.584 0.373 0.287 metrics used, our approach outperforms the baselines significantly when using word vectors as semantic representations. For example, on 2-hop, we are able to improve the F@1 accuracy by 2% over the state-of-the-art. However, we note that this improvement is not as significant when using MDS-WordNet features as semantic representations. We observe that the 1-nearest-neighbor classifiers perform better than using predicted exem- plars as more powerful semantic representations. We suspect that, when the number of classes is very high, zero-shot learning methods (CONSE or SYNC) do not fully take advantage of the meaning provided by each dimension of the exemplars. 7.2.3 Generalized Zero-Shot Learning Results Conventional zero-shot learning setting unrealistically assumes that test data always come from the unseen classes. In GZSL, instances from both seen and unseen classes are present at test time, and the label space is the union of both types of classes. We refer the reader for more discussions regarding GZSL and related settings in Chapter 6 and in [260]. We evaluate our methods and baselines using the Area Under Seen-Unseen accuracy Curve (AUSUC) and report the results in Table 7.5. Following the same evaluation procedure as before, our approach again outperforms the baselines on all datasets. Recently, Xian et al. [256] proposes to unify the evaluation protocol in terms of image fea- tures, class semantic representations, data splits, and evaluation criteria for conventional and generalized zero-shot learning. In their protocol, GZSL is evaluated by the harmonic mean of seen and unseen classes’ accuracies. Technically, AUSUC provides a more complete picture of zero-shot learning method’s performance, but it is less simpler than the harmonic mean. Further investigation under this newly proposed evaluation protocol (both in conventional and generalized zero-shot learning) is left for future work. 64 Table 7.6: Detailed analysis of various methods: the effect of feature and attribute types on multi-way classification accuracies (in %). Within each column, the best is in red and the 2nd best is in blue. We cite both previously published results (numbers in bold italics) and results from our implementations of those competing methods (numbers in normal font) to enhance comparability and to ease analysis (see texts for details). We use the shallow features provided by [137], [109], [192] for AwA, CUB, SUN, respectively. Methods Attribute Shallow features Deep features type AwA CUB SUN AwA CUB SUN DAP [137] binary 41.4 28.3 22.2 60.5 (50.0) 39.1 (34.8) 44.5 IAP [137] binary 42.2 24.4 18.0 57.2 (53.2) 36.7 (32.7) 40.8 BN [251] binary 43.4 - - - - - ALE [4] z binary 37.4 18.0 y - - - - ALE binary 34.8 27.8 - 53.8 (48.8) 40.8 (35.3) 53.8 SJE [5] continuous 42.3 z 19.0 yz - 66.7(61.9) 50.1(40.3) y - SJE continuous 36.2 34.6 - 66.3 (63.3) 46.5 (42.8) 56.1 ESZSL [209] x continuous 49.3 37.0 - 59.6 (53.2) 44.0 (37.2) 8.7 ESZSL continuous 44.1 38.3 - 64.5 (59.4) 34.5 (28.0) 18.7 CONSE [183] continuous 36.5 23.7 - 63.3 (56.5) 36.2 (32.6) 51.9 COSTA [165] ] continuous 38.9 28.3 - 61.8 (55.2) 40.8 (36.9) 47.9 SYNC O-VS-O continuous 42.6 35.0 - 69.7 (64.0) 53.4 (46.6) 62.8 SYNC CS continuous 42.1 34.7 - 68.4 (64.8) 51.6 (45.7) 52.9 SYNC STRUCT continuous 41.5 36.4 - 72.9 (62.8) 54.5 (47.1) 62.7 y : Results reported by the authors on a particular seen-unseen split. z : Based on Fisher vectors as shallow features, different from those provided in [109, 137, 192]. x : On the attribute vectors without` 2 normalization, while our own implementation shows that normalization helps in some cases. ] : As co-occurrence statistics are not available, we combine pre-trained classifiers with the weights defined in Eq. (5.2). 65 Table 7.7: Effect of types of semantic representations on AwA. Semantic representations Dimensions Accuracy (%) word vectors 100 42.2 word vectors 1000 57.5 attributes 85 69.7 attributes + word vectors 185 73.2 attributes + word vectors 1085 76.3 7.3 Detailed Results and Analysis 7.3.1 Detailed Results and Analysis on SYNC We experiment extensively to understand the benefits of many factors in our and other algorithms. While trying all possible combinations is prohibitively expensive, we have provided a compre- hensive set of results for comparison and drawing conclusions. Advantage of deep features It is clear from Table 7.6 that, across all methods, deep features significantly boost the performance based on shallow features. We use GoogLeNet and AlexNet (numbers in parentheses) and GoogLeNet generally outperforms AlexNet. It is worthwhile to point out that the reported results under deep features columns are obtained using linear clas- sifiers, which outperform several nonlinear classifiers that use shallow features. This seems to suggest that deep features, often thought to be specifically adapted to seen training images, still work well when transferred to unseen images [68]. One explanation for this observation is that deep features are learned hierarchically and ex- pected to be more abstract and semantically meaningful. Arguably, similarities between them (measured in inner products between classifiers) might be more congruent with similarities com- puted in the semantic space for combining classifiers. Additionally, shallow features have higher dimensions (around 10,000) than deep features (e.g., 1,024 for GoogLeNet) so they might require more phantom classes to synthesize classifiers. The effect of semantic types It is also clear from Table 7.6 that, in general, continuous at- tributes as semantic representations for classes attain better performance than binary attributes. This is especially true when deep learning features are used to construct classifiers. It is some- what expected that continuous attributes provide a more accurate real-valued similarity measure among classes. This presumably is exploited further by more powerful classifiers. We also experiment with word vectors provided by Fu et al. [71, 72], which are of 100 and 1000 dimensions per class, respectively. In Table 7.7, we show how effective SYNC O-VS-O exploits the two types of semantic spaces: (continuous) attributes and word-vector embeddings on AwA (the only dataset with both embedding types). We find that attributes yield better performance than word-vector embeddings and higher-dimensional word vectors often give rise to better per- formance. However, combining the two gives the best result, suggesting that these two semantic spaces could be complementary and further investigation is ensured. Table 7.8 takes a different view on identifying the best semantic space. We study whether we can learn optimally the semantic representations for the phantom classes that correspond to base classifiers. These preliminary studies seem to suggest that learning attributes could have a positive effect, though it is difficult to improve over word-vector embeddings. We plan to study this issue more thoroughly in the future. 66 Table 7.8: Effect of learning semantic representations Datasets Types of embeddings w/o learning w/ learning AwA attributes 69.7% 71.1% 100-d word vectors 42.2% 42.5% 1000-d word vectors 57.6% 56.6% CUB attributes 53.4% 54.2% SUN attributes 62.8% 63.3% 20 40 60 80 100 120 140 160 70 80 90 100 110 Ratio to the number of seen classes (%) Relative accuracy (%) AwA CUB Figure 7.1: We vary the number of phantom classes R as a percentage of the number of seen classes S and investigate how much that will affect classification accuracy (the vertical axis corresponds to the ratio with respect to the accuracy whenR = S). The base classifiers are learned with SYNC O-VS-O . The number of base classifiers How many base classifiers are necessary? In Fig. 7.1, we investigate how many base classifiers are needed — so far, we have set that number to be the number of seen classes out of convenience. The plot shows that in fact, a smaller number (about 60% -70%) is enough for our algorithm to reach the plateau of the performance curve. Moreover, increasing the number of base classifiers does not seem to have an overwhelming effect. Initialization The semantic representationb r of the phantom classes are set equal toa r ;8r2 f1; ;Rg at 100% (i.e.,R = S). For percentages smaller than 100%, we performK-means and setb r to be the cluster centroids after` 2 normalization (in this case, R = K). For percentages larger than 100%, we set the firstSb r to bea r , and the remainingb r as the random combinations ofa r (also with` 2 normalization onb r ). We have shown that even by using fewer base (phantom) classifiers than the number of seen classes (e.g., around 60 %), we get comparable or even better results, especially for CUB. We sur- mise that this is because CUB is a fine-grained recognition benchmark and has higher correlations among classes, and provide analysis in Fig. 7.2 to justify this. We train one-versus-other classifiers for each value of the regularization parameter on both AwA and CUB, and then perform PCA on the resulting classifier matrices. We then plot the required number (in percentage) of PCA components to capture 95% of variance in the classifiers. Clearly, AwA requires more. This explains why we see the drop in accuracy for AwA but not CUB when using even fewer base classifiers. Particularly, the low percentage for CUB in Fig. 7.2 implies that fewer base classifiers are possible. The numbers of seen and unseen classes In this section, we analyze the results under different numbers of seen/unseen classes in performing zero-shot learning using the CUB dataset. 67 0 0.2 0.4 0.6 0.8 1 x 10 −3 0.2 0.4 0.6 0.8 1 Regularization parameter (λ) Percentage of components AwA CUB Figure 7.2: Percentages of basis components required to capture 95% of variance in classifier matrices for AwA and CUB. Table 7.9: Performance of our method under different numberS of seen classes on CUB. The number of unseen classesU is fixed to be 50. U andS S = 50 S = 100 S = 150 U = 50 38.4 49.9 53.8 Varying the number of seen classes We first examine the performance of zero-shot learn- ing under different numbers of seen classes (e.g., 50, 100, and 150) while fixing the number of unseen classes to be 50. We perform 20 random selections of seen/unseen classes. Unsurpris- ingly, Table 7.9 shows that increasing the number of seen classes in training leads to improved performance on zero-shot learning. Varying the number of unseen classes We then examine the performance of our approach to zero-shot learning under different numbers of unseen classes (e.g., within [0, 150]), with the number of seen classes fixed to be 50 during training. We again conduct experiments on CUB, and perform 20 random selections of seen/unseen classes. The results are presented in Fig. 7.3. We see that the accuracy drops as the number of unseen classes increases. The effect of learning metrics We improve our method by also learning metrics for comput- ing semantic similarity. Please see Sect. 5.1.3 for more details. Preliminary results on AwA in Table 7.10 suggest that learning metrics can further improve upon our current one-vs-other formulation. 7.3.2 Detailed Results and Analysis on EXEM The quality of predicted visual exemplars We first show that predicted visual exemplars bet- ter reflect visual similarities between classes than semantic representations. Let D au be the pairwise Euclidean distance matrix between unseen classes computed from semantic representa- tions (i.e., U by U),D (au) the distance matrix computed from predicted exemplars, andD vu the distance matrix computed from real exemplars (which we do not have access to). Table 7.11 Table 7.10: Effect of learning metrics for computing semantic similarity on AwA. Dataset Type of embeddings w/o learning w/ learning AwA attributes 69.7% 73.4% 68 50 100 150 0 20 40 60 80 100 number of unseen classes accuracy (in %) Figure 7.3: Performance of our method under different numbers of unseen classes on CUB. The number of seen classes is fixed to be 50. Table 7.11: We compute the Euclidean distance matrix between the unseen classes based on semantic representations (D au ), predicted exemplars (D (au) ), and real exemplars (D vu ). Our method leads toD (au) that is better correlated withD vu thanD au is. See text for more details. Dataset Correlation toD vu name Semantic distances Predicted exemplar distances D au D (au) AwA 0.862 0.897 CUB 0.777 0.021 0.904 0.026 SUN 0.784 0.022 0.893 0.019 69 AwA CUB SUN ImageNet Figure 7.4: t-SNE [242] visualization of randomly selected real images (crosses) and predicted visual ex- emplars (circles) for the unseen classes on (from left to right) AwA, CUB, SUN, and ImageNet. Different colors of symbols denote different unseen classes. Perfect predictions of visual features would result in well-aligned crosses and circles of the same color. Plots for CUB and SUN are based on their first splits. Plots for ImageNet are based on randomly selected 48 unseen classes from 2-hop and word vectors as semantic representations. Best viewed in color. shows that the correlation betweenD (au) andD vu is much higher than that betweenD au and D vu . Importantly, we improve this correlation without access to any data of the unseen classes. See also similar results using another metric in Sect. A.4. We then show some t-SNE [242] visualization of predicted visual exemplars of the unseen classes. Ideally, we would like them to be as close to their corresponding real images as possible. In Fig. 7.4, we demonstrate that this is indeed the case for many of the unseen classes; for those unseen classes (each of which denoted by a color), their real images (crosses) and our predicted visual exemplars (circles) are well-aligned. The quality of predicted exemplars (in this case based on the distance to the real images) depends on two main factors: the predictive capability of semantic representations and the number of semantic representation-visual exemplar pairs available for training, which in this case is equal to the number of seen classes S. On AwA where we have only 40 training pairs, the predicted exemplars are surprisingly accurate, mostly either placed in their corresponding clusters or at least closer to their clusters than predicted exemplars of the other unseen classes. Thus, we expect them to be useful for discriminating among the unseen classes. On ImageNet, the predicted exemplars are not as accurate as we would have hoped, but this is expected since the word vectors are purely learned from text. We also observe relatively well-separated clusters in the semantic embedding space (in our case, also the visual feature space since we only apply PCA projections to the visual features), confirming our assumption about the existence of clustering structures. On CUB, we observe that these clusters are more mixed than on other datasets. This is not surprising given that it is a fine-grained classification dataset of bird species. Note that it is the relative distance that is important. Even when the predicted exemplars are not well aligned with their corresponding images, they are in many cases closer to those images than the predicted exemplars of other classes are. For example, on AwA, we would be able to predict test images from “orange” class correctly as the closest exemplar is orange (but the images and the exemplar are not exactly aligned). The effect of semantic types In Table 7.12, we show that we can improve the quality of word vectors on AwA as well. We use the 1,000-dimensional word vectors in [72] and follow the same evaluation protocol as before. For other specific details, please refer to Sect. 7.1.2. 70 Table 7.12: ZSL results in the per-class multi-way classification accuracies (in %) on AwA using word vectors as semantic representations. We use the 1,000-dimensional word vectors in [72]. All approaches use GoogLeNet as the visual features. Approach AwA SYNC O-VS-O 57.5 EXEM (SYNC O-VS-O ) 61.7 EXEM (1NN) 63.5 Table 7.13: Accuracy of EXEM (1NN) on AwA, CUB, and SUN when predicted exemplars are from original visual features (No PCA) and PCA-projected features (PCA withd = 1024 andd = 500). Dataset No PCA PCA PCA name d = 1024 d = 1024 d = 500 AwA 77.8 76.2 76.2 CUB 55.1 56.3 56.3 SUN 69.2 69.6 69.6 PCA or not? Table 7.13 investigates the effect of PCA. In general, EXEM (1NN) performs comparably with and without PCA. Moreover, decreasing PCA projected dimensiond from 1024 to 500 does not hurt the ZSL performance. Clearly, a smaller PCA dimension leads to faster computation due to fewer regressors to be trained. Kernel regression vs. Multi-layer perceptron We compare two approaches for predicting visual exemplars: kernel-based support vector regressors (SVR) and 2-layer multi-layer percep- tron (MLP) with ReLU nonlinearity, MLP weights are` 2 regularized, and we cross-validate the regularization constant. Table 7.14 shows that SVR performs more robustly than MLP. One explanation is that MLP is prone to overfitting due to the small training set size (the number of seen classes) as well as model selection challenge imposed by ZSL scenarios. SVR also comes with other benefits; it is more efficient and less susceptible to initialization. Table 7.14: Comparison between EXEM (1NN) with support vector regressors (SVR) and with 2-layer multi-layer perceptron (MLP) for predicting visual exemplars. Results on CUB are for the first split. Each number for MLP is an average over 3 random initialization. Dataset How to predict No PCA PCA PCA name exemplars d = 1024 d = 1024 d = 500 AwA SVR 77.8 76.2 76.2 MLP 76.1 0.5 76.4 0.1 75.5 1.7 CUB SVR 57.1 59.4 59.4 MLP 53.8 0.3 54.2 0.3 53.8 0.5 71 Table 7.15: Performances measured in AUSUC of several methods for Generalized Zero-Shot Learning on AwA and CUB. The higher the better (the upper bound is 1). AwA CUB Method Novelty detection [227] Calibrated Novelty detection [227] Calibrated Gaussian LoOP Stacking Gaussian LoOP Stacking DAP 0.302 0.272 0.366 0.122 0.137 0.194 IAP 0.307 0.287 0.394 0.129 0.145 0.199 CONSE 0.342 0.300 0.428 0.130 0.136 0.212 SYNC O-VS-O 0.420 0.378 0.568 0.191 0.209 0.336 SYNC STRUCT 0.424 0.373 0.583 0.199 0.224 0.356 Table 7.16: Performance measured in AUSUC for novelty detection (Gaussian and LoOP) and calibrated stacking on AwA and CUB. Hyper-parameters are cross-validated to maximize accu- racies. Calibrated stacking outperforms Gaussian and LoOP in all cases. AwA CUB Method Novelty detection [227] Calibrated Novelty detection [227] Calibrated Gaussian LoOP Stacking Gaussian LoOP Stacking DAP 0.280 0.250 0.341 0.126 0.142 0.202 IAP 0.319 0.289 0.366 0.132 0.149 0.194 CONSE 0.364 0.331 0.443 0.131 0.141 0.190 SYNC O-VS-O 0.411 0.387 0.539 0.195 0.219 0.324 SYNC STRUCT 0.424 0.380 0.551 0.199 0.225 0.356 7.3.3 Detailed Results and Analysis on GZSL Which method to use to perform GZSL? Table 7.15 provides an experimental comparison between several methods utilizing seen and unseen classifiers for generalized ZSL with hyper- parameters cross-validated to maximize AUSUC. See Sect. A.7 for implementation details on both novelty detection algorithms and hyper-parameter tuning. The results show that, irrespective of which ZSL methods are used to generate models for seen and unseen classes, our method of calibrated stacking for generalized ZSL outperforms other methods. In particular, despite their probabilistic justification, the two novelty detection methods do not perform well. We believe that this is because most existing zero-shot learning methods are discriminative and optimized to take full advantage of class labels and semantic information. In contrast, either Gaussian or LoOP approach models all the seen classes as a whole, possibly at the cost of modeling inter-class differences. In Table 7.16, we show that calibrated stacking outperforms Gaussian and LoOP as well when hyper-parameters are cross-validated to maximize accuracies. We further show the SUCs of Gaussian, LoOP, and calibrated stacking on AwA in Fig. 7.5. Again, we observe the superior performance of calibrated stacking over Gaussian and LoOP across all zero-shot learning approaches, regardless of cross-validation strategies. Moreover, interestingly, we see that the curves for Gaussian and LoOP cross each other in such a way that implies that Gaussian has a tendency to classifying more data into “unseen” categories (consistent with the observations reported by [227]). 72 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 DAP AU→T AS→T Gauss: 0.280 LoOP: 0.250 Cal Stack: 0.341 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 IAP AU→T AS→T Gauss: 0.280 LoOP: 0.289 Cal Stack: 0.366 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 ConSE AU→T AS→T Gauss: 0.364 LoOP: 0.331 Cal Stack: 0.443 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 SynC o−v−o AU→T AS→T Gauss: 0.411 LoOP: 0.387 Cal Stack: 0.539 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 SynC struct AU→T AS→T Gauss: 0.424 LoOP: 0.380 Cal Stack: 0.551 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 DAP AU→T AS→T Gauss: 0.302 LoOP: 0.272 Cal Stack: 0.366 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 IAP AU→T AS→T Gauss: 0.307 LoOP: 0.287 Cal Stack: 0.394 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 ConSE AU→T AS→T Gauss: 0.342 LoOP: 0.300 Cal Stack: 0.428 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 SynC o−v−o AU→T AS→T Gauss: 0.420 LoOP: 0.378 Cal Stack: 0.568 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 SynC struct AU→T AS→T Gauss: 0.424 LoOP: 0.373 Cal Stack: 0.583 Figure 7.5: Seen-Unseen accuracy Curves (SUC) for Gaussian (Gauss), LoOP, and calibrated stacking (Cal Stack) for all zero-shot learning approaches on AwA. Hyper-parameters are cross-validated based on accuracies (top) and AUSUC (bottom). Calibrated stacking outperforms both Gaussian and LoOP in all cases. 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 AwA A U→T A S→T DAP: 0.366 IAP: 0.394 ConSE: 0.428 SynC o−v−o : 0.568 SynC struct : 0.583 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CUB (Split 1) A U→T A S→T DAP: 0.205 IAP: 0.211 ConSE: 0.208 SynC o−v−o : 0.338 SynC struct : 0.354 Figure 7.6: Comparison between several ZSL approaches on the task of GZSL for AwA and CUB. 73 0 0.05 0.1 0 0.2 0.4 0.6 0.8 1 ImageNet (Flat hit@1) A U→T A S→T ConSE: 0.042 SynC o−v−o : 0.044 SynC struct : 0.043 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 ImageNet (Flat hit@5) A U→T A S→T ConSE: 0.168 SynC o−v−o : 0.218 SynC struct : 0.199 0 0.1 0.2 0.3 0.4 0 0.2 0.4 0.6 0.8 1 ImageNet (Flat hit@10) A U→T A S→T ConSE: 0.247 SynC o−v−o : 0.338 SynC struct : 0.308 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 1 ImageNet (Flat hit@20) A U→T A S→T ConSE: 0.347 SynC o−v−o : 0.466 SynC struct : 0.433 Figure 7.7: Comparison between CONSE and SYNC of their performances on the task of GZSL for ImageNet where the unseen classes are within 2 tree-hops from seen classes. Which Zero-shot learning approach is more robust to GZSL? Fig. 7.6 contrasts in detail several ZSL approaches when tested on the task of GZSL, using the method of calibrated stack- ing. Clearly, SYNC dominates all other methods in the whole ranges. The crosses on the plots mark the results of direct stacking. Fig. 7.7 contrasts in detail CONSE to SYNC. When the accuracies measured in Flat hit@1 (i.e., multi-class classification accuracy), neither method dominates the other, suggesting the different trade-offs by the two methods. However, when we measure hit rates in the topK > 1, SYNC dominates CONSE. Table 7.17 gives summarized comparison in AUSUC between the two meth- ods on the ImageNet dataset. We observe that SYNC in general outperforms CONSE except when Flat hit@1 is used, in which case the two methods’ performances are nearly indistinguishable. Fig. 7.8 provides SUCs for all splits of CUB and Fig. 7.9 provides SUCs for ImageNet 3- hop and All. As before, we observe the superior performance of the method of SynC over other approaches in most cases. 74 Table 7.17: Performances measured in AUSUC by different zero-shot learning approaches on GZSL on ImageNet, using our method of calibrated stacking. Unseen Method Flat hit@K classes 1 5 10 20 2-hop CONSE 0.042 0.168 0.247 0.347 SYNC O-VS-O 0.044 0.218 0.338 0.466 SYNC STRUCT 0.043 0.199 0.308 0.433 3-hop CONSE 0.013 0.057 0.090 0.135 SYNC O-VS-O 0.012 0.070 0.119 0.186 SYNC STRUCT 0.013 0.066 0.110 0.170 All CONSE 0.007 0.030 0.048 0.073 SYNC O-VS-O 0.006 0.034 0.059 0.097 SYNC STRUCT 0.007 0.033 0.056 0.090 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CUB (Split 1) A U→T A S→T DAP: 0.205 IAP: 0.211 ConSE: 0.208 SynC o−v−o : 0.338 SynC struct : 0.354 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CUB (Split 2) A U→T A S→T DAP: 0.190 IAP: 0.204 ConSE: 0.223 SynC o−v−o : 0.340 SynC struct : 0.369 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CUB (Split 3) A U→T A S→T DAP: 0.211 IAP: 0.213 ConSE: 0.234 SynC o−v−o : 0.353 SynC struct : 0.377 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 CUB (Split 4) A U→T A S→T DAP: 0.171 IAP: 0.168 ConSE: 0.182 SynC o−v−o : 0.313 SynC struct : 0.322 Figure 7.8: Comparison of performance measured in AUSUC between different zero-shot learning ap- proaches on the four splits of CUB. 75 0 0.01 0.02 0.03 0.04 0.05 0.06 0 0.2 0.4 0.6 0.8 1 ImageNet 3−hop (Flat hit@1) A U→T A S→T ConSE: 0.013 SynC o−v−o : 0.012 SynC struct : 0.013 0 0.005 0.01 0.015 0.02 0.025 0.03 0 0.2 0.4 0.6 0.8 1 ImageNet All (Flat hit@1) A U→T A S→T ConSE: 0.007 SynC o−v−o : 0.006 SynC struct : 0.007 0 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 ImageNet 3−hop (Flat hit@5) A U→T A S→T ConSE: 0.057 SynC o−v−o : 0.070 SynC struct : 0.066 0 0.01 0.02 0.03 0.04 0.05 0 0.2 0.4 0.6 0.8 1 ImageNet All (Flat hit@5) A U→T A S→T ConSE: 0.030 SynC o−v−o : 0.034 SynC struct : 0.033 0 0.05 0.1 0.15 0 0.2 0.4 0.6 0.8 1 ImageNet 3−hop (Flat hit@10) A U→T A S→T ConSE: 0.090 SynC o−v−o : 0.119 SynC struct : 0.110 0 0.02 0.04 0.06 0.08 0 0.2 0.4 0.6 0.8 1 ImageNet All (Flat hit@10) A U→T A S→T ConSE: 0.048 SynC o−v−o : 0.059 SynC struct : 0.056 0 0.05 0.1 0.15 0.2 0 0.2 0.4 0.6 0.8 1 ImageNet 3−hop (Flat hit@20) A U→T A S→T ConSE: 0.134 SynC o−v−o : 0.186 SynC struct : 0.170 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.2 0.4 0.6 0.8 1 ImageNet All (Flat hit@20) A U→T A S→T ConSE: 0.073 SynC o−v−o : 0.096 SynC struct : 0.090 Figure 7.9: Comparison of performance measured in AUSUC between different zero-shot learning ap- proaches on ImageNet-3hop (left) and ImageNet-All (right). 76 Part IV Learning and Leveraging Task Similarity for Sequence Tagging 77 Chapter 8 Multi-Task Learning for Sequence Tagging 8.1 Introduction Multi-task learning (MTL) has long been studied in the machine learning literature, cf. [28]. The technique has also been popular in NLP, for example, in [44, 45, 156]. The main thesis underpinning MTL is that solving many tasks together provides a shared inductive bias that leads to more robust and generalizable systems. This is especially appealing for NLP as data for many tasks are scarce — shared learning thus reduces the amount of training data needed. MTL has been validated in recent work, mostly where auxiliary tasks are used to improve the performance on a target task, for example, in sequence tagging [8, 20, 22, 196, 228]. Despite those successful applications, several key issues about the effectiveness of MTL re- main open. Firstly, with only a few exceptions, much existing work focuses on “pairwise” MTL where there is a target task and one or several (carefully) selected auxiliary tasks. However, can jointly learning many tasks benefit all of them together? A positive answer will significantly raise the utility of MTL. Secondly, how are tasks related such that one could benefit another? For instance, one plausible intuition is that syntactic and semantic tasks might benefit among their two separate groups though cross-group assistance is weak or unlikely. However, such notions have not been put to test thoroughly on a significant number of tasks. In this chapter, we address such questions. We investigate learning jointly multiple sequence tagging tasks. Besides using independent single-task learning as a baseline and a popular shared- encoder MTL framework for sequence tagging [45], we propose two variants of MTL, where both the encoder and the decoder could be shared by all tasks. We conduct extensive empirical studies on 11 sequence tagging tasks — we defer the dis- cussion on why we select such tasks to a later section. We demonstrate that there is a benefit to moving beyond “pairwise” MTL. We also obtain interesting pairwise relationships that reveal which tasks are beneficial or harmful to others, and which tasks are likely to be benefited or harmed. We find such information correlated with the results of MTL using more than two tasks. We also study selecting only benefiting tasks for joint training, showing that such a “greedy” approach in general improves the MTL performance, highlighting the need of identifying with whom to jointly learn. The rest of the chapter is organized as follows. We describe different approaches for learning from multiple tasks in Sect. 8.2. We describe our experimental setup and results in Sect. 8.3 and Sect. 8.4, respectively. We discuss related work in Sect. 8.5. Finally, we conclude with discussion and future work in Sect. 8.6. 78 8.2 Multi-Task Learning for Sequence Tagging In this section, we describe general approaches to multi-task learning (MTL) for sequence tag- ging. We select sequence tagging tasks for several reasons. Firstly, we want to concentrate on comparing the tasks themselves without being confounded by designing specialized MTL meth- ods for solving complicated tasks. Sequence tagging tasks are done at the word level, allowing us to focus on simpler models while still enabling varying degrees of sharing among tasks. Sec- ondly, those tasks are often the first steps in NLP pipelines that come with extremely diverse resources. Understanding the nature of the relationships between them is likely to have a broad impact on many downstream applications. Let T be the number of tasks andD t be training data of taskt2f1;:::;Tg. A dataset for each task consists of input-output pairs. In sequence tagging, each pair corresponds to a sequence of words w 1:L and their corresponding ground-truth tags y 1:L , where L is the sequence length. We note that our definition of “task” is not the same as “domain” or “dataset.” In particular, we differentiate between tasks based on whether or not they share the label space of tags. For instance, part-of-speech tagging on weblog and that on email domains are considered the same task in this paper. Given the training datafD 1 ;:::;D T g, we describe how to learn one or more models to perform all the T tasks. In general, our models follow the design of state-of-the-art sequence taggers [203]. They have an encodere with parameters that encodes a sequence of word tokens into a sequence of vectors and a decoderd with parameters that decodes the sequence of vectors into a sequence of predicted tags ^ y 1:L . That is, c 1:L = e(w 1:L ;) and ^ y 1:L = d(c 1:L ;). The model parameters are learned by minimizing some loss functionL(^ y 1:L ;y 1:L ) over and. In what follows, we will use superscripts to differentiate instances from different tasks. In single-task learning (STL), we learn T models independently. For each taskt, we have an encodere t (; t ) and a decoderd t (; t ). Clearly, information is not shared between tasks in this case. In multi-task learning (MTL), we consider two or more tasks and train an MTL model jointly over a combined loss P t L(^ y t 1:L ;y t 1:L ). In this paper, we consider the following general frameworks that are different in the nature of how the parameters of those tasks are shared. Multi-task learning with multiple decoders (Multi-Dec) We learn a shared encodere(;) and T decodersfd t (; t )g T t=1 . This setting has been explored for sequence tagging in [44, 45]. In the context of sequence-to-sequence learning [233], this is similar to the “one-to-many” MTL setting in [156]. Multi-task learning with task embeddings (TE) We learn a shared encodere(;) for the input sentence as well as a shared decoderd(;). To equip our model with the ability to perform one- to-many mapping (i.e., multiple tasks), we augment the model with “task embeddings.” Specifi- cally, we additionally learn a functionf(t) that maps a task IDt to a vector. We explore two ways of injecting task embeddings into models. In both cases, ourf is simply an embedding layer that maps the task ID to a dense vector. One approach, denoted by TEDEC, is to incorporate task embeddings into the decoder. We concatenate the task embeddings with the encoder’s outputsc t 1:L and then feed the result to the decoder. The other approach, denoted by TEENC, is to combine the task embeddings with the input sequence of words at the encoder. We implement this by prepending the task token (<<upos>>, 79 ℎ " ℎ " # " l =l+1 $ " % $ " & Decoder CRFs Encoder RNNs ℎ " ℎ " # " l =l+1 ' " Task 1 Words ' " Task T Words ℎ " ℎ " # " l =l+1 $ " % $ " & Decoder CRFs Shared Encoder RNN ' " Task 1 Words ' " Task T Words Single-task learning MTL (MULTI-DEC) Task Emb ℎ " ℎ " # " l =l+1 $ " % $ " & + Task T oken 1 ' " Words + Task T oken T ' " Words Shared Encoder RNN Shared Decoder CRF Task 1 Task T ℎ " ℎ " # " l =l+1 $ " % $ " & Shared Encoder RNN Shared Decoder CRF ; Task T oken 1 ' " Task 1 Words ; Task Token T ' " Task T Words MTL (TEDEC) MTL (TEENC) Figure 8.1: Different settings for learning from multiple tasks considered in our experiments 80 Table 8.1: Datasets used in our experiments, as well as their key characteristics and their corresponding tasks. / is used to separate statistics for training data only and those for all subsets of data. Dataset # sentences Token/type Task # labels Label entropy Universal Dependencies v1.4 12543/16622 12.3/13.2 UPOS 17 2.5 XPOS 50 3.1 CoNLL-2000 8936/10948 12.3/13.3 CHUNK 42 2.3 CoNLL-2003 14041/20744 9.7/11.2 NER 17 0.9 Streusle 4.0 2723/3812 8.6/9.3 MWE 3 0.5 SUPSENSE 212 2.2 SemCor 13851/20092 13.2/16.2 SEM 75 2.2 SEMTR 11 1.3 Broadcast News 1 880/1370 5.2/6.1 COM 2 0.6 FrameNet 1.5 3711/5711 8.6/9.1 FRAME 2 0.5 Hyper-Text Corpus 2000/3974 6.7/9.0 HYP 2 0.4 <<chunk>>, <<mwe>>, etc.) to the input sequence and treat the task token as a word to- ken [113]. While the encoder in TEDEC must learn to encode a general-purpose representation of the input sentence, the encoder in TEENC knows from the start which task it is going to perform. Fig. 8.1 illustrates different settings described above. Clearly, the number of model param- eters is reduced significantly when we move from STL to MTL. Which MTL model is more economical depends on several factors, including the number of tasks, the dimension of the en- coder output, the general architecture of the decoder, the dimension of task embeddings, how to augment the system with task embeddings, and the degree of tagset overlap. 8.3 Experimental Setup 8.3.1 Datasets and Tasks Table 8.1 summarizes the datasets used in our experiments, along with their corresponding tasks and important statistics. Table 8.2 shows an example of each task’s input-output pairs. We de- scribe details below. For all tasks, we use the standard splits unless specified otherwise. We perform universal and English-specific POS tagging (UPOS and XPOS) on sentences from the English Web Treebank [19], annotated by the Universal Dependencies project [182]. We perform syntactic chunking (CHUNK) on sentences from the WSJ portion of the Penn Treebank [163], annotated by the CoNLL-2000 shared task [240]. We use sections 15-18 for training. The shared task uses section 20 for testing and does not designate the development set, so we use the first 1001 sentences for development and the rest 1011 for testing. We perform named entity recognition (NER) on sentences from the Reuters Corpus [145], consisting of news stories between August 1996-97, annotated by the CoNLL-2003 shared task [241]. For both CHUNK and NER, we use the IOBES tagging scheme. We perform multi-word expression identification (MWE) and supersense tagging (SUPSENSE) on sentences from the reviews section of the English Web Treebank, annotated under the Streusle project [217] 1 . We perform supersense (SEM) and semantic trait (SEMTR) tagging on SemCor’s sentences [139], taken from a subset of the Brown Corpus [67], using the splits provided by [8] 1 https://github.com/nert-gu/streusle 81 for both tasks 2 . For SEM, they are annotated with supersense tags [172] by [41] 3 . For SEMTR, [8] uses the EuroWordNet list of ontological types for senses [245] to convert supersenses into coarser semantic traits. For sentence compression (COM), we identify which words to keep in a compressed version of sentences from the 1996 English Broadcast News Speech (HUB4) [89], created by [43] 4 . We use the labels from the first annotator. For frame target identification (FRAME), we detect words that evoke frames [48] on sentences from the British National Corpus, annotated under the FrameNet project [11]. For both COM and FRAME, we use the splits provided by [20]. For hyper-link detection (HYP), we identify which words in the sequence are marked with hyperlinks on text from Daniel Pipes’ news-style blog collected by [230] 5 . We use the “select” subset that correspond to marked, complete sentences. 2 https://github.com/bplank/multitasksemantics 3 We consider SUPSENSE and SEM as different tasks as they use different sets of supersense tags. 4 http://jamesclarke.net/research/resources/ 5 https://nlp.stanford.edu/valentin/pubs/markup-data.tar.bz2 82 Table 8.2: Examples of input-output pairs of the tasks in consideration Task Input/Output UPOS once again , thank you all for an outstanding accomplishment . ADV ADV PUNCT VERB PRON DET ADP DET ADJ NOUN PUNCT XPOS once again , thank you all for an outstanding accomplishment . RB RB , VBP PRP DT IN DT JJ NN . CHUNK the carrier also seemed eager to place blame on its american counterparts . B-NP E-NP S-ADVP S-VP S-ADJP B-VP E-VP S-NP S-PP B-NP I-NP E-NP O NER 6. pier francesco chili ( italy ) ducati 17541 O B-PER I-PER E-PER O S-LOC O S-ORG O MWE had to keep in mind that the a / c broke , i feel bad it was their opening ! B I B I I O O B I I O O O O O O O O O O SUPSENSE this place may have been something sometime ; but it way past it " sell by date " . O n.GROUP O O v.stative O O O O O O p.Time p.Gestalt O v.possession p.Time n.TIME O O SEM a hypothetical example will illustrate this point . O adj.all noun.cognition O verb.communication O noun.communication O SEMTR he wondered if the audience would let him finish . O Mental O O Object O Agentive O BoundedEvent O COM he made the decisions in 1995 , in early 1996 , to spend at a very high rate . KEEP KEEP DEL KEEP DEL DEL DEL DEL DEL DEL DEL KEEP KEEP KEEP KEEP DEL KEEP KEEP KEEP FRAME please continue our important partnership . O B-TARGET O B-TARGET O O HYP will this incident lead to a further separation of civilizations ? O O O O O O O B-HTML B-HTML B-HTML O 83 8.3.2 Metrics and Score Comparison We use the span-based micro-averaged F1 score (without the O tag) for all tasks. We run each configuration three times with different initializations and compute mean and standard deviation of the scores. To compare two scores, we use the following strategy. Let 1 , 1 and 2 , 2 be two sets of scores (mean and std, respectively). We say that 1 is “higher” than 2 if 1 k 1 > 2 +k 2 , wherek is a parameter that controls how strict we want the definition to be. “lower” is defined in the same manner with> changed to< and switched with +. k is set to 1.5 in all of our experiments. 8.3.3 Models General architectures We use bidirectional recurrent neural networks (biRNNs) as our en- coders for both words and characters [103, 105, 138, 158]. Our word/character sequence encoders and decoder classifiers are common in literature and most similar to [138], but we use two-layer biRNNs (instead of one) with Gated Recurrent Unit (GRU) [40] (instead of with LSTM [99]). Each word is represented by a 100-dimensional vector that is the concatenation of a 50- dimensional embedding vector and the 50-dimensional output of a character biRNN (whose hidden representation dimension is 25 in each direction). We feed a sequence of those 100- dimensional representations to a word biRNN, whose hidden representation dimension is 300 in each direction, resulting in a sequence of 600-dimensional vectors. In TEDEC, the token en- coder is also used to encode a task token (which is then concatenated to the encoder’s output), where each task is represented as a 25-dimensional vector. For decoder/classifiers, we predict a sequence of tags using a linear projection layer (to the tagset size) followed by a conditional random field (CRF) [135]. Implementation and training details We implement our models in PyTorch [191] on top of the AllenNLP library [78]. Code is to be available athttps://github.com/schangpi/. Words are lower-cased, but characters are not. Word embeddings are initialized with GloVe [195] trained on Wikipedia 2014 and Gigaword 5. We use strategies suggested by [158] for initializing other parameters in our networks. Character embeddings are initialized uniformly in [ p 3=d; p 3=d], whered is the dimension of the embeddings. Weight matrices are initialized with Xavier Uniform [83], i.e., uniformly in [ p 6=(r +c); p 6=(r +c)], wherer andc are the number of of rows and columns in the structure. Bias vectors are initialized with zeros. We use Adam [117] with default hyperparameters and a mini-batch size of 32. The dropout rate is 0.25 for the character encoder and 0.5 for the word encoder. We use gradient normalization [190] with a threshold of 5. We halve the learning rate if the validation performance does not improve for two epochs, and stop training if the validation performance does not improve for 10 epochs. We use L2 regularization with parameter 0.01 for the transition matrix of the CRF. For the training of MTL models, we make sure that each mini-batch is balanced; the differ- ence in numbers of examples from any pair of tasks is no more than 1. As a result, each epoch may not go through all examples of some tasks whose training set sizes are large. In a similar manner, during validation, the average F1 score is over all tasks rather than over all validation examples. 8.3.4 Various Settings for Learning from Multiple Tasks We consider the following settings: (i) “STL” where we train each model on one task alone; (ii) “Pairwise MTL” where we train on two tasks jointly; (iii) “All MTL” where we train on all tasks 84 jointly; (iv) “Oracle MTL” where we train on the Oracle set of the testing task jointly with the testing task; (v) “All-but-one MTL” setting where we train on all tasks jointly except for one (as part of Sect. 8.4.3.) Constructing the Oracle Set of a testing task The Oracle set of a taskt is constructed from the pairwise performances: let (A;t);(A;t) be the F1 score and the standard deviation of a model that is jointly trained on a set of tasks in the set A and that is tested on task t. Task s is considered “beneficial” to another (testing) task t if (fs;tg;t) is “higher” than (ftg;t) (cf. Sect. 8.3.2). Then, the “Oracle” set for a taskt is the set of its all beneficial (single) tasks. Throughout our experiments, we compute and by averaging over three rounds (cf. Sect. 8.3.2, standard deviations can be found in Sect. B.3.) 8.4 Results and Analysis Fig. 8.2 summarizes our main findings. We compare relative improvement over single-task learn- ing (STL) between various settings with different types of sharing in Sect. 8.3.4. Scores from the pairwise setting (“+One Task”) are represented as a vertical bar, delineating the maximum and minimum improvement over STL by jointly learning a task with one of the remaining 10 tasks. The “All” setting (red triangles) indicates the joint learning all 11 tasks. The “Oracle” setting (blue rectangles) indicates the joint learning using a subset of 11 tasks which are deemed beneficial, based on corresponding performances in pairwise MTL, as defined in Sect. 8.3.4. We observe that (1) [STL vs. Pairwise/All] Neither pairwise MTL nor All always improves upon STL; (2) [STL vs. Oracle] Oracle in general outperforms or at least does not worsen STL; (3) [All/Oracle vs. Pairwise] All does better than Pairwise on about half of the cases, while Oracle almost always does better than Pairwise; (4) [All vs. Oracle] Consider when both All and Oracle improve upon STL. For MULTI-DEC and TEENC, Oracle generally dominates All, except on the task HYP. For TEDEC, their magnitudes of improvement are mostly comparable, except on SEMTR (Oracle is better) and on HYP (All is better). In addition, All is better than Oracle on the task COM, in which case Oracle is STL. In Sect. B.1, we compare different MTL approaches: MULTI-DEC, TEDEC, and TEENC. There is no significant difference among them. 8.4.1 Pairwise MTL Results 85 Figure 8.2: Summary of our results for MTL methods MULTI-DEC (top), TEDEC (middle), and TEENC (bottom) on various settings with different types of sharing. The vertical axis is the relative improvement over STL. See texts for details. Best viewed in color. 86 Figure 8.3: Pairwise MTL relationships (benefit vs. harm) using MULTI-DEC (left), TEDEC (middle), and TEENC (right). Solid green (red) directed edge from s to t denotes s benefiting (harming) t. Dashed Green (Red) edges between s and t denote they benefiting (harming) each other. Dotted edges denote asymmetric relationship: benefit in one direction but harm in the reverse direction. Absence of edges denotes neutral relationships. Best viewed in color and with a zoom-in. 87 Summary The summary plot in Fig. 8.3 gives a bird’s-eye view of patterns in which a task might benefit or harm another one. For example, MWE is always benefited from jointly learning any of the 10 tasks as the incoming edges are green, so is SEMTR in most cases. On the other end, COM seems to be harming any of the 10 as the outgoing edges are almost always red. For CHUNK and U/XPOS, it generally benefits others (or at least does not do harm) as most of their outgoing edges are green. In Table 8.3-8.5, we report F1 scores for MULTI-DEC, TEDEC, and TEENC, respec- tively. In each table, rows denote settings in which we train our models, and columns correspond to tasks we test them on. We also include “Average” of all pairwise scores, as well as the number of positive (") and negative (#) relationships in each row or each column. Which tasks are benefited or harmed by others in pairwise MTL? MWE, SUPSENSE, SEMTR, and HYP are generally benefited by other tasks. The improvement is more significant in MWE and HYP. UPOS, XPOS, NER, COM, and FRAME (MULTI-DEC and TEDEC) are often hurt by other tasks. Finally, the results are mixed for CHUNK and SEM. Which tasks are beneficial or harmful? UPOS, XPOS, and CHUNK are universal helpers, ben- eficial in 16, 17, and 14 cases, while harmful only in 1, 3, and 0 cases, respectively. Interestingly, CHUNK never hurts any task, while both UPOS and XPOS can be harmful to NER. While CHUNK is considered more of a syntactic task, the fact that it informs other tasks about the boundaries of phrases may aid the learning of other semantic tasks (task embeddings in Sect. 8.4.3 seem to support this hypothesis). On the other hand, COM, FRAME, and HYP are generally harmful, all useful in 0 cases and causing the performance drop in 22, 10, 12 cases, respectively. One factor that may play a role is the training set sizes of these tasks. However, we note that both MWE and SUPSENSE (Streusle dataset) has smaller training set sizes than FRAME does, but those tasks can still benefit some tasks. (On the other hand, NER has the largest training set, but infrequently benefits other tasks, less frequently than SUPSENSE does.) Another potential cause is the fact that all those harmful tasks have the smallest label size of 2. This combined with small dataset sizes leads to a higher chance of overfitting. Finally, it may be possible that harmful tasks are simply unrelated; for example, the nature of COM, FRAME, or HYP may be very different from other tasks — an entirely different kind of reasoning is required. Finally, NER, MWE, SEM, SEMTR, and SUPSENSE can be beneficial or harmful, depending on which other tasks they are trained with. 88 Table 8.3: F1 scores for MULTI-DEC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## +UPOS 95.4 95.01 94.18" 87.68 59.99" 73.23" 74.93" 68.25" 72.46 62.14 48.02 5 0 +XPOS 95.38 95.04 93.97" 87.61# 58.87" 73.34" 74.91" 67.78" 72.83 60.77 48.81" 6 1 +CHUNK 95.43 95.1 93.49 87.96 59.18" 73.16" 74.79" 67.39" 72.44 62.67 47.85" 5 0 +NER 95.38 94.98 93.47 88.24 55.4" 72.88 74.34" 68.06" 70.93 62.39 47.9 3 0 +MWE 95.15# 94.7# 93.54 88.15 53.07 72.75 74.51" 66.88 71.31 61.75 47.32 1 2 +SEM 95.23 94.77# 93.63" 87.35# 60.16" 72.77 74.73" 68.29" 72.72 61.74 48.15" 5 2 +SEMTR 95.17 94.86# 93.61 87.34# 58.84" 72.5# 74.02 68.6" 71.96 62.03 47.74 2 3 +SUPSENSE 95.08# 94.75 93.2 87.9 58.81" 72.81 74.61" 66.81 72.24 61.94 49.23" 3 1 +COM 93.04# 93.19# 91.94# 86.62# 53.89 70.39# 72.6 65.57# 72.71 56.52# 47.41 0 7 +FRAME 94.98# 94.64# 93.22# 88.15 53.88 72.76 74.18 66.59 72.47 62.04 47.5 0 3 +HYP 94.84# 94.46# 92.96# 87.98 53.08 72.47# 74.23 66.47 71.82 61.02 46.73 0 4 #" 0 0 3 0 7 3 7 6 0 0 4 ## 5 6 3 4 0 3 0 1 0 1 0 Average 94.97 94.65 93.37 87.67 57.21 72.63 74.38 67.39 72.12 61.3 47.99 All 95.04# 94.31# 93.44 86.38# 61.43" 71.53# 74.26" 68.1" 74.54 59.71 51.41" 4 4 Oracle 95.4 95.04 94.01" 88.24 62.76" 73.32" 75.23" 68.53" 72.71 62.04 50.0" 6 0 89 Table 8.4: F1 scores for TEDEC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## +UPOS 95.4 94.99 94.02" 87.99 60.28" 73.17" 74.87" 67.8" 72.86 61.54 49.36" 6 0 +XPOS 95.4 95.04 94.18" 87.65# 60.32" 73.21" 74.84" 68.3" 72.87 61.44 49.23" 6 1 +CHUNK 95.57" 95.21" 93.49 88.11 57.61" 73.02" 74.73" 67.29 73.3 61.39 48.43" 6 0 +NER 95.32 95.09 93.64" 88.24 55.17" 72.77 74.01 67.25 71.08# 59.25# 48.24 2 2 +MWE 95.11# 94.8# 93.59 87.99 53.07 72.66 74.63" 66.88 70.93# 56.77 45.83 1 3 +SEM 95.2# 94.82 93.45 87.27# 58.21" 72.77 74.72" 68.46" 73.14 60.09# 47.95 3 3 +SEMTR 95.21# 94.8# 93.47 87.75 58.55" 72.5# 74.02 68.18" 71.74 59.77 46.96 2 3 +SUPSENSE 95.05# 94.81# 93.25 87.94 58.75" 72.71 74.52" 66.81 69.13# 55.68# 47.29 2 4 +COM 94.03# 93.94# 92.29# 86.59# 51.72 70.37# 71.76# 64.98# 72.71 55.25# 45.24 0 8 +FRAME 94.79# 94.66# 93.23# 88.02 53.05 72.26# 74.21 66.2# 72.89 62.04 46.0 0 5 +HYP 94.35# 94.56# 92.86# 87.91 52.98 72.15# 74.19 66.52 70.47 55.35# 46.73 0 5 #" 1 1 3 0 7 3 6 4 0 0 3 ## 7 6 3 3 0 4 1 2 3 5 0 Average 95.0 94.77 93.4 87.72 56.67 72.48 74.25 67.19 71.84 58.65 47.45 All 94.95# 94.42# 93.64 86.8# 61.97" 71.72# 74.36" 67.98" 74.61" 58.14# 51.31" 5 5 Oracle 95.57" 95.21" 94.07" 88.24 61.74" 73.1" 75.24" 68.22" 72.71 62.04 50.15" 8 0 90 Table 8.5: F1 scores for TEENC. We compare STL setting (blue), with pairwise MTL (+htaski), All, and Oracle. We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. The last two columns indicate how many tasks are helped or harmed by the task at that row. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## +UPOS 95.4 94.94 94.0" 87.43# 57.61" 73.11" 74.85" 67.76" 72.09 62.27 48.27 5 1 +XPOS 95.42 95.04 93.98" 87.71# 58.26" 73.04 74.66" 67.77" 72.41 61.62 48.06" 5 1 +CHUNK 95.4 95.1 93.49 88.07 58.06" 73.13" 74.77" 67.36 72.88 62.98 47.13 3 0 +NER 95.29 95.05 93.54 88.24 53.4 72.91 74.04 67.57" 70.78# 63.02 48.64 1 1 +MWE 95.05# 94.66# 93.33 88.02 53.07 72.83 74.66" 66.26 71.36 60.61 46.71 1 2 +SEM 95.27 94.93 93.52 87.49# 58.62" 72.77 74.41" 68.1" 72.25 62.17 47.12 3 1 +SEMTR 95.23 94.97 93.45 87.29# 58.31" 72.17# 74.02 67.64 72.15 62.79 46.1 1 2 +SUPSENSE 95.27 95.0 93.13# 87.92 58.05" 73.09" 74.94" 66.81 72.12 61.96 47.24 3 1 +COM 93.6# 93.12# 91.86# 86.75# 51.71 70.18# 71.35# 65.55# 72.71 57.65 47.81 0 7 +FRAME 95.0# 94.55# 93.29 87.99 53.3 72.49 74.63" 66.75 72.1 62.04 46.66 1 2 +HYP 94.43# 94.26# 93.13# 87.82 52.59 71.95 74.14 66.16 72.79 61.14 46.73 0 3 #" 0 0 2 0 6 3 7 4 0 0 1 ## 4 4 3 5 0 2 1 1 1 0 0 Average 95.0 94.66 93.32 87.65 55.99 72.49 74.24 67.09 72.09 61.62 47.37 All 94.94# 94.3# 93.7" 86.01# 59.57" 71.58# 74.35 68.02" 74.61" 61.83 49.5" 5 4 Oracle 95.4 95.04 93.93" 88.24 61.92" 73.14" 75.09" 69.04" 72.71 62.04 48.06" 6 0 91 Table 8.6: F1 scores for MULTI-DEC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## All 95.04 94.31 93.44 86.38 61.43 71.53 74.26 68.1 74.54 59.71 51.41 All - UPOS 94.03 93.59 86.03 61.28 70.87 73.54 68.27 74.42 58.47 51.13 0 0 All - XPOS 94.57# 93.57 86.04 61.91 71.12 74.03 67.99 74.36 60.16 51.65 0 1 All - CHUNK 94.84# 94.46 86.05 61.01 71.07 73.97 68.26 74.2 60.01 50.27 0 1 All - NER 94.81# 94.3 93.59 62.69 70.82 73.51# 68.16 74.08 59.17 50.86 0 2 All - MWE 94.93# 94.45 93.71 86.21 71.01 73.61# 68.18 74.7 59.23 50.83 0 2 All - SEM 94.82 94.34 93.63 85.81 61.17 71.97# 67.36 74.31 58.73 50.93 0 1 All - SEMTR 94.83 94.35 93.58 86.11 63.04 69.72# 68.17 74.2 59.49 51.27 0 1 All - SUPSENSE 94.97 94.54 93.67 86.43 60.51 71.22 73.86# 74.24 59.23 50.86 0 1 All - COM 95.19" 94.69" 93.67 86.6 61.95 72.38" 74.75" 68.67 62.37" 50.28 5 0 All - FRAME 95.15 94.57 93.7 85.9 62.62 71.48 74.24 68.47 75.03 50.89 0 0 All - HYP 94.93 94.53 93.78" 86.31 62.04 71.22 74.02 68.46 74.62 59.69 1 0 #" 1 1 1 0 0 1 1 0 0 1 0 ## 4 0 0 0 0 1 4 0 0 0 0 92 8.4.2 All MTL Results In addition to pairwise MTL results, we report the performances in the All and Oracle MTL settings in the last two rows of Table 8.3-8.5. We find that their performances depend largely on the trend in their corresponding pairwise MTL. We provide examples and discussion of such observations below. How much is STL vs. Pairwise MTL predictive of STL vs. All MTL? We find that the performance of pairwise MTL is predictive of the performance of All MTL to some degree. Below we discuss the results in more detail. Note that we would like to be predictive in both the performance direction and magnitude (whether and how much the scores will improve or degrade over the baseline). When pairwise MTL improves upon STL even slightly, All improves upon STL in all cases (MWE, SEMTR, SUPSENSE, and HYP). This is despite the fact that jointly learning some pairs of tasks lead to performance degradation (COM and FRAME in the case of SUPSENSE and COM in the case of SEMTR). Furthermore, when pairwise MTL leads to improvement in all cases (all pairwise rows in MWE and HYP), All MTL will achieve even better performance, suggesting that tasks are beneficial in a complementary manner and there is an advantage of MTL beyond two tasks. The opposite is almost true. When pairwise MTL does not improve upon STL, most of the time All MTL will not improve upon STL, either — with one exception: COM. Specifically, the pairwise MTL performances of UPOS, XPOS, NER and FRAME (TEDEC) are mostly negative and so are their All MTL performances. Furthermore, tasks can also be harmful in a complemen- tary manner. For instance, in the case of NER, All MTL achieves the lowest or the second lowest score when compared to any row of the pairwise MTL settings. In addition, SEM’s pairwise MTL performances are mixed, making the average score about the same or slightly worse than STL. However, the performance of All MTL when tested on SEM almost achieves the lowest. In other words, SEM is harmed more than it is benefited but pairwise MTL performances cannot tell. This suggests that harmful tasks are complementary while beneficial tasks are not. Our results when tested on COM are the most surprising. While none of pairwise MTL settings help (with some hurting), the performance of All MTL goes in the opposite direction, outperforming that of STL. Further characterization of task interaction is needed to reveal why this happens. One hypothesis is that instances in COM that are benefited by one task may be harmed by another. The joint training of all tasks thus works because tasks regularize each other. We believe that our results open the doors to other interesting research questions. While the pairwise MTL performance is somewhat predictive of the performance direction of All MTL (except COM), the magnitude of that direction is difficult to predict. It is clear that additional factors beyond pairwise performance contribute to the success or failure of the All MTL setting. It would be useful to automatically identify these factors or design a metric to capture that. There have been initial attempts along this research direction in [8, 20, 21], in which manually-defined task characteristics are found to be predictive of pairwise MTL’s failure or success. Oracle MTL Recall that a task has an “Oracle” set when the task is benefited from some other tasks according to its pairwise results (cf. Sect. 8.3.4). In general, our simple heuristic works well. Out of 20 cases where Oracle MTL performances exist, 16 are better than the performance of All MTL. In SEM, UPOS and XPOS (TEDEC, Oracle MTL is able to reverse the negative results obtained by All MTL to the positive ones, leading to improved scores over STL in all 93 Figure 8.4: t-SNE visualization of the embeddings of the 11 tasks that are learned from TEDEC cases. This suggests that pairwise MTL performances are valuable knowledge if we want to go beyond two tasks. But, as mentioned previously, pairwise performance information fails in the case of COM; All MTL leads to improvement but we do not have an Oracle set in this case. Out of 4 cases where Oracle MTL does not improve upon All MTL, 3 is when we test on HYP and one is when we test on MWE. These two tasks are not harmed by any tasks. This result seems to suggest that sometimes “neutral” tasks can help in MTL (but not always, for example, in MULTI-DEC and TEENC of MWE). This also raises the question of whether there is a more effective way to construct an oracle set. 8.4.3 Analysis Task Contribution in All MTL How much does one particular task contribute to the perfor- mance of All MTL? To investigate this, we remove one task at a time and train the rest jointly. Results are shown in Table 8.6 for the method MULTI-DEC– results for other two methods are in Sect. B.2 as they are similar to MULTI-DEC qualitatively. We find that UPOS, SEM and SEMTR are in general sensitive to a task being removed from All MTL. Moreover, at least one task sig- nificantly contributes to the success of All MTL at some point; if we remove it, the performance will drop. On the other hand, COM generally negatively affects the performance of All MTL as removing it often leads to performance improvement. Task Embeddings Fig. 8.4 shows t-SNE visualization [242] of task embeddings learned from TEDEC 6 in the All MTL setting. The learned task embeddings reflect our knowledge about similarities between tasks, where there are clusters of syntactic and semantic tasks. We also learn that sentence compression (COM) is more syntactic, whereas multi-word expression identification (MWE) and hyper-text detection (HYP) are more semantic. Interestingly, CHUNK seems to be in between, which may explain why it never harms any tasks in any settings (cf. Sect. 8.4.1). In general, it is not obvious how to translate task similarities derived from task embeddings into something indicative of MTL performance. While our task embeddings could be considered as “task characteristics” vectors, they are not guaranteed to be interpretable. We thus leave a thorough investigation of information captured by task embeddings to future work. Nevertheless, we observe that task embeddings disentangle “sentences/tags” and “actual task” to some degree. For instance, if we consider the locations of each pair of tasks that use the same set of sentences for training in Fig. 8.4, we see that SEM and SEMTR (or MWE and SUPSENSE) are not neighbors, while XPOS and UPOS are. On the other hand, MWE and NER are neighbors, 6 We observed that task embeddings learned from TEENC are not consistent across multiple runs. 94 even though their label set size and entropy are not the closest. These observations suggest that hand-designed task features used in [20] may not be the most informative characterization for predicting MTL performance. 8.5 Related Work For a comprehensive overview of MTL in NLP, see Chapter 20 of [85] and [211]. Here we highlight those which are mostly relevant. MTL for NLP has been popular since a unified architecture was proposed by [44, 45]. As for sequence to sequence learning [233], general multi-task learning frameworks are explored by [156]. Our work is different from existing work in several aspects. First, the majority of the work focuses on two tasks, often with one being the main task and the other being the auxiliary one [8, 20, 22, 196, 228]. For example, POS is the auxiliary task in [228] while CHUNK, CCG supertagging (CCG) [42], NER, SEM, or MWE+SUPSENSE is the main one. They find that POS benefits CHUNK and CCG. Another line of work considers language modeling as the auxiliary objective [84, 151, 202]. Besides sequence tagging, some work considers two high-level tasks or one high-level task with another lower-level one. Examples are dependency parsing (DEP) with POS [278], with MWE [46], or with semantic role labeling (SRL) [223]; machine translation (TRANSLATE) with POS or DEP [56, 181]; sentence extraction and COM [7, 18, 164]. Exceptions to this include the work of [45], which considers four tasks: POS, CHUNK, NER, and SRL; [199], which considers three: word sense disambiguation with POS and coarse-grained semantic tagging based on WordNet lexicographer files; [94], which considers five: POS, CHUNK, DEP, semantic relatedness, and textual entailment; [118, 181], which both consider three: TRANS- LATE with POS and NER, and TRANSLATE with POS and DEP, respectively. We consider as many as 11 tasks jointly. Second, we choose to focus on model architectures that are generic enough to be shared by many tasks. Our structure is similar to [45], but we also explore frameworks related to task embeddings and propose two variants. In contrast, recent work considers stacked architectures (mostly for sequence tagging) in which tasks can supervise at different layers of a network [8, 20, 94, 120, 196, 228]. More complicated structures require more sophisticated MTL methods when the number of tasks grows and thus prevent us from concentrating on analyzing relationships among tasks. For this reason, we leave MTL for complicated models for future work. The purpose of our study is relevant to but different from transfer learning, where the setting designates one or more target tasks and focuses on whether the target tasks can be learned more effectively from the source tasks; see e.g., [179, 267]. 8.6 Discussion and Future Work We conduct an empirical study on MTL for sequence tagging, which so far has been mostly studied with two or a few tasks. We also propose two alternative frameworks that augment taggers with task embeddings. Our results provide insights regarding task relatedness and show benefits of the MTL approaches. Nevertheless, we believe that our work simply scratches the surface of MTL. The characterization of task relationships seems to go beyond the performances of pairwise MTL training or similarities of their task embeddings. We are also interested in exploring further other techniques to MTL, especially when tasks become more complicated. For example, it is not clear how to best represent task specification as well as how to incorporate them into 95 NLP systems. Finally, the definition of tasks can be relaxed to include domains or languages. Combining all these will move us toward the goal of having a single robust, generalizable NLP agent that is equipped with a diverse set of skills. 96 Part V Concluding Remarks and Future Work 97 Chapter 9 Conclusion and Future Directions In this section, we provide concluding remarks, interesting questions, and future directions moti- vated by the work in this thesis. We also make these directions more concrete by providing their exemplar case studies. 9.1 Rethinking Similarity Modeling and Learning Our SCA deals with one drawback of similarity. However, we would like to remind the reader of two broader issues. The first one is the definition of similarity. Concretely, what are necessary or sufficient conditions for two objects to be similar? For example, if we know that transitivity is undesirable, we can limit the scope of methods to be considered. Furthermore, what if supervision (whether or not two things are similar) is not or only partially available? This question motivates research on both unsupervised and semi-supervised similarity learning. We provide a case study below. The second and even broader issue is the fact that relationships among data are much more than similarity (I can eat an apple, but an apple cannot eat me.) While developing a system to perform a specific task, we should ask how our assumptions or definitions of the data relationships would affect the task outcome. If our goal is to build a general-purpose system, we probably want to make as fewest assumptions about the data as possible. 9.1.1 Case Study: Modeling non-metric similarity in unsupervised learning with applications to word sense discovery SCA models and understands similarity from similarity annotated data. Can we understand sim- ilarity in unlabeled data as well? Toward this goal, we propose to understand semantic similarity between words (in the context of natural language processing tasks) as our case study. We consider methods for learning word representations that are based on the distributional hypothesis of Harris [93], which states that words in similar contexts have similar meanings. Under this assumption, words are not merely discrete units, allowing their similarity and dissimilarity to be measured. Relevant work on word vectors is summarized in Sect. 3.1.1.2. Our main focus will be on the skip-gram model [168, 169]. We consider two main extensions to the skip-gram model. Firstly, like in SCA, we modify how similarity is measured in the skip-gram model with multiplicative combination of multiple types of similarities. Secondly, we extend the skip-gram model by introducing latent variable that governs the component (i.e., sense) of each word. In what follows, we useW andC to denote the vocabularies of words and contexts, respec- tively. In this chapter,C is derived from the neighboring words around the word of interest; thus, 98 C =W. We differentiate two types of embeddings: v w for the embedding for the wordw inW andu c for the wordc inC. We refer to them as word vectors and context vectors, respectively. We useuv to represent an inner product between the two vectors. Skip-gram model (SG) For each word-context pair (w;c), the objective is to maximize the probability Pr(cjw), which is parametrized by the softmax function: Pr(cjw)/ exp(v w u c ) (9.1) To reduce the computational burden, Mikolov et al. [169] introduce the technique of negative sampling where logistic regression is used to approximate the softmax objective: log Pr(cjw) log Pr(D = 1jw;c) (9.2) +kE c 0 Pr D (c) log(1 Pr(D = 1jw;c 0 )) where Pr(D = 1jw;c) =(v w u c ) is the sigmoid function predicting the co-occurrence of the word and context pair. k is the number of “negative” samples of such pair, where c 0 is drawn from a “background” distribution of context words Pr D (c) — in practice, a smoothed empirical unigram distribution is found to be effective. While our objective is still based on (9.2), we model Pr(D = 1jw;c) differently in both models below. Skip-gram with non-metric similarity LetK be the number of latent components we desire, andv kw be thekth embedding for the wordw. As a consequence of havingK latent components deciding non-metric similarities, we have Pr(D = 1jw;c) = 1 K Y k=1 (1(v kw u c )) (9.3) Skip-gram with multiple senses: Probabilistic Latent Multi-Sense Skip-gram (PLMSG) LetT be the number of embeddings we desire, andv tw be thetth embedding for the wordw. Note that, for context words, the number of embeddings remains at 1. As a consequence of havingT embeddings, we have Pr(D = 1jw;c) = T X t=1 Pr(D = 1;tjw;c) (9.4) = T X t=1 Pr(D = 1jw;c;t) Pr(tjw;c) (9.5) = T X t=1 (v tw u c ) exp(v tw u c ) P T t 0 =1 exp(v t 0 w u c ) (9.6) Obviously, the model simplifies to the skip-gram whenT = 1. Our model assumes that, whenever we see an occurrence of word-context pair, there is an underlying sense associated with it. Intuitively, our model is trained to jointly predict the context (similar to the skip-gram) and the sense in which such word and context co-occur. Without the 99 Table 9.1: Nearest words to the target words under different senses Word/ SG MSSG PLMSG Method sense#1 sense#2 sense#3 sense#1 sense#2 sense#3 apple blackberry mattel amigaos pecan strawberry ibm macintosh apricot merchandise beos persimmon blueberry microsoft iigs peach dreamworks os/2 pear dandelion desktop trs-80 rose rising mcclung raised brandy grace risen peach berry knight yielding touch vincent soared garden soared montgomery falling rod charlotte rocketed poinsettia interest interests demand experience focus interested liabilities desire considerable externalities artistic emphasis interests equities enthusiam attention substantial difficulties significance attention assets understanding harry harold frank oliver jack jack ron harold frank herbert fred robin sam potter frank jack ralph richard sam robin half-blood fred knowledge about which sense they belong to, the prediction is performed by marginalizing out over all the senses. Our model can be also seen as a factor model. Levy and Goldberg [143] show that the skip-gram model with negative sampling implicitly factorizes a word-context PMI matrix. Anal- ogously, our model assumes that the word-context matrix has multiple underlying co-occurrence patterns, each of which depends on the latent senses and Pr(tjw;c) is the weight. Preliminary results for PLMSG We evaluate the quality of the word vectors from our ap- proach (PLMSG), both qualitatively and quantitively and compare to two baselines: the skip- gram model (SG) and Multiple-Sense Skip-Gram (MSSG) [180]. MSSG alternates between in- ferring the senses (by clustering contexts) and learning embeddings. Our model combines those two steps with a unified objective function. Setup We closely follow the experimental setting described for MSSG. We train on the April 2010 snapshot of Wikipedia [222], which contains about 2 million documents and 990 million tokens. We remove words that appear less than 20 times. We use multiple embeddings for the 6K provided in [180] and learn single embeddings for the rest of the vocabulary. The embedding dimension is 50 and the number of aspects is T = 3. We use negative sampling with a window size of 10 centered at the target word. We initialize the vectors by training a skip-gram model with an initial learning rate of 0.025 then train our model with an initial learning rate of 0.00625. Qualitative results In Table 9.1, for each word, we show top three nearest words under cosine similarity for each method. Our model captures well different senses for various types of words. As expected, for the skip-gram model, words from different senses mix. While MSSG can capture multiple senses, our model can be more semantically coherent in some cases. For example, our model clearly captures the flower sense of ROSE while MSSG does not. MSSG’s senses for HARRY are all about names while ours include one for the Harry Potter series sense of HARRY. Word similarity We also evaluate our model with the popular word similarity task. In particular, the goal is to assign each pair of words a similarity score. Concretely, the quality of the assignment is evaluated by computing the Spearman rank correlation between the cosine similarity scores between embeddings and the scores based on human judgements. 100 Table 9.2: Evaluation of multiple embeddings on the task of word similarity. is the Spearman rank correlation with human judgements. The higher, the better. 100 Test set/Method SG MSSG PLMSG max avg max avg WS-353-ALL 66.9 62.3 63.1 63.0 64.3 MTurk-771 61.4 52.4 54.0 59.5 55.8 MTurk-287 66.3 62.2 61.3 66.0 64.7 RG-65 72.6 63.22 62.7 68.8 73.7 MC-30 74.0 77.7 72.1 74.2 64.6 SIMLEX-999 27.6 20.1 32.0 26.7 30.6 MEN-TR-3k 69.8 58.8 60.0 68.5 64.9 RW-STANFORD 44.5 33.5 35.07 41.5 45.3 VERB-143 38.8 26.0 30.2 34.0 36.7 YP-130 38.6 33.4 29.5 48.9 36.0 Datasets We evaluate on the following benchmark datasets: WS-353-ALL [66], MTurk-771 [92], MTurk-287 [198], RG-65 [210], MC-30 [171], SIMLEX-99 [97], MEN-TR-3k [24], RW- STANFORD [157], VERB-143 [12], YP-130 [265]. We consider two options for measuring similarity between two sets of multiple embeddings, one for each word pair max(w;w 0 ) = max t;t 0 sim(v wt ;v w 0 t 0) (9.7) avg(w;w 0 ) = 1 T 2 T X t=1 T X t 0 =1 sim(v wt ;v w 0 t 0) (9.8) where sim is the cosine similarity measure. Table 9.2 illustrates our results. Whether avg or max performs better generally depends on the datasets. Our multiple embeddings outperform MSSG in most cases in either similarity measure. However, neither our method nor the standard skip-gram dominate the other. Word similarity with context Similarity scores between pairs of words in isolation can be less informative as the senses of words depend highly on their contexts. To address this issue, Huang et al. [101] introduce the Stanford’s Contextual Word Similarities (SCWS) dataset where the context for each word is additionally provided. Following [101, 180, 239], we study two similarity measures that take into consideration the contexts. The key step is to predict the sense using the contexts. maxC((w;C); (w 0 ;C 0 )) = sim(v wt ;v w 0 t 0) (9.9) wheret = arg max t2f1;:::;Tg Pr(tjw;C), and avgC((w;C); (w 0 ;C 0 )) = (9.10) T X t=1 T X t 0 =1 Pr(tjw;C) Pr(t 0 jw 0 ;C 0 ) sim(v wt ;v w 0 t 0) 101 Table 9.3: Evaluation of multiple embeddings on the contextual word similarity task on the SCWS dataset. is the Spearman rank correlation with human judgements. The higher, the better. Method 100 max avg maxC avgC SG 62.1 - MSSG 60.3 64.4 49.2 66.9 PLMSG-M 62.6 64.6 59.7 66.0 PLMSG-A 59.7 66.0 PLMSG-M 62.5 65.1 60.3 66.6 PLMSG-A 60.2 66.6 whereC andC 0 are contexts forw andw 0 , respectively. We experiment with two ways of predicting the sense from all the words in the contextC = fc 1 ;c 2 ; ;c K g: M : Pr(tjw;c 1 ;:::;c K )/ K Y i=1 Pr(tjw;c i ) (9.11) A : Pr(tjw;c 1 ;:::;c K )/ Pr(tjw; c) (9.12) where Pr(tjw;c) is defined as in eq. (9.6). Moreover,u c = 1 K P K i=1 u c i , i.e., the average of the context word vectors. We refer to our method with these two types of choices as PLMSG-M and PLMSG-A, respectively. Similar to [101], we also train models without stop words (designated wtih ). Results are shown in Table 9.3. First, we observe almost no difference between our two methods for predicting the sense (PLMSG-M vs. PLMSG-A). Secondly, both MSSG and our method can utilize the contexts to improve the rank correlation, significantly outperforming the skip-gram, if avgC is used to compute similarity. MSSG slightly outperforms our method. This could be due to the fact that MSSG’s method for inferring senses uses the average of context vectors during training while our method is purely bag-of-words (i.e., inferring sense for each word-context pair). Both methods, however, perform unfavorably when maxC is used, implying that committing to a single sense — even if it is the most likely one — could lead to inaccurate similarity scores. Note that our method does not degrade much, proving to be more robust than MSSG under this setting. 9.2 Correcting and Reliably Leveraging Similarity Graphs for Effective Learning We have shown that similarity graphs can be useful in many scenarios. However, they are often incomplete and noisy, and sometimes even unknown. One main question that arises from this is how do we correct such similarity graphs?. In this section, we consider possible directions related to applications we have considered in this thesis. 102 9.2.1 Case Study: Improving the quality of semantic representations Obtaining semantic representations is a non-trivial task. One often faces with the challenge of balancing between the quality of the semantics (depth) and the coverage the semantics provide (breadth). In zero-shot learning, the quality of semantic representations is governed by how well they align with discriminative visual features of object categories, while the coverage over classes is usually governed by how automatic the discovery process is. Attributes annotated by human experts are shown to be better than word vectors in terms of the quality but apparently much harder to scale. Can we take current semantic representations and improve upon them? Moreover, can we develop discovery methods that achieve the best of both worlds? Below we propose several potential approaches for improving the quality of semantic representations and, in some case, still maintaining automaticity. Two-stage approach One strategy to improve the quality of semantic representations is to in- ject discriminative visual knowledge into the existing representations. The method of predicting visual exemplars [32], where we train a non-linear mapping from semantic representations to visual feature exemplars, is a step toward this direction. Currently, we derive visual feature exemplars from the last layer before the softmax layer of a convolutional neural network. We would like to explore activations in intermediate layers as semantic representations. These acti- vations contain potentially useful visual knowledge, but are more mid-level than activations from the last layer. Therefore, they arguably could be better at knowledge transfer. Previous work has shown that attributes can emerge from intermediate layers of convolutional neural networks [57, 244, 281], but none has explored their potential benefits in the context of zero-shot learning. Unified approach Learning a mapping to adapt existing semantic representations will work best if initial semantic representations are predictive of neural activations. However, current methods for discovering attributes and word vectors do not take this into account. For instance, attribute annotators could come up a set of attributes that are highly correlated or hardly machine detectable (visually); thus, these attributes are not useful for distinguishing between classes. As a consequence, to build semantic representations from scratch, we want to design a principled, possibly interactive, approach that translates a new concept into neural activations directly. Concretely, the potential method could be as follows. First, we learn and establish connec- tions between neural activations and human semantics. One way to achieve this goal is to exploit current techniques for visualizing convolutional neural networks [275, 276]. Second, we learn and establish connections between new concepts (such as unseen classes) and semantics discov- ered in the first step. • Interactive approach For both steps, human experts can help refine the process. In partic- ular, human experts may provide additional semantics after visualizing activated neurons. These semantics are then translated into neural network parameters (see, e.g., [127] for approaches in the context of deformable part models [63, 81]). Human experts may also provide associations between novel concepts and uncovered semantics visualized by acti- vated neurons. • Purely automatic approach Using word vectors to scale up the semantic representation discovery is sensible because in the end the object class names are represented in text. However, textual similarities barely imply visual similarities [116]. In order to learn and 103 establish the two connections described above, we can use noisy image-text data from the web. It has been shown that such webly supervised data can produce high-quality visual images features for classification [114]. It is interesting to see if similar approaches can be applied to learn the two connections. 9.2.2 Case Study: Generalized few-shot learning The key assumption in zero-shot learning is that knowledge about seen categories is useful for recognition of unseen categories. However, even with perfect semantic representations, it is naive to assume that this assumption is always realistic. For example, algorithms that have seen only animals would have a hard time understanding what a car should look like. Two main questions arise by stepping away from this assumption. First, can we determine from data whether or not zero-shot learning is even possible? Second, if not, given resource constraints, can we select object categories whose labeled data will be the most useful? To answer these two questions, we can analyze zero-shot learning algorithms in the gener- alized few-shot learning setting. Unlike in the zero-shot learning setting, the learning algorithm is allowed to have access to randomly selected images from some unseen classes. Unlike in the few-shot learning setting [62], the learning algorithm does not know for which subset of unseen classes we have a few of their examples. Determining which additional knowledge is required for zero-shot transfer We expect two main criteria for selecting a subset of unseen categories for generalized few-shot learning. • Coverage How much are those selected categories similar to the rest of the unseen cate- gories? In this case, labeled examples provide additional core knowledge to transfer from. This criterion can be translated into a diversity-promoting objective such as submodular functions [124] and determinantal point processes [130]. • Discrimination How much do those selected categories resolve confusion between cate- gories? In this case, labeled examples provide additional specialized knowledge for under- standing similarities and differences between seen and unseen classes and/or among unseen classes themselves. This criterion encourages selecting categories in the high-density part of the semantic representation space. Generalizing from few examples The long-tail nature of object frequencies makes it difficult to collect training labeled data for rare classes. Thus, it is important that the learning algorithm is able to generalize from only few examples provided by a selected subset of unseen classes. Recent advances in conditional generative modeling has shown great promise to benefit this task. Unlike in unconditional models, the generation process must also condition on descriptions of objects to be generated. We note that descriptions here are defined in the broadest sense; they can be anything from category indicator, attribute vectors, word vectors, to even text description [173, 197, 201, 264]. • Methods for generalization Given a collection of images, the key challenge in generative modeling is to learn to disentangle factors of variations in those images [16]. In our case, the task is even more difficult as (i) we want to model object-specific factors of variations (poses of rigid versus deformable objects, lighting of different surface materials, etc.) and (ii) we have only a few exemplars for those rare objects to generalize from. One approach 104 to alleviate the issue (ii) is to assume that variations of semantically similar objects are similar (cat images are more useful for modeling variations of tiger images than car images are.). One-shot generalization in the context of alphabet generation can be found in [204]. • Zero-shot learning as an evaluation method ofconditional generative models The abil- ity of generative models to understand the world may be characterized roughly by samples they generate. If the data distribution is modeled explicitly, then we can assign numbers specifying how likely the samples are. However, such a strategy is not believed to reflect the true likelihood of the samples. For example, generative models for high-dimensional natural images often end up assigning high likelihood to some interpolation of training data, which usually look unrealistic and blurred. Motivated by this, can we come up with a reliable quantitative evaluation for generative models? See [238] for comprehensive dis- cussions about flaws in current evaluation methods for generative models. One possible solution is to use zero-shot learning to evaluate conditional generative models. We expect that it is a useful metric because the task requires the models to extrapolate beyond what they have seen. Specifically, models that understand and are able to disentangle styles and contents of images are likely to perform well in zero-shot learning. 9.3 Applications This thesis considers visual object recognition in the wild and natural language sequence tagging. However, long-tail phenomena are ubiquitous, and thus applications of learning with limited data can be broader than what we have considered. In this section, we discuss tasks and domains where our approaches can potentially be applicable and useful. Furthermore, while we leverage similarity in the context of learning from limited data, we should not rule out the possibility of exploiting similarity even when data is abundant. We leave this for future work. 9.3.1 Case Study: Beyond image classification in computer vision What does it mean to understand an image? While there is probably no single correct answer to this question, we can probably agree that this goes beyond naming objects in the images, the only task that we have considered so far. The problem with the long-tail phenomena is the shortage of training data for rare categories. In the domain of image classification, categories refer to object categories. For complex tasks, however, categories are often defined in terms of a composition of simpler building blocks. Such compositionality often means that the number of categories will exponentially explode, making it even harder to collect data for all of them. Under this theme, we consider the task of visual relationship detection [10, 153] as well as the task of situation recognition [269]. The task of visual relationship detection is as follows. Given an image, we would like to predict all the relationships between any two objects that occur in the image. Clearly, zero- shot scenarios occur. In fact, they occur much more frequently than in the image classification task. Besides unseen objects (as in image classification) and unseen relationships, likely there are many unseen combinations of object-relationship-object. This forces a zero-shot learner to truly understand the nature of the relationships. For example, it cannot simply take the presence of “man” and “bicycle” as the only signal for “riding”; otherwise, it would not be able to detect the unseen triple (“man” “riding” “lion”). Our zero-shot learning algorithms can be extended for this task. A naive extension of existing zero-shot learning algorithms requires that there exist semantic representations for all triples. One 105 way to achieve this is to learn a semantic composition function that, given semantic representa- tions of any two objects and their relationship as input, predicts the semantic representation of the whole object-relationship-object triples. Semantic representations of individual components can be learned using previous strategies Sect. 3.1.1. In terms of training data, we can use recently released the Visual Genome dataset [126], which contains dense annotations of objects and their relationships. Finally, another example is the task of situation recognition or visual semantic role labeling [269]. The task is to recognize the main activity, the participating actors, objects, substances, and locations, as well as the roles these participants play in the activity (e.g., the man is clipping, the shears are his tool, the wool is being clipped from the sheep, and the clipping is in a field). A strategy described above applies to this task and the dataset is provided by [269]. Compared to visual relationship detection, situation recognition asks for a more detailed description of an image. 9.3.2 Case Study: Natural language processing and understanding In domain of natural language processing, examples of limited-data scenarios are: • Out-of-vocabulary words (e.g, named entities, misspelling). • Unseen or unfamiliar sentences and multi-word expressions (e.g., kick the bucket). • Low-resource languages (e.g., Tagalog, Swahili). • Specialized domains in which data are not yet abundantly available (e.g., medicine vs. news). • Commonsense information (e.g., what constitutes an event such as going to a movie, an obvious fact such as humans are smaller than the house they live in). • Tasks in which labeled data is very expensive to obtain (e.g., identifying abstract meaning representation of sentences). For example, to address zero-shot learning of word representations, we may assume that se- mantic representations of words are the word vectors that are learned in an unsupervised manner based on their contexts. To operate in a specific task where labeled data are limited, we can ex- pand the vocabulary by adapting semantic representations (i.e., task-agnostic word embeddings). Adaptation techniques include, but not limited to, [13, 60, 88, 119, 134, 159, 160, 194, 270]. In computer vision, adaptation is similar to fine-tuning [80, 184] or learning without forgetting [147]. Additionally, similar to objects in images, words are part of a sentence, a paragraph, a doc- ument, and a corpus. Generally, like what we described above, representation learning beyond word embeddings requires an understanding of how a combination of words constructs the mean- ing of bigger entities that they are part of — modeling composition functions. 106 Bibliography [1] I. Abraham, S. Chechik, D. Kempe, and A. Slivkins. Low-distortion inference of latent similarities from a multiplex social network. In SODA, 2013. [2] I. Abraham, S. Chechik, D. Kempe, and A. Slivkins. Low-distortion inference of latent similarities from a multiplex social network. SIAM Journal on Computing, 44(3):617– 668, 2015. [3] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. JMLR, 9:1981–2014, June 2008. [4] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In CVPR, 2013. [5] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR, 2015. [6] Z. Al-Halah and R. Stiefelhagen. How to transfer? zero-shot object recognition via hier- archical transfer of semantic attributes. In WACV, 2015. [7] M. B. Almeida and A. F. T. Martins. Fast and robust compressive summarization with dual decomposition and multi-task learning. In ACL, 2013. [8] H. M. Alonso and B. Plank. Multitask learning for semantic sequence prediction under varying data conditions. In EACL, 2017. [9] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73:243–272, 2008. [10] Y . Atzmon, J. Berant, V . Kezami, A. Globerson, and G. Chechik. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639, 2016. [11] C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley FrameNet Project. In COLING- ACL, 1998. [12] S. Baker, R. Reichart, and A. Korhonen. An unsupervised model for instance level subcat- egorization acquisition. In EMNLP, 2014. [13] M. Bansal, K. Gimpel, and K. Livescu. Tailoring continuous word representations for dependency parsing. In ACL, 2014. [14] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396, 2003. [15] A. Bellet, A. Habrard, and M. Sebban. Metric learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 9(1):1–151, 2015. [16] Y . Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI, 35(8):1798–1828, 2013. [17] Y . Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. JMLR, 3:1137–1155, 2003. [18] T. Berg-Kirkpatrick, D. Gillick, and D. Klein. Jointly learning to extract and compress. In ACL, 2011. 107 [19] A. Bies, J. Mott, C. Warner, and S. Kulick. English web treebank. Technical Report LDC2012T13, Linguistic Data Consortium, Philadelphia, PA, 2012. [20] J. Bingel and A. Søgaard. Identifying beneficial task relations for multi-task learning in deep neural networks. In EACL, 2017. [21] J. Bjerva. Will my auxiliary tagging task help? Estimating Auxiliary Tasks Effectivity in Multi-Task Learning. In NoDaLiDa, 2017. [22] J. Bjerva, B. Plank, and J. Bos. Semantic tagging with deep residual networks. In COLING, 2016. [23] D. M. Blei, A. Y . Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003. [24] E. Bruni, G. Boleda, M. Baroni, and N.-K. Tran. Distributional semantics in technicolor. In ACL, 2012. [25] M. Bucher, S. Herbin, and F. Jurie. Zero-shot classification by generating artificial visual features. In RFIAP, 2018. [26] J. A. Bullinaria and J. P. Levy. Extracting semantic representations from word co- occurrence statistics: A computational study. Behavior research methods, 39(3):510–526, 2007. [27] R. Caruana. Multitask learning: A knowledge-based source of inductive bias. In ICML, 1993. [28] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [29] V . Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 110(13):E1181–E1190, 2013. [30] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2:27:1–27:27, 2011. [31] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zero-shot learning. In CVPR, 2016. [32] S. Changpinyo, W.-L. Chao, and F. Sha. Predicting visual exemplars of unseen classes for zero-shot learning. In ICCV, 2017. [33] S. Changpinyo, H. Hu, and F. Sha. Multi-task learning for sequence tagging: An empirical study. In COLING, 2018. [34] S. Changpinyo, K. Liu, and F. Sha. Similarity component analysis. In NIPS, 2013. [35] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, 2016. [36] C.-Y . Chen and K. Grauman. Inferring analogous attributes. In CVPR, 2014. [37] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading Wikipedia to answer open-domain questions. In ACL, 2017. [38] H. Chen, A. Gallagher, and B. Girod. Describing clothing by semantic attributes. In ECCV, 2012. [39] H. Chen, A. C. Gallagher, and B. Girod. What’s in a name? first names as facial attributes. In CVPR, 2013. [40] K. Cho, B. van Merrienboer, aglar G¨ ulehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014. [41] M. Ciaramita and Y . Altun. Broad-coverage sense disambiguation and information extrac- tion with a supersense sequence tagger. In EMNLP, 2006. [42] S. Clark. Supertagging for combinatory categorial grammar. In Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+ 6), 108 pages 19–24, 2002. [43] J. Clarke and M. Lapata. Constraint-based sentence compression: An integer programming approach. In ACL, 2006. [44] R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, 2008. [45] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. JMLR, 12(Aug):2493–2537, 2011. [46] M. Constant and J. Nivre. A transition-based system for joint lexical and syntactic analysis. In ACL, 2016. [47] K. Crammer and Y . Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265–292, 2002. [48] D. Das, D. Chen, A. F. T. Martins, N. Schneider, and N. A. Smith. Frame-semantic parsing. Computational Linguistics, 40:9–56, 2014. [49] J. V . Davis and I. S. Dhillon. Differential entropic clustering of multivariate gaussians. In NIPS, 2006. [50] J. V . Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learn- ing. In ICML, 2007. [51] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391, 1990. [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [53] M. Douze, A. Ramisa, and C. Schmid. Combining attributes and fisher vectors for efficient image retrieval. In CVPR, 2011. [54] K. Duan, D. Parikh, D. Crandall, and K. Grauman. Discovering localized attributes for fine-grained recognition. In CVPR, 2012. [55] M. Elhoseiny, B. Saleh, and A. Elgammal. Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV, 2013. [56] A. Eriguchi, Y . Tsuruoka, and K. Cho. Learning to parse and translate improves neural machine translation. In ACL, 2017. [57] V . Escorcia, J. C. Niebles, and B. Ghanem. On the relationship between visual attributes and convolutional networks. In CVPR, 2015. [58] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. JMLR, 9:1871–1874, 2008. [59] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. [60] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting word vectors to semantic lexicons. In NAACL, 2015. [61] T. Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27:861–874, 2006. [62] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. TPAMI, 28(4):594–611, 2006. [63] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 32(9):1627–1645, 2010. [64] V . Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007. [65] S. E. Fienberg, M. M. Meyer, and S. S. Wasserman. Statistical analysis of multiple socio- metric relations. JASA, 80(389):51–67, March 1985. 109 [66] L. Finkelstein, E. Gabrilovich, Y . Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. In WWW, 2001. [67] W. N. Francis and H. Kuˇ cera. Frequency analysis of english usage: Lexicon and grammar. Journal of English Linguistics, 18(1):64–70, 1982. [68] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In NIPS, 2013. [69] A. Frome, Y . Singer, and J. Malik. Image retrieval and classification using local distance functions. In NIPS, 2006. [70] A. Frome, Y . Singer, F. Sha, and J. Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In ICCV, 2007. [71] Y . Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong. Transductive multi-view embed- ding for zero-shot recognition and annotation. In ECCV, 2014. [72] Y . Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learn- ing. TPAMI, 2015. [73] Y . Fu, T. Xiang, Y .-G. Jiang, X. Xue, L. Sigal, and S. Gong. Recent advances in zero-shot recognition: Toward data-efficient understanding of visual content. IEEE Signal Process- ing Magazine, 35:112–125, 2018. [74] Z. Fu, T. Xiang, E. Kodirov, and S. Gong. Zero-shot object recognition by semantic mani- fold distance. In CVPR, 2015. [75] C. Gan, M. Lin, Y . Yang, Y . Zhuang, and A. G. Hauptmann. Exploring semantic interclass relationships (sir) for zero-shot action recognition. In AAAI, 2015. [76] C. Gan, T. Yang, and B. Gong. Learning attributes equals multi-source domain generaliza- tion. In CVPR, 2016. [77] C. Gan, Y . Yang, L. Zhu, D. Zhao, and Y . Zhuang. Recognizing an action using its name: A knowledge-based approach. IJCV, pages 1–17, 2016. [78] M. A. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer. AllenNLP: A deep semantic natural language pro- cessing platform. arXiv preprint arXiv:1803.07640, 2018. [79] E. Gavves, T. Mensink, T. Tommasi, C. G. Snoek, and T. Tuytelaars. Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In ICCV, 2015. [80] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. [81] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks. In CVPR, 2015. [82] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS, 2005. [83] X. Glorot and Y . Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. [84] J. Godwin, P. Stenetorp, and S. Riedel. Deep semi-supervised learning with linguistically motivated sequence labeling task hierarchies. arXiv preprint arXiv:1612.09113, 2016. [85] Y . Goldberg. Neural network methods for natural language processing. Synthesis Lectures on Human Language Technologies, 10(1):1–309, 2017. [86] Y . Goldberg and O. Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. In arXiv preprint arXiv:1402.3722, 2014. [87] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004. [88] S. Gouws and A. Søgaard. Simple task-specific bilingual word embeddings. In NAACL, 2015. 110 [89] D. Graff. The 1996 broadcast news speech and language-model corpus. In Proceedings of the 1997 DARPA Speech Recognition Workshop, 1997. [90] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. In NAACL, 2018. [91] M. Gutmann and A. Hyv¨ arinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS, 2010. [92] G. Halawi, G. Dror, E. Gabrilovich, and Y . Koren. Large-scale learning of word relatedness with constraints. In KDD, 2012. [93] Z. S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954. [94] K. Hashimoto, C. Xiong, Y . Tsuruoka, and R. Socher. A Joint Many-Task Model: Growing a neural network for multiple NLP tasks. In EMNLP, 2017. [95] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classification. TPAMI, 18(6):607–616, Jun 1996. [96] S. Hauberg, O. Freifeld, and M. Black. A geometric take on metric learning. In NIPS, 2012. [97] F. Hill, R. Reichart, and A. Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 2015. [98] G. E. Hinton and S. T. Roweis. Stochastic neighbor embedding. In NIPS, 2002. [99] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 8:1735–80, 1997. [100] M. Hu, Y . Peng, and X. Qiu. Reinforced mnemonic reader for machine comprehension. CoRR, abs/1705.02798, 2017. [101] E. H. Huang, R. Socher, C. D. Manning, and A. Y . Ng. Improving word representations via global context and multiple word prototypes. In ACL, 2012. [102] H.-Y . Huang, C. Zhu, Y . Shen, and W. Chen. FusionNet: Fusing via fully-aware attention with application to machine comprehension. In ICLR, 2018. [103] Z. Huang, W. Xu, and K. Yu. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991, 2015. [104] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [105] O. Irsoy and C. Cardie. Opinion mining with deep recurrent neural networks. In EMNLP, 2014. [106] T. S. Jaakkola and M. I. Jordan. Variational probabilistic inference and the QMR-DT network. JAIR, 10(1):291–322, May 1999. [107] L. P. Jain, W. J. Scheirer, and T. E. Boult. Multi-class open set recognition using probability of inclusion. In ECCV, 2014. [108] P. Jain, B. Kulis, I. Dhillon, and K. Grauman. Online metric learning and fast similarity search. In NIPS, 2008. [109] D. Jayaraman and K. Grauman. Zero-shot recognition with unreliable attributes. In NIPS, 2014. [110] D. Jayaraman, F. Sha, and K. Grauman. Decorrelating semantic visual attributes by resist- ing the urge to share. In CVPR, 2014. [111] R. Jia and P. Liang. Adversarial examples for evaluating reading comprehension systems. In EMNLP, 2017. [112] Y . Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multi- media, 2014. 111 [113] M. Johnson, M. Schuster, Q. V . Le, M. Krikun, Y . Wu, Z. Chen, N. Thorat, F. B. Vi´ egas, M. Wattenberg, G. S. Corrado, M. Hughes, and J. Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351, 2017. [114] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. Learning visual features from large weakly supervised data. In ECCV, 2016. [115] P. Kankuekul, A. Kawewong, S. Tangruamsub, and O. Hasegawa. Online incremental attribute-based zero-shot learning. In CVPR, 2012. [116] D. Kiela and L. Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, 2014. [117] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [118] E. Kiperwasser and M. Ballesteros. Scheduled multi-task learning: From syntax to trans- lation. TACL, 6:225–240, 2018. [119] R. Kiros, Y . Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In NIPS, 2015. [120] S. Klerke, Y . Goldberg, and A. Søgaard. Improving sentence compression by learning to predict gaze. In HLT-NAACL, 2016. [121] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot learning. In ICCV, 2015. [122] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [123] A. Kovashka and K. Grauman. Attribute adaptation for personalized image search. In ICCV, 2013. [124] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in gaussian pro- cesses: Theory, efficient algorithms and empirical studies. JMLR, 9(Feb):235–284, 2008. [125] H.-P. Kriegel, P. Kr¨ oger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In CIKM, 2009. [126] R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2016. [127] V . R. Krishnan. Tinkering under the hood: Interactive zero-shot learning with pictorial classifiers. Master’s thesis, Carnegie Mellon University, 2016. [128] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolu- tional neural networks. In NIPS, 2012. [129] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011. [130] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Founda- tions and Trends in Machine Learning, 5(2–3), 2012. [131] B. Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5:287– 364, 2013. [132] N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describable visual attributes for face verification and image search. TPAMI, 33(10):1962–1977, 2011. [133] V . Kumar Verma, G. Arora, A. Mishra, and P. Rai. Generalized zero-shot learning via synthesized examples. In CVPR, 2018. [134] I. Labutov and H. Lipson. Re-embedding words. In ACL, 2013. [135] J. D. Lafferty, A. D. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. 112 [136] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. [137] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. TPAMI, 36(3):453–465, 2014. [138] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. Neural architec- tures for named entity recognition. In NAACL, 2016. [139] S. Landes, C. Leacock, and R. I. Tengi. Building semantic concordances. WordNet: An electronic lexical database, 199(216):199–216, 1998. [140] Q. V . Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, 2014. [141] R. Lebret and R. Collobert. Word emdeddings through hellinger pca. In EACL, 2013. [142] J. Lei Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Predicting deep zero-shot convo- lutional neural networks using textual descriptions. In ICCV, 2015. [143] O. Levy and Y . Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, 2014. [144] O. Levy, Y . Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. TACL, 3:211–225, 2015. [145] D. D. Lewis, Y . Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361–397, 2004. [146] X. Li, Y . Guo, and D. Schuurmans. Semi-supervised zero-shot classification with label representation learning. In ICCV, 2015. [147] Z. Li and D. Hoiem. Learning without forgetting. In ECCV, 2016. [148] D. Lin. An information-theoretic definition of similarity. In ICML, 1998. [149] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011. [150] K. Liu, A. Bellet, and F. Sha. Similarity learning for high-dimensional sparse data. In AISTATS, 2015. [151] L. Liu, J. Shang, F. F. Xu, X. Ren, H. Gui, J. Peng, and J. Han. Empower sequence labeling with task-aware neural language model. In AAAI, 2018. [152] Y . Long, L. Liu, L. Shao, F. Shen, G. Ding, and J. Han. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In CVPR, 2017. [153] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with lan- guage priors. In ECCV, 2016. [154] Y . Lu. Unsupervised learning of neural network outputs. In IJCAI, 2016. [155] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co- occurrence. Behavior Research Methods, Instruments, & Computers, 28(2):203–208, 1996. [156] M.-T. Luong, Q. V . Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence learning. In ICLR, 2016. [157] M.-T. Luong, R. Socher, and C. D. Manning. Better word representations with recursive neural networks for morphology. In CoNLL, 2013. [158] X. Ma and E. H. Hovy. End-to-end sequence labeling via bi-directional LSTM-CNNs- CRF. In ACL, 2016. [159] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y . Ng, and C. Potts. Learning word vectors for sentiment analysis. In ACL, 2011. [160] P. S. Madhyastha, M. Bansal, K. Gimpel, and K. Livescu. Mapping unseen words to task- trained embedding spaces. In Proceedings of the 1st Workshop on Representation Learning 113 for NLP, 2016. [161] T. Malisiewicz and A. A. Efros. Recognition by association via learning per-exemplar distances. In CVPR, 2008. [162] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from cap- tions with attention. In ICLR, 2016. [163] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. [164] A. F. T. Martins and N. A. Smith. Summarization with a joint model for sentence extrac- tion and compression. In Proceedings of the NAACL-HLT Workshop on Integer Linear Programming for NLP, 2009. [165] T. Mensink, E. Gavves, and C. G. Snoek. COSTA: Co-occurrence statistics for zero-shot classification. In CVPR, 2014. [166] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV, 2012. [167] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Distance-based image classification: Generalizing to new classes at near-zero cost. TPAMI, 35(11):2624–2637, 2013. [168] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representa- tions in vector space. In ICLR Workshops, 2013. [169] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013. [170] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In NAACL, 2013. [171] G. A. Miller and W. G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1991. [172] G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. A semantic concordance. In Pro- ceedings of the workshop on Human Language Technology, 1993. [173] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014. [174] M. Miwa and M. Bansal. End-to-end relation extraction using LSTMs on sequences and tree structures. In ACL, 2016. [175] A. Mnih and G. Hinton. Three new graphical models for statistical language modelling. In ICML, 2007. [176] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise- contrastive estimation. In NIPS, 2013. [177] A. Mnih and Y . W. Teh. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012. [178] P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition. In CVPR, 2017. [179] L. Mou, Z. Meng, R. Yan, G. Li, Y . Xu, L. Zhang, and Z. Jin. How transferable are neural networks in nlp applications? In EMNLP, 2016. [180] A. Neelakantan, J. Shankar, A. Passos, and A. McCallum. Efficient non-parametric esti- mation of multiple embeddings per word in vector space. In EMNLP, 2014. [181] J. Niehues and E. Cho. Exploiting linguistic resources for neural machine translation using multi-task learning. In WMT, 2017. [182] J. Nivre, M.-C. de Marneffe, F. Ginter, Y . Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies v1: A multilingual treebank collection. In LREC, 2016. 114 [183] M. Norouzi, T. Mikolov, S. Bengio, Y . Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR Workshops, 2014. [184] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, 2014. [185] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In NIPS, 2009. [186] S. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 22:1345–1359, 2010. [187] S. Parameswaran and K. Weinberger. Large margin multi-task metric learning. In NIPS, 2010. [188] D. Parikh and K. Grauman. Interactively building a discriminative vocabulary of nameable attributes. In CVPR, 2011. [189] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. [190] R. Pascanu, T. Mikolov, and Y . Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013. [191] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In Proceedings of the NIPS Workshop on the future of gradient-based machine learning software and techniques, 2017. [192] G. Patterson, C. Xu, H. Su, and J. Hays. The SUN Attribute Database: Beyond categories for deeper scene understanding. IJCV, 108(1-2):59–81, 2014. [193] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., 1988. [194] H. Peng, L. Mou, G. Li, Y . Chen, Y . Lu, and Z. Jin. A comparative study on regularization strategies for embedding-based neural networks. In EMNLP, 2015. [195] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014. [196] B. Plank, A. Søgaard, and Y . Goldberg. Multilingual part-of-speech tagging with bidirec- tional long short-term memory models and auxiliary loss. In ACL, 2016. [197] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016. [198] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch. A word at a time: comput- ing word relatedness using temporal semantic analysis. In WWW, 2011. [199] A. Raganato, C. D. Bovi, and R. Navigli. Neural sequence learning models for word sense disambiguation. In EMNLP, 2017. [200] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. iCaRL: Incremental classifier and representation learning. In CVPR, 2017. [201] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016. [202] M. Rei. Semi-supervised multitask learning for sequence labeling. In ACL, 2017. [203] N. Reimers and I. Gurevych. Reporting score distributions makes a difference: Perfor- mance study of LSTM-networks for sequence tagging. In EMNLP, 2017. [204] D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot general- ization in deep generative models. In ICML, 2016. [205] M. Ristin, M. Guillaumin, J. Gall, and L. Van Gool. Incremental learning of random forests for large-scale image classification. TPAMI, 38(3):490–503, 2016. [206] D. L. T. Rohde, L. M. Gonnerman, and D. C. Plaut. An improved model of semantic similarity based on lexical co-occurrence. Communications of the ACM, 8:627–633, 2006. 115 [207] M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, 2011. [208] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where–and why? semantic relatedness for knowledge transfer. In CVPR, 2010. [209] B. Romera-Paredes and P. H. S. Torr. An embarrassingly simple approach to zero-shot learning. In ICML, 2015. [210] H. Rubenstein and J. B. Goodenough. Contextual correlates of synonymy. Communica- tions of ACM, 8(10):627–633, 1965. [211] S. Ruder. An overview of multi-task learning in deep neural networks. CoRR, abs/1706.05098, 2017. [212] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recogni- tion challenge. IJCV, 2015. [213] R. Salakhutdinov, A. Torralba, and J. Tenenbaum. Learning to share visual appearance for multiclass object detection. In CVPR, 2011. [214] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult. Toward open set recognition. TPAMI, 35(7):1757–1772, 2013. [215] W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models for open set recognition. TPAMI, 36(11):2317–2324, 2014. [216] W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult. Multi-attribute spaces: Cali- bration for attribute fusion and similarity search. In CVPR, 2012. [217] N. Schneider and N. A. Smith. A corpus and model integrating multiword expressions and supersenses. In HLT-NAACL, 2015. [218] B. Sch¨ olkopf and A. J. Smola. Learning with kernels: support vector machines, regular- ization, optimization, and beyond. MIT press, 2002. [219] B. Sch¨ olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207–1245, 2000. [220] R. Sennrich and B. Haddow. Linguistic input features improve neural machine translation. In WMT, 2016. [221] G. Shakhnarovich. Learning task-specific similarity. PhD thesis, Massachusetts Institute of Technology, 2005. [222] C. Shaoul. The westbury lab wikipedia corpus. Edmonton, AB: University of Alberta, 2010. [223] P. Shi, Z. Teng, and Y . Zhang. Exploiting mutual benefits between syntax and semantic roles using neural network. In EMNLP, 2016. [224] B. Siddiquie, R. S. Feris, and L. S. Davis. Image ranking and retrieval based on multi- attribute queries. In CVPR, 2011. [225] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. [226] A. J. Smola and B. Sch¨ olkopf. A tutorial on support vector regression. Statistics and Computing, 14:199–222, 2004. [227] R. Socher, M. Ganjoo, C. D. Manning, and A. Y . Ng. Zero-shot learning through cross- modal transfer. In NIPS, 2013. [228] A. Søgaard and Y . Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In ACL, 2016. [229] Y . Souri, E. Noury, and E. Adeli-Mosabbeb. Deep relative attributes. In ACCV, 2016. 116 [230] V . I. Spitkovsky, D. Jurafsky, and H. Alshawi. Profiting from mark-up: Hyper-text anno- tations for guided parsing. In ACL, 2010. [231] K. Stratos, M. Collins, and D. Hsu. Model-based word embeddings from decompositions of count matrices. In ACL, 2015. [232] E. B. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent pitman-yor processes. In NIPS, 2008. [233] I. Sutskever, O. Vinyals, and Q. V . Le. Sequence to sequence learning with neural net- works. In NIPS, 2014. [234] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. [235] C. Szegedy, W. Liu, Y . Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V . Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CVPR, 2015. [236] M. Szell, R. Lambiotte, and S. Thurner. Multirelational organization of large-scale social networks in an online world. PNAS, 107(31):13636–13641, 2010. [237] K. D. Tang, M. F. Tappen, R. Sukthankar, and C. H. Lampert. Optimizing one-shot recog- nition with micro-set learning. In CVPR, 2010. [238] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016. [239] F. Tian, H. Dai, J. Bian, B. Gao, R. Zhang, E. Chen, and T.-Y . Liu. A probabilistic model for learning multi-prototype word embeddings. In COLING, 2014. [240] E. F. Tjong Kim Sang and S. Buchholz. Introduction to the CoNLL-2000 shared task: Chunking. In CoNLL, 2000. [241] E. F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In CoNLL, 2003. [242] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE. JMLR, 9(2579-2605):85, 2008. [243] L. van der Maaten and G. Hinton. Visualizing non-metric similarities in multiple maps. Machine Learning, 33:33–55, 2012. [244] S. Vittayakorn, T. Umeda, K. Murasaki, K. Sudo, T. Okatani, and K. Yamaguchi. Auto- matic attribute discovery with neural activations. In ECCV, 2016. [245] P. V ossen, L. Bloksma, H. Rodriguez, S. Climent, N. Calzolari, A. Roventini, F. Bertagna, A. Alonge, and W. Peters. The EuroWordNet Base Concepts and Top Ontology. Technical Report LE2-4003, University of Amsterdam, The Netherlands, 1998. [246] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition and part localiza- tion with humans in the loop. In ICCV, 2011. [247] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds- 200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technol- ogy, 2011. [248] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1:1–305, 2008. [249] J. Wang, A. Woznica, and A. Kalousis. Parametric local metric learning for nearest neigh- bor classification. In NIPS, 2012. [250] Q. Wang and K. Chen. Zero-shot visual recognition via bidirectional latent embedding. IJCV, 124:356–383, 2017. [251] X. Wang and Q. Ji. A unified probabilistic approach modeling relationships between at- tributes and objects. In ICCV, 2013. 117 [252] X. Wang, K. M. Kitani, and M. Hebert. Contextual visual similarity. arXiv preprint arXiv:1612.02534, 2016. [253] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009. [254] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neigh- bor classification. JMLR, 10:207–244, 2009. [255] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011. [256] Y . Xian, Z. Akata, and B. Schiele. Zero-shot learning – the Good, the Bad and the Ugly. In CVPR, 2017. [257] Y . Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In CVPR, 2016. [258] Y . Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning - a comprehensive evaluation of the Good, the Bad and the Ugly. TPAMI, 2018. [259] Y . Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning. In CVPR, 2018. [260] Y . Xian, B. Schiele, and Z. Akata. Zero-shot learning - the Good, the Bad and the Ugly. In CVPR, 2017. [261] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. [262] E. P. Xing, A. Y . Ng, M. I. Jordan, and S. Russell. Distance metric learning, with applica- tion to clustering with side-information. In NIPS, 2002. [263] X. Xu, T. Hospedales, and S. Gong. Semantic embedding space for zero-shot action recog- nition. In ICIP, 2015. [264] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2Image: Conditional image generation from visual attributes. In ECCV, 2016. [265] D. Yang and D. M. W. Powers. Verb similarity on the taxonomy of wordnet. In the 3rd International WordNet Conference (GWC-06), 2006. [266] Y . Yang and T. M. Hospedales. A unified perspective on multi-domain and multi-task learning. In ICLR, 2015. [267] Z. Yang, R. Salakhutdinov, and W. W. Cohen. Transfer learning for sequence tagging with hierarchical recurrent networks. In ICLR, 2017. [268] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recogni- tion by learning bases of action attributes and parts. In ICCV, 2011. [269] M. Yatskar, L. Zettlemoyer, and A. Farhadi. Situation recognition: Visual semantic role labeling for image understanding. In CVPR, 2016. [270] W. Yin and H. Sch¨ utze. Learning word meta-embedding. In ACL, 2016. [271] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine, 13(3):55–75, 2018. [272] F. X. Yu, L. Cao, R. S. Feris, J. R. Smith, and S.-F. Chang. Designing category-level attributes for discriminative visual recognition. In CVPR, 2013. [273] X. Yu and Y . Aloimonos. Attribute-based transfer learning for object categorization with zero/one training example. In ECCV, 2010. [274] A. Zamir, A. F. Sax, W. W. Shen, L. J. Guibas, J. Malik, and S. Savarese. Taskonomy: Disentangling task transfer learning. In CVPR, 2018. 118 [275] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014. [276] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down neural attention by excita- tion backprop. In ECCV, 2016. [277] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In CVPR, 2017. [278] Y . Zhang and D. Weiss. Stack-propagation: Improved representation learning for syntax. In ACL, 2016. [279] Z. Zhang and V . Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV, 2015. [280] Z. Zhang and V . Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR, 2016. [281] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In ICLR, 2015. [282] X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcat- egories. In CVPR, 2014. [283] Y . Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A generative adversarial ap- proach for zero-shot learning from noisy texts. In CVPR, 2018. 119 Appendices 120 Appendix A Zero-shot learning experiments A.1 Expanded Main Results Table A.1 expands Table 7.2 to include additional baselines. First, we include results of additional baselines [5, 165, 209]. Second, we report results of very recently proposed methods that use the more optimistic metric per-sample accuracy as well as different types of deep visual features. Per-sample accuracy is computed by averaging over accuracy of each sample. This is different from per-class accuracy that is computed by averaging over accuracy of each unseen class. It is likely that per-sample accuracy is the more optimistic metric of the two, as [256] reports that they are unable to reproduce results of SSE [279], which uses per-sample accuracy, with per-class accuracy. We also note that visual features can affect the performance greatly. For example, VGG features [225] of AwA used in [250, 279, 280] are likely more discriminative than GoogLeNet features. In particular, BIDILEL [250] reports results on both features with VGG outperforming GoogLeNet by an absolute 5.8%. This could explain strong results on AwA reported in [250, 279, 280]. It would also be interesting to investigate how GoogLetNet V2 [104] (in additional to per-sample evaluation metric) used by DEM [277] contributes to their superior performance on AwA. Finally, despite the variations in experimental settings, our method still outperforms all base- lines on CUB. 121 Table A.1: Expanded comparison (cf. Table 7.2) to existing ZSL approaches in the multi-way classifica- tion accuracies (in %) on AwA, CUB, and SUN. For each dataset, we mark the best in red and the second best in blue. We include results of recent ZSL methods with other types of deep features (VGG by [225] and GoogLeNet V2 by [104]) and/or different evaluation metrics. See text for details on how to interpret these results. Approach Visual Evaluation AwA CUB SUN features metric SSE [279] VGG per-sample 76.3 30.4 x - JLSE [280] VGG per-sample 80.5 42.1 x - BIDILEL [250] VGG per-sample 79.1 47.6 x - DEM [277] GoogLeNet V2 per-sample 86.7 58.3 x - SJE [5] GoogLeNet per-class 66.3 46.5 56.1 ESZSL [209] GoogLeNet per-class 64.5 34.5 18.7 COSTA [165] GoogLeNet per-class 61.8 40.8 47.9 CONSE [183] GoogLeNet per-class 63.3 36.2 51.9 BIDILEL [250] GoogLeNet per-class 72.4 49.7 x - LATEM z [257] GoogLeNet per-class 72.1 48.0 64.5 SYNC O-VS-O GoogLeNet per-class 69.7 53.4 62.8 SYNC CS GoogLeNet per-class 68.4 51.6 52.9 SYNC STRUCT GoogLeNet per-class 72.9 54.5 62.7 EXEM (CONSE) GoogLeNet per-class 70.5 46.2 60.0 EXEM (LATEM) z GoogLeNet per-class 72.9 56.2 67.4 EXEM (SYNC O-VS-O ) GoogLeNet per-class 73.8 56.2 66.5 EXEM (SYNC STRUCT ) GoogLeNet per-class 77.2 59.8 66.1 EXEM (1NN) GoogLeNet per-class 76.2 56.3 69.6 EXEM (1NNS) GoogLeNet per-class 76.5 58.5 67.3 x : on a particular split of seen/unseen classes. z : based on the code of [257], averaged over 5 different initializations. 122 Table A.2: Expanded comparison (cf. Table 7.3) to existing ZSL approaches on ImageNet using word vectors of the class names as semantic represen- tations. For both types of metrics (in %), the higher the better. The best is in red. AlexNet is by [128]. The number of actual unseen classes are given in parentheses. Test data Approach Visual Flat Hit@K Hierarchical precision@K K= features 1 2 5 10 20 2 5 10 20 2-hop DEVISE [68] AlexNet 6.0 10.1 18.1 26.4 36.4 15.2 19.2 21.7 23.3 (1,549) CONSE [183] AlexNet 9.4 15.1 24.7 32.7 41.8 21.4 24.7 26.9 28.4 CONSE [183] GoogLeNet 8.3 12.9 21.8 30.9 41.7 21.5 23.8 27.5 31.3 SYNC o-vs-o GoogLeNet 10.5 16.7 28.6 40.1 52.0 25.1 27.7 30.3 32.1 2-hop SYNC struct GoogLeNet 9.8 15.3 25.8 35.8 46.5 23.8 25.8 28.2 29.6 (1,509) EXEM (SYNC O-VS-O ) GoogLeNet 11.8 18.9 31.8 43.2 54.8 25.6 28.1 30.2 31.6 EXEM (1NN) GoogLeNet 11.7 18.3 30.9 42.7 54.8 25.9 28.5 31.2 33.3 EXEM (1NNS) GoogLeNet 12.5 19.5 32.3 43.7 55.2 26.9 29.1 31.1 32.0 3-hop DEVISE [68] AlexNet 1.7 2.9 5.3 8.2 12.5 3.7 19.1 21.4 23.6 (7,860) CONSE [183] AlexNet 2.7 4.4 7.8 11.5 16.1 5.3 20.2 22.4 24.7 CONSE [183] GoogLeNet 2.6 4.1 7.3 11.1 16.4 6.7 21.4 23.8 26.3 SYNC o-vs-o GoogLeNet 2.9 4.9 9.2 14.2 20.9 7.4 23.7 26.4 28.6 3-hop SYNC struct GoogLeNet 2.9 4.7 8.7 13.0 18.6 8.0 22.8 25.0 26.7 (7,678) EXEM (SYNC O-VS-O ) GoogLeNet 3.4 5.6 10.3 15.7 22.8 7.5 24.7 27.3 29.5 EXEM (1NN) GoogLeNet 3.4 5.7 10.3 15.6 22.7 8.1 25.3 27.8 30.1 EXEM (1NNS) GoogLeNet 3.6 5.9 10.7 16.1 23.1 8.2 25.2 27.7 29.9 All DEVISE [68] AlexNet 0.8 1.4 2.5 3.9 6.0 1.7 7.2 8.5 9.6 (20,842) CONSE [183] AlexNet 1.4 2.2 3.9 5.8 8.3 2.5 7.8 9.2 10.4 CONSE [183] GoogLeNet 1.3 2.1 3.8 5.8 8.7 3.2 9.2 10.7 12.0 SYNC O-VS-O GoogLeNet 1.4 2.4 4.5 7.1 10.9 3.1 9.0 10.9 12.5 All SYNC STRUCT GoogLeNet 1.5 2.4 4.4 6.7 10.0 3.6 9.6 11.0 12.2 (20,345) EXEM (SYNC O-VS-O ) GoogLeNet 1.6 2.7 5.0 7.8 11.8 3.2 9.3 11.0 12.5 EXEM (1NN) GoogLeNet 1.7 2.8 5.2 8.1 12.1 3.7 10.4 12.1 13.5 EXEM (1NNS) GoogLeNet 1.8 2.9 5.3 8.2 12.2 3.6 10.2 11.8 13.2 123 Table A.3: Overlap of k-nearest classes (in %) on AwA, CUB, SUN. We measure the overlap between those searched by real exemplars and those searched by semantic representations (i.e., attributes) or pre- dicted exemplars. We set k to be 40 % of the number of unseen classes. See text for more details. Distances for kNN using AwA CUB SUN (k=4) (k=20) (k=29) Semantic representations 57.5 68.9 75.2 Predicted exemplars 67.5 80.0 82.1 A.2 Expanded Large-Scale Zero-Shot Classification Results We then provide expanded results for ImageNet, following evaluation protocols in the literature. Table A.2 expands the results of Table 7.3 in the main text to include other previously published results that use AlexNet features [128] and evaluate on all unseen classes. In all cases, our method outperforms the baseline approaches. A.3 Supplementary Material on SYNC Qualitative results We present qualitative results of our method. We first illustrate what visual information the models (classifiers) for unseen classes capture, when provided with only semantic representations (no example images). In Fig. A.1, we list (on top) the 10 unseen class labels of AwA, and show (in the middle) the top-5 images classified into each class c, according to the decision valuesw T c x. Misclassified images are marked with red boundaries. At the bottom, we show the first (highest score) misclassified image (according to the decision value) into each class and its ground-truth class label. According to the top images, our method reasonably captures discriminative visual properties of each unseen class based solely on its semantic representation. We can also see that the misclassified images are with appearance so similar to that of predicted class that even humans cannot easily distinguish between the two. For example, the pig image at the bottom of the second column looks very similar to the image of hippos. Fig. A.2 and Fig. A.3 present the results in the same format on CUB and SUN, respectively (both on a subset of unseen class labels). We further analyze the success and failure cases; i.e., why an image from unseen classes is misclassified. The illustrations are in Fig. A.4, A.5, and A.6 for AwA, CUB, and SUN, respec- tively. In each figure, we consider (Left) one unseen class and show its convex combination weightss c =fs c1 ; ;s cR g as a histogram. We then present (Middle-Left) the top-3 seman- tically similar (in terms ofs c ) seen classes and their most representative images. As our model exploits phantom classes to connect seen and unseen classes in both semantic and model spaces, we expect that the model (classifier) for such unseen class captures similar visual information as those for semantically similar seen classes do. (Middle-Right) We examine two images of such unseen class, where the top one is correctly classified; the bottom one, misclassified. We also list (Right) the top-3 predicted labels (within the pool of unseen classes) and their most representa- tive images. Green corresponds to correct labels. We see that, in the misclassified cases, the test images are visually dissimilar to those of the semantically similar seen classes. The synthesized unseen classifiers, therefore, cannot correctly recognize them, leading to incorrect predictions. A.4 Supplementary Material on EXEM 124 Persian cat Hippo Leopard Humpback whale Seal Chimpanzee Rat Giant panda Pig Raccoon Raccoon Pig Persian cat Seal Humpback whale rat Raccoon Seal Hippo Rat Figure A.1: Qualitative results of our method (SYNC STRUCT ) on AwA. (Top) We list the 10 unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. Artic tern Ringed kingfisher American crow Cedar waxwing House sparrow Orange- crowned warbler Hooded warbler Heermann gull Cactus wren Whip-poor will Laysan albatross Scissor- tailed flycatcher Pelagic cormorant Gray kingbird Harris sparrow Hooded warbler Prairie Warbler Slaty- backed gull Northern flicker Cactus wren Figure A.2: Qualitative results of our method (SYNC STRUCT ) on CUB. (Top) We list a subset of unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. 125 Computer room Great hall Video store Botanical garden Firing range (outdoor) Gasworks Glacier Mausoleum Moat (water) Raceway Trading floor Lobby Toy shop Moat (water) Mastaba Chemical plant Ice shelf Cabana Arch Velodrome (outdoor) Figure A.3: Qualitative results of our method (SYNC O-VS-O ) on SUN. (Top) We list a subset of unseen class labels. (Middle) We show the top-5 images classified into each class, according to the decision values. Misclassified images are marked with red boundaries. (Bottom) We show the first misclassified image (according to the decision value) into each class and its ground-truth class label. Another metric for evaluating the quality of predicted visual exemplars Besides the Pear- son correlation coefficient used in Table 2 of the main text 1 , we provide another evidence that predicted exemplars better reflect visual similarities (as defined by real exemplars) than seman- tic representations. Let %kNNoverlap(D) be the percentage of k-nearest neighbors (neighboring classes) using distances D that overlap with k-nearest neighbors using real exemplar distances. In Table A.3, we report %kNNoverlap (semantic representation distances) and %kNNoverlap (pre- dicted exemplar distances). We set k to be 40% of the number of unseen classes, but we note that the trends are consistent for different ks. Similar to the results in the main text, we observe clear improvement in all cases. Qualitative results Finally, we provide qualitative results on the zero-shot learning task on AwA and SUN in Fig. A.7. For each row, we provide a class name, three attributes with the high- est strength, and the nearest image to the predicted exemplar (projected back to the original visual feature space). We stress that each class that we show here is an unseen class, and the images are from unseen classes as well. Generally, the results are reasonable; class names, attributes, and images generally correspond well. Even when the image is from the wrong class, the appearance of the nearest image is reasonable. For example, we predict a hippopotamus exemplar from the pig attributes, but the image does not look too far from pigs. This could also be due to the fact that many of these attributes are not visual and thus our regressors are prone to learning the wrong thing [110]. 1 We treat rows of each distance matrix as data points and compute the Pearson correlation coefficients between matrices. 126 Unseen class Semantically closed seen classes Testing images of the unseen class Top-3 predictions (within unseen classes) Persian cat Chihuahua Collie Siamese cat Persian cat Rat Raccoon Chimpanzee Rat Raccoon Figure A.4: Success/failure case analysis of our method (SYNC STRUCT ) on AwA: (Left) an unseen class la- bel, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen class, and (Right) the top-3 predicted unseen classes. The green text corresponds to the correct label. Unseen class Semantically closed seen classes Testing images of the unseen class Top-3 predictions (within unseen classes) Prairie warbler Kentucky warbler Yellow warbler Wilson warbler Prairie warbler Orange crowned warbler Hooded warbler Barn swallow Le Conte sparrow Field sparrow Figure A.5: Success/failure case analysis of our method (SYNC STRUCT ) on CUB. (Left) an unseen class la- bel, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen classes, and (Right) the top-3 predicted unseen class. The green text corresponds to the correct label. Unseen class Semantically closed seen classes Testing images of the unseen class Top-3 predictions (within unseen classes) Ghost town Military hut Kasbah Quonset hut Ghost town Mastaba Chemical plant Gasworks Hayfield Road cut Figure A.6: Success/failure case analysis of our method (SYNC O-VS-O ) on SUN. (Left) an unseen class la- bel, (Middle-Left) the top-3 semantically similar seen classes to that unseen class, (Middle-Right) two test images of such unseen classes, and (Right) the top-3 predicted unseen class. The green text corresponds to the correct label. 127 Class name Top 3 attributes Nearest image to predicted exemplar Persian cat Furry, domestic, paws Humpback whale big, water, swims Leopard fast, meat, stalker Pig quadrupedal, bulbous, newworld Class name Top 3 attributes Nearest image to predicted exemplar Airport open area, manmade, far-away horizon Art studio enclosed area, no horizon, manmade Botanical garden foliage, vegetation, trees Clothing store cloth, enclosed area, shopping Figure A.7: Qualitative zero-shot learning results on AwA (top) and SUN (bottom). For each row, we provide a class name, three attributes with highest strength, and the nearest image to the predicted exemplar (projected back to the original visual feature space). 128 Seen class index Unseen class index Instance index 20 % for revealing 80 % for testing : training data from seen classes : additional training data from peeked unseen classes : test data : untouched data Figure A.8: Data split for zero-to-few-shot learning on ImageNet A.5 From Zero-Shot to Few-Shot Learning In this section, we investigate what will happen when we allow ZSL algorithms to peek into some labeled data from part of the unseen classes. Our focus will be on All categories of ImageNet, two ZSL methods (SYNC o-vs-o and EXEM (1NN)), and two evaluation metrics (F@1 and F@20). For brevity, we will denote SYNC o-vs-o and EXEM (1NN) by SYNC and EXEM, respectively. Setup We divide images from each unseen class into two sets. The first 20% are reserved as training examples that may or may not be revealed. This corresponds to on average 127 images per class. If revealed, those peeked unseen classes will be marked as seen, and their labeled data can be used for training. The other 80% are for testing. The test set is always fixed such that we have to do few-shot learning for peeked unseen classes and zero-shot learning on the rest of the unseen classes. Fig. A.8 summarizes this protocol. We then vary the number of peeked unseen classes B. Also, for each of these numbers, we explore the following subset selection strategies (more details below): (i) Uniform random: Randomly selectedB unseen classes from the uniform distribution; (ii) Heavy-toward-seen ran- dom Randomly selectedB classes that are semantically similar to seen classes according to the WordNet hierarchy; (iii) Light-toward-seen random Randomly selectedB classes that are se- mantically far away from seen classes; (iv) K-means clustering for coverage Classes whose semantic representations are nearest to each cluster’s center, where semantic representations of the unseen classes are grouped by k-means clustering with k = B; (v) DPP for diversity Se- quentially selected classes by a greedy algorithm for fixed-sized determinantal point processes (k-DPPs) [129] with the RBF kernel computed on semantic representations. Details on how to select peeked unseen classes Denote byB the number of peeked unseen classes whose labeled data will be revealed. In what follows, we provide detailed descriptions of how we select a subset of peeked unseen classes of sizeB. Uniform random and heavy (light)-toward-seen random As mentioned in Sect. 3.1 of the main text, there are different subsets of unseen classes on ImageNet according to the WordNet hierarchy. Each subset contains unseen classes with a certain range of tree-hop distance from the 1K seen classes. The smaller the distance is, the higher the semantic similarity between unseen classes and seen classes. Here, we consider the following three disjoint subsets: • 2-hop: 1,509 (out of 1,549) unseen classes that are within 2 tree-hop distance from the 1K seen classes. 129 • Pure 3-hop: 6,169 (out of 6,311) unseen classes that are with exactly 3 tree-hop distance from the 1K seen classes. • Rest: 12,667 (out of 12,982) unseen classes that are with more than 3 tree-hop distance from the 1K seen classes. Note that 3-hop defined in Sect. 3.1 of the main text is exactly 2-hop[ Pure 3-hop, and All is 2-hop[ Pure 3-hop[ Rest. For uniform random, we pick from 2-hop/Pure 3-hop/ Rest the number of peeked unseen classes proportional to their set size (i.e., 1,509/6,169/12,667). That is, we do not bias the selected classes towards any subset. For heavy-toward-seen random, we pick from 2-hop/Pure 3-hop/Rest the number of peeked unseen classes proportional to (161,509)/(46,169)/(112,667). For light-toward-seen random, we pick from 2-hop/Pure 3-hop/Rest the number of peeked unseen classes proportional to (11,509)/(46,169)/(1612,667). Given the number of peeked unseen classes for each subset, we then perform uniform sampling (without replacement) within each subset to select the peeked unseen classes. If the number of peeked unseen classes to select from a subset exceeds the number of classes of that subset, we split the exceeding budget to other subsets following the proportion. DPP Given a ground set of N items (e.g., classes) and the corresponding N-by-N kernel matrixL that encodes the pair-wise item similarity, a DPP [130] defines the probability of any subset sampled from the ground set. The probability of a specific subset is proportional to the determinant of the principal minor ofL indexed by the subset. A diverse subset is thus with a higher probability to be sampled. For zero-shot to few-shot learning experiments, we constructL with the RBF kernel com- puted on semantic representations (e.g, word vectors) of all the seen and unseen classes (i.e., S +U classes). We then compute theU-by-U kernel matrixL U conditional on that all theS seen classes are already included in the subset. Please refer to [130] for details on conditioning in DPPs. WithL U , we would like to select additionalB classes that are diverse from each other and from the seen classes to be the peeked unseen classes. Since finding the most diverse subset (either fixed-size or not) is an NP-hard problem [129, 130], we apply a simple greedy algorithm to sequentially select classes. Denote Q t as the set of peeked unseen classes with sizet andU t as the remaining unseen classes, we enumerate all possible subset of sizet + 1 (i.e.,Q t [fc2 U t g). We then includec that leads to the largest probability intoQ t+1 (i.e.,Q t+1 = Q t [fc g andU t+1 = U t fc g). We iteratively perform the update untilt =B. A.5.1 Main Results For each of the ZSL methods (EXEM and SYNC), we first compare different subset selection methods when the number of peeked unseen classes is small (up to 2,000) in Fig. A.9. We see that the performances of different subset selection methods are consistent across ZSL methods. Moreover, heavy-toward-seen classes are preferred for strict metrics (Flat Hit@1) but clustering is preferred for flexible metrics (Flat Hit@20). This suggests that, for a strict metric, it is better to pick the classes that are semantically similar to what we have seen. On the other hand, if the metric is flexible, we should focus on providing coverage for all the classes so each of them has knowledge they can transfer from. Next, using the best performing heavy-toward-seen selection, we focus on comparing EXEM and SYNC with larger numbers of peeked unseen classes in Fig. A.10. When the number of 130 0 500 1000 1500 2000 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 # peeked unseen classes accuracy EXEM: F@1 uniform heavy−seen clustering light−seen DPP 0 500 1000 1500 2000 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 # peeked unseen classes accuracy SynC: F@1 uniform heavy−seen clustering light−seen DPP 0 500 1000 1500 2000 0.1 0.12 0.14 0.16 0.18 0.2 # peeked unseen classes accuracy EXEM: F@20 uniform heavy−seen clustering light−seen DPP 0 500 1000 1500 2000 0.1 0.12 0.14 0.16 0.18 0.2 # peeked unseen classes accuracy SynC: F@20 uniform heavy−seen clustering light−seen DPP Figure A.9: Accuracy vs. the number of peeked unseen classes for EXEM (top) and SYNC (bottom) across different subset selection methods. Evaluation metrics are F@1 (left) and F@20 (right). peeked unseen classes is small, EXEM outperforms SYNC. (In fact, EXEM outperforms SYNC for each subset selection method in Fig. A.9.) However, we observe that SYNC will finally catch up and surpass EXEM. This is not surprising; as we observe more labeled data (due to the increase in peeked unseen set size), the setting will become more similar to supervised learning (few-shot learning), where linear classifiers used in SYNC should outperform nearest center classifiers used by EXEM. Nonetheless, we note that EXEM is more computationally advantageous than SYNC. In particular, when training on 1K classes of ImageNet with over 1M images, EXEM takes 3 mins while SYNC 1 hour. A.5.2 Detailed Results In this section, we analyze experimental results for EXEM (1NN) in detail. We refer the reader to the setup described in Sect. 3.3.3 of the main text, as well as additional setup below. Additional setup We will consider several fine-grained evaluation metrics. We denote by A K X!Y the Flat Hit@K of classifying test instances fromX to the label space ofY. Since 131 0 5000 10000 15000 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 # peeked unseen classes accuracy F@1 EXEM (heavy−seen) SynC (heavy−seen) 0 5000 10000 15000 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 # peeked unseen classes accuracy F@20 EXEM (heavy−seen) SynC (heavy−seen) Figure A.10: Accuracy vs. the number of peeked unseen classes for EXEM and SYNC for heavy-toward- seen class selection strategy. Evaluation metrics are F@1 (left) and F@20 (right). there are two types of test data, we will have two types of accuracy: the peeked unseen accu- racyA K P!U and the remaining unseen accuracyA K R!U , whereP is the peeked unseen set,R is the remaining unseen set, andU =P[R is the unseen set. Then, the combined accuracy A K U!U =w P A K P!U +w R A K R!U , wherew P (w R ) is the proportion of test instances in peeked unseen (remaining unseen) classes to the total number of test instances. Note that the combined accuracy is the one we use in the main text. Full curves for EXEM (1NN) Fig. A.11 showsA K U!U when the number of peeked unseen classes keeps increasing. We observe this leads to improved overall accuracy, although the gain eventually is flat. We also show the upperbound: EXEM (1NN) with real exemplars instead of predicted ones for all the unseen classes. Though small, the gap to the upperbound could poten- tially be improved with a more accurate prediction method of visual exemplars, in comparison to SVR. Detailed analysis of the effect of labeled data from peeked unseen classes Fig. A.12 expands the results in Fig. A.11 by providing the weighed peeked unseen accuracy w P A K P!U and the weighted remaining unseen accuracyw R A K R!U . We note that, as the number of peeked unseen classes increases, w P goes up while w R goes down, roughly linearly in both cases. Thus, the curves go up for the top row and go down for the bottom row. As we observe additional labeled data from more peeked unseen classes, the weighed peeked unseen accuracy improves roughly linearly as well. On the other hand, the weighed remaining unseen accuracy degrades very quickly for F@1 but slower for F@20. This suggests that the im- provement we see (over ZSL performance) in Fig. 4 of the main text and Fig. A.11 is contributed largely by the fact that peeked unseen classes benefit themselves. But how do peeked unseen classes exactly affect the remaining unseen classes? The above question is tricky to answer. There are two main factors that contribute to the per- formance on remaining unseen test instances. The first factor is the confusion among remaining classes themselves, and the second one is the confusion with peeked unseen classes. We perform more analysis to understand the effect of each factor when classifying test instances from the remaining unseen setR. 132 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 x 10 4 0 0.1 0.2 0.3 0.4 0.5 # peeked unseen classes accuracy EXEM (F@1) EXEM (F@5) EXEM (F@20) UB (F@1) UB (F@5) UB (F@20) Figure A.11: Combined accuracyA K U!U vs. the number of peeked unseen classes for EXEM (1NN). The “squares” correspond to the upperbound (UB) obtained by EXEM (1NN) on real exemplars. F@K=1, 5, and 20. To remove the confusion with peeked unseen classes, we first restrict the label space to only the remaining unseen classesR. In particular, we considerA K R!R and compare the method in two settings: ZSL and PZSL (ZSL with peeked unseen classes). ZSL uses only the training data from seen classes while PZSL uses the training data from both seen and peeked unseen classes. In Fig. A.13, we see that adding labeled data from peeked unseen classesdoeshelp by resolving confusion among remaining unseen classes themselves, suggesting that peeked unseen classes inform other remaining unseen classes about visual information. In Fig. A.14, we add the confusion introduced by peeked unseen classes back by letting the label space consist of bothP andR. That is, we considerA K R!U . For Flat Hit@1, the accuracy is hurt so badly that it goes down below ZSL baselines. However, for Flat Hit@20, the accuracy drops but still higher than ZSL baselines. Thus, to summarize, adding peeked unseen classes has two benefits: it improves the accura- cies on the peeked unseen classesP (Fig. A.12 (Top)), as well as reduces the confusion among remaining unseen classesR themselves (Fig. A.13). It biases the resulting classifiers towards the peeked unseen classes, hence causing confusion betweenP andR (Fig. A.14). When the pros outweigh the cons, we observe overall improvement (Fig. A.11). Additionally, when we use less strict metrics, peeked unseen classes always help (Fig. A.14). Results on additional metric, additional method, and additional rounds We further provide experimental results on additional metric: per-class accuracy and on multiple values of K in Flat Hit @K (i.e., K2f1; 2; 5; 10; 20g); additional method: EXEM (1NNS). We also provide results for EXEM (1NN) averaged over multiple rounds using heavy-toward-seen random, light-toward- seen random, and uniform random to select peeked unseen classes to illustrate the stability of these methods. Fig. A.15 and A.16 summarize the results for per-image and per-class accuracy, respectively. For both figures, each row corresponds to a ZSL method and each column corresponds to a specific value of K in Flat Hit@K. In particular, from top to bottom, ZSL methods are EXEM (1NN), EXEM (1NNS), and SYNC O-VS-O . From left to right, Flat Hit@K = 1, 2, 5, 10, and 20. 133 0 500 1000 1500 2000 2500 3000 0 0.01 0.02 0.03 0.04 0.05 0.06 # peeked unseen classes weighted peeked unseen accuracy hit@1 uniform heavy−seen clustering 0 500 1000 1500 2000 2500 3000 0 0.02 0.04 0.06 0.08 0.1 0.12 # peeked unseen classes weighted peeked unseen accuracy hit@20 uniform heavy−seen clustering 0 500 1000 1500 2000 2500 3000 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 # peeked unseen classes weighted remaining unseen accuracy hit@1 uniform heavy−seen clustering 0 500 1000 1500 2000 2500 3000 0.1 0.105 0.11 0.115 0.12 0.125 # peeked unseen classes weighted remaining unseen accuracy hit@20 uniform heavy−seen clustering Figure A.12: (Top) Weighted peeked unseen accuracy w P A K P!U vs. the number of peeked unseen classes. (Bottom) Weighted remaining unseen w R A K R!U accuracy vs. the number of peeked unseen classes. The weight w P (w R ) is the number of test instances belonging toP (R) divided by the total number of test instances. The evaluation metrics are F@1 (left) and F@20 (right). 134 0 500 1000 1500 2000 0.016 0.017 0.018 0.019 0.02 0.021 0.022 0.023 # peeked unseen classes accuracy of R to R hit@1 ZSL (heavy−seen) PZSL (heavy−seen) ZSL (clustering) PZSL (clustering) 0 500 1000 1500 2000 0.12 0.13 0.14 0.15 0.16 0.17 # peeked unseen classes accuracy of R to R hit@20 ZSL (heavy−seen) PZSL (heavy−seen) ZSL (clustering) PZSL (clustering) Figure A.13: Accuracy on test instances from the remaining unseen classes when classifying into the label space of remaining unseen classes onlyA K R!R vs. the number of peeked unseen classes. ZSL trains on labeled data from the seen classes only while PZSL (ZSL with peeked unseen classes) trains on the the labeled data from both seen and peeked unseen classes. The evaluation metrics are F@1 (left) and F@20 (right). 0 500 1000 1500 2000 0.01 0.015 0.02 # peeked unseen classes accuracy of R to U hit@1 ZSL (heavy−seen) PZSL (heavy−seen) ZSL (clustering) PZSL (clustering) 0 500 1000 1500 2000 0.11 0.115 0.12 0.125 0.13 0.135 0.14 0.145 # peeked unseen classes accuracy of R to U hit@20 ZSL (heavy−seen) PZSL (heavy−seen) ZSL (clustering) PZSL (clustering) Figure A.14: Accuracy on test instances from the remaining unseen classes when classifying into the label space of unseen classesA K R!U vs. the number of peeked unseen classes. ZSL trains on labeled data from the seen classes only while PZSL (ZSL with peeked unseen classes) trains on the the labeled data from both seen and peeked unseen classes. Note that these plots are the unweighted version of those at the bottom row of Fig. A.12. The evaluation metrics are F@1 (left) and F@20 (right). 135 0 1000 2000 0.01 0.02 0.03 0.04 0.05 per−image accuracy EXEM (1NN): hit@1 0 1000 2000 0.02 0.03 0.04 0.05 0.06 0.07 EXEM (1NN): hit@2 0 1000 2000 0.04 0.06 0.08 0.1 EXEM (1NN): hit@5 0 1000 2000 0.08 0.1 0.12 0.14 EXEM (1NN): hit@10 0 1000 2000 0.12 0.14 0.16 0.18 0.2 EXEM (1NN): hit@20 0 1000 2000 0.01 0.02 0.03 0.04 0.05 per−image accuracy EXEM (1NNs): hit@1 0 1000 2000 0.02 0.03 0.04 0.05 0.06 0.07 EXEM (1NNs): hit@2 0 1000 2000 0.04 0.06 0.08 0.1 EXEM (1NNs): hit@5 0 1000 2000 0.08 0.1 0.12 0.14 EXEM (1NNs): hit@10 0 1000 2000 0.12 0.14 0.16 0.18 0.2 EXEM (1NNs): hit@20 0 1000 2000 0.01 0.02 0.03 0.04 0.05 # peeked unseen classes per−image accuracy SynC o−vs−o : hit@1 0 1000 2000 0.02 0.03 0.04 0.05 0.06 0.07 # peeked unseen classes SynC o−vs−o : hit@2 0 1000 2000 0.04 0.06 0.08 0.1 # peeked unseen classes SynC o−vs−o : hit@5 0 1000 2000 0.08 0.1 0.12 0.14 # peeked unseen classes SynC o−vs−o : hit@10 0 1000 2000 0.12 0.14 0.16 0.18 0.2 # peeked unseen classes SynC o−vs−o : hit@20 uniform heavy−seen clustering light−seen DPP Figure A.15: Combined per-image accuracy vs. the number of peeked unseen classes for EXEM (1NN), EXEM (1NNS), and SYNC. The evaluation metrics are, from left to right, Flat Hit@1 ,2 ,5, 10, and 20. Five subset selection approaches are considered. 0 1000 2000 0.01 0.015 0.02 0.025 0.03 0.035 per−class accuracy EXEM (1NN): hit@1 0 1000 2000 0.02 0.03 0.04 0.05 EXEM (1NN): hit@2 0 1000 2000 0.03 0.04 0.05 0.06 0.07 0.08 EXEM (1NN): hit@5 0 1000 2000 0.04 0.06 0.08 0.1 EXEM (1NN): hit@10 0 1000 2000 0.08 0.1 0.12 0.14 0.16 EXEM (1NN): hit@20 0 1000 2000 0.01 0.015 0.02 0.025 0.03 0.035 per−class accuracy EXEM (1NNs): hit@1 0 1000 2000 0.02 0.03 0.04 0.05 EXEM (1NNs): hit@2 0 1000 2000 0.03 0.04 0.05 0.06 0.07 0.08 EXEM (1NNs): hit@5 0 1000 2000 0.04 0.06 0.08 0.1 EXEM (1NNs): hit@10 0 1000 2000 0.08 0.1 0.12 0.14 0.16 EXEM (1NNs): hit@20 0 1000 2000 0.01 0.015 0.02 0.025 0.03 0.035 # peeked unseen classes per−class accuracy SynC o−vs−o : hit@1 0 1000 2000 0.02 0.03 0.04 0.05 # peeked unseen classes SynC o−vs−o : hit@2 0 1000 2000 0.03 0.04 0.05 0.06 0.07 0.08 # peeked unseen classes SynC o−vs−o : hit@5 0 1000 2000 0.04 0.06 0.08 0.1 # peeked unseen classes SynC o−vs−o : hit@10 0 1000 2000 0.08 0.1 0.12 0.14 0.16 # peeked unseen classes SynC o−vs−o : hit@20 uniform heavy−seen clustering light−seen DPP Figure A.16: Combined per-class accuracy vs. the number of peeked unseen classes for EXEM (1NN), EXEM (1NNS), and SYNC. The evaluation metrics are, from left to right, Flat Hit@1, 2, 5, 10, and 20. Five subset selection approaches are considered. 136 Per-class accuracy is generally lower than per-image accuracy. This can be attributed to two factors. First, the average number of instances per class in 2-hop is larger than that in Pure 3-hop and Rest 2 . Second, the per-class accuracy in 2-hop is higher than that in Pure 3-hop and Rest 3 . That is, when we compute the per-image accuracy, we emphasize the accuracy from 2-hop. The first factor indicates the long-tail phenomena in ImageNet, and the second factor indicates the nature of zero-shot learning — unseen classes that are semantically more similar to the seen ones perform better than those that are less similar. A.6 Supplementary Material on GZSL Zero-shot learning, either in conventional setting or generalized setting, is a challenging problem as there is no labeled data for the unseen classes. The performance of ZSL methods depends on at least two factors: (1) how seen and unseen classes are related; (2) how effectively the relation can be exploited by learning algorithms to generate models for the unseen classes. For generalized zero-shot learning, the performance further depends on how classifiers for seen and unseen classes are combined to classify new data into the joint label space. Despite extensive study in ZSL, several questions remain understudied. For example, given a dataset and a split of seen and unseen classes, what is the best possible performance of any ZSL method? How far are we from there? What is the most crucial component we can improve in order to reduce the gap between the state-of-the-art and the ideal performances? In this section, we empirically analyze ZSL methods in detail and shed light on some of those questions. Setup As ZSL methods do not use labeled data from unseen classes for training classifiers, one reasonable estimate of their best possible performance is to measure the performance on a multi-class classification task where annotated data on the unseen classes are provided. Concretely, to construct the multi-class classification task, on AwA and CUB, we randomly select 80% of the data along with their labels from all classes (seen and unseen) to train classifiers. The remaining 20% will be used to assess both the multi-class classifiers and the classifiers from ZSL. Note that, for ZSL, only the seen classes from the 80% are used for training — the portion belonging to the unseen classes are not used. On ImageNet, to reduce the computational cost (of constructing multi-class classifiers which would involve 20,345-way classification), we subsample another 1,000 unseen classes from its original 20,345 unseen classes. We call this new dataset ImageNet-2K (including the 1K seen classes from ImageNet). The subsampling procedure is as follows. We pick 74 classes from 2-hop, 303 from “pure” 3-hop (that is, the set of 3-hop classes that are not in the set of 2-hop classes), and 623 from the rest of the classes. These numbers are picked to maintain the pro- portions of the three types of unseen classes in the original ImageNet. Each of these classes has between 1,050-1,550 examples. Out of those 1,000 unseen classes, we randomly select 50 samples per class and reserve them for testing and use the remaining examples (along with their labels) to train 2000-way classifiers. For ZSL methods, we use either attribute vectors or word vectors (WORD2VEC) as semantic representations. Since SYNC O-VS-O performs well on a range of datasets and settings, we focus on 2 On average, 2-hop has 696 instances/class, Pure 3-hop has 584 instances/class, and Rest has 452 nstances/class 3 For example, the per-class accuracy of EXEM (1NN) in 2-hop/Pure 3-hop/Rest is 12.1/3.0/0.8 (%) at Flat Hit@1 under 1,000 peeked unseen classes selected by heavy-toward-seen random. 137 Table A.4: Comparison of performances measured in AUSUC between GZSL (using WORD2VEC and G-attr) and multi-class classification on ImageNet-2K. Few-shot results are averaged over 100 rounds. GZSL with G-attr improves upon GZSL with WORD2VEC significantly and quickly approaches multi- class classification performance. Method Flat hit@K 1 5 10 20 WORD2VEC 0.04 0.17 0.27 0.38 G-attr from 1 image 0.080.003 0.250.005 0.330.005 0.420.005 GZSL G-attr from 10 images 0.200.002 0.500.002 0.620.002 0.720.002 G-attr from 100 images 0.250.001 0.570.001 0.690.001 0.780.001 G-attr from all images 0.25 0.58 0.69 0.79 Multi-class classification 0.35 0.66 0.75 0.82 this method. For multi-class classification, we train one-versus-others SVMs. Once we obtain the classifiers for both seen and unseen classes, we use the calibrated stacking decision rule to combine (as in generalized ZSL) and vary the calibration factor to obtain the Seen-Unseen accuracy Curve, exemplified in Fig. 6.1. 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 A U→T A S→T ImageNet−2K (Flat hit@1) Multi−class: 0.352 GZSL (word2vec): 0.038 GZSL (G−attr): 0.251 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 A U→T A S→T ImageNet−2K (Flat hit@5) Multi−class: 0.657 GZSL (word2vec): 0.170 GZSL (G−attr): 0.578 Figure A.17: We contrast the performances of GZSL to multi-class classifiers trained with labeled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses WORD2VECTOR (in red color) and the idealized visual features (G-attr) as semantic representations (in black color). How far are we from the ideal performance? Fig. A.17 displays the Seen-Unseen accuracy Curves for ImageNet-2K. Clearly, there is a large gap between the performances of GZSL us- ing the default WORD2VEC semantic representations and the ideal performance indicated by the multi-class classifiers. Note that the cross marks indicate the results of direct stacking. The multi- class classifiers not only dominate GZSL in the whole ranges (thus, with very high AUSUCs) but also are capable of learning classifiers that are well-balanced (such that direct stacking works well). How much can idealized semantic representations help? We hypothesize that a large portion of the gap between GZSL and multi-class classification can be attributed to the weak semantic representations used by the GZSL approach. We investigate this by using a form of idealized semantic representations. As the success of zero-shot learning relies heavily on how accurate semantic representations are informative about 138 Table A.5: Comparison of performances measured in AUSUC between GZSL with WORD2VEC and GZSL with G-attr on the full ImageNet with 21,000 unseen classes. Few-shot results are averaged over 20 rounds. Method Flat hit@K 1 5 10 20 WORD2VEC 0.006 0.034 0.059 0.096 G-attr from 1 image 0.0180.0002 0.0710.0007 0.1060.0009 0.1500.0011 G-attr from 10 images 0.0500.0002 0.1840.0003 0.2630.0004 0.3520.0005 G-attr from 100 images 0.0650.0001 0.2300.0002 0.3220.0002 0.4210.0002 G-attr from all images 0.067 0.236 0.329 0.429 visual similarity among classes, we examine the idea of visual features as semantic representa- tions. Concretely, for each class, semantic representations can be obtained by averaging visual features of images belonging to that class. We call them G-attr as we derive the visual features from GoogLeNet. Note that, for unseen classes, we only use the reserved training examples to derive the semantic representations; we do not use their labels to train classifiers. Fig. A.17 shows the performance of GZSL using G-attr — the gaps to the multi-class clas- sification performances are significantly reduced from those made by GZSL using WORD2VEC. In some cases, GZSL can almost match the performance of multi-class classifiers without using any labels from the unseen classes! How much labeled data do we need to improve GZSL’s performance? Imagine we are given a budget to label data from unseen classes, how much those labels can improve GZSL’s perfor- mance? Table A.4 contrasts the AUSUCs obtained by GZSL to those from mutli-class classification on ImageNet-2K, where GZSL is allowed to use visual features as embeddings — those features can be computed from a few labeled images from the unseen classes, a scenario we can refer to as “few-shot” learning. Using about (randomly sampled) 100 labeled images per class, GZSL can quickly approach the performance of multi-class classifiers, which use about 1,000 labeled images per class. Moreover, those G-attr visual features as semantic representations improve upon WORD2VEC more significantly under Flat hit@K = 1 than when K> 1. We further examine on the whole ImageNet with 20,345 unseen classes in Table A.5, where we keep 80% of the unseen classes’ examples to derive G-attr and test on the rest, and observe similar trends. Specifically on Flat hit@1, the performance of G-attr from merely 1 image is boosted threefold of that by WORD2VEC, while G-attr from 100 images achieves over tenfold. We evaluate our methods and compare to existing state-of-the-art models on four benchmark datasets with varying scales and difficulty. Despite variations in datasets, evaluation protocols, and implementation details, we aim to provide a comprehensive and fair comparison to existing methods. Additional plots and analysis We provide additional SUC plots on the performance of GZSL with idealized semantic representations G-attr in comparison to the performance of multi-class classification: Fig. A.18 for AwA and CUB and Fig. A.19 for ImageNet-2K. As observed previ- ously, the gaps between GZSL with visual attributes/WORD2VEC and multi-class classifiers are reduced significantly. The effect of G-attr is particularly immense on the CUB dataset, where 139 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 A U→T A S→T AwA Multi−class: 0.813 GZSL (attributes): 0.567 GZSL (G−attr): 0.712 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 A U→T A S→T CUB (Split 1) Multi−class: 0.590 GZSL (attributes): 0.342 GZSL (G−attr): 0.542 Figure A.18: Comparison between GZSL and multi-class classifiers trained with labeled data from both seen and unseen classes on the datasets AwA and CUB. GZSL uses visual attributes (in red color) or G-attr (in black color) as semantic representations. Table A.6: Comparison of performance measured in AUSUC between GZSL (using (human-defined) vi- sual attributes and G-attr) and multi-class classification on AwA and CUB. Few-shot results are averaged over 1,000 rounds. GZSL with G-attr improves upon GZSL with visual attributes significantly. On CUB, the performance of GZSL with visual attributes almost matches that of multi-class classification. Method AwA CUB VISUAL ATTRIBUTES 0.57 0.34 G-attr (1-shot) 0.550.04 0.260.02 G-attr (2-shot) 0.610.03 0.340.02 GZSL G-attr (5-shot) 0.660.02 0.440.01 G-attr (10-shot) 0.690.02 0.490.01 G-attr (100-shot) 0.710.003 - y G-attr (all images) 0.71 0.54 Multi-class classification 0.81 0.59 y : We omit this setting as no class in CUB has more than 100 labeled images. GZSL almost matches the performance of multi-class classifiers without using labels from the unseen classes (0.54 vs. 0.59). Finally, we provide additional results on the analysis of how much labeled data needed to improve GZSL’s performance on AwA and CUB. In Table A.6, we see the same trend observed previously, GZSL with G-attr quickly approaches the performance of multi-class classifiers and large improvements from GZSL with visual attributes are observed — even though these at- tributes are defined and annotated by human experts in this case. A.7 Implementation Details Hyper-parameter tuning The standard approach for cross-validation (CV) in a classification task splits training data into several folds such that they share the same set of class labels with one another. This strategy is less sensible in zero-shot learning as it does not imitate what actually happens at the test stage. We thus adopt the strategy in [5, 55, 209, 209, 279]. In this scheme, we split training data into several folds such that the class labels of these folds are disjoint; we hold out data from a subset of seen classes as pseudo-unseen classes, train our models on the 140 0 0.1 0.2 0.3 0.4 0.5 0 0.2 0.4 0.6 0.8 A U→T A S→T ImageNet−2K (Flat hit@1) Multi−class: 0.352 GZSL (word2vec): 0.038 GZSL (G−attr): 0.251 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 A U→T A S→T ImageNet−2K (Flat hit@5) Multi−class: 0.657 GZSL (word2vec): 0.170 GZSL (G−attr): 0.578 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 A U→T A S→T ImageNet−2K (Flat hit@10) Multi−class: 0.746 GZSL (word2vec): 0.264 GZSL (G−attr): 0.693 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 1 A U→T A S→T ImageNet−2K (Flat hit@20) Multi−class: 0.816 GZSL (word2vec): 0.377 GZSL (G−attr): 0.787 Figure A.19: Comparison between GZSL and multi-class classifiers trained with labeled data from both seen and unseen classes on the dataset ImageNet-2K. GZSL uses WORD2VEC (in red color) or G-attr (in black color) as semantic representations. class index Sample index Seen classes Unseen classes (a) Zero-shot splitting class index Sample index Unseen classes (b) Sample-wise CV folds class index Sample index (c) Class-wise CV folds Unseen classes Figure A.20: Data splitting for different cross-validation (CV) strategies: (a) the seen-unseen class split- ting for zero-shot learning, (b) the sample-wise CV , (c) the class-wise CV 141 remaining folds (which belong to the remaining classes), and tune hyper-parameters based on a certain performance metric on the held-out fold. For clarity, we denote the standard CV as sample-wise CV and the zero-shot CV scheme as class-wise CV . Fig. A.20 illustrates the two scenarios. We use this strategy to tune hyper-parameters in both our approaches (SYNC and EXEM) and the baselines. In SYNC, the main hyper-parameters are the regularization parameter in Eq. (5.5) and the scaling parameter in Eq. (5.3). When learning semantic representations (Eq. (5.6)), we also tune and . To reduce the search space during CV , we first fixb r =a r forr = 1;:::;R and tune;. Then we fix and and tune and . In EXEM, we tune (a) projected dimensionality d for PCA and (b) and the RBF-kernel bandwidth in SVR. We find that = 1 works robustly in all datasets. For (a), we found that the ZSL performance is not sensitive tod and thus setd = 500 for all experiments. For (b), we per- form class-wise CV with two exceptions. Since EXEM is a two-stage approach, we consider the following two performance metrics. The first one minimizes the distance between the predicted exemplars and the ground-truth (average of projected validation data by PCA matrix of each class) inR d . We use the Euclidean distance in this case. We term this measure “CV-distance.” This approach does not assume the downstream task at training and aims to measure the quality of predicted exemplars by its faithfulness. The other approach “CV-accuracy” maximizes the per-class classification accuracy on the validation set. This measure can easily be obtained for EXEM (1NN) and EXEM (1NNS), which use simple decision rules that have no further hyper- parameters to tune. Empirically, we found that CV-accuracy generally leads to slightly better per- formance. The results reported in the main text for these two approaches are thus based on this measure. On the other hand, EXEM (ZSL method) (where ZSL method = SYNC, CONSE, ESZSL) requires further hyper-parameter tuning. For computational purposes, we use CV-distance for tuning hyper-parameters of the regressors, followed by the hyper-parameter tuning for METHOD using the predicted exemplars. As SYNC and CONSE construct their classifiers based on the dis- tance values between class semantic representations, we do not expect a significant performance drop in this case. In the generalized zero-shot learning setting, we maximize AUSUC unless stated otherwise. Details on how to obtain word vectors on ImageNet We use the word2vec package 4 . We preprocess the input corpus with the word2phrase function so that we can directly obtain word vectors for both single-word and multiple-word terms, including those terms in the ImageNet synsets; each class of ImageNet is a synset: a set of synonymous terms, where each term is a word or a phrase. We impose no restriction on the vocabulary size. Following [68], we use a window size of 20, apply the hierarchical softmax for predicting adjacent terms, and train the model for a single epoch. As one class may correspond to multiple word vectors by the nature of synsets, we simply average them to form a single word vector for each class. We ignore classes without word vectors in the experiments. CONSE [183] In addition to reporting published results, we have also reimplemented the method CONSE [183], introducing a few improvements. Instead of using the CNN 1K-class classifiers directly, we train (regularized) logistic regression classifiers using recently released multi-core version of LIBLINEAR [58]. Furthermore, in [183], the authors use the averaged 4 https://code.google.com/p/word2vec/ 142 word vectors for seen classes, but keep for each unseen class the word vectors of all synonyms. In other words, each unseen class can be represented by multiple word vectors. In our implemen- tation, we use averaged word vectors for both seen and unseen classes for fair comparison. Novelty detection algorithms [227] We use the code provided by Socher et al. [227] and fol- low their settings. In particular, we train a two-layer neural networks with the same loss function as in [227] to learn a mapping from the visual feature space to the semantic representation space. We tune the hyper-parameter (a multiplier on the standard deviation) in LoOP jointly with other hyper-parameters of zero-shot learning approaches — although we empirically observe that does not significantly affect the novelty detection rankings, consistent with the observations made by [125]. Following [227], we set the number of neighbors (from the seen classes’ exam- ples)k in LoOP to be 20. 143 Appendix B Multi-Task Learning Experiments B.1 Comparison Between Different MTL Approaches In Table B.1, we summarize the results of different MTL approaches. We observe no significant differences between those methods. B.2 Additional Results on All-but-one Settings Table B.2 and Table B.3 compare All and All-but-one settings for TEDEC and TEENC, respectively. We show similar results for MULTI-DEC in the main text. B.3 Detailed Results Separated By the Tasks Being Tested On In Table B.4-B.14, we provide F1 scores with standard deviations in all settings. Each table corresponds to a task we test our models on. Rows denote training settings and columns denote MTL approaches. 144 Table B.1: Comparison between MTL approaches Settings Method UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP Average STL 95.4 95.04 93.49 88.24 53.07 72.77 74.02 66.81 72.71 62.04 46.73 74.58 (Average) MULTI-DEC 94.97 94.65 93.37 87.67 57.21 72.63 74.38 67.39 72.12 61.3 47.99 74.88 Pairwise TEDEC 95.0 94.77 93.4 87.72 56.67 72.48 74.25 67.19 71.84 58.65 47.45 74.49 TEENC 95.0 94.66 93.32 87.65 55.99 72.49 74.24 67.09 72.09 61.62 47.37 74.68 MULTI-DEC 95.04 94.31 93.44 86.38 61.43 71.53 74.26 68.1 74.54 59.71 51.41 75.47 All TEDEC 94.95 94.42 93.64 86.8 61.97 71.72 74.36 67.98 74.61 58.14 51.31 75.44 TEENC 94.94 94.3 93.7 86.01 59.57 71.58 74.35 68.02 74.61 61.83 49.5 75.31 (Average) MULTI-DEC 94.91 94.43 93.65 86.15 61.82 71.09 73.75 68.2 74.42 59.66 50.9 75.36 All-but-one TEDEC 94.83 94.4 93.64 86.39 60.55 70.95 73.74 67.81 74.47 58.66 50.86 75.12 TEENC 94.77 94.35 93.53 85.96 60.23 70.83 73.64 68.15 74.05 61.23 50.15 75.17 Table B.2: F1 scores for TEDEC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## All 94.95 94.42 93.64 86.8 61.97 71.72 74.36 67.98 74.61 58.14 51.31 All - UPOS 94.06# 93.44 86.47 60.48 71.08# 73.79 68.1 74.69 58.32 50.83 0 2 All - XPOS 94.38# 93.6 86.68 60.09 70.98# 73.78# 67.9 74.26 58.31 50.6 0 3 All - CHUNK 94.6# 94.29 86.08 60.6 70.39# 73.36# 68.07 74.47 58.73 51.1 0 3 All - NER 94.69# 94.31 93.69 60.48# 70.64# 73.59# 67.51 74.49 58.19 50.44 0 4 All - MWE 94.93 94.46 93.72 86.21# 71.11# 74.04 67.38 74.49 57.6 50.5 0 2 All - SEM 94.86 94.41 93.6 85.97# 59.94# 72.26# 67.35 74.34 59.08 50.48 0 3 All - SEMTR 94.8 94.28 93.56 86.23# 61.23 69.62# 68.16 74.36 58.85 51.5 0 2 All - SUPSENSE 94.82 94.4 93.67 86.49 59.11 71.02# 73.76# 74.69 58.28 51.96 0 2 All - COM 95.19" 94.76" 93.79 86.25# 62.02 72.32 74.92" 67.62 60.72" 50.0# 4 2 All - FRAME 95.03 94.6 93.64 86.68 60.52# 71.11# 73.9 67.69 74.49 51.23 0 2 All - HYP 94.94 94.45 93.69 86.86 61.07 71.22 74.04# 68.32 74.4 58.55 0 1 #" 1 1 0 0 0 0 1 0 0 1 0 ## 3 1 0 4 3 8 6 0 0 0 1 145 Table B.3: F1 scores for TEENC. We compare All with All-but-one settings (All -hTASKi). We test on each task in the columns. Beneficial settings are in green. Harmful setting are in red. UPOS XPOS CHUNK NER MWE SEM SEMTR SUPSENSE COM FRAME HYP #" ## All 94.94 94.3 93.7 86.01 59.57 71.58 74.35 68.02 74.61 61.83 49.5 All - UPOS 94.0 93.36# 85.98 59.58 70.68 73.66 68.19 74.07 60.51 50.23 0 1 All - XPOS 94.24# 93.29# 85.8 59.81 70.57# 73.64# 68.47 73.94 60.13 50.39 0 4 All - CHUNK 94.66 94.3 85.73 61.58 70.78 73.65 67.87 73.67# 61.73 50.18 0 1 All - NER 94.71 94.25 93.5 59.05 70.58# 73.4# 67.95 74.16 59.96 49.95 0 2 All - MWE 94.94 94.5 93.63 86.1 71.12 73.75 69.0 74.28 61.51 49.81 0 0 All - SEM 94.76 94.32 93.45 85.58 59.47 72.21# 67.77 74.2 61.76 50.15" 1 1 All - SEMTR 94.68 94.25 93.54 86.02 60.59 69.86# 67.96 73.81# 61.31 51.72" 1 2 All - SUPSENSE 94.8 94.27 93.56 86.04 59.25 70.53# 73.27# 74.3 59.98 50.01 0 2 All - COM 95.25" 94.72" 93.82 86.23 60.63 72.38" 75.06" 67.94 63.55 48.77 4 0 All - FRAME 94.84 94.39 93.51# 85.99 61.21 70.78 73.69 68.13 74.3 50.35 0 1 All - HYP 94.86 94.45 93.59 86.1 61.09 71.03# 74.09 68.17 73.78# 61.91 0 2 #" 1 1 0 0 0 1 1 0 0 0 2 ## 1 0 3 0 0 5 4 0 3 0 0 146 Table B.4: F1 score tested on the task UPOS in different training scenarios Trained with Tested on UPOS MULTI-DEC TEDEC TEENC UPOS only 95.4 0.08 Pairwise +XPOS 95.38 0.03 95.4 0.04 95.42 0.07 +CHUNK 95.43 0.11 95.57 0.02" 95.4 0.0 +NER 95.38 0.1 95.32 0.03 95.29 0.04 +MWE 95.15 0.05# 95.11 0.07# 95.05 0.05# +SEM 95.23 0.14 95.2 0.05# 95.27 0.08 +SEMTR 95.17 0.15 95.21 0.03# 95.23 0.13 +SUPSENSE 95.08 0.08# 95.05 0.04# 95.27 0.08 +COM 93.04 0.77# 94.03 0.42# 93.6 0.15# +FRAME 94.98 0.13# 94.79 0.09# 95.0 0.07# +HYP 94.84 0.07# 94.35 0.21# 94.43 0.15# Average 94.97 95.0 95.0 All-but-one All - XPOS 94.57 0.12# 94.38 0.05# 94.24 0.24# All - CHUNK 94.84 0.01# 94.6 0.1# 94.66 0.15# All - NER 94.81 0.07# 94.69 0.05# 94.71 0.07# All - MWE 94.93 0.01# 94.93 0.08# 94.94 0.04# All - SEM 94.82 0.17# 94.86 0.08# 94.76 0.15# All - SEMTR 94.83 0.12# 94.8 0.03# 94.68 0.17# All - SUPSENSE 94.97 0.07# 94.82 0.03# 94.8 0.07# All - COM 95.19 0.05# 95.19 0.04# 95.25 0.02# All - FRAME 95.15 0.07# 95.03 0.17 94.84 0.1# All - HYP 94.93 0.18# 94.94 0.11# 94.86 0.04# All 95.04 0.03# 94.95 0.08# 94.94 0.1# Oracle 95.4 0.08 95.57 0.02 95.4 0.08 Table B.5: F1 score tested on the task XPOS in different training scenarios Trained with Tested on XPOS MULTI-DEC TEDEC TEENC XPOS only 95.04 0.06 Pairwise +UPOS 95.01 0.04 94.99 0.03 94.94 0.05 +CHUNK 95.1 0.02 95.21 0.02" 95.1 0.04 +NER 94.98 0.12 95.09 0.07 95.05 0.13 +MWE 94.7 0.16# 94.8 0.08# 94.66 0.07# +SEM 94.77 0.08# 94.82 0.15 94.93 0.08 +SEMTR 94.86 0.02# 94.8 0.09# 94.97 0.09 +SUPSENSE 94.75 0.15 94.81 0.06# 95.0 0.12 +COM 93.19 0.75# 93.94 0.21# 93.12 0.44# +FRAME 94.64 0.06# 94.66 0.05# 94.55 0.06# +HYP 94.46 0.3# 94.56 0.09# 94.26 0.18# Average 94.65 94.77 94.66 All-but-one All - UPOS 94.03 0.13# 94.06 0.09# 94.0 0.26# All - CHUNK 94.46 0.09# 94.29 0.07# 94.3 0.12# All - NER 94.3 0.03# 94.31 0.02# 94.25 0.07# All - MWE 94.45 0.05# 94.46 0.12# 94.5 0.09# All - SEM 94.34 0.09# 94.41 0.09# 94.32 0.17# All - SEMTR 94.35 0.08# 94.28 0.07# 94.25 0.12# All - SUPSENSE 94.54 0.02# 94.4 0.08# 94.27 0.03# All - COM 94.69 0.1# 94.76 0.08# 94.72 0.06# All - FRAME 94.57 0.12# 94.6 0.19# 94.39 0.08# All - HYP 94.53 0.07# 94.45 0.1# 94.45 0.07# All 94.31 0.15# 94.42 0.07# 94.3 0.2# Oracle 95.04 0.06 95.21 0.02 95.04 0.06 147 Table B.6: F1 score tested on the task CHUNK in different training scenarios Trained with Tested on CHUNK MULTI-DEC TEDEC TEENC CHUNK only 93.49 0.01 Pairwise +UPOS 94.18 0.02" 94.02 0.08" 94.0 0.15" +XPOS 93.97 0.16" 94.18 0.01" 93.98 0.13" +NER 93.47 0.1 93.64 0.03" 93.54 0.1 +MWE 93.54 0.13 93.59 0.2 93.33 0.2 +SEM 93.63 0.02" 93.45 0.07 93.52 0.13 +SEMTR 93.61 0.07 93.47 0.03 93.45 0.07 +SUPSENSE 93.2 0.21 93.25 0.15 93.13 0.13# +COM 91.94 0.4# 92.29 0.27# 91.86 0.09# +FRAME 93.22 0.16# 93.23 0.04# 93.29 0.13 +HYP 92.96 0.08# 92.86 0.08# 93.13 0.04# Average 93.37 93.4 93.32 All-but-one All - UPOS 93.59 0.13 93.44 0.17 93.36 0.17 All - XPOS 93.57 0.19 93.6 0.05" 93.29 0.21 All - NER 93.59 0.09 93.69 0.14 93.5 0.23 All - MWE 93.71 0.11" 93.72 0.13" 93.63 0.04" All - SEM 93.63 0.08 93.6 0.11 93.45 0.13 All - SEMTR 93.58 0.08 93.56 0.14 93.54 0.06 All - SUPSENSE 93.67 0.08" 93.67 0.12 93.56 0.12 All - COM 93.67 0.12 93.79 0.14" 93.82 0.05" All - FRAME 93.7 0.09" 93.64 0.11 93.51 0.06 All - HYP 93.78 0.12" 93.69 0.05" 93.59 0.07 All 93.44 0.09 93.64 0.21 93.7 0.06" Oracle 94.01 0.13 94.07 0.25 93.93 0.16 Table B.7: F1 score tested on the task NER in different training scenarios Trained with Tested on NER MULTI-DEC TEDEC TEENC NER only 88.24 0.09 Pairwise +UPOS 87.68 0.41 87.99 0.21 87.43 0.11# +XPOS 87.61 0.27# 87.65 0.14# 87.71 0.08# +CHUNK 87.96 0.19 88.11 0.21 88.07 0.16 +MWE 88.15 0.23 87.99 0.15 88.02 0.36 +SEM 87.35 0.16# 87.27 0.36# 87.49 0.25# +SEMTR 87.34 0.27# 87.75 0.38 87.29 0.17# +SUPSENSE 87.9 0.24 87.94 0.33 87.92 0.16 +COM 86.62 0.72# 86.59 0.31# 86.75 0.45# +FRAME 88.15 0.35 88.02 0.17 87.99 0.32 +HYP 87.98 0.21 87.91 0.4 87.82 0.31 Average 87.67 87.72 87.65 All-but-one All - UPOS 86.03 0.53# 86.47 0.14# 85.98 0.29# All - XPOS 86.04 0.15# 86.68 0.27# 85.8 0.27# All - CHUNK 86.05 0.1# 86.08 0.49# 85.73 0.2# All - MWE 86.21 0.27# 86.21 0.19# 86.1 0.37# All - SEM 85.81 0.32# 85.97 0.14# 85.58 0.04# All - SEMTR 86.11 0.28# 86.23 0.23# 86.02 0.39# All - SUPSENSE 86.43 0.12# 86.49 0.17# 86.04 0.14# All - COM 86.6 0.79# 86.25 0.06# 86.23 0.33# All - FRAME 85.9 0.29# 86.68 0.15# 85.99 0.3# All - HYP 86.31 0.18# 86.86 0.25# 86.1 0.56# All 86.38 0.12# 86.8 0.08# 86.01 0.4# Oracle 88.24 0.09 88.24 0.09 88.24 0.09 148 Table B.8: F1 score tested on the task MWE in different training scenarios Trained with Tested on MWE MULTI-DEC TEDEC TEENC MWE only 53.07 0.12 Pairwise +UPOS 59.99 0.36" 60.28 0.24" 57.61 0.2" +XPOS 58.87 0.78" 60.32 0.3" 58.26 0.25" +CHUNK 59.18 0.03" 57.61 1.53" 58.06 0.88" +NER 55.4 0.52" 55.17 0.44" 53.4 0.98 +SEM 60.16 1.23" 58.21 0.09" 58.62 0.61" +SEMTR 58.84 1.45" 58.55 0.28" 58.31 2.24" +SUPSENSE 58.81 1.01" 58.75 0.33" 58.05 0.72" +COM 53.89 1.41 51.72 1.01 51.71 1.05 +FRAME 53.88 0.76 53.05 1.32 53.3 1.15 +HYP 53.08 1.72 52.98 1.66 52.59 1.98 Average 57.21 56.67 55.99 All-but-one All - UPOS 61.28 0.78" 60.48 0.93" 59.58 1.14" All - XPOS 61.91 1.56" 60.09 0.9" 59.81 0.83" All - CHUNK 61.01 1.61" 60.6 1.52" 61.58 1.05" All - NER 62.69 0.26" 60.48 0.15" 59.05 0.4" All - SEM 61.17 0.86" 59.94 0.85" 59.47 0.04" All - SEMTR 63.04 0.85" 61.23 2.05" 60.59 0.59" All - SUPSENSE 60.51 0.25" 59.11 2.02" 59.25 0.74" All - COM 61.95 0.97" 62.02 1.73" 60.63 0.73" All - FRAME 62.62 0.85" 60.52 0.47" 61.21 0.99" All - HYP 62.04 0.6" 61.07 0.51" 61.09 1.06" All 61.43 1.94" 61.97 0.5" 59.57 0.64" Oracle 62.76 0.63 61.74 1.49 61.92 0.66 Table B.9: F1 score tested on the task SEM in different training scenarios Trained with Tested on SEM MULTI-DEC TEDEC TEENC SEM only 72.77 0.04 Pairwise +UPOS 73.23 0.06" 73.17 0.08" 73.11 0.01" +XPOS 73.34 0.12" 73.21 0.04" 73.04 0.21 +CHUNK 73.16 0.05" 73.02 0.05" 73.13 0.07" +NER 72.88 0.08 72.77 0.19 72.91 0.08 +MWE 72.75 0.09 72.66 0.18 72.83 0.07 +SEMTR 72.5 0.07# 72.5 0.05# 72.17 0.06# +SUPSENSE 72.81 0.04 72.71 0.03 73.09 0.08" +COM 70.39 0.46# 70.37 0.28# 70.18 0.54# +FRAME 72.76 0.16 72.26 0.21# 72.49 0.23 +HYP 72.47 0.02# 72.15 0.1# 71.95 1.22 Average 72.63 72.48 72.49 All-but-one All - UPOS 70.87 0.19# 71.08 0.19# 70.68 0.76# All - XPOS 71.12 0.1# 70.98 0.24# 70.57 0.13# All - CHUNK 71.07 0.27# 70.39 0.39# 70.78 0.35# All - NER 70.82 0.41# 70.64 0.15# 70.58 0.03# All - MWE 71.01 0.14# 71.11 0.17# 71.12 0.29# All - SEMTR 69.72 0.27# 69.62 0.37# 69.86 0.36# All - SUPSENSE 71.22 0.29# 71.02 0.16# 70.53 0.19# All - COM 72.38 0.08# 72.32 0.23# 72.38 0.17# All - FRAME 71.48 0.51# 71.11 0.16# 70.78 0.44# All - HYP 71.22 0.25# 71.22 0.33# 71.03 0.07# All 71.53 0.28# 71.72 0.21# 71.58 0.24# Oracle 73.32 0.04 73.1 0.03 73.14 0.06 149 Table B.10: F1 score tested on the task SEMTR in different training scenarios Trained with Tested on SEMTR MULTI-DEC TEDEC TEENC SEMTR only 74.02 0.04 Pairwise +UPOS 74.93 0.09" 74.87 0.1" 74.85 0.05" +XPOS 74.91 0.06" 74.84 0.21" 74.66 0.2" +CHUNK 74.79 0.13" 74.73 0.12" 74.77 0.13" +NER 74.34 0.08" 74.01 0.05 74.04 0.07 +MWE 74.51 0.18" 74.63 0.28" 74.66 0.21" +SEM 74.73 0.1" 74.72 0.14" 74.41 0.01" +SUPSENSE 74.61 0.24" 74.52 0.05" 74.94 0.22" +COM 72.6 0.95 71.76 0.88# 71.35 0.95# +FRAME 74.18 0.19 74.21 0.37 74.63 0.11" +HYP 74.23 0.27 74.19 0.45 74.14 0.23 Average 74.38 74.25 74.24 All-but-one All - UPOS 73.54 0.54 73.79 0.46 73.66 0.97 All - XPOS 74.03 0.11 73.78 0.28 73.64 0.07# All - CHUNK 73.97 0.22 73.36 0.05# 73.65 0.39 All - NER 73.51 0.35 73.59 0.19# 73.4 0.19# All - MWE 73.61 0.2# 74.04 0.18 73.75 0.24 All - SEM 71.97 0.3# 72.26 0.28# 72.21 0.48# All - SUPSENSE 73.86 0.09 73.76 0.19 73.27 0.2# All - COM 74.75 0.22" 74.92 0.1" 75.06 0.12" All - FRAME 74.24 0.37 73.9 0.29 73.69 0.32 All - HYP 74.02 0.12 74.04 0.17 74.09 0.21 All 74.26 0.1" 74.36 0.03" 74.35 0.29 Oracle 75.23 0.06 75.24 0.13 75.09 0.02 Table B.11: F1 score tested on the task SUPSENSE in different training sce- narios Trained with Tested on SUPSENSE MULTI-DEC TEDEC TEENC SUPSENSE only 66.81 0.22 Pairwise +UPOS 68.25 0.42" 67.8 0.29" 67.76 0.14" +XPOS 67.78 0.4" 68.3 0.71" 67.77 0.15" +CHUNK 67.39 0.15" 67.29 0.33 67.36 0.29 +NER 68.06 0.16" 67.25 0.21 67.57 0.27" +MWE 66.88 0.14 66.88 0.24 66.26 0.9 +SEM 68.29 0.21" 68.46 0.38" 68.1 0.59" +SEMTR 68.6 0.81" 68.18 0.39" 67.64 0.92 +COM 65.57 0.17# 64.98 0.34# 65.55 0.18# +FRAME 66.59 0.07 66.2 0.16# 66.75 0.22 +HYP 66.47 0.24 66.52 0.59 66.16 0.43 Average 67.39 67.19 67.09 All-but-one All - UPOS 68.27 0.33" 68.1 0.28" 68.19 0.55" All - XPOS 67.99 0.5" 67.9 0.54 68.47 0.18" All - CHUNK 68.26 0.48" 68.07 0.28" 67.87 0.32" All - NER 68.16 0.26" 67.51 0.4 67.95 0.24" All - MWE 68.18 0.62" 67.38 0.22 69.0 0.45" All - SEM 67.36 0.42 67.35 0.18 67.77 0.28" All - SEMTR 68.17 0.15" 68.16 0.47" 67.96 0.73 All - COM 68.67 0.37" 67.62 0.6 67.94 0.22" All - FRAME 68.47 0.72" 67.69 0.95 68.13 0.39" All - HYP 68.46 0.37" 68.32 0.18" 68.17 0.36" All 68.1 0.54" 67.98 0.29" 68.02 0.21" Oracle 68.53 0.09 68.22 0.61 69.04 0.44 150 Table B.12: F1 score tested on the task COM in different training scenarios Trained with Tested on COM MULTI-DEC TEDEC TEENC COM only 72.71 0.75 Pairwise +UPOS 72.46 0.34 72.86 0.12 72.09 0.36 +XPOS 72.83 0.16 72.87 0.56 72.41 0.51 +CHUNK 72.44 0.11 73.3 0.15 72.88 0.26 +NER 70.93 0.73 71.08 0.31# 70.78 0.27# +MWE 71.31 0.31 70.93 0.43# 71.36 0.42 +SEM 72.72 0.22 73.14 0.08 72.25 0.07 +SEMTR 71.96 0.16 71.74 0.46 72.15 0.5 +SUPSENSE 72.24 0.27 69.13 0.19# 72.12 0.66 +FRAME 72.47 0.08 72.89 0.22 72.1 0.93 +HYP 71.82 0.97 70.47 0.81 72.79 0.97 Average 72.12 71.84 72.09 All-but-one All - UPOS 74.42 0.24" 74.69 0.26" 74.07 0.19 All - XPOS 74.36 0.14" 74.26 0.64 73.94 0.3 All - CHUNK 74.2 0.13" 74.47 0.26" 73.67 0.23 All - NER 74.08 0.07" 74.49 0.38" 74.16 0.48 All - MWE 74.7 0.14" 74.49 0.13" 74.28 0.16" All - SEM 74.31 0.1" 74.34 0.42 74.2 0.28 All - SEMTR 74.2 0.24" 74.36 0.36 73.81 0.16 All - SUPSENSE 74.24 0.44 74.69 0.52" 74.3 0.13" All - FRAME 75.03 0.24" 74.49 0.2" 74.3 0.19" All - HYP 74.62 0.14" 74.4 0.06" 73.78 0.05 All 74.54 0.53 74.61 0.24" 74.61 0.32" Oracle 72.71 0.75 72.71 0.75 72.71 0.75 Table B.13: F1 score tested on the task FRAME in different training scenarios Trained with Tested on FRAME MULTI-DEC TEDEC TEENC FRAME only 62.04 0.74 Pairwise +UPOS 62.14 0.35 61.54 0.53 62.27 0.33 +XPOS 60.77 0.39 61.44 0.06 61.62 1.01 +CHUNK 62.67 0.47 61.39 0.78 62.98 0.5 +NER 62.39 0.37 59.25 0.52# 63.02 0.39 +MWE 61.75 0.21 56.77 2.79 60.61 0.91 +SEM 61.74 0.27 60.09 0.48# 62.17 0.36 +SEMTR 62.03 0.41 59.77 0.81 62.79 0.19 +SUPSENSE 61.94 0.43 55.68 0.61# 61.96 0.18 +COM 56.52 0.27# 55.25 2.29# 57.65 2.42 +HYP 61.02 0.62 55.35 0.5# 61.14 1.77 Average 61.3 58.65 61.62 All-but-one All - UPOS 58.47 1.0# 58.32 0.35# 60.51 0.1# All - XPOS 60.16 0.42# 58.31 0.8# 60.13 1.38 All - CHUNK 60.01 0.65 58.73 0.68# 61.73 0.48 All - NER 59.17 0.27# 58.19 0.89# 59.96 0.52# All - MWE 59.23 0.33# 57.6 0.82# 61.51 0.43 All - SEM 58.73 0.67# 59.08 0.84# 61.76 0.52 All - SEMTR 59.49 0.79# 58.85 0.51# 61.31 1.16 All - SUPSENSE 59.23 0.64# 58.28 0.19# 59.98 1.23 All - COM 62.37 0.37 60.72 0.73 63.55 0.31 All - HYP 59.69 0.41# 58.55 0.29# 61.91 0.59 All 59.71 0.85 58.14 0.23# 61.83 0.98 Oracle 62.04 0.74 62.04 0.74 62.04 0.74 151 Table B.14: F1 score tested on the task HYP in different training scenarios Trained with Tested on HYP MULTI-DEC TEDEC TEENC HYP only 46.73 0.55 Pairwise +UPOS 48.02 0.31 49.36 0.36" 48.27 0.68 +XPOS 48.81 0.36" 49.23 0.55" 48.06 0.02" +CHUNK 47.85 0.2" 48.43 0.3" 47.13 0.35 +NER 47.9 0.67 48.24 0.65 48.64 1.17 +MWE 47.32 0.29 45.83 0.46 46.71 0.64 +SEM 48.15 0.21" 47.95 0.75 47.12 0.43 +SEMTR 47.74 0.57 46.96 0.85 46.1 0.11 +SUPSENSE 49.23 0.13" 47.29 0.41 47.24 0.43 +COM 47.41 1.18 45.24 0.46 47.81 0.8 +FRAME 47.5 0.46 46.0 0.53 46.66 0.54 Average 47.99 47.45 47.37 All-but-one All - UPOS 51.13 0.94" 50.83 0.65" 50.23 0.73" All - XPOS 51.65 0.63" 50.6 0.44" 50.39 1.17" All - CHUNK 50.27 0.76" 51.1 0.28" 50.18 0.81" All - NER 50.86 0.87" 50.44 0.39" 49.95 0.38" All - MWE 50.83 0.61" 50.5 0.9" 49.81 0.44" All - SEM 50.93 0.27" 50.48 0.53" 50.15 0.11" All - SEMTR 51.27 0.5" 51.5 0.46" 51.72 0.15" All - SUPSENSE 50.86 1.85" 51.96 0.29" 50.01 1.13" All - COM 50.28 1.02" 50.0 0.11" 48.77 0.54" All - FRAME 50.89 0.64" 51.23 1.01" 50.35 0.68" All 51.41 0.25" 51.31 0.55" 49.5 0.05" Oracle 50.0 0.42 50.15 0.25 48.06 0.02 152
Abstract (if available)
Abstract
Measuring similarity between any two entities is an essential component in most machine learning tasks. This thesis describes a set of techniques revolving around the notion of similarity. The first part involves modeling and learning similarity. We introduce Similarity Component Analysis (SCA), a Bayesian network for modeling instance-level similarity that does not observe the triangle inequality. Such a modeling choice avoids the transitivity bias in most existing similarity models, making SCA intuitively more aligned with the human perception of similarity. The second part involves learning and leveraging similarity for effective learning with limited data, with applications in computer vision and natural language processing. We first leverage incomplete and noisy similarity graphs in different modalities to aid the learning of object recognition models. In particular, we propose two novel zero-shot learning algorithms that utilize class-level semantic similarities as a building block, establishing state-of-the-art performance on the large-scale benchmark with more than 20,000 categories. As for natural language processing, we employ multi-task learning (MTL) to leverage unknown similarities between sequence tagging tasks. This study leads to insights regarding the benefit of joint MTL of more than two tasks, task selection strategies, as well as the nature of task relationships.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Transfer learning for intelligent systems in the wild
PDF
Leveraging training information for efficient and robust deep learning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Multimodal reasoning of visual information and natural language
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Visual representation learning with structural prior
PDF
Neural sequence models: Interpretation and augmentation
PDF
Learning shared subspaces across multiple views and modalities
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Syntactic alignment models for large-scale statistical machine translation
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Towards understanding language in perception and embodiment
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Generating psycholinguistic norms and applications
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Computational model of stroke therapy and long term recovery
PDF
Deep generative models for image translation
PDF
Visual knowledge transfer with deep learning techniques
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Neural creative language generation
Asset Metadata
Creator
Changpinyo, Soravit
(author)
Core Title
Modeling, learning, and leveraging similarity
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
10/30/2018
Defense Date
05/08/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,machine learning,multi-task learning,natural language processing,OAI-PMH Harvest,pattern recognition,similarity,similarity learning,transfer learning,zero-shot learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sha, Fei (
committee chair
), Knight, Kevin (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
martbeerina@gmail.com,schangpi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-99755
Unique identifier
UC11676630
Identifier
etd-Changpinyo-6917.pdf (filename),usctheses-c89-99755 (legacy record id)
Legacy Identifier
etd-Changpinyo-6917.pdf
Dmrecord
99755
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Changpinyo, Soravit
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
machine learning
multi-task learning
natural language processing
pattern recognition
similarity
similarity learning
transfer learning
zero-shot learning