Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Bridging the visual reasoning gaps in multi-modal models
(USC Thesis Other)
Bridging the visual reasoning gaps in multi-modal models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Bridging the Visual Reasoning Gaps in Multi-modal Models by Woojeong Jin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2024 Copyright 2024 Woojeong Jin Acknowledgements I am grateful to my advisor, Xiang Ren, for his support and guidance throughout my graduate studies. He taught me not only necessary research skills but also mindsets required for being a good researcher and a person. He always encouraged me to do impactful research. I would like to thank my defense committee members, Professors Ram Nevatia, Yan Liu, and Toby Mintz for their critical feedback and constructive suggestions. I would like to extend my appreciation to all of my friends at USC INK lab, including Bill Yuchen Lin, Wenxuan Zhou, Aaron Chan, Shushan Arakelyan, Jun Yan, Qinyuan Ye, Pei Zhou, Xisen Jin, Soumya Sanyal, Sahana Ramnath, and Brihi Joshi, for their support, friendship, and fruitful collaborations. I am indebted to my mentors during internships, Subhabrata Mukherjee, Yu Cheng, Yelong Shen, Ahmed Awadallah, Weizhu Chen, Hamed Firooz, Maziar Sanjabi, and Liang Tan for their valuable guidance, support, and the opportunity to work on exciting research problems. Last but certainly not least, I want to express my deep appreciation to my wife for her unwavering support, encouragement, and understanding throughout my academic journey. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Chapter 2: Low-resource Prompt-based Learning for Vision-Language Models . . . . . . . . . . . . 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Analysis Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.3 Downstream Tasks and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.5 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Encoder-decoder Vision-language Model . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Pre-training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Low-resource Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1 Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1.1 Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.1.2 Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1.3 MiniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6.1 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6.2 Performance on Zero-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.3 Performance on Few-shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.4 MiniImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6.5 Study of Prompt Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.5.1 Zero-shot Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.5.2 Few-shot Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.6.6 Pre-training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 iii 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Grounded Vision-language Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Generalization to Diverse Vision-language Tasks . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Background: Visual Grounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Pre-training for Better Task Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3 Pre-training Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.4 Pre-training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4.1 Object-word Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Experiment Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.4 Downstream Tasks and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.6 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 4: Probing Visual Properties of Objects Under Different States . . . . . . . . . . . . . . . . 42 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 The WinoViz Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 The WinoViz Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Versions of WinoViz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Analysis Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.2 Zero-shot Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.3 Few-shot Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.4 Results of Encoder-only Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.5 Pragmatic and Visual Knowledge Reasoning . . . . . . . . . . . . . . . . . . . . . . 53 4.4.6 Using Image Generation for WinoViz Task. . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 5: Learning Visual Knowledge in Language Tasks . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2 Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.2 Analysis Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.3 Pre-training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.4 Downstream Tasks and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.5 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 iv 5.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.1 Text Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Cross-modal Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 6: Saliency-aware Knowledge Distillation for Multimodal Understanding . . . . . . . . . . 77 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2.2 Conventional Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3 Analysis Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.1 Modality-specific Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.3 Datasets and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.4 Modality Weighting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.1 Population-based Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4.2 Saliency-based Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4.3 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.5 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.5.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.5.3 Learning Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5.4 Observation of Teacher’s Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 v List of Tables 2.1 Hand-crafted prompts. We study hand-crafted prompts on zero-shot and few-shot tasks. [Q] and [A] refer to question text and answer text, respectively. <text_1> is a sentinel token. We append image features to input text. Target prompts are “[A]” and “<text_1> [A]” in VQA. We use caption text as a target prompt in captioning. . . . . . . 10 2.2 Zero-shot VQA results. We test models without any training examples. VL-T5no-vqa is pre-trained without VQA datasets. Compared to larger models, Frozen and PICa-Full, our models outperform them or show the comparable results. . . . . . . . . . . . . . . . . . . 13 2.3 Few-shot VQA results. We report average performance over 5 different splits. The size of training and validation sets are 16 for our FewVLM and VL-T5no-vqa, and Frozen and PICa use 4 and 16 in-context training examples, respectively. For the fair comparison to Frozen, we include FewVLM∗ base with 4 training and validation examples. . . . . . . . . . 13 2.4 Zero-shot captioning results. We use the CIDEr and SPICE metrics for evaluation. . . . 14 2.5 Few-shot captioning results. We report average performance over 5 different splits. We use the CIDEr and SPICE metrics for evaluation. . . . . . . . . . . . . . . . . . . . . . . . 14 2.6 Prompt templates. We test different input prompts on VQAv2. [Q] refers to input question text. We use <text_1> [A] as target text. We append image features to input text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.7 5-way miniImageNet results. We evaluate FewVLM in a generative manner. The shot represents the number of training examples per class. . . . . . . . . . . . . . . . . . . . . . 18 2.8 Zero-shot results of hand-crafted prompts. We test different input prompts in zero-shot predictions. We use a CIDEr metric for Flickr30k. Note that zero-shot setting does not require target prompts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.9 Results on different pre-training objectives. We test our pre-training objectives to investigate how it affects zero-shot and few-shot performance. We train FewVLMbase with 16 training and validation examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Zero-shot results. We report performance on downstream tasks without any training data. Our model surpasses all baselines on classification tasks. . . . . . . . . . . . . . . . . 34 vi 3.2 Few-shot results. We report performance on downstream tasks with 32 labeled examples for fine-tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Results on RefCOCOg and Flickr30k-entities with 0 and 32 examples. We report recall@1 for Flickr30k-entities. †This model used the RefCOCOg dataset in the pretraining. ‡These models used the Flickr30k-entities dataset in the pre-training while ours did not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 VQA results with 0 and 32 examples. We report zero-/32-shot performance on the VQAv2 dataset. Flamingo has 3B or 80B parameters and uses in-context examples for inference while our model has 310M parameters and uses the examples for fine-tuning. . 37 3.5 Ablations on the pre-training objectives and hybrid sequences in pre-training. We report Q → AR for VCR, and R@1 for Flick30k-entities. . . . . . . . . . . . . . . . . . . 39 4.1 A list of models used in the experiments: BERT [29], CLIP [98], VL-BERT [116], Oscar [66], FLAN-T5 [23], InstructBLIP [26], LLaMA2 [123], LLaVA [72], GPT-3.5 [13, 88], and GPT-4 [87]. We use ‘gpt-3.5-turbo-0125’ for GPT-3.5, and ‘gpt-4-0613’ for GPT-4. . . . 49 4.2 Results on WinoViz in a zero-shot manner. We evaluate large models using 0 examples on both our single-hop and multi-hop datasets. We observe that these models performed well on the single-hop data; however, their performance significantly degrades on the multi-hop data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Results on WinoViz with 4-shot in-context learning. We use FLAN-T5-XXL, GPT-3.5, and GPT-4 in this analysis. Standard prompting marginally improves the performance of them, while chain-of-thought prompting (CoT) is beneficial for GPT-3.5 in the multi-hop task. Interestingly, GPT-4 degrades with chain-of-thought prompting. We found that 16.9% of single-hop questions and 10.5% of multi-hop questions are unpredictable by GPT-4. GPT-4’s performance on individual questions without the cases are 93.51% and 86.32% on single-hop and multi-hop questions, respectively. . . . . . . . . . . . . . . . . . 52 4.4 Results on WinoViz after NLI training. We train encoder-only models on NLI datasets and choose an option by the highest probability of the ‘entailment’ class. . . . . . . . . . . 53 4.5 Results on pragmatic reasoning, visual knowledge reasoning, and our original data (combined). We study different types of reasoning in our data. We report individual accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Results on WinoViz with generated images. We use Stable Diffusion [105] to generate 5 images per premise sentence. We adopt majority voting at inference time to choose an option. FLAN-T5-Base (No imgs) refers to a model without any generated images, with a size comparable to CLIP-Large. FLAN-T5-XXL (No imgs) refers to a model without any generated images, while FLAN-T5-XXL (Captions) refers to a model with captions generated by BLIP2 on the generated images. Instead of directly inputting images into FLAN-T5, we extract captions from the generated images and use them as additional context. InstructBLIP uses generated images. . . . . . . . . . . . . . . . . . . . . . . . . . 54 vii 5.1 Downstream task data statistics. We create in-house test set for PIQA and CSQA, and in-house dev set for VP by splitting the train set. . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Performance (accuracy) in low-resource setting. We test models on diverse datasets with low-resource learning (64 and 128 training samples). We use captions in the MS COCO dataset for text knowledge transfer methods and images and captions for crossmodal knowledge transfer methods. We get average performance on 64 and 128 training samples. Bold and underlined numbers refer to the best and second-best performance, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Performance (accuracy) in fully supervised setting. Bold and underlined numbers refer to the best and second-best performance, respectively. . . . . . . . . . . . . . . . . . 71 5.4 Performance (accuracy) on GLUE benchmark. Bold and underlined numbers refer to the best and second-best performance, respectively. . . . . . . . . . . . . . . . . . . . . . . 72 5.5 Results of text knowledge transfer methods with different corpora. We pre-train text knowledge transfer methods, MLM ans TCL, with different corpora. CP is MS COCO captions, GK is GenericsKB, BC is BooksCorpus, and WT is WikiText. Bold and underlined numbers refer to the best and second-best performance, respectively. . . . . . 74 6.1 Dataset Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 Main Results. Mean results (±std) over five repetitions are reported. MSD outperforms all the KD approaches. Here, we use MSD on top of conventional KD [45]. Also, our weight learning for weights shows the best performance. . . . . . . . . . . . . . . . . . . . 90 6.3 Improvement over KD approaches with MSD. The MSD improves existing KD approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 viii List of Figures 2.1 Examples of VQA and Captioning tasks. In our setup, we convert the tasks into generative tasks in which models need to generate target text given input text and an image. 6 2.2 Illustration of FewVLM. This shows inference of FewVLM with prompt-based learning. Given a prompt template, we convert the question text into input text. The prompt helps the model generate correct answers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Pre-training objectives. We pre-train FewVLM with masked language modeling (MaskedLM) and prefix language modeling (PrefixLM). . . . . . . . . . . . . . . . . . . . . 9 2.4 VQAv2 results on noisy prompts. We investigate different prompts on various training sizes. FewVLM is trained with our best hand-crafted prompt (P3), irrelevant prompts, noisy tokens and random sentences. We list the prompt templates in Table 2.6. We use “<text_1> [A]” as our target prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Flickr30k results on hand-crafted prompts. We investigate different hand-crafted prompts (Q1, Q2, and Q3) on various training sizes. . . . . . . . . . . . . . . . . . . . . . . 20 2.6 VQAv2 results on different target prompts. We investigate different target prompts with hand-crafted input prompts on various training sizes. . . . . . . . . . . . . . . . . . . 21 3.1 Examples of vision-language tasks. Vision-language tasks have different task formats, which makes challenging to generalize in a zero-/few-shot way. In this work, we study generalization of few-shot methods and propose GRILL that can generalize to diverse VL tasks without introducing task-specific special representations or pre-trained object detectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Illustration of GRILL. Our model is a sequence-to-sequence transformer that uses a vision transformer (ViT) [31, 75] to process images with patch embeddings, where each patch represents a fixed-size region of the image. We replace the referring words with the corresponding visual patches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Pre-training objectives. We illustrate our pre-training objectives. We include masked language modeling, prefix language modeling, and the discriminative objective as our pre-training objectives. Given an image-caption pair, we create proper inputs for each objective. Text in green color is the target text of each objective. . . . . . . . . . . . . . . . 30 ix 3.4 Object-word alignments. To create hybrid sequences, we first get object-word alignments by object detection, object tag-word matching, and object-word alignments. . 32 3.5 Performance with different input formats for inference on the zero-shot setup. We report Q → AR for VCR, and R@1 for Flick30k-entities. . . . . . . . . . . . . . . . . . 40 4.1 The WinoViz task. We investigate the divergent properties of an object and explore the reasoning abilities of language models pertaining to object attributes. The premise sentence depicts a scene involving a banana and two hypothesis sentences describe the visual properties of a banana. The task is to choose a more plausible hypothesis given the premise. For the multi-hop version, we replace the visual attribute word with another object word which has a similar visual attribute. . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Dataset Collection with Human Annotators. We collect our data through crowdsourcing efforts. The first step is to identify properties and visual attributes for an object and the second step is to write natural sentences for each property and attribute. Sentences with properties will be used as premise sentences and sentences with visual attributes will be used as hypothesis sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Examples of generated images. We generate images using Stable Diffusion [105]. In the second example, the bananas in both images are yellow, leading the model to select the incorrect option. The generated image examples don’t assist in selecting a more plausible hypothesis option. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Reporting Bias. People tend to report what interests them rather than typical and general facts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Illustration of different methods for transferring visual knowledge into transformer-based language model. In this example, we assume image-caption pair as an input. (a) masked language model [28] on image captions. (b) text contrastive learning obtains positive example by dropout representation to learn better sentence representation while negative augmentation is optional. (c) voken classification employs token-level text-to-image retrieval to transfer visual knowledge. (d)cross-modal contrastive learning aims to train correct paring of images and captions. (e) cross-modal knowledge distillation transfers knowledge from the teacher model, which is trained by cross-modal contrastive learning, into student model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 LM perturbation. We create adversarial negatives using language models. . . . . . . . . . 67 5.4 Results on varying training sizes. We test methods with different training sizes. . . . . 75 6.1 Density of model outputs on Hateful-Memes: given multimodality samples as input (Multi), given only image modality as input (Image), and given only text modality as input (Text). KD denotes a student model with knowledge distillation and the small model is a student model without distillation. We observe that there is still a prediction gap between the teacher and the student trained by KD. In this paper, we study saliency explanations for each modality and propose modality-specific distillation (MSD) to minimize the gap. . 78 x 6.2 Saliency scores in the Hateful-memes and MM-IMDB test sets. Saliency scores of text modality are mostly higher than those of image modality in MM-IMDB while Hateful-Memes does not show such a global pattern. . . . . . . . . . . . . . . . . . . . . . 90 6.3 Saliency scores in the SNLI-VE dev set. We observe that saliency scores for text modality are correlated with labels. For the "Entailment" label, scores for text modality are relatively lower, while they are higher for the "Contradiction" label. . . . . . . . . . . 91 6.4 Density of model outputs on samples of label 0 (not hateful) on the test set of Hateful-Memes: given multimodal samples as input (Multi), given only image modality as input (Image), and given only text modality as input (Text). MSD with the weight-learning approach, minimizes the gap between the teacher and the student trained by KD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Kullback-Leibler divergence on the MM-IMDB test set between the teacher’s outputs and other models’ outputs. This is a measure of how the teacher’s probability distribution is different from other models’. The lower divergence is, the closer a model is to the teacher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.6 Teacher-Student consistency ratio. We investigate the student model’s sensitiveness to changes in modalities. Higher ratio indicates its sensitiveness is closer to the teacher’s. 94 6.7 Test accuracy of a student on SNLI-VE during training and comparison between knowledge distillation (KD) and modality-specific distillation (MSD) with population-based weighting, instance-wise weighting, and weight learning for weights. . . . . . . . . . . . . . . . . . . 95 6.8 Prediction probabilities of test samples for different modalities. Black points correspond to the predictions of samples with both modalities (original input), red points do with image modality, and blue points do with text modality. The samples are ordered based on their multimodal output probabilities. There is a strong correlation between multimodal predictions and predictions from text modality in MM-IMDB, while there is no such a global pattern in Hateful-Memes. . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.9 A multimodal violating sample (Left). We further replaced its image modality with a background picture that makes it benign and examined models on both examples (Right). 97 xi Abstract Human intelligence is inherently rooted in the integration of perception and interpretation. When humans interact with the visual world or comprehend language, they simultaneously interpret raw observations through a mental model of their surroundings. This intuitive framework allows us to comprehend a wide array of concepts, from basic objects and actions to complex and abstract scenarios. The goal of Artificial Intelligence (AI) is to replicate this human-like reasoning by developing agents capable of similar cognitive processes. In recent years, deep learning has seen remarkable progress in natural language processing (NLP) and computer vision (CV). Particularly, language models (LMs) and vision-language models (VLMs) have showcased impressive abilities across various tasks, including achieving human-comparable performance in tasks like reading comprehension and question answering. Furthermore, these models demonstrate their ability in generating elaborate narratives and intricate visuals,showcasing their versatility and creative capabilities. However, language models have struggled with the challenge of developing reasoning abilities and acquiring knowledge from experience, despite being innate for human. Obtaining knowledge by observing the visual world is challenging because the knowledge is often not explicitly described in text format. Also, cross-task generalization, a key aspect of human intelligence, remains another big challenge for current models. The ability to perform one task given knowledge from another is a fundamental cognitive skill, yet models struggle to replicate this proficiency. xii In this thesis, we aim to build a reasoner that can do complex reasoning about the physical world and generalization on vision-language tasks. we will present a few lines of work to bridge the visual knowledge gaps in pre-trained models. I will first discuss how we can improve zero-/few-shot learning of a smaller model with prompt. we will introduce FewVLM, relatively smaller than recent few-shot learners. Our model is evaluated on visual question answering and captioning tasks in a zero-/few-shot way. Then, we will introduce GRILL, a VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding with no or very few training instances. Next, we present WinoViz, a text-only evaluation dataset, consisting of 5,606 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states. In addition, we investigate whether language models can be improved using knowledge transfer. For this, we explore two types of knowledge transfer, text knowledge transfer and cross-modal knowledge transfer using imagecaption datasets Finally, We perform a large-scale empirical study to investigate the importance and effects of text or image modality in knowledge distillation. We introduce a multimodal knowledge distillation framework,modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher’s behavior within each modality. xiii Chapter 1 Introduction Human intelligence is inherently rooted in the integration of perception and interpretation. When humans engage with the visual world or comprehend language, they simultaneously interpret raw observations through a mental model of their surroundings. This common-sense mental model enables us to conceptualize a wide range of elements, from simple objects and actions to more intricate and abstract situations. The objective of Artificial Intelligence (AI) is to emulate human-level reasoning by constructing agents capable of similar cognitive processes. Deep learning has witnessed remarkable advancements in natural language processing (NLP) and computer vision (CV) in recent years. Notably, language models (LMs) and vision-language models (VLMs) have demonstrated remarkable proficiency in various tasks, such as achieving human-level performance in reading comprehension and question answering. Moreover, these models exhibit the ability to generate intricate stories and sophisticated images, showcasing their versatility and creative capabilities. However, language models have struggled with the challenge of developing reasoning abilities and acquiring knowledge from experience, despite these being innate for humans. Humans effortlessly enhance 1 their knowledge by observing the visual world through their eyes. However, obtaining this type of knowledge presents difficulties because it is often not explicitly described in text format. Overcoming these challenges necessitates visual grounding, which involves establishing connections and associations between language and visual information to facilitate comprehension and interpretation of the visual world. One the other hand, cross-task generalization, a key aspect of human intelligence, remains a formidable hurdle for current models. The ability to perform one task given knowledge from another is a fundamental cognitive skill, yet models struggle to replicate this proficiency. Tasks like Visual Question Answering (VQA) [37], image captioning [144], and phrase grounding [94] showcase the difficulty in seamlessly transitioning between different types of tasks, revealing the need for advancements in cross-task generalization. In this thesis, we aim to build a reasoner that can do complex reasoning about the physical world and generalization on vision-language tasks. The central question revolves around assessing whether LMs and VLMs possess human-like reasoning capabilities and, if not, how these capabilities can be enhanced. Areas such as counterfactual reasoning, commonsense reasoning, VL few-shot reasoning, and cross-task generalization are scrutinized, aiming to bridge the gap between the current capabilities of AI models and the nuanced reasoning abilities exhibited by humans. 1.1 Contributions • We study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). We analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning achieves comparable results to a 246× larger model, PICa [142]. 2 • We introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods. • We present WinoViz, a text-only evaluation dataset, consisting of 5,606 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states. We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task. • We investigate whether language models can be improved using knowledge transfer. We explore two types of knowledge transfer: (1) text knowledge transfer using image captions that may contain enriched visual knowledge and (2) cross-modal knowledge transfer using both images and captions with vision-language training objectives. On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings. • We perform a large-scale empirical study to investigate the importance and effects of text or image modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher’s behavior within each modality. The idea aims at mimicking a 3 teacher’s modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. 4 Chapter 2 Low-resource Prompt-based Learning for Vision-Language Models Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen [125] which is 31× larger than FewVLM by 18.2% point and achieves comparable results to a 246× larger model, PICa [142]. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at https://github.com/woojeongjin/FewVLM 5 question: What position is this man playing? answer: <text_1> an image of a small black dog standing over a plate of food. VQA Captioning Input image Input text <text_1> pitcher Target text Input text Target text Input image Figure 2.1: Examples of VQA and Captioning tasks. In our setup, we convert the tasks into generative tasks in which models need to generate target text given input text and an image. 2.1 Introduction Fine-tuning large pre-trained language models (PLMs) have led to strong results in various domains including vision-language tasks [28, 100, 13, 98]. Such large PLMs can learn a new task with a few examples or generalize to a new task without fine-tuning on any training examples, i.e., few-shot and zero-shot learning [13, 98, 125]. Few-shot learning overcomes the challenges of data-hungry supervised learning, where collecting human-labeled data is costly and slow. However, recent few-shot models such as GPT3 [13], Frozen [125], and PICa [142] are too large to deploy in small or moderate computing machines due to their gigantic model sizes In this paper, we study low-resource learning of VL tasks with our proposed method, FewVLM, a moderate-sized vision-language model, in which we fine-tune the model with no or a handful of training examples. For FewVLM, we pre-train a sequence-to-sequence transformer model [22, 100] with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). This setup is more practical 6 Transformer Encoder Transformer Decoder What position is this man playing? <text_1> pitcher <text_1> pitcher question: [Q] answer: <text_1> [Q] [Q] <text_1> Prompts Faster R-CNN question: What position is this man playing? answer: <text_1> Figure 2.2: Illustration of FewVLM. This shows inference of FewVLM with prompt-based learning. Given a prompt template, we convert the question text into input text. The prompt helps the model generate correct answers. in that training and inference can be run economically using standard computing hardware and it is expensive to obtain a large number of quality training examples in the real world. In such a few-shot setting, task-specific prompts or task descriptions are important and have shown effectiveness in few-shot NLP tasks [35, 98, 109, 110, 13]. To extend the success to VL tasks, we aim to answer the following questions for prompt-based lowresource VL learning. Q1) How does prompt design affect zero/few-shot learning on new tasks? Q2) Does prompt design still matter given larger training? Q3) How do different pre-training objectives affect zero/few-shot learning? To answer these questions, we explore various prompt formats including hand-crafted and noisy prompts on zero/few-shot VL learning datasets. In addition, we study pre-training objectives on few-shot tasks inspired by Raffel et al. [100]: prefix language modeling (PrefixLM) inspired by Raffel et al. [100] and masked language modeling (MaskedLM). To this end, we investigate the model’s performance on few-shot VL tasks including visual question answering [37, 79, 47], captioning [2, 144] (Fig. 2.1), and miniImageNet [129]. In our empirical analysis, our FewVLM with prompt-based learning outperforms Frozen [125] which is 31× larger than FewVLM by 18.2% point on zero-shot VQAv2 and achieves comparable results to a 246× 7 larger model, PICa [142]. Furthermore, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance on new tasks (§2.6.2 and §2.6.3), (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data (§2.6.5), and (3) MaskedLM helps few-shot VQA tasks while PrefixLM boosts captioning performance (§2.6.6). 2.2 Related Work Vision-language few-shot learning. Recently, several few-shot learners on vision-language tasks were proposed including GPT [99, 13], Frozen [125], PICa [142], and SimVLM [134]. Frozen [125] is a large language model based on GPT-2 [99], and is transformed into a multimodal few-shot learner by extending the soft prompting to incorporate a set of images and text. Their approach shows the few-shot capability on visual question answering and image classification tasks. Similarly, PICa [142] uses GPT-3 [13] to solve VQA tasks in a few-shot manner by providing a few in-context VQA examples. It converts images into textual descriptions so that GPT-3 can understand the images. SimVLM [134] is trained with prefix language modeling on weakly-supervised datasets. It demonstrates its effectiveness on a zero-shot captioning task. While these models achieve improvement on few-shot tasks, they are impractical to use in real-world applications due to their model sizes. Language model prompting. Providing prompts or task descriptions play an vital role in improving pre-trained language models in many tasks [35, 98, 109, 110, 13]. Among them, GPT models [99, 13] achieved great success in prompting or task demonstrations in NLP tasks. In light of this direction, promptbased approaches improve small pre-trained models in few-shot text classification tasks [35, 109, 110]. CLIP [98] also explores prompt templates for image classification which affect zero-shot performance. We follow these core ideas so we aim to improve zero-shot and few-shot performance using prompts in vision-language tasks. 8 a lady walking next to a bicycle Prefix LM carrying an umbrella Masked LM a lady walking next to a <text_1> carrying an <text_2> <text_1> bicycle <text_2> umbrella Input image Input text Target text Figure 2.3: Pre-training objectives. We pre-train FewVLM with masked language modeling (MaskedLM) and prefix language modeling (PrefixLM). 2.3 Analysis Setup In this work, we study the zero-shot and few-shot performance of vision-language models L. We introduce our analysis setup: problem formulation, analysis questions, downstream tasks and datasets, evaluation metrics, and baselines. 2.3.1 Problem Formulation For zero-shot tasks, a pre-trained VL model L have no access to training set Dtrain and development set Ddev, and directly makes inference on the test instances Dtest. For few-shot tasks, we compose a dev set Ddev from training data and ensure that |Dtrain| = |Ddev| following Perez, Kiela, and Cho [91] and Gao, Fisch, and Chen [35] to tune the hyper-parameters and select the model. We limit the sizes of training and development sets to meet the goal of learning from limited data. The size of Dtrain and Ddev are small — i.e., we set the size of both to 16 in our study. 9 Table 2.1: Hand-crafted prompts. We study hand-crafted prompts on zero-shot and few-shot tasks. [Q] and [A] refer to question text and answer text, respectively. <text_1> is a sentinel token. We append image features to input text. Target prompts are “[A]” and “<text_1> [A]” in VQA. We use caption text as a target prompt in captioning. Task ID Input prompt Example VQA P1 [Q] <text_1> input: What position is this man playing? <text_1> output: <text_1> pitcher P2 question: [Q] answer: input: question: What position is this man playing? answer: output: <text_1> pitcher P3 question: [Q] answer: <text_1> input: question: What position is this man playing? answer: <text_1> output: <text_1> pitcher Captioning Q1 a picture of input: a picture of output: a small black dog standing over a plate of food. Q2 a photo of input: a photo of output: a small black dog standing over a plate of food. Q3 an image of input: an image of output: a small black dog standing over a plate of food. 2.3.2 Analysis Questions We aim to answer the following questions in this study through experiments on multiple VL datasets. Q1) How does prompt design affect zero/few-shot learning on new tasks? Providing a pre-trained language model with task-specific prompts or significantly improves zero-shot and few-shot performance on NLP domains [35, 109, 110, 13]. For this question, we test several ad-hoc prompts on vision-language tasks and analyze how large zero-shot and few-shot performance is affected by different prompts, handcrafted and noisy prompts, in Sec. 2.6.5. Q2) Does prompt design still matter given larger training data? As we will see in our experiments, prompts affect the zero/few-shot performance. However, prompts may have different effects when models are given different sizes of training data. To answer this question, we train models with different sizes of training data and various prompts, and compare the performance between different prompts. Q3) How do different pre-training objectives affect zero/few-shot performance? We study two different pre-training objectives on few-shot performance: prefix language modeling (PrefixLM) inspired by Raffel et al. [100] and masked language modeling (MaskedLM). In this setup, we pre-train our model with different objectives and test the model on zero-shot and few-shot tasks in Sec. 2.6.6. 10 2.3.3 Downstream Tasks and Datasets In this work, we mainly focus on three tasks: visual question answering, captioning, and categorical learning. The visual question answering task requires models to answer a question to a given context image. We convert the visual question answering task into a generation task so that the model can generate answers in the zero-shot setting. The captioning task requires a model to generate descriptions for a given context image. The categorical learning requires a model to choose the correct category or class. We evaluate our model in an open-ended fashion to quantify fast learning of categories, in which it must generate correct labels unlike other classification methods. We include VQAv2 [37], OK-VQA [79], and GQA [47] for visual question answering tasks, and NoCaps [2], and Flickr30k [144] for image captioning. We use Karpathy split [53] for Flickr30k, which resplits train and val images into 29,000 / 1,014 / 1,000 for train / validation / test. For categorical learning, we include miniImageNet [129], a meta learning dataset. Following [125], we use only meta test data to evaluate FewVLM in a few-shot manner and test on 5-way k-shot setup, where 5 classes and k examples per class are given.∗ 2.3.4 Evaluation Metrics To evaluate few-shot performance, we randomly sample 5 different training and dev splits and measure average performance on the 5 splits. We fine-tune the vision-language models with 200 epochs for the fewshot setup and choose the best checkpoint on the dev set. For NoCaps task, it does not have training data. Thus we use the training data from COCO captioning in the experiments following Wang et al. [134]. We evaluate on the VQAv2 validation set, GQA test-dev, OK-VQA test set, test set of Karpathy split for Flickr30k captioning, and NoCaps validation set. We adopt accuracy for VQA datasets and miniImageNet, and CIDEr [128] and SPICE [4] as evaluation metrics for captioning. ∗ For VQA and captioning, we include k samples in total, not per class. 11 2.3.5 Baselines We evaluate strong zero/few-shot vision-language learners for comparison: Frozen [125], PICa [142] for VQA datasets and SimVLM [134] for captioning datasets. We include Unified VLP [153] for few-shot VQAv2 and Flickr30k. Also, we compare them with fully fine-tuned models Lfull as upper bounds of fewshot models for each task; these models are fine-tuned on the entire datasets while few-shot models can access a small amount of data. For fully fine-tuned models Lfull, we borrow numbers from Uniterlarge [20] for VQAv2, Oscar [66] for GQA, SimVLM [134] and VinVL [150] for NoCaps CIDER and SPICE respectively, and Unified VLP [153] for Flickr30k captioning. We include VL-T5no-vqa as a baseline which is pre-trained without visual question answering datasets [22]. For miniImageNet, we include Frozen and AFHN [61]. Frozen is designed for few-shot learning while AFHN is for meta learning, which is smaller and faster. 2.4 Method Before diving into the analysis, we introduce our model, FewVLM, to do zero/few-shot learning on VL tasks and answer the analysis questions we raised. We introduce FewVLM architecture and pre-training objectives. 2.4.1 Encoder-decoder Vision-language Model We adopt an encoder-decoder architecture [22, 127], to encode visual and text inputs and generate target text. We represent an input image with 36 object regions from a Faster R-CNN [104] trained on Visual Genome [57]. The sets of region representations are fed into the encoder by appending them to the text Cho et al. [22]. We train the model parameters θ by minimizing the negative log-likelihood of target text y tokens given input text x and image v: Lθ = − X |y| i=1 log Pθ(yi |y, and then the masked text is fed into the encoder. Then the decoder generates the masked spans as target text. We randomly mask 15% of input text tokens and replace them with sentinel tokens. Pre-training data. To pre-train FewVLM, we collect image-caption data from MS COCO [70, 19] and Visual Genome (VG) [57]. The pre-training datasets contains 6M image-text pairs and 180K distinct images. 13 Table 2.4: Zero-shot captioning results. We use the CIDEr and SPICE metrics for evaluation. Model Model size NoCaps Flickr30k CIDEr SPICE CIDEr SPICE Unified VLP 122M - - 24.9 7.2 VL-T5no-vqa 224M 4.4 5.3 2.6 2.0 SimVLMhuge - 101.4 - - - FewVLMbase 224M 42.2 8.5 31.0 10.0 FewVLMlarge 740M 47.7 9.1 36.5 10.7 Table 2.5: Few-shot captioning results. We report average performance over 5 different splits. We use the CIDEr and SPICE metrics for evaluation. Model Model size NoCaps Flickr30k CIDEr SPICE CIDEr SPICE Unified VLP 122M - - 28.8 9.4 VL-T5no-vqa 224M 22.0 6.8 12.8 8.3 FewVLMbase 224M 48.6 10.0 32.6 12.8 FewVLMlarge 740M 53.1 10.4 37.0 13.5 Fine-tuned Lfull - 112.2 13.1 67.4 17.0 2.5 Low-resource Adaptation In downstream tasks, we train our model with few-shot examples. Fig. 2.2 shows an illustration of FewVLM in inference time. Given a prompt template P, we first get input text and target text using the template x, y = P(input, label). Then we train model parameters by minimizing the negative log-likelihood in Eq. (2.1). In inference, we use the same prompt and the model generates the label text. Here we obtain the final label by removing the target prompt template. 2.5.1 Prompt Design Prompts affect the performance of the vision-language model [22]; we study the effect of different prompts on the zero-shot and few-shot performance on downstream tasks. Tables 2.1 and 2.6 show prompts we used in our experiments. 2.5.1.1 Visual Question Answering The visual question answering tasks (VQA, OK-VQA, and GQA) require models to answer a question to a given context image. Recent approaches [20, 119, 116, 63, 66] tackle visual question answering tasks as multi-label classification over a predefined set of answer candidates. Instead, we approach the visual question answering tasks as a generation task so that the model can produce the answers without introducing any task-specific heads. In this setup, prompts act as constraints to guide the models to generate proper 14 Table 2.6: Prompt templates. We test different input prompts on VQAv2. [Q] refers to input question text. We use <text_1> [A] as target text. We append image features to input text. Input prompt template Category Fill in the blank in the below sentence: [Q] irrelevant prompts Question: [Q] True or False? irrelevant prompts [Q] What color is the floor? irrelevant prompts Paraphrase this into a different question? [Q] irrelevant prompts [Q] How many are they? irrelevant prompts nezg publice passed Dream [Q] noisy tokens benefic video starting garbagetap Talent summary [Q] noisy tokens gestion Bun dates youngest batteriesfeder organisationoyez [Q] noisy tokens [Q] chefernt,iei geekutilisées plantingasta Pest principiiMF saddle véritable noisy tokens [Q] composant emergency laissé Klägereiniger swipe concentrateOSS/18 rewardprepaid noisy tokens [Q] A black dog is sitting on a couch. random sentences [Q] A man working at a kitchen counter in a room illuminated by sunlight. random sentences A brown purse is sitting on a green bench. [Q] random sentences A television that is sitting next to signs. [Q] random sentences [Q] A woman is wearing white pants. random sentences formats of answers; models might generate a sentence for VQA, which is not the correct format, without prompts. Therefore, we study several prompts for input and output as shown in Tables 2.1 and 2.6; we explore hand-crafted prompts (Table 2.1) and noisy prompts for ablation study (Table 2.6). Hand-crafted prompts. For input prompts, we explore three different templates: “question: [Q] answer:” and with the <text_1> sentinel token at the end. Similarly to masked language modeling, we expect models to generate words thanks to the sentinel token. For target prompts, we explore two different templates: “[A]” (an answer) and “<text_1> [A]” (an answer with a sentinel token). Here, we aim to mimic MaskedLM’s target text format, so the similar format helps the model quickly adapt to the new task. We call each prompt ID as in Table 2.1. Noisy prompts. To understand the effect of noisy prompts in zero/few-shot learning, we include irrelevant prompts, noisy tokens, and random sentences as in Table 2.6. Irrelevant prompts are random questions or instructions that mislead models to answer wrong questions or follow irrelevant instructions. 15 Noisy tokens are randomly selected from T5’s vocabulary, so we test how robust our model is to random tokens. Finally, random sentences are captions from MS COCO and this gives false information to models. 2.5.1.2 Captioning In NoCaps and Flickr30k, we explore three hand-crafted input prompts: “a picture of ”, “a photo of ”, and “an image of ”. We study the effect of different word choices in this captioning task. While the three different words have similar meanings, they show different performance in zero-shot and few-shot tasks as we will see in our experiments.. For target prompts, we just train the model with the original caption without any additional prompts. 2.5.1.3 MiniImageNet In miniImageNet, we train our model with a hand-crafted input prompt, “This is <text_1>,” and target prompt, “<text_1> [A].” We compare our model with and without prompts in this dataset to study whether prompts are helpful in categorical learning. 2.6 Results and Discussion In this section, we first discuss our main results on zero-shfewvlm/fewvlm/figures/ew-shot tasks and then answer the questions we raised: does prompt design matter in zero/few-shot learning? 2.6.1 Experiment Details For pre-training, we set batch size 1,280 and 800 for FewVLMbase and FewVLMlarge, respectively and pretrain them with 30 epochs. We use learning rate 1e-4 with 5% linear warmup. For few-shot learning, we train models with 200 epochs, learning rate 5e-5 and 5% linear warmup and choose the best checkpoint on 16 the dev set. For FewVLM, we use “question: [Q] answer <text_1>” (P3) as an input prompt and “<text_- 1> [A]” as a target prompt for visual question answering, and “an image of” (Q3) as an input prompt for captioning, which show the best performance. We will study the effect of different prompts in Sec. 2.6.5. The sizes of of Dtrain and Ddev are 16 on VQA and captioning tasks. For miniImageNet, we use ‘This is <text_1>,” and “<text_1> [A]” as input and target prompts. In this data, we test with {1, 3, 5}-shots per class. 2.6.2 Performance on Zero-shot Learning We evaluate the existing models in a zero-shot manner, in which models do not have access to any training data. Tables 2.2 and 2.4 show the results on VQA and captioning datasets, respectively. First, FewVLM with the hand-crafted prompt (P3) achieves better performance than other baselines on VQA datasets. In particular, our FewVLMbase significantly outperforms Frozen which is about 31× larger than ours. Also, PICa based on GPT3 [13] shows the best performance on OK-VQA. It is noticeable that our FewVLMlarge, the 246× smaller model, achieves the comparable result to PICa. Compared to VL-T5no-vqa which is the same architecture as ours, FewVLMbase improves VQAv2 performance by about 30% point. As we will see in the later section, our pre-training objectives and the prompts boost the VQA performance. On NoCaps, SimVLMhuge shows the best performance. Our FewVLMbase significantly improves the performance compared to VL-T5no-vqa. As we will see in the later section, our pre-training objectives and the prompts boost the VQA and captioning performance. 2.6.3 Performance on Few-shot Learning Tables 2.3 and 2.5 show the few-shot performance on VQA and captioning datasets. Sizes of training and validation sets are 16 for FewVLM, VL-T5no-vqa, and Unified VLP; and Frozen and PICa use 4 and 16 in-context demonstration examples, respectively. 17 Table 2.7: 5-way miniImageNet results. We evaluate FewVLM in a generative manner. The shot represents the number of training examples per class. Model Model size 1 shot 3 shots 5 shots Frozen 7B 14.5 34.7 33.8 FewVLMbase (no prompt) 224M 48.0 75.0 82.6 FewVLMbase 224M 57.0 78.0 84.2 FewVLMlarge 740M 57.1 78.3 84.4 AFHN - 62.3 - 78.1 On VQAv2 and OK-VQA, PICa shows the best performance while our FewVLMlarge achieves the comparable result on VQAv2. OK-VQA requires external knowledge to answer unlike other VQA datasets, so larger models and large pre-training data (prior knowledge) are necessary to improve. Interestingly, FewVLM∗ base, which is trained with 4 training examples, outperforms Frozen. On captioning data, FewVLMbase notably outperforms VL-T5no-vqa by 31.1% point on NoCaps CIDEr. Unified VLP slightly underperforms FewVLM on Flickr30k captioning task. We conjecture that their architecture is based on a encoder-decoder transfomer and it is pre-trained with a captioning task [153]. 2.6.4 MiniImageNet Table 2.7 shows results on miniImageNet, where models must choose the correct class for each image. We train and evaluate FewVLM in an generative manner; the model must generate correct label text to get the credit. FewVLM significantly outperforms Frozen in all shots. Note that we train FewVLM with a few training samples while Frozen uses them as in-context demonstration. Interestingly, FewVLM with a hand-crafted prompt improves performance a lot on the 1-shot case, while it marginally improves on the 5-shot case. 18 Table 2.8: Zero-shot results of hand-crafted prompts. We test different input prompts in zero-shot predictions. We use a CIDEr metric for Flickr30k. Note that zero-shot setting does not require target prompts. no prompt P1 P2 P3 VQAv2 3.7 9.9 19.0 43.4 no prompt Q1 Q2 Q3 Flickr30k 9.6 15.2 25.6 31.0 2.6.5 Study of Prompt Design Here we examine the effect of different prompts on FewVLMbase in Table 2.8 and Figs. 2.6, 2.5, and 2.4. We test the model on VQAv2 and Flickr30k datasets. 2.6.5.1 Zero-shot Predictions Table 2.8 shows the zero-shot performance on VQAv2 and Flickr30k. We observe that zero-shot results are remarkably affected by input prompts on both datasets. For input prompts, <text_1> in P1 and P3 helps the zero-shot predictions significantly compared to “no prompt” and P2. We conjecture that <text_1> guides the model to predict masked spans similarly to MaskedLM, so it improves the performance. On Flickr30k, we examine different word choices of prompts: “a picture of” (Q1), “a photo of” (Q2), and “an image of” (Q3). For instance, using “an image of” outperforms using no prompt by 21.4 point. It is noticeable that different word choices significantly affect the zero-shot results. 2.6.5.2 Few-shot Predictions We study various input prompts including irrelevant prompts, noisy tokens, and random sentences on VQAv2 (Fig. 2.4). First, noisy prompts and no prompt achieve near 0 accuracy on the zero-shot setting. In few-shot predictions, FewVLM with noisy prompts learns as quickly as hand-crafted prompts given 19 0 10 20 30 50 100 200 Training size 0 10 20 30 40 50 ACC on VQAv2 hand-crafted no prompt irrelevant prompts noisy tokens random sentences Figure 2.4: VQAv2 results on noisy prompts. We investigate different prompts on various training sizes. FewVLM is trained with our best hand-crafted prompt (P3), irrelevant prompts, noisy tokens and random sentences. We list the prompt templates in Table 2.6. We use “<text_1> [A]” as our target prompt. 10 20 30 50 100 200 300 Training size 35.0 37.5 40.0 42.5 45.0 CIDEr on Flickr30k hand-crafted no prompt Figure 2.5: Flickr30k results on hand-crafted prompts. We investigate different hand-crafted prompts (Q1, Q2, and Q3) on various training sizes. larger data. For example, our model with noisy prompts achieves comparable results to the best handcrafted prompt. Among all different types of noisy prompts, random sentences deteriorate performance the most. This is because the random sentences come from captions in MS COCO, so the model might choose the answer from wrong captions not from images. Interestingly, no prompt outperforms the other 20 10 20 30 50 100 200 300 Training size 30 35 40 45 50 ACC on VQAv2 <text_1> [A] [A] Figure 2.6: VQAv2 results on different target prompts. We investigate different target prompts with hand-crafted input prompts on various training sizes. noisy prompts and even shows similar to or better than the hand-crafted prompt with larger training data. We also observe a similar phenomenon on Flickr30k; no prompt performs similar to hand-crafted prompts in Fig. 2.5. In addition, we explore two different target prompts, “<text_1> [A]” and “[A].” We try to mimic the MaskedLM’s target text format, so we add “<text_1>” to target prompt on VQA. This might help the model’s fast adaptation to a new task since they share the same target prompt. In Fig. 2.6, we notice an interesting phenomenon; the target prompt “[A]” shows a larger variance than the other suggesting that introducing “<text_1>” helps the model quickly adapt to a new task. However, both prompts show similar results given larger training data, e.g., 300. 2.6.6 Pre-training Objectives We investigate how pre-training objectives affect different tasks. We pre-train FewVLM with different pretraining objectives: masked language modeling (MaskedLM) and prefix language modeling (PrefixLM). 21 Table 2.9: Results on different pre-training objectives. We test our pre-training objectives to investigate how it affects zero-shot and few-shot performance. We train FewVLMbase with 16 training and validation examples. Objective VQAv2 GQA Flickr30k CIDEr Zero-shot MaskedLM 42.4 25.1 4.6 PrefixLM 11.9 6.7 26.8 MaskedLM + PrefixLM 43.4 27.0 31.0 Few-shot MaskedLM 46.0 31.4 18.5 PrefixLM 40.8 27.6 31.8 MaskedLM + PrefixLM 48.2 32.2 32.6 In Table 2.9, we observe that MaskedLM helps VQA tasks while PrefixLM helps captioning tasks in zero-shot and few-shot settings. We conjecture that MaskedLM is to predict spans, which is analogous to predict correct answers to questions, and PrefixLM is to generate the rest of the given prefix, which is similar to captioning tasks. In other words, if the pre-training task is similar to the downstream tasks, then it will help performance further. When pre-training with both objectives, they create a synergetic effect and thus improve cross-task generalization. 2.7 Conclusion In this work, we present FewVLM, a few-shot prompt-based learner on vision-language tasks. On diverse datasets, FewVLM outperforms baselines and shows comparable results to PICa which is 246× larger than ours. We observe that prompts are vital in zero-shot and few-shot tasks and each pre-training objective helps different few-shot tasks. Also, we find out that models with larger training data are not significantly 22 affected by noisy prompts. Future work includes exploring automatic prompt generation and diverse formats of few-shot tasks such as multiple-choice VQA. Finding optimal prompts requires exhaustive engineering to achieve the best performance and leads to impressive results. We leave the exploration of these directions to future investigations. 23 Chapter 3 Grounded Vision-language Pre-training Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/fewshot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning [147] or NLVR2 [117]. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods. 3.1 Introduction Generalization to unseen tasks has been explored and investigated on zero-/few-shot NLP tasks by performing multi-task learning with task-specific prompts [108] or pre-training huge language models on a massive dataset and using a few examples as demonstrations for generalization [13]. Similarly, few-shot 24 Visual Commonsense Reasoning (VCR) Flickr30k-entities VQA Why do [person1] and [person2] look so scared? [person1] is holding a gun Output: False A man in a plaid shirt is looking at flowers. Output: True What are the people riding? Output: Motorcycles Figure 3.1: Examples of vision-language tasks. Vision-language tasks have different task formats, which makes challenging to generalize in a zero-/few-shot way. In this work, we study generalization of few-shot methods and propose GRILL that can generalize to diverse VL tasks without introducing taskspecific special representations or pre-trained object detectors. vision-language (VL) learning methods aim to leverage the pre-trained language models and their powerful generalization abilities to adapt to VL domains and learn new tasks from zero or a few examples [125, 98, 50, 3]. While the few-shot learners can overcome the challenges of supervised learning and avoid the need for task-specific fine-tuning, existing few-shot VL learners suffer from limited generalization to unseen tasks such as grounding tasks that require not only understanding the image and the language, but also locating and identifying relevant regions or objects in images, such as visual commonsense reasoning (VCR) [147] or Flickr30k-entities [94]. Existing few-shot VL methods exhibit great performance on visual question answering and captioning tasks [3, 125, 50], but they lack the skills to generalize to grounding tasks as they do not explicitly model the spatial and visual information of the regions or objects. On the other hand, existing fine-tuning methods rely on special representations for representing regions or objects, such as special tokens that mark the regions or objects in the captions and the images [22], and object features 25 extracted from a pre-trained object detector [116, 20]. These methods achieve good results with finetuning, but they are not compatible with zero-/few-shot generalization, due to the different designs of object representation for each task and the dependence on external object detectors that may not cover all the relevant concepts. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a new VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks in a zero-/few-shot fashion. We address the challenge of few-shot generalization to unseen tasks by a) learning object grounding and localization in pre-training, b) representing visual concepts (e.g., regions and images) with versatile image patches, and c) unifying the tasks into text generation. Specifically, our model is a generative sequence-to-sequence transformer model [127] with a vision transformer (ViT) [31, 75] to process images with patch embeddings, where each patch represents a fixed-size region of the image. We represent a visual concept (object or region) that corresponds to a group of patches by aggregating information across the patches. This enables our model to generate better representations for any kind of regions or images. We construct our pre-training dataset from MS-COCO [70, 19] and Visual Genome [57], where each caption contains images or bounding boxes within them, which provide rich and diverse information for the model to learn object grounding and localization. Given the dataset, we pre-train our model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM) objectives, which encourage the model to generate natural language from images and fill in the missing words in captions, respectively; and a discriminative objective, which encourages the model to distinguish whether the paired image-captions are correct or not. We test our GRILL on 7 zero-/few-shot vision-language tasks including Visual Commonsense Reasoning (VCR) [147], RefCOCOg [78], Flickr30k-entities [94], NLVR2 [117], SNLI-VE [137], visual question 26 A Transformer Encoder Decoder <text_1> motorcycle Vision Transformer sitting on a <text_1> Figure 3.2: Illustration of GRILL. Our model is a sequence-to-sequence transformer that uses a vision transformer (ViT) [31, 75] to process images with patch embeddings, where each patch represents a fixedsize region of the image. We replace the referring words with the corresponding visual patches. answering [37], and Flickr30k captioning [144]. We observe that our model demonstrates better zero-/fewshot generalization on diverse tasks compared to baselines. We also find that our pre-training objectives and pre-training datasets are vital for better zero-/few-shot performance. 3.2 Generalization to Diverse Vision-language Tasks Various VL tasks require phrase and object grounding and their task formats are different, which makes few-shot models challenging to generalize. In this work, we introduce a model that can generalize to VL tasks including grounding with no or a few labeled examples. We first introduce the background, formal problem definition, and challenges. 3.2.1 Background: Visual Grounding Visual grounding refers to the ability to link linguistic concepts (sentences, phrases, or words) to visual concepts (images and regions) [17]. Here we consider two types of visual grounding: image grounding and object grounding. 27 Image grounding refers to the linking of textual concepts to image concepts [17]. In this work, we consider image grounding as linking any type of text including sentences, phrases, and words to an entire image (e.g., image captioning, and image retrieval). Given an image and a corresponding caption, object grounding aims to localize objects in the image as mentioned by a noun phrase in the caption (or the entire caption sentence). Such object grounding occurs at word, phrase, and sentence levels in the language modality. Many VL tasks require object grounding implicitly or explicitly and we consider tasks that explicitly require localization as object grounding tasks such as referring expression comprehension (RefCOCOg [78]), phrase grounding (Flickr30k-entities [94]), and visual commonsense reasoning [147]. 3.2.2 Problem Formulation In this work, we re-formulate the widely used pre-training task for image-caption datasets such that each caption may have one or more images including bounding boxes or regions in itself as a part of the text, denoted by (T, {Vj} N ), in addition to the associated images. Note that some captions may not have images in themselves, N = 0. We refer to learning on the captions with images grounded learning. For pretraining, a VL model is pre-trained on image-caption datasets where captions include images or bounding boxes. For zero-shot tasks, the pre-trained model L cannot access training data Dtrain and validation data Dval. We directly evaluate the model on the test data Dtest. For few-shot tasks, the model has access to K instances of training data for fine-tuning. For hyper-parameter tuning and model selection, we assume validation data Dval which has an equal number of instances to Dtrain to simulate a real-world low-resource environment and compose the validation data from training data. The sizes of Dtrain and Dval are 32 in our study. Challenges Our goal is to pre-train a VL model that seamlessly transfers to various tasks not limited to visual question answering and captioning in a zero-shot or few-shot manner. Different tasks, especially grounding tasks, have different task (input and output) formats as in Fig. 3.1, and thus the main challenge 28 of this work is to generalize the zero-/few-shot ability to diverse tasks. Existing works on grounding tasks introduce special representations to depict regions such as special tokens [22] or object representations by an object detector [116, 20]. While these works perform well on grounding tasks via expensive fine-tuning on labeled data, they have to design different object representations for different task formats. This makes it difficult to generalize to new tasks in a zero-shot fashion. For example, the object representations from an object detector are difficult to transfer to a task that refers to multiple images such as NLVR2 [117]. In this work, we tackle these challenges by introducing patch embeddings to represent objects, regions, and images; learning object grounding and localization in pre-training, and unifying all the tasks into text generation. 3.3 Pre-training for Better Task Generalization In this section, we introduce GRILL, a few-shot VL model for jointly learning contextualized representations from vision and language tasks. We first present an overview of GRILL (§3.3.1), our model architecture (§3.3.2), pre-training objectives (§3.3.3), and pre-training data (§3.3.4) in this section. 3.3.1 Overview We propose GRILL, a VL model that can learn object grounding and localization in pre-training and generalize to a wide range of VL tasks in a zero-/few-shot fashion. Our model is a sequence-to-sequence transformer [127] and takes a hybrid sequence, denoted by (I, T, {Vj} N ), consisting of text T, an image I and visual concepts or regions {Vj} N as input and the output is a text sequence. We represent an input image with image patches by vision transformer [31, 75] and represent a region that corresponds to a set of patches by aggregating information among the patches (§3.3.2). We illustrate our model in Fig. 3.2. Given sequences with paired text outputs, we pre-train our model with prefix language modeling, masked 29 A giraffe standing near some rocks on the grass A standing near some rocks on the grass A standing near some <text_1> on the grass A standing near some A standing near some rocks on the grass <text_1> rocks rocks on the grass False Prefix LM Masked LM Discriminative Figure 3.3: Pre-training objectives. We illustrate our pre-training objectives. We include masked language modeling, prefix language modeling, and the discriminative objective as our pre-training objectives. Given an image-caption pair, we create proper inputs for each objective. Text in green color is the target text of each objective. language modeling, and a discriminative objective (§3.3.3). Then we discuss how we create the hybrid sequences from image-caption datasets (§3.3.4). 3.3.2 Model Architecture For unified text generation, we adopt a transformer encoder-decoder architecture [127], which takes a text sequence as an input and generates another text sequence as an output. To encode images and regions for vision-language tasks, we adopt a vision transformer [31, 75] as our image encoder; it splits an input image with a sequence of image patches. Specifically, it first splits an image into non-overlapping patches and linearly embeds all patches, and these patches are passed to the transformer encoder layers, yielding {v1, ..., vm}. For an image of resolution of 224 × 224 and patch size of 32 × 32, we have m = 49. We assume that vi encodes the information of the corresponding patch pi . The image patches are versatile in that they can represent any type of images or regions; we represent a visual concept (object or region) Vj that corresponds to a set of patches by aggregating information among the patches, and these patches are additionally passed to the transformer encoder layer. We adopt Swin transformer (Swin-B) [75] as our vision transformer. 30 3.3.3 Pre-training Objectives We pre-train our model with prefix language modeling (PrefixLM), masked language modeling (MaskedLM) following Jin et al. [50], and a discriminative objective. Many VL tasks are classification tasks that require choosing one of the options. To deal with the classification tasks, we additionally adopt the discriminative objective, which is to classify whether the given sequence is correct or not. Fig. 3.3 illustrates the pre-training objectives. Prefix language modeling. We include prefix language modeling (PrefixLM) following [100, 50]. The objective randomly splits the text with regions input into two separate sequences. The first part may contain regions and is used as an input with an image to the encoder, and the second part does not contain regions and is used as target text to be generated by the decoder. The target text is not allowed to have region representations since our model generates text only Masked language modeling. Masked language modeling [22, 50] is to mask out random spans with numbered sentinel tokens, e.g., <text_1>, and then the masked sequence is fed into the encoder. Then the decoder generates the masked spans as target text. We randomly mask 15% of input text tokens and replace them with sentinel tokens. Note that the input sequence may include region representations in addition to a paired image and the region representations are not allowed to be masked. Discriminative objective. The discriminative objective is important so that our model can do classification tasks where it has to determine whether the given sequence is correct or not. Thus, we pre-train GRILL with the discriminative objective and the model generates target texts, “true” for positive pairs and “false” for negative pairs. We consider an image and its captions with associated regions (if any) as positive pairs. With a probability of 50%, we create the negative pairs by replacing the referring words with random region representations from the given image or randomly choosing another training caption. The negative samples let the model learn the correct bindings of referring words and corresponding regions. 31 Object detection Object tag – word matching Object – word alignment A giraffe standing near some rocks on the grass giraffe rock giraffe – giraffe rock – rocks giraffe rocks Figure 3.4: Object-word alignments. To create hybrid sequences, we first get object-word alignments by object detection, object tag-word matching, and object-word alignments. 3.3.4 Pre-training Data To pre-train GRILL, we collect image-caption data from MS COCO [70, 19] and Visual Genome (VG) [57]. From the image-caption pairs, we create our hybrid sequences which may have one or more region representations pre-training. We introduce object-word alignments representing correspondence between words and objects, and use the alignments to create hybrid sequences. We create hybrid sequences in pre-training on the fly; we randomly choose k object-word alignments and replace the words with the corresponding bounding boxes. In addition, we include region descriptions and the aligned regions as hybrid sequences from Visual Genome, and non-hybrid sequences (raw text and images) in the pre-training. 3.3.4.1 Object-word Alignments Given image-caption pairs, the process of getting object-word alignments consists of three steps: (1) object detection on images, (2) object tag-word matching, and (3) object-word alignments. We illustrate the process in Fig. 3.4. Note that we use object detection only in pre-training and do not use it on downstream tasks. 32 Object detection. The first step is to detect objects and object tags from images. We use the state-of-theart object detector [150] to get object bounding boxes and tags, yielding {(V1, l1), ...,(Vm, lm)} where Vi is a bounding box and li is a tag for the box. Given the set of tags {l1, ..., lm}, we will find correspondence between the tags and words {w1, ..., wn} in a caption in the next step. Object tag-word matching. The second step is to find similar words {w1, ..., wn} to one of tags {l1, ..., lm}. To find similar words, we introduce a rule-based approach as follows: • Exact token matching • Plural - Singular exact token matching • Word vector similarity [82] • WordNet Synonyms [83] If one of the rules is satisfied, then we mark them as aligned tags and words {(li , wj )}. Note that a word can be matched to multiple tags. Object-word alignments. In the last step, we find alignments between object bounding boxes and words {(Vi , wj )} given the alignments between tags and words {(li , wj )} and an object list {(V1, l1), ...,(Vm, lm)}. We can simply find the object-word alignments since each tag is mapped to each bounding box, yielding {(Vi , li , wj )}. However, note that some object bounding boxes share the same object tag; thus the alignments can include noisy correspondences between object boxes and words. To filter out the noisy alignments, we run CLIP [98] over the aligned words and objects. After this process, we obtained 1.8 object-word alignments per image-caption pair on average. 33 Method Size VCR RefCOCOg Flickr30k-entities NLVR2 SNLI-VE VQAv2 Flickr30k Q → A QA → R Q → AR Acc R@1 R@5 R@10 Acc Acc Acc CIDEr Random - 25.0 25.0 6.3 19.0 6.5 27.7 47.8 50.0 33.3 0.0 - UNITERlarge 303M 32.6 26.1 8.7 10.0 - - - 49.1 17.9 0.0 - VL-T5 224M 28.2 27.5 8.2 0.0 0.0 0.0 1.1 48.7 - 13.5 4.4 FewVLMbase 224M 25.9 25.4 6.5 0.0 0.0 0.0 0.0 50.6 - 43.4 31.0 FewVLMlarge 740M 27.0 26.1 7.4 0.0 0.0 0.0 0.0 51.2 - 47.7 36.5 GRILL 310M 40.6 39.3 16.2 47.5 18.9 53.4 70.3 56.1 46.9 42.3 25.6 Table 3.1: Zero-shot results. We report performance on downstream tasks without any training data. Our model surpasses all baselines on classification tasks. 3.4 Experiments 3.4.1 Experiment Details For pre-training, we use 1,280 batch size for GRILL, set learning rate 1e-4 with 5% linear warmup, and pretrain it with 30 epochs. For the few-shot setting, we randomly choose 32 examples and sample 5 different training and dev splits, and we train models with 100 epochs with a learning rate of 5e-5 and choose the best checkpoint using the dev split. GRILL has 310M parameters. 3.4.2 Evaluation Setup To evaluate few-shot performance, we randomly sample 5 different training and dev splits and measure the average performance on the 5 splits. We fine-tune the vision-language models with 100 epochs for the few-shot setup and choose the best checkpoint on the dev set. We report the model performance on the test set for RefCOCOg, NLVR2, Flickr30k-entities, SNLI-VE, and Flickr30k captioning (Karpathy split [53]), and the validation set for VCR and VQAv2. We adopt accuracy for VCR, RefCOCOg, SNLI-VE, NLVR2, and VQA datasets; Recall@1,5,10 for Flickr30k-entities; and CIDEr [128] for captioning as evaluation metrics. 3.4.3 Baselines For baselines, we include existing VL models: UNITERlarge [20], VL-T5 [22], GLIP-L [64, 149], MDETRENB3 [52]; and few-shot VL models: FewVLM [50], Flamingo [3], and CPT [143]. For a fair comparison, 34 Method Size VCR RefCOCOg Flickr30k-entities NLVR2 SNLI-VE VQAv2 Flickr30k Q → A QA → R Q → AR Acc R@1 R@5 R@10 Acc Acc Acc CIDEr Random - 25.0 25.0 6.3 19.0 6.5 27.7 47.8 50.0 33.3 0.0 - UNITERlarge 303M 29.1±3.4 28.6±2.0 8.4±1.0 45.4±4.0 - - - 53.1±9.3 40.7±8.4 24.2±3.9 - VL-T5 224M 29.7±1.3 28.0±1.6 8.7±0.8 56.9±2.0 28.1±2.7 60.6±2.6 73.3±1.8 48.7±0.1 - 43.7±1.8 28.0±1.2 FewVLMbase 224M 29.1±0.9 28.4±1.1 8.5±0.4 16.0±3.7 4.2±1.2 18.7±1.8 31.7±2.0 50.3±0.7 - 47.8±0.2 37.5±2.9 FewVLMlarge 740M 30.0±2.7 30.1±2.5 9.3±1.5 17.4±1.1 5.1±1.1 22.7±4.0 38.0±5.8 51.3±1.2 - 52.3±0.8 38.4±2.1 GRILL 310M 41.1±0.7 40.4±1.1 16.7±0.6 48.1±1.2 25.4±1.0 61.3±1.8 76.0±1.5 56.2±0.3 48.4±1.0 46.8±0.1 37.1±1.5 Table 3.2: Few-shot results. We report performance on downstream tasks with 32 labeled examples for fine-tuning. we exclude VQA datasets for VL-T5 and pre-train the model using their code. Parameter sizes of each model are 303M for UNITERlarge, 224M for VL-T5, 231M for GLIP-L, 152M for MDETR, 224M and 740M for FewVLMbase and FewVLMlarge, 3B and 80B for Flamingo, and 113M for CPT. 3.4.4 Downstream Tasks and Datasets In this section, we compare our GRILL on 7 downstream tasks; visual Commonsense Reasoning, referring expression comprehension, phrase grounding, NLVR2, SNLI-VE, VQA, and captioning. Visual Commonsense Reasoning (VCR). Visual Commonsense Reasoning (VCR) [147] is a multiplechoice question-answering task that requires commonsense reasoning between objects in images. The task is decomposed into two sub-tasks, question answering (Q → A) and rationale prediction (QA → R). In the holistic setting (Q → AR), models have to predict answers and rationales. Following VL-T5 [22], we rank the choices with P(true)/(P(true) + P(false)). and choose the one with the highest score. VCR provides bounding boxes around entities, with explicit groundings between those entities and references in questions. Referring Expression Comprehension. Referring expression comprehension is to localize an object given a referring expression. We adopt the RefCOCOg dataset [78] for this task. We present a referring phrase and candidate regions from the image to our model; our model finds the most plausible region to the given phrase by ranking the regions with P(true)/(P(true)+P(false)). Following VL-T5 [22], we use Mask R-CNN [5] to find region detections as candidates for inference. We consider the selected region to 35 be correct if its intersection over union (IoU) with the ground truth region is greater than 0.5. The upper bound performance on the test set by the Mask R-CNN is 86.09%. We get the performance of the random predictor by randomly choosing the bounding box from the object detector. Phrase Grounding. Given one or more phrases, phrase grounding is to provide a set of bounding boxes for each phrase. We use the Flickr30k-entities dataset [94] for this task. Following BAN [55] and VisualBERT [63], we adopt Faster R-CNN [104] pre-trained on Visual Genome to detect regions as candidates for inference. The predicted region is correct if its intersection over union (IoU) with the groundtruth region is greater than 0.5. The upper bound performance on the test set by the Faster R-CNN is 87.45%. Similar to RefCOCOg we provide a referring phrase and candidate regions from the image to our model; and our model finds the most plausible region to the given phrase by ranking the regions with P(true)/(P(true) + P(false)). We use the any-box-protocol from MDETR [52]. NLVR2. The task of NLVR2 [117] is to determine whether a text description is true given two images. The task requires understanding two images and comparing them. To apply our model to this task, we create one image by concatenating the two images, and then our model generates text labels “true” and “false” for inference. Visual Entailment. Visual entailment, SNLI-VE [137], is to determine whether the image semantically entails the text given an image-sentence pair. The task is a 3-way classification where labels are “entailment”, “neutral”, and “contradiction.” We define label words for the classification as “entailment”: “true”, “neutral”: “maybe”, “contradiction”: “false.” We choose the classification label by measuring the probability of each word and picking the highest one. Visual Question Answering. The visual question answering task [37] requires models to answer a question to a given context image. We approach the visual question answering task as a generation task so that the model can produce the answers without introducing any task-specific heads following Jin et al. [50] 36 Method Size RefCOCOg Flickr30k-entities 0 32 0 32 Random - 19.0 19.0 6.5 6.5 UNITERlarge [20] 303M 10.0 45.4 - - VL-T5 [22] 224M 0.0 56.9 0.0 28.1 FewVLMlarge [50] 740M 0.0 17.4 0.0 5.1 CPT [143] [143] 113M 36.5 - - - MDETR-ENB3 [52] 152M 54.0† - 84.8‡ - GLIP-L [64, 149] 231M - - 87.1‡ - GRILL 310M 47.5 48.1 18.9 25.4 Table 3.3: Results on RefCOCOg and Flickr30k-entities with 0 and 32 examples. We report recall@1 for Flickr30k-entities. †This model used the RefCOCOg dataset in the pre-training. ‡These models used the Flickr30k-entities dataset in the pre-training while ours did not. Model size 0-shot 32-shot Random - 0.0 0.0 UNITERlarge [20] 303M 0.0 24.2 VL-T5 [22] 224M 13.5 43.7 FewVLMlarge [50] 740M 47.7 52.3 Flamingo-3B [3] 3B 49.2 57.1 Flamingo-80B 80B 56.3 67.6 GRILL 310M 42.3 46.8 Table 3.4: VQA results with 0 and 32 examples. We report zero-/32-shot performance on the VQAv2 dataset. Flamingo has 3B or 80B parameters and uses in-context examples for inference while our model has 310M parameters and uses the examples for fine-tuning. and Cho et al. [22]. We adopt the input prompt, “question: {question} answer: <text_1>,” where <text_1> is a sentinel token, from [50] for the generation. Captioning. The captioning task is to generate a caption given an image. In Flickr30k [144], we use “an image of ’ as our input prompt from Jin et al. [50]. 3.4.5 Results Zero-shot performance. We evaluate the existing models in a zero-shot manner, where models do not have access to any training data. Tab. 3.1 shows the performance on each task. First, GRILL shows the best performance on most tasks while baselines show worse performance than the random predictor on many 37 of the grounding tasks. On Table 3.3, we additionally include baselines, GLIP-L and MDETR-ENB3, that are targeted for grounding tasks. These models include the corresponding task-specific datasets in pre-training so they demonstrate great performance without additional fine-tuning. Note that we do not include taskspecific datasets in the pre-training. In addition, our model still performs well on SNLI-VE, visual question answering and captioning that do not require explicit grounding. By comparing Flamingo in Tab. 3.4, a 3B or 80B-sized vision-language model, our model demonstrates good accuracy considering our model size. This suggests that our model has a generalization capability to unseen tasks while competitors have difficulty generalizing to grounding tasks that need phrase or region grounding in a zero-shot way. Few-shot performance. We evaluate our model and competitors on the few-shot setting (Tab. 3.2). Our model, GRILL, shows great performance overall, while VL-T5 outperforms our model on the RefCOCOg dataset We conjecture that the method includes the phrase grounding task in their pre-training, so it achieves great performance. However, the model still struggles with other tasks including the VCR task, which demonstrates their limited generalization. Our model shows consistently good results and thus exhibits great generalization on the few-shot setup. 3.4.6 Ablations Here, we study ablations for our method. Tab. 3.5 and Fig. 3.5 show the ablations on the hybrid sequences and pre-training objectives, and different input formats during inference on the zero-shot setup, respectively. Hybrid sequences and pre-training objectives. We study the ablation of pre-training objectives and hybrid sequences in pre-training. On Tab. 3.5, our model without hybrid sequences significantly affects the performance on many tasks. Specifically, results on RefCOCOg and Flickr30k-entities are significantly degraded, suggesting that hybrid sequences in pre-training play a vital role in improving phrase grounding. Among pre-training objectives in GRILL, we notice that the discriminative objective is important for many 38 Model VCR RefCOCOg NLVR2 Flickr30kentities Zero-shot GRILL 16.2 47.5 56.1 18.9 No hybrid sequences 12.9 18.9 55.7 5.7 No discriminative 6.8 30.5 50.4 12.7 No PrefixLM 14.4 48.5 55.8 18.5 No MLM 15.6 47.8 56.0 19.3 32-shot GRILL 16.7 48.1 56.2 25.4 No hybrid sequences 14.3 16.3 55.9 18.7 No discriminative 7.2 42.0 50.5 15.3 No PrefixLM 14.7 48.7 55.9 21.9 No MLM 16.3 47.9 56.1 23.5 Table 3.5: Ablations on the pre-training objectives and hybrid sequences in pre-training. We report Q → AR for VCR, and R@1 for Flick30k-entities. of the tasks while others do not affect a lot. We conjecture that the tasks in the table are classification tasks so the discriminative objective is the most useful for the tasks. Input formats in inference. We investigate the different input formats (hybrid sequences vs. original sequences) during zero-shot inference on Fig. 3.5. Note that we use hybrid sequences in the pre-training. On VCR, we replace the referring words (e.g., [person1] in Fig. 3.1) with bounding boxes for text input (hybrid sequences), or we do not replace them and use original text input (original sequences). On NLVR2, we replace the “left” word with the left image and the “right" word with the right image (hybrid sequences), or we do not replace them and use the original text input (original). On Flickr30k-entities, we replace the referring words with corresponding bounding boxes (hybrid sequences), or we don’t replace the referring words and use the referring words and bounding boxes for inference (original). Counter-intuitively, we observe that our model with original input formats during inference shows better performance on all the datasets. We conjecture that using the hybrid sequences with bounding boxes may disturb the model predictions since the model needs to judge whether the grounding information is correct or not. We leave the sophisticated design for future work. 39 VCR NLVR2 Flickr30k-entities Datasets 0 10 20 30 40 50 Accuracy hybrid original Figure 3.5: Performance with different input formats for inference on the zero-shot setup. We report Q → AR for VCR, and R@1 for Flick30k-entities. 3.5 Related Work Vision-language few-shot learning. There have been attempts to address the challenge of data-hungry supervised learning in vision-language domains: FewVLM [50], Frozen [125], Flamingo [3], GLIP [64, 149], FewVLM [50] improves the few-shot performance of VQA and captioning by prompting the model and its performance is on par with large few-shot learners. Frozen [125] adapts a few-shot language model [99] to vision-language tasks with soft prompting for images. Flamingo [3] achieves state-of-the-art results on few-shot VQA and captioning tasks by prompting the model with task-specific examples. While these models achieve improvement on few-shot tasks, they are not applicable to grounding tasks. Lastly, GLIP [64, 149] unifies object detection and phrase grounding and it achieves great performance on zero-shot object detection and phrase grounding tasks. Unlike our method, GLIP used grounding datasets including Flickr30k-entities in pre-training so it achieved great performance on the phrase grounding without finetuning. Our method is not applicable to object detection since it requires bounding box regression. We leave this extension for future work. 40 Grounded vision-language learning. Grounded vision-language learning has been explored to learn grounding between objects in images and phrases in sentence [66, 150, 52, 64, 149]. MDETR is a modulated detector that detects objects in an image conditioned on a raw text query [52]. The model exhibits remarkable results on object detection, phrase grounding, and referring expression comprehension by pretraining the model on object detection data. GLIP followed a similar direction and it unifies object detection and phrase grounding [64, 149]. While the methods rely on object detection datasets to improve grounding, our method utilizes grounded sequences from image-caption datasets and an object. Our model does not only work on grounding tasks but also on visual question answering and captioning tasks. 3.6 Conclusion In this work, we proposed GRILL, a new VL model that can generalize to a variety of VL tasks including grounding tasks. Our model learns object grounding and localization by introducing hybrid sequences in pre-training and easily adapt to diverse task by using a vision transformer for versatile image processing. To pre-train our model, we introduced our dataset using object-word alignments and pre-train it with masked language modeling, prefix language modeling, and the discriminative objective. On the empirical analysis, we observed that our model demonstrated good zero-/few-shot generalization on diverse tasks. We also observed that the discriminative objective and hybrid sequences in pre-training were vital for better zero-/few-shot performance. 41 Chapter 4 Probing Visual Properties of Objects Under Different States Humans perceive different visual properties of objects in different contexts; bananas appear green when they are unripe, turn yellow when they ripen, and finally turn brown when they become rotten. Previous studies on probing visual commonsense knowledge have primarily focused on examining language models’ understanding of prototypical properties of objects; they did not consider various properties of an object under different states. We present WinoViz, a text-only evaluation dataset, consisting of 5,606 examples that probe the ability of language models to identify visual properties of objects under different contexts. We also present a multi-hop variant of WinoViz, which requires multiple reasoning steps to solve our task. Our experiments on the WinoViz dataset find that large language models such as GPT-4 demonstrate effective performance, but are bottlenecked by their visual knowledge. We further find that language models have low accuracy on the multi-hop variant of WinoViz. Finally, we show that vision-language models outperform their unimodal counterparts, indicating that multimodally trained language models may have better visual commonsense even when deployed unimodally. 4.1 Introduction Language models (LMs) have struggled with the challenge of developing reasoning abilities and acquiring knowledge from experience, despite these being innate for humans. Humans effortlessly enhance their 42 A man went to grab a quick breakfast before leaving, but saw that the only remaining banana was inedible. The banana is yellow. The banana is brown. Premise sentence Hypothesis 1 Hypothesis 2 The banana is the color of a tree log. The banana is the color of an egg yolk. Multi-hop Figure 4.1: The WinoViz task. We investigate the divergent properties of an object and explore the reasoning abilities of language models pertaining to object attributes. The premise sentence depicts a scene involving a banana and two hypothesis sentences describe the visual properties of a banana. The task is to choose a more plausible hypothesis given the premise. For the multi-hop version, we replace the visual attribute word with another object word which has a similar visual attribute. knowledge by observing the visual world through their eyes. However, obtaining this type of knowledge presents difficulties because it is often not explicitly described in text format. Overcoming these challenges necessitates visual grounding, which involves establishing connections and associations between language and visual information to facilitate comprehension and interpretation of the visual world. Previous studies have predominantly aimed at investigating language models about prototypical visual properties about objects, and transferring such knowledge from vision-language models [86, 89, 148, 62]. These studies discovered that reporting bias has a negative impact on model performance, but multimodal training can alleviate these effects. These studies are limited in that they mainly focused on fixed prototypical properties, e.g., bananas are yellow. However, an object may exhibit divergent visual attributes under different states or contexts. For example, bananas appear green when they are unripe, turn yellow when they ripen, and finally turn brown when they become rotten. 43 In this work, we investigate the reasoning abilities of language models and vision-language models about diverse object attributes under different object states. To accomplish this goal, we compose a unique, text-only evaluation dataset dubbed WinoViz. This compilation was meticulously crafted through a combination of crowd-sourcing and data generation techniques utilizing a language model. For crowdsourcing, we assigned an annotator with the task of crafting a premise sentence that portrays a scene involving a banana, along with two contrasting hypothesis sentences highlighting its visual properties, as depicted in Fig. ??. The premise sentence should demonstrate better compatibility with one of the hypothesis sentences compared to the other. Our task requires pragmatic reasoning [9] and visual knowledge reasoning; pragmatic reasoning involves finding intended meanings, object states, and conditions, e.g., a banana is ripe, while visual knowledge reasoning requires a model to reason about the properties of objects under the states, e.g. a banana is yellow when it is ripe. Additionally, we introduce a more challenging version of the dataset, referred to as the multi-hop data, which requires multi-step reasoning chains to solve the task. In this version, we replace the visual attribute word with another object word that shares a similar visual attribute. The multi-hop dataset was generated leveraging a language model. We utilize a benchmark to assess the zero-/few-shot performance of various language models, which encompass both text-only models and vision-augmented LMs. In our examination of text-only language models, we consider a range of models, including BERT [29], T5 [101, 23], and models from the GPT family [13]. These models vary in scale, with parameters spanning from 110 million. In addition to text-based models, we explore models that incorporate visual information, such as VL-BERT [116] and Oscar [66]. Our motivation for this exploration is the notion that acquiring visual knowledge from images can enhance the capabilities of language models. Furthermore, we leverage machine-generated images [105] to guide the LMs inspired by imagination-guided text generation [155]. In our experiments with the WinoViz benchmark, we have observed the following key findings: a) Large language models, such as GPT-4, demonstrate effective performance. However, when it comes to 44 multi-hop data, their performance significantly degrades. b) Large models perform well on pragmatic reasoning, but visual knowledge reasoning is a bottleneck on our task. c) Vision-language models outperform their language-model counterparts. d) A model with machine-generated images performs poorly on our task. This is due to the poor quality of the generated images. 4.2 The WinoViz Task We present the proposed WinoViz task using precise mathematical notations and address its inherent challenges. The task entails the need for a model to deduce whether objects can demonstrate prototypical behaviors in various scenarios. More precisely, when provided with a natural language sentence describing an object engaged in a particular behavior (premise sentence), the model must determine between two sentences presenting contrasting visual properties of the object (hypothesis sentences). Task Definition. Formally, the input of the WinoViz task is a premise sentence p about an object o and two hypothesis sentences {c1, c2} describing contrasting attributes about the object. The premise sentence describes a common scenario about an object in our daily life and the hypothesis sentences describe the visual characteristics of the object. A scenario can depict either a static situation or a short series of actions. The expected output is one of the hypotheses that is more compatible with the premise sentence than the other. Furthermore, we propose two different tasks: single-hop and multi-hop reasoning data. Challenges. The WinoViz task assesses a machine’s reasoning ability regarding objects in our daily lives, focusing on the varied properties of these objects under different object states. One particular area where models frequently encounter challenges is in grasping visual knowledge related to common objects and their attributes. This difficulty arises because such knowledge is seldom explicitly detailed in the training text, primarily due to reporting bias [86, 51]. Furthermore, our task is challenging since it requires pragmatic reasoning and visual knowledge reasoning. Pragmatic reasoning involves finding intended meanings, the object states, and contexts in the text, while visual knowledge reasoning requires a 45 model to reason about the properties of objects under the states and contexts. We also introduce a more challenging version of called multi-hop data, which necessitates multi-step reasoning chains to solve our task. 4.3 The WinoViz Data In this section, we describe how we construct our WinoViz dataset. 4.3.1 Data Collection The data collection is broken down into following sections: (1) collecting candidate objects, (2) annotating premise and hypothesis sentences, (3) augmenting the data using a language model. and (4) verifying the quality of the dataset. , and (4) human evaluation. Object Collection. To begin with, we gather a collection of objects along with their potential properties or attributes for constructing our data. These objects and attributes are obtained by scraping information from reliable sources such as Memory Colors norlund2021transferring, Visual Property Norms hagstrom2022models, and McRae feature norms mcrae2005semantic. Through this process, we manage to collect a total of 800 unique objects and 302 unique attributes. However, it is necessary to refine our dataset by filtering out attributes that are either too abstract or non-visual in nature. To accomplish this, we manually filter the objects and attributes to ensure the inclusion of only concrete and visually relevant attributes. As a result of this filtering process, we successfully obtain a final dataset comprising 775 objects and 156 attributes. Dataset Annotation. We utilized Amazon Mechanical Turk [24] for data annotation, as depicted in Figure ??. The data annotation process involves several steps. Initially, annotators are given an object, and are instructed to identify two properties for the object and corresponding visual attributes for those properties. For example, for the object banana, the annotator may come up with two properties ripe and rotten, which would have corresponding visual attributes yellow and brown, respectively. After identifying a pair 46 Object: Banana P1: Ripe P2: Rotten Identify properties and attributes VA1: Yellow VA2: Brown Write natural sentences for each attribute and property He picked up a banana and put it in his grocery cart. P1: Ripe P2:Rotten He went to grab a quick breakfast before leaving, but saw that the only remaining banana was inedible. VA1: Yellow The banana is yellow. VA2: Brown The banana is brown. Figure 4.2: Dataset Collection with Human Annotators. We collect our data through crowdsourcing efforts. The first step is to identify properties and visual attributes for an object and the second step is to write natural sentences for each property and attribute. Sentences with properties will be used as premise sentences and sentences with visual attributes will be used as hypothesis sentences. of object properties and visual attributes, they are tasked with composing natural language sentences for each attribute and property. The properties are associated with premise sentences, while the attributes were linked to hypothesis sentences. Annotators were selected from a small pool of Mechanical Turkers that the authors had previously worked with. The Turkers had to further pass a qualification task that tested their understanding of the annotation task. The authors manually examined the annotations to ensure quality of the collected data. As a result, we get 1,380 annotated examples. Dataset Augmentation. Relying on human annotators can be costly; therefore, we leverage GPT-4 [87] to augment our dataset. We first generate more examples using few-shot prompting. Subsequently, we manually assess the quality of these generated examples, resulting in a total of 4,226 instances. Notably, our analysis reveals 179 novel visual properties derived from these generations. 47 To measure the difficulty of the generated data, we compare performance metrics between the annotated and generated datasets. Using FLAN-T5-XXL [23], we achieve an accuracy of 86.24% on the annotated data and 84.44% on the generated data. Notably, we did not observe significant performance disparities between the two datasets. 4.3.2 Versions of WinoViz We present several versions of our WinoViz data. We propose the multi-hop data, a more challenging version of WinoViz, and a dataset for probing visual knowledge. For the multi-hop data, we create new hypothesis options that require more intermediate steps. Multi-hop Data. To create a more challenging task, we introduce a multi-hop version of our data, which necessitates additional intermediate steps. The basic idea of the multi-hop data is to replace a visual attribute word in hypotheses with another object word which has a similar visual attribute. This requires one more reasoning step to find out the visual attribute. For example, one hypothesis option is ‘The banana is yellow.’ Then ’yellow’ can be replaced with ‘the color of an egg yolk.’ Consequently, the new hypothesis option for the multi-hop version becomes ’The banana is the color of an egg yolk.’ The multi-hop version presents a greater challenge as the model must ascertain the color of an egg yolk. We create the multi-hop data using GPT-4 with few-shot prompting. Pragmatic Reasoning vs. Visual Knowledge Reasoning. Another important aspect of this work is that models genuinely understand and know visual knowledge. Our task requires pragmatic reasoning and visual knowledge reasoning but models may fail in one of the reasoning steps. Thus, we decouple the premise sentence into pragmatic reasoning step and visual knowledge reasoning step to analyze which step is a bottleneck. Pragmatic reasoning involves finding the intended meaning, key phrases, object states, conditions, or contexts for the next step. For example, a model should first find ‘the banana is ripe’ given the premise sentence in the pragmatic reasoning step (Figure ??). Given the sentence, a model 48 Model # Params Public VL model BERT-Base 109M ✓ ✗ BERT-Large 335M ✓ ✗ VL-BERT-Large 335M ✓ ✓ Oscar-Large 335M ✓ ✓ CLIP-Large 427M ✓ ✓ FLAN-T5-XXL 11B ✓ ✗ InstructBLIP 11B ✓ ✓ LLaMA2 13B ✓ ✗ LLaVA 13B ✓ ✓ GPT-3.5 Unknown ✗ ✗ GPT-4 Unknown ✗ ✗ Table 4.1: A list of models used in the experiments: BERT [29], CLIP [98], VL-BERT [116], Oscar [66], FLAN-T5 [23], InstructBLIP [26], LLaMA2 [123], LLaVA [72], GPT-3.5 [13, 88], and GPT-4 [87]. We use ‘gpt-3.5-turbo-0125’ for GPT-3.5, and ‘gpt-4-0613’ for GPT-4. should choose a better option, ‘the banana is yellow’, in the visual knowledge reasoning step. We obtain 160 samples for this with human annotators (Section 4.4.5). 4.4 Experiments We first describe the experimental setup used in our analysis and share experimental results. Language Models. We experiment with 6 language models in total (Table ??). We include encoder-only, encoder-decoder, decoder-only models. We include large LMs, GPT-3, GPT-3.5, and GPT-4 [13, 88, 87]. Vision-language Models. We also experiment with 5 vision-language models in total (Table ??). Our task requires visual knowledge on objects under different states. Such knowledge can be obtained from image-caption datasets and thus we explore vision-language models. We analyze whether vision-language models can outperforms language models on our task. For model evaluation, we deliberately exclude any image inputs and refrain from utilizing the image encoders of the models. Instead, we focus solely on the language components of the models. We use encoder-only models, VL-BERT [116] and Oscar [66], 49 and a decoder-only model, LLaVA-v1.5 [72], and a bi-encoder model, CLIP (‘clip-vit-large-patch14’) [98]. Additionally, we utilize an image generation approach, Stable Diffusion [105], to generate images. We use the generated images to guide the LMs inspired by imagination-guided text generation [155]. Inference. In our analysis, we rely on zero-shot inference and few-shot in-context learning for encoderdecoder, decoder-only models. Our prompt design for the zero-shot inference is as follows: “You will be given a sentence, and two options. Output either Option 1 or Option 2, depending on which option is more likely to be true given the sentence.” For the few-shot in-context learning, we use 4 examples. We also adopt chain-of-thought prompting [135] for the few-shot inference. In addition to the encoder-decoder and decoder-only models, we explore encoder-only models. Encoder-only models cannot do zero-shot inference for multi-choice tasks since it requires a task-specific head for unseen tasks. Thus, we finetune the encoder-only models with SNLI [12] and ANLI [85] datasets and we use only ‘contradiction’ and ‘entailment’ labels in fine-tuning. Evaluation Setup. We evaluate models with two different metrics: individual accuracy (Ind.) and pair accuracy (Pair). Individual accuracy refers to accuracy on each individual question, while pair accuracy refers to the accuracy on each pair of questions. In WinoViz, two premise sentences are paired and they share the same set of hypothesis options. We measure the model’s performance based on its ability to accurately predict both premise sentences. If the model’s prediction is correct for only one of the premise sentences in the pair, we consider the prediction less robust. We use greedy decoding for generations. 4.4.1 Analysis Questions In our empirical analysis, we try to answer the following questions: 1. How good are large models on our task? When it comes to multi-hop data, how good are they? (Section 4.4.2) 50 Model Single-hop Multi-hop Ind. Pair Ind. Pair FLAN-T5-XXL 85.34 71.53 73.93 49.16 LLaMA2 75.24 52.09 64.00 30.57 LLaVA 80.34 62.08 67.57 37.14 GPT-3.5 86.34 73.74 75.26 52.66 GPT-4 84.53 72.39 79.45 61.15 Table 4.2: Results on WinoViz in a zero-shot manner. We evaluate large models using 0 examples on both our single-hop and multi-hop datasets. We observe that these models performed well on the singlehop data; however, their performance significantly degrades on the multi-hop data. 2. Do few-shot inferences with standard prompting and chain-of-thought (CoT) prompting improve the results? (Section 4.4.3) 3. Which reasoning step between pragmatic reasoning and visual knowledge reasoning is the main bottleneck in our task? (Section 4.4.5) 4. Do vision-language models outperform language-model counterparts? (Section 4.4.2) 5. Can we improve performance using image generation approaches? Do generated images help solving our task? (Section 4.4.6) 4.4.2 Zero-shot Results We evaluate language models and vision-language models in a zero-shot way, without utilizing any training data (Table 4.2). Overall, large models perform well on the single-hop data, but their performance is significantly degraded on the multi-hop data. Among them, GPT-3.5 exhibits the best performance on the single-hop data while, GPT-4 shows the best result on the multi-hop data. Surprisingly, FLAN-T5-XXL, the smallest model among the comparison, yields comparable results to larger models, including GPT-4. LLaVA, built upon LLaMA2 and trained with image-caption datasets, shows noteworthy performance. As 51 Model Single-hop Multi-hop Ind. Pair Ind. Pair FLAN-T5 (0) 85.34 71.53 73.83 49.16 FLAN-T5 (4) 85.37 71.67 74.01 49.66 FLAN-T5 (4 CoT) 85.02 70.85 72.67 46.88 GPT-3.5 (0) 86.34 73.74 75.26 52.66 GPT-3.5 (4) 86.35 73.53 74.96 51.34 GPT-3.5 (4 CoT) 87.46 75.99 81.22 64.93 GPT-4 (0) 84.53 72.39 79.45 61.15 GPT-4 (4) 87.55 75.81 81.75 64.61 GPT-4 (4 CoT) 77.72 65.50 77.22 61.51 Table 4.3: Results on WinoViz with 4-shot in-context learning. We use FLAN-T5-XXL, GPT-3.5, and GPT-4 in this analysis. Standard prompting marginally improves the performance of them, while chainof-thought prompting (CoT) is beneficial for GPT-3.5 in the multi-hop task. Interestingly, GPT-4 degrades with chain-of-thought prompting. We found that 16.9% of single-hop questions and 10.5% of multi-hop questions are unpredictable by GPT-4. GPT-4’s performance on individual questions without the cases are 93.51% and 86.32% on single-hop and multi-hop questions, respectively. indicated in the table. LLaVA surpasses LLaMA2 on both single-hop and multi-hop data, suggesting that image-caption datasets enhance reasoning on our task. 4.4.3 Few-shot Results Table 4.3 displays the results with 4 in-context examples for FLAN-T5-XXL, GPT-3.5, and GPT-4. We conduct tests using standard prompting and chain-of-thought prompting [135] in this experiment. Initially, standard prompting with 4 in-context examples do not improves the performance of FLAN-T5 and GPT3.5 significantly on both single-hop and multi-hop tasks, while it improves GPT-4. Surprisingly, chainof-thought prompting appears to have a negative impact on the performance of FLAN-T5. However, it yields considerable improvements for GPT-3.5, particularly in the multi-hop task. We hypothesize that the effectiveness of chain-of-thought prompting becomes more pronounced as the task complexity increases. Interestingly, GPT-4’s performance degrades with chain-of-thought prompting. We discover that 16.9% 52 Method Single-hop Multi-hop Ind. Pair Ind. Pair BERT-Large 68.56 40.65 59.83 24.07 VL-BERT-Large 70.75 43.75 61.56 27.54 Oscar-Large 73.12 51.54 70.17 41.34 Table 4.4: Results on WinoViz after NLI training. We train encoder-only models on NLI datasets and choose an option by the highest probability of the ‘entailment’ class. of single-hop questions and 10.5% of multi-hop questions remain unanswered by GPT-4, which is not observed in zero-shot inferences or 4-shot inferences using standard prompting. The individual accuracy of GPT-4 excluding these cases is 93.51% and 86.32% for single-hop and multi-hop questions, respectively. This suggests that GPT-4 improves its performance with chain-of-thought prompting. 4.4.4 Results of Encoder-only Models We compare the peformance of encoder-only models, BERT, VL-BERT, Oscar in Table 4.4. Encoder-only models cannot be applied to our task without fine-tuning. Thus, we fine-tune the encoder-only models on natural language inference (NLI) datasets instead. By doing this, our task is framed into the NLI setup and choose an option by the highest probability of the ‘entailment’ class. We fine-tune the encoder-only models with SNLI [12] and ANLI [85] datasets and we use only ‘contradiction’ and ‘entailment’ labels. VLBERT and Oscar are BERT-based vision-language models, and they are trained on image-caption datasets. In our experiments, we observe that the vision-language encoder models consistently surpass the BERT model on our dataset. 4.4.5 Pragmatic and Visual Knowledge Reasoning We investigate whether models genuinely understand visual knowledge for our task. Our task requires pragmatic reasoning and visual knowledge reasoning. We decouple our task into pragmatic reasoning 53 Model Pragmatic Visual Combined FLAN-T5-XXL 93.04 82.91 79.75 LLaMA2 86.71 70.25 69.62 LLaVA 92.41 74.05 73.25 GPT-3.5 91.14 82.28 79.75 GPT-4 95.57 88.61 85.44 Table 4.5: Results on pragmatic reasoning, visual knowledge reasoning, and our original data (combined). We study different types of reasoning in our data. We report individual accuracy. Model Ind. Pair FLAN-T5-Base (No imgs) 69.23 42.10 CLIP-Large 56.63 29.42 FLAN-T5-XXL (No imgs) 85.34 71.53 FLAN-T5-XXL (Captions) 81.19 64.77 InstructBLIP 81.24 64.65 Table 4.6: Results on WinoViz with generated images. We use Stable Diffusion [105] to generate 5 images per premise sentence. We adopt majority voting at inference time to choose an option. FLAN-T5- Base (No imgs) refers to a model without any generated images, with a size comparable to CLIP-Large. FLAN-T5-XXL (No imgs) refers to a model without any generated images, while FLAN-T5-XXL (Captions) refers to a model with captions generated by BLIP2 on the generated images. Instead of directly inputting images into FLAN-T5, we extract captions from the generated images and use them as additional context. InstructBLIP uses generated images. and visual knowledge reasoning and analyze which step is a bottleneck. Table 4.5 shows the results on pragmatic reasoning (pragmatic), visual knowledge reasoning (visual), and our original data (combined), utilizing the same subset. Firstly, results on pragmatic reasoning are better than others, suggesting that large models do well on pragmatic reasoning. For example, GPT-4 achieves 95.57% on pragmatic reasoning. Th main bottleneck in our task is on visual knowledge reasoning; results on visual knowledge reasoning are lower than those on pragmatic reasoning. When comparing LLaMA2 and LLaVA, LLaVA demonstrates superior abilities in both pragmatic reasoning and visual knowledge reasoning. Interestingly, FLAN-T5- XXL performs comparably to a proprietary model, GPT-3.5, in terms of pragmatic reasoning and visual reasoning. 54 4.4.6 Using Image Generation for WinoViz Task. Another approach for our task is to utilize image generation. We generate images based on premise sentences and employ these generated images for our task. The generated images may contain useful information that assists in identifying a correct hypothesis. Given the generated images, there are three ways to use them. The first method involves using CLIP [98] on both the images and hypothesis sentences to identify a superior hypothesis option. Specifically, we calculate the cosine similarity between the embedding of a generated image and the embedding of a hypothesis option, selecting the hypothesis with a higher cosine similarity score. The second approach is to generate captions for the generated images using a caption model. Since language models cannot directly process images, we generate captions and utilize them as additional context for the task. BLIP2 [60] is employed for caption generation. The third strategy is to reframe our task as a visual question-answering task and employ a vision-language model to identify a better option. In this setup, we use InstructBLIP [26]. For image generation, we use Stable Diffusion [105], generating 5 images per premise sentence. A better hypothesis option is determined through majority voting. Table 4.6 displays the outcomes related to image generation. The first approach utilizing CLIP falls short compared to FLAN-T5-Base which is slightly smaller than CLIP-Large. In the second approach involving BLIP2 captions, we opt for FLAN-T5-XXL as the benchmark, comparing one scenario with no additional data and another incorporating captions from generated images. Our experiment reveals a notable decline in performance when captions are employed. The third approach underperforms FLAN-T5-XXL by a large margin. These experiments collectively indicate that generated images offer limited utility for our task. Furthermore, a manual assessment of 100 generated images reveals that 66% of them do not contribute meaningfully to our objectives. Examples of generated images with premise sentences are shown in Figure 4.3. In the figure, the bananas in both images are yellow; the generated images do not provide any clues to choose a more plausible option. 55 Figure 4.3: Examples of generated images. We generate images using Stable Diffusion [105]. In the second example, the bananas in both images are yellow, leading the model to select the incorrect option. The generated image examples don’t assist in selecting a more plausible hypothesis option. 4.5 Related Work There are multiple perspectives on how our contributions relate to previous work, and we elaborate on this in the subsequent sections. Visual Knowledge Probing. Several attempts have been made to assess the reasoning ability of language models regarding objects, primarily through natural language benchmarks [86, 43, 89, 148, 115, 97]. Norlund, Hagström, and Johansson [86] introduced a task involving querying a multimodal model for visual commonsense knowledge related to memory colors, which are the typical colors associated with 56 well-known objects. Hagström and Johansson [43] expanded on this work by proposing visual property norms as a measure of visual commonsense knowledge in both language models and multimodal models. Paik et al. [89] evaluated the color perception of language models using a color dataset called CoDa, revealing that reporting bias negatively affects model performance and that multimodal training can alleviate these effects. Zhang et al. [148] confirmed these findings and extended the evaluation to a wider range of visually salient properties. Similarly, Singh, Qasemi, and Chen [115] evaluated vision-language models on a visually accessible commonsense knowledge dataset. Liu et al. [73] explored spatial commonsense, the knowledge about spatial position and relationship between objects, finding that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models. Gu, Mishra, and Clark [38] proposed a probing dataset for physical knowledge about everyday things. In contrast, we present a challenging dataset that probes the reasoning abilities of language models regarding variant visual properties of objects under different context. Vision-Language Modeling Recent advances in vision-language (VL) models have led to success on vision-language tasks such as visual question answering, captioning, and grounding [6, 70, 78]. Existing VL models jointly learn image and text representations through cross-modal alignments including VLBERT [116], LXMERT [119], Oscar [66]. Recent approaches have introduced visual instruction tuning, which involves fine-tuning a VL model using instruction-following data [72]. While these VL models have shown significant improvement in VL tasks, the exploration of how to transfer visual knowledge from VL modeling to language tasks remains underexplored. Vokenization [120] utilized token-level text-to-image retrieval to transfer visual knowledge to language models. VidLanKD [121] employd contrastive learning to train a teacher model on video datasets and uses distillation approaches to transfer visual knowledge from the teacher to a student model. CMKT [51] investigated two types of knowledge transfer: text knowledge transfer (e.g., captions) and visual knowledge transfer (e.g., images and captions). Their findings demonstrate that such transfer can enhance performance on 57 commonsense reasoning tasks. Our work also reveals that incorporating visual data into language models enhances their ability to reason about object properties across various conditions. 4.6 Conclusion Examining real-world object properties requires a visual understanding that language models lack. In our study, we introduced a text-only WinoViz focused on question-answering tasks, comprising 5,606 examples exploring language models’ reasoning capabilities across various visual properties of objects under diverse contexts. Our findings revealed that large language models demonstrate effective performance overall but struggle particularly with the multi-hop version of our dataset. The performance on multi-hop data improved using chain-of-thought prompting. Vision-language models surpass their language-only counterparts, although image-generation approaches prove ineffective for our specific task. Future endeavors will delve into how to efficiently transfer visual knowledge from images or captions. 4.7 Limitations Our work is focused on a specific subset of language models and vision-language models. We adopt visionlanguage models in which the language backbones are pre-trained using image-caption datasets. Additionally, we employ Stable Diffusion for image generation, although the current output may not directly benefit our task. Utilizing state-of-the-art diffusion models could enhance image quality, yet the challenge of generating images useful for our task persists. Moreover, our observations indicate that large language models excel in our single-hop task, achieving up to 90% accuracy. This suggests that these large models can effectively reason over visual knowledge even in the absence of explicit visual signals. Nonetheless, how visual signals can be harnessed to enhance language models is under-explored, and we defer it to future research endeavors. 58 Chapter 5 Learning Visual Knowledge in Language Tasks Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such information due to reporting bias. In this work, we study whether integrating visual knowledge into a language model can fill the gap. We investigate two types of knowledge transfer: (1) text knowledge transfer using image captions that may contain enriched visual knowledge and (2) crossmodal knowledge transfer using both images and captions with vision-language training objectives. On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings. ∗ 5.1 Introduction Pre-trained language models (PTLMs) such as BERT [28], RoBERTa [74], and T5 [102] have shown impressive results in various conventional natural language understanding (NLU) tasks by capturing syntactic and semantic knowledge from the pre-training tasks of masked language modeling and masked span infilling tasks on massive text corpora. ∗ https://github.com/INK-USC/CMKT 59 Interesting facts about orange ! 1. Orange elevates mood levels. 2. Orange are often grown in the Mediterranean. Human 3. Oranges facing the sunnier tend to be sweeter. Typical facts about orange … 1. Orange is a shape of circle. 2. Orange is a color of orange. Report Already knows... May not report Figure 5.1: Reporting Bias. People tend to report what interests them rather than typical and general facts. Though yielding good performance on various NLU downstream tasks, these pre-training objectives suffer from a lack of out-of-domain knowledge that is not explicitly present in the pre-training corpus [41, 93, 111]. Specifically, one type of knowledge that models often struggle with is the visual knowledge of common objects such as attributes (e.g. appearance, measurable quantity) and affordances. This is because this kind of knowledge is rarely explicitly described in the training text due to reporting bias. For example, as shown in Figure 5.1, people tend to report what interests them rather than general facts such as a shape or color of oranges they already know. Towards better knowledge-enhanced PTLMs, recent works incorporate external knowledge bases (e.g., knowledge graph, dictionary) to inject entity knowledge into PTLMs [152, 92, 132, 146] or retrieve knowledge from external knowledge bases to solve the problem [68, 131]. However, these approaches still suffer from a lack of visual knowledge that is important to understand the real world. In this paper, we conduct systematic experiments to understand whether such visual knowledge can be transferred into LMs, and if so, how to perform effective knowledge transfer. Specifically, we look into a series of analysis question as follows: (1) Can intermediate pre-training [95] on image-caption pairs help transfer the knowledge? (2) What types of knowledge sources are more helpful? To answer questions, we explore various intermediate pre-training tasks [95] on two different sources: text-only (text knowledge transfer from visual domains) and image-caption pairs (cross-modal knowledge transfer). 60 For the text knowledge transfer, we utilize text corpus from visual domain, e.g., image captions. We leverage two training objectives for the language model: (1) masked language modeling follows the domain adaptive pre-training scheme [41], assuming the corpus contains enriched visual knowledge or physical commonsense knowledge; (2) text contrastive learning augments the sentence representation with dropout to create positive samples while considering all others in the batch as negative samples for the contrastive learning [36], assuming training better sentence representations leads to better understanding of the corpus. For the cross-modal knowledge transfer, we explore multiple methods to transfer visual-related knowledge to LMs: (1) masked language modeling with visual clues incorporates visual clues to capture dependencies between visual and linguistic contents [116]; (2) voken classification contextually aligns language tokens to their related images (called "vokens") to transfer visual knowledge into LMs [120]; (3)cross-modal contrastive learning aims to improve text representations by maximizing the agreement between correct image-text pairs versus random (in-batch) and adversarial negative pairs by contrastive learning between image and text modalities; and (4) cross-modal knowledge distillation transfers the knowledge from the teacher model, which is trained by cross-modal contrastive learning on image and text modalities, to the student language model using knowledge distillation. We perform comprehensive comparisons on five downstream tasks that may require visual or physical commonsense knowledge, including PIQA [11], Visual Paraphrasing (VP) [71], CSQA [118], OBQA [81], and RiddleSense [69]. Results suggest that: (1) Simple intermediate pre-training on captions can help improving performance on commonsense reasoning that needs physical or visual knowledge. (2) Crossmodal knowledge transfer approaches consistently improve the performance in a large margin when only few train examples are available. (3) Cross-modal contrastive learning shows that it is best for packaging visual knowledge into LMs. 61 (a) Masked Language Modeling (b) Text Contrastive Learning (TCL) (e) Cross-modal Knowledge Distillation (CMKD) A girl puts an apple in her bag. Transformer A girl puts an MASK in her MASK apple bag Transformer A girl puts an apple in her bag. A girl puts an envelope in her bag. pos_emb pos_emb neg_emb MS COCO Teacher LM Student LM Text Corpus Distillation Text Knowledge Transfer (c) Voken Classification … … Transformer A girl puts an MASK in her MASK Cross-modal Knowledge Transfer (d) Cross-modal Contrastive Learning (CMCL) Transformer A girl puts an apple in her bag. A girl puts an envelope in her bag. pos_emb neg_emb Image Encoder img_emb Figure 5.2: Illustration of different methods for transferring visual knowledge into transformerbased language model. In this example, we assume image-caption pair as an input. (a) masked language model [28] on image captions. (b) text contrastive learning obtains positive example by dropout representation to learn better sentence representation while negative augmentation is optional. (c) voken classification employs token-level text-to-image retrieval to transfer visual knowledge. (d) cross-modal contrastive learning aims to train correct paring of images and captions. (e) cross-modal knowledge distillation transfers knowledge from the teacher model, which is trained by cross-modal contrastive learning, into student model. 5.2 Analysis Setup In this work, we study how to transfer the visual knowledge into language models. For this study, we introduce our analysis setup: problem formulation, analysis questions, and knowledge corpora. 5.2.1 Problem Formulation We focus on a pre-trained text encoder fL and an image encoder fV if images are available. fL and fV are initialized with pre-trained model and we continue to pre-train the models on different sources and tasks, which we call intermediate pre-training [42, 96]. After the intermediate pre-training, we fine-tune fL on downstream NLU tasks. Existing NLU benchmarks have been trained against standard supervised learning paradigms that typically require a large number of question answering examples which need a large annotation efforts. However, in scenarios where the number of labeled examples is small, the model tends to overfit the training examples and shows poor generalization performance on test set. Here, we 62 evaluate the intermediate pre-training objective’s generalization ability on test set in both fully supervised and low-resource settings. 5.2.2 Analysis Questions In this paper, we provide a comprehensive study for transferring the visual knowledge into LMs. Visual knowledge transfer can be done in two approaches, depending on the source to be trained: (1) Text knowledge transfer using the text corpus in the visual domain, e.g., image captions and (2) cross-modal knowledge transfer which passes visual knowledge about common objects to LMs by training over paired image and captions. By evaluating the model on 5 downstream datasets that require physical and visual commonsense knowledge, we explore following three research questions. Q1: Can intermediate pre-training on external knowledge sources help transfer visual knowledge to augment text encoders? We investigate diverse intermediate pre-training methods with external knowledge sources including caption data to inject visual information from images and captions into LMs. We first analyze the performance of text and cross-modal knowledge transfer methods with a imagecaption dataset, and we additionally study text knowledge transfer methods with other text corpora such as GenericsKB [10], Wiki103 [80] and BookCorpus [156]. Q2: What types of knowledge sources are more helpful for visual knowledge transfer? As mentioned above, we have two categories to exploit visual information: (1) text knowledge transfer and (2) cross-modal knowledge transfer. Here, we explore which type of knowledge transfer is more useful to transfer the visual knowledge into LMs. Q3: What intermediate pre-training objectives are effective for cross-modal knowledge transfer? We present three pre-training objectives for cross-modal knowledge transfer: (1) voken classification, (2) contrastive learning, and (3) knowledge distillation. Here, we want to present which strategy is best suited 63 Dataset # Train # Dev # Test # choices PIQA 14,113 1,838 2,000 2 VP 21,988 2,000 6,057 2 CSQA 8,500 1,221 1,241 5 OBQA 4,957 500 500 4 RiddleSense 3,510 1,021 1,202 5 Table 5.1: Downstream task data statistics. We create in-house test set for PIQA and CSQA, and inhouse dev set for VP by splitting the train set. for cross-modal knowledge transfer. Furthermore, we study how to enhance cross-modal contrastive learning with adversarial negative samplings. 5.2.3 Pre-training Data To transfer the visual knowledge, we collect 250K image-caption pairs from MS COCO [70, 19]. MS COCO contains images reflecting the composition of actual everyday scenes and corresponding captions which describe contextual reasoning between objects in the scene. We only use captions for text knowledge transfer while we use both images and captions for cross-modal knowledge transfer. As an ablation study, we explore other text corpora such as GenericsKB [10], Wiki103 [80] and BookCorpus [156]. 5.2.4 Downstream Tasks and Datasets For downstream benchmarks, we find tasks that can benefit from visual knowledge: multiple choice question answering tasks including PIQA [11] which requires physical commonsense reasoning, CSQA [118] for general understanding of commonsense reasoning, OBQA [81] that needs elemenatry-level science knowledge, and RiddleSense (RS) [69] for complex understanding of figurative language, and binary classification task including Visual Paraphrasing (VP) [71] that needs scene understanding. We use in-house test sets made from training sets for PIQA and CSQA since test set is not provided to public. We list the data statics in Table 5.1. Moreover, We additionally test on GLUE [130] to evaluate the general text understanding. 64 5.2.5 Evaluation Protocol We evaluate the models in both fully supervised and low-resource settings. For both settings, we consider accuracy for 5 different classification tasks and get average performance over tasks to check the final performance. In the fully supervised setting, we evaluate models with 3 different random seeds and report the average accuracy. In the low-resource setting, we set the size of the train data to 64 or 128. For each experiment, we run over 5 different sub-samples and show the average accuracy. 5.3 Method In this section, we introduce the following two approaches to integrate visual knowledge into LMs: (1) text knowledge transfer; and (2) cross-modal knowledge transfer. Throughout this section, we assume the data is a collection of image x v and caption x l pairs (x v i , xl i ) m i=1 (m is the size of the pairs) and image encoder fV and text encoder fL are given. Note that we use the same text encoder. 5.3.1 Text Knowledge Transfer For text knowledge transfer, we investigate following pre-training objectives: (1) masked language modeling; and (2) text contrastive learning. Masked Language Modeling (MLM) Following BERT [28], we select 15% of input tokens and replace them with [MASK]. Of the selected tokens, 80% are replaced, 10% are not changed and 10% are replaced by random vocabulary token. Here, we employ dynamic masking, which performs random masking and replacement during training to prevent the same masking for the same examples [74]. MLM objective is the cross-entropy loss for masked token predictions : ℓMLM(x l i ) = − log p(x l i |x masked), (5.1) 65 where xi is the i-th token and x masked is a mask. Text Contrastive Learning (TCL) Contrastive learning aims to learn representations by pulling positive pairs closer and pushing negative pairs apart. Here, we employ the contrastive framework with cross-entropy objective and in-batch negatives [18, 36]. Given a text encoder fL, and a caption x l i , we first get text representations using the encoders h l i = fL(x l i ). Following Gao, Yao, and Chen [36], we create identical positive sample h l + i by different dropout representations. The contrastive loss is defined as follows: ℓ l i = − log e sim(h l i ,hl+ i )/τ PN j=1 e sim(h l i ,hl j )/τ , (5.2) where N is a batch size and sim(·) represents cosine similarity, i.e., sim(u, v) = u·v/∥u∥∥v∥. τ represents a temperature parameter. 5.3.2 Cross-modal Knowledge Transfer Language models might learn additional information from visual sources such as images and captions. So we include a variety of vision-based approaches and investigate the approaches whether they can benefit from visual sources. We introduce vision-based approaches as follows. Voken Classification Vokenization [120] employs token-level text-to-image retrieval to transfer visual knowledge. It aligns language tokens to their related images (called “vokens”) to transfer visual knowledge into LMs, and call it “voken classification”. Given text x and a voken vi for the i-th token, the loss is defined as ℓ voken i = − log(p(vi |x)). (5.3) Similar to masked language modeling, it classifies each token to a corresponding voken. Vokenization trains language models with the voken classification task and MLM. 66 A girl puts an apple in her bag. A girl puts an [MASK] in her bag. Mask a token A girl puts an envelope in her bag. Top-k predictions from LM Figure 5.3: LM perturbation. We create adversarial negatives using language models. Masked Language Modeling with Visual Clues VL-BERT [116] adopts masked language modeling with visual clues in which models are given a caption with masked tokens and an image and predict the masked tokens using visual clues. VL-BERT is pre-trained on Conceptual Captions [112] as an imagecaption corpus, and BooksCorpus [157] and English Wikipedia as text-only corpora. It shows its effectiveness in many vision-language tasks. We investigate whether this model also succeed in NLP tasks and compare it with others. Cross-modal Contrastive Learning (CMCL) To harness the visual knowledge from image-caption datasets, we adopt contrastive loss on image and text vectors. Given an image encoder fV , a text encoder fL, and an image-caption pair (x v i , xl i ), we first get image and text representations using the encoders h v i = fV (x v i ), hl i = fL(x l i ). Then the contrastive learning objective contains two loss functions: an imageto-text contrastive loss ℓ (v,l) and a text-to-image contrastive loss ℓ (l,v) . The image-to-text contrastive loss is defined as follows: ℓ (v,l) i = − log e sim(h v i ,hl i )/τ PN j=1 e sim(h v i ,hl j )/τ , (5.4) 67 where N is a batch size and sim(·) represents cosine similarity. This loss encourages a closer distance between representations of aligned image-caption pairs than unaligned pairs given an image and multiple captions. Similarly, the text-to-image contrastive loss ℓ (l,v) is defined as follows: ℓ (l,v) i = − log e sim(h l i ,hv i )/τ PN j=1 e sim(h l i ,hv j )/τ . (5.5) The final loss is defined as L = 1 N X N i=1 (ℓ (v,l) i + ℓ (l,v) i ). (5.6) CLIP [98] and ConVIRT [151] also adopt contrastive learning, but we freeze the image encoder in training and use the trained text encoder for downstream tasks. CMCL with Adversarial Negative Samples (ANS) As in-batch negatives in CMCL are not challenging enough for models to distinguish, we present adversarial negative sampling strategy to improve CMCL. Given an image-caption pair (x v i , xl i ), we define a LM-perturbed sentence x l− i , which is a hard negative where n is replaced with a different word n ′ from a probability distribution of PTLMs. We expect the l − is syntactically correct and plausible sentence even the word n is replaced to n ′ , while it does not semantically match to the corresponding image x v i . With such hard negative, we try to make more challenging task so that models can effectively learn from the task. For example, we choose a word ‘girl’ in the sentence ‘A girl puts an apple in her bag.’ in Figure 5.3. Then we mask the word with [MASK] token to do masked token predictions by PTLMs. Then we get top-k predictions from language models and replace the masked tokens with one of the predicted ones. To avoid false negative sentences which may have the same semantics as the original sentence, we introduce an additional filtering step: if the masked predictions are synonyms 68 Model PIQA VP CSQA OBQA RiddleSense Average 64 128 64 128 64 128 64 128 64 128 64 128 - BERT-base 52.6±0.9 53.8±0.1 85.9±1.1 86.6±0.7 35.8±0.7 37.8±0.3 31.3±1.2 32.0±0.7 24.7±0.1 25.2±0.2 46.1 47.1 Caption MLM 53.1±0.2 54.3±0.3 86.5±0.3 87.3±0.4 35.7±0.3 36.7±0.1 33.4±0.6 34.2±0.3 26.3±0.1 26.5±0.2 47.0 47.8 TCL 52.6±0.5 52.9±0.6 86.4±0.1 88.0±0.1 35.7±0.2 36.1±0.3 34.2±1.4 35.2±0.7 30.3±0.5 30.7±0.4 47.8 48.5 TCL + MLM 53.6±0.7 54.6±0.2 84.2±0.2 87.6±0.3 33.6±2.2 35.1±0.6 31.8±2.3 34.3±0.5 20.6±0.0 20.6±0.0 44.7 46.4 TCL + ANS 50.0±0.7 50.5±0.6 67.3±0.4 68.2±0.7 26.8±1.2 27.5±0.5 33.4±1.1 35.0±1.0 26.1±1.7 26.5±1.8 40.7 41.5 TCL + PSA + ANS 51.1±0.1 51.2±0.4 66.0±0.0 66.0±0.0 22.7±0.9 22.9±0.1 30.2±3.1 31.8±0.4 23.5±1.2 25.2±1.5 38.7 39.4 Caption-Image Pairs VL-BERT-base 53.1±0.6 53.9±0.4 88.5±0.3 88.4±0.5 36.2±0.7 36.8±0.8 33.4±1.2 34.6±1.2 26.1±0.8 26.1±0.9 47.7 48.5 Vokenization 50.5±0.5 51.1±0.4 68.8±1.6 78.1±1.9 19.2±1.4 21.5±0.8 31.2±2.7 33.2±2.2 17.1±0.5 16.7±0.7 37.3 40.1 VidLanKD 55.0±0.4 55.6±0.5 86.7±0.5 88.5±0.5 37.1±1.0 38.6±0.5 31.8±1.3 32.6±1.0 24.4±0 24.4±0 47.0 47.9 VidLanKD variant 55.3±0.3 55.2±0.4 87.4±0.1 88.2±0.6 37.3±1.2 38.9±0.5 32.4±2.1 32.2±1.1 24.4±0.0 24.4±0.0 47.3 47.7 CMKD (VL-BERT-large) 54.7±0.5 54.5±0.2 86.5±0.8 88.4±0.4 36.7±0.4 38.5±0.4 29.8±0.8 31.7±0.2 25.2±0.1 25.2±0.0 46.5 47.6 CMCL 54.7±0.4 55.1±0.1 87.9±0.3 88.9±0.2 36.3±0.3 38.4±0.4 31.1±1.1 32.8±0.9 25.0±0.2 25.4±0.4 47.0 48.1 CMCL + ANS 55.4±0.1 55.7±0.2 88.1±0.9 88.9±0.7 37.5±0.8 39.0±0.2 32.2±0.7 32.0±0.6 27.4±0.0 27.5±0.1 48.1 48.6 CMCL + PSA + ANS 55.4±0.2 55.1±0.2 88.8±1.0 88.2±0.2 37.0±0.3 38.1±0.3 34.1±0.4 34.8±0.9 26.7±0.4 28.8±0.7 48.4 49.0 Table 5.2: Performance (accuracy) in low-resource setting. We test models on diverse datasets with low-resource learning (64 and 128 training samples). We use captions in the MS COCO dataset for text knowledge transfer methods and images and captions for cross-modal knowledge transfer methods. We get average performance on 64 and 128 training samples. Bold and underlined numbers refer to the best and second-best performance, respectively. or hypernyms of the original tokens, we discard the predictions. We use WordNet [83] to find synonyms and hypernyms. The contrastive loss with hard negative is defined as follows: − log e sim(h v i ,hl i )/τ PN j=1 e sim(h v i ,hl j )/τ + PM k=1 e sim(h v i ,hl− j )/τ , (5.7) where M is the number of hard negative samples per positive pair. This formula is only for image-to-text contrastive loss ℓ (v,l) and final loss is defined to same as equation (5.6). CMCL with Positive Sample Augmentation (PSA) In ANS, we filter perturbed sentences where the masked predictions are synonyms or hypernyms of the original tokens. Instead of excluding these perturbed sentences, another option is to include them as additional positive samples l + to the paired images. We name this as positive sample augmentation (PSA). It also adopts LM-perturbed negative samples as in ANS. Cross-modal Knowledge Distillation (CMKD) Cross-modal knowledge distillation is to transfer knowledge between different modalities, e.g., image modality and text modality. In this category, CMKD is 69 to transfer knowledge from a teacher model which is knowledgeable about visual information. VidLanKD [121] also utilizes a cross-modal knowledge distillation method to help with general language understanding. A teacher model is first trained using contrastive learning on a video-text dataset, and then it transfers its knowledge to a student language model using KD on a text corpus. Their contrastive learning loss (hinge loss) is defined as L = X N i [max(0, α − sim(h v i , hl i ) + sim(h v ′ i , hl i )) + max(0, α − sim(h v i , hl i ) + sim(h v i , hl ′ i ))], (5.8) where v ′ and l ′ are a random image and caption text, respectively. α is the margin between the similarities of a positive pair and a negative pair. Instead of video datasets, we use a MS COCO dataset to train a teacher model and use two versions of contrastive learning, equations (5.6) and (5.8). We call the version with equation (5.8) VidLanKD and equation (5.6) VidLanKD variant. As another version of CMKD, we consider distilling visual knowledge from a pre-trained vision-language model, VL-BERT, which is knowledgeable about grounded language. We adopt masked language modeling on Wikitext103 [80], a subset of English Wikipedia, in the knowledge distillation step. For knowledge distillation, we adopt Neuron Selectivity Transfer (NST) [46], which proves the effectiveness in VidLanKD [121]. 5.4 Experimental Settings For all the approaches, we use bert-base-uncased [28] as text encoder fL and ResNeXt101 [139] as an image encoder fV . We continue to pre-train the encoders in our experiments. For text knowledge transfer, (1) MLM follows the exact setting of codebase in huggingface† which uses dynamic masking strategy to conduct language modeling task. (2) TCL conducts contrastive learning with fL. We choose the best checkpoint by the best spearman correlation on STSb [16]. For cross-modal knowledge transfer, (1) CMKD † https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling 70 Model PIQA VP CSQA OBQA RiddleSense Average - BERT-base 62.5±1.3 93.1±0.4 53.2±1.2 52.2±0.5 38.9±0.9 59.9 Caption MLM 63.8±0.9 93.5±0.1 52.6±0.3 53.9±1.1 39.3±1.4 60.6 TCL 62.1±0.5 93.5±0.4 49.0±0.5 54.1±1.0 41.2±0.3 60.1 TCL + MLM 62.3±0.7 93.2±0.3 49.0±0.4 49.0±0.8 40.5±0.5 58.8 TCL + ANS 60.1±1.2 93.3±0.1 47.0±0.1 50.2±0.9 36.7±0.8 57.4 TCL + PSA + ANS 59.5±1.0 92.4±0.3 34.0±1.3 44.6±1.4 28.4±2.3 51.7 Caption-Image Pairs VL-BERT-base 63.8±1.5 93.6±0.1 50.3±1.1 49.6±2.3 39.1±1.0 59.2 Vokenization 58.4±5.1 92.7±0.3 45.0±0.2 48.1±0.8 33.5±0.7 55.5 VidLanKD 63.1±1.1 93.7±0.4 52.4±0.8 50.6±3.9 39.5±1.7 59.8 VidLanKD variant 64.1±0.2 93.8±0.3 53.6±0.5 47.9±4.3 38.8±2.0 59.6 CMKD (VL-BERT-large) 63.8±0.0 93.7±0.7 53.3±1.4 48.7±3.0 38.7±0.4 59.6 CMCL 62.7±0.1 93.3±0.3 50.8±0.9 52.3±0.7 37.6±1.0 59.2 CMCL + ANS 63.5±0.1 93.3±0.3 50.3±0.1 52.9±0.3 38.4±0.9 59.7 CMCL + PSA + ANS 63.9±0.5 94.3±0.1 50.9±0.3 52.4±1.2 39.0±0.3 60.1 Table 5.3: Performance (accuracy) in fully supervised setting. Bold and underlined numbers refer to the best and second-best performance, respectively. explores VL-BERT, Vokenization, and VidLanKD approaches. Here, we use VL-BERT-large model to do CMKD. We use the VL-BERT and Vokenization checkpoints from their official codebases‡ . VidLanKD trains a teacher model by two versions of contrastive learning (equations (5.6) and (5.8)) on MS COCO dataset. We call the version with equation (5.6) VidLanKD variant. We set α = 1 in VidLanKD (equation (5.8)). (2) CMCL conducts contrastive learning with fL and fV . Here, we set τ = 0.05 (equations (5.2) and (5.4)). (3) CMCL with ANS chooses three noun words or verb words to do masked prediction and use top-5 predictions from fL as replacement. We filter out synonyms and hypernyms of original words using WordNet [83]. (4) CMCL with PSA includes the perturbed sentences with synonyms and hypernyms as additional positive samples. In CMCL, we adopt ResNeXt101 [139] as an image encoder fV and BERT as a text encoder fL. TCL and CMCL train with batch size 64, maximum sequence length 20, learning rate 1e-4 for 3 epochs. For fine-tuning on downstream tasks, we do grid search on learning rates {5e-5, 1e-4, 3e-4, 4e-4, 5e-4, 6e-4} and choose the best learning rate. We set maximum epochs to 30 in low-resource and 15 in fully supervised settings. ‡ https://github.com/jackroos/VL-BERT, https://github.com/airsplay/vokenization 71 Model RTE MRPC STS-B CoLA SST-2 QNLI QQP Avg. - BERT-base 70.0 87.9 89.1 57.4 91.3 90.4 89.3 82.3 Caption MLM 62.8 87.0 89.1 53.9 92.6 91.1 90.9 81.0 TCL 58.4 83.1 88.2 55.5 91.9 91.4 90.9 79.9 TCL + MLM 54.8 81.6 87.2 53.6 91.9 90.9 89.2 78.5 TCL + ANS 56.3 83.9 87.0 51.5 91.3 91.2 89.4 78.6 TCL + PSA + ANS 52.3 75.6 81.5 17.4 90.0 85.8 88.2 70.1 Caption-Image Pairs VL-BERT-base 57.4 85.7 89.5 58.1 90.6 89.7 88.7 80.0 Vokenization 53.0 87.0 83.3 51.3 91.4 89.2 88.5 77.7 VidLanKD 67.5 87.8 89.4 57.7 90.7 90.3 88.6 81.7 VidLanKD variant 68.5 87.9 89.7 54.9 91.1 90.5 88.6 81.6 CMKD (VL-BERT-large) 68.5 88.5 89.3 55.4 90.9 89.7 88.6 81.6 CMCL 63.5 82.5 89.5 51.1 90.4 90.0 88.4 79.3 CMCL + ANS 69.6 86.8 89.4 56.1 90.7 90.5 88.6 81.7 CMCL + PSA + ANS 69.8 86.2 89.0 55.3 90.4 90.5 88.6 81.6 Table 5.4: Performance (accuracy) on GLUE benchmark. Bold and underlined numbers refer to the best and second-best performance, respectively. 5.5 Results and Analysis We analyze the main results of intermediate pre-training. Tables 5.2 and 5.3 show the main results of low-resource learning and fully supervised learning with the MS COCO captioning dataset, respectively. We train the models with a few training examples, 64 and 128, to understand the better initialization. We argue that if a model obtains better performance in the low-resource setup, then it is a faster learner and has better generalization on downstream tasks. Can text intermediate pre-training help improve text encoders? Text intermediate pre-training using MLM and TCL on a caption corpus improves the performance on downstream tasks in both lowresource and fully supervised settings. In particular, TCL shows significant improvement on OBQA and RiddleSense over BERT (p-value < 0.01). These results suggest that text intermediate pre-training on visual-related datasets helps performance on commonsense reasoning tasks. 72 Can cross-modal intermediate pre-training help transfer visual knowledge to augment text encoders? We observe that cross-modal intermediate pre-training is helpful in both fully supervised and low-resource settings (See Table 5.2 and 5.3). Specifically, CMKD with VidLanKD variant outperforms the baseline by 1.6% point on the PIQA dataset in fully supervised setting. CMCL also shows its effectiveness. However, we could find that it becomes more powerful when equipped with PSA and ANS. It suggests that data augmentation for positive and negative sampling is an important factor for CMCL. In low-resource setting, we find that cross-modal knowledge transfer helps better initialization and lets models learn new tasks faster. What intermediate pre-training objectives are effective for cross-modal knowledge transfer? Among various cross-modal knowledge transfer methods, we study which method is the most effective for cross-modal knowledge transfer. Overall, CMCL with PSA and ANS shows the best performance among all cross-modal methods. Interestingly, VL-BERT also shows better performance than BERT-base on all datasets in the low-resource setting. This suggests that exploiting images in masked language modeling task help transfer the knowledge to language models. What types of knowledge sources are most helpful? Here, we investigate whether using an image source in addition to a text source can further improve the model. To answer this question, we analyze methods from different types of sources: text-only and text-image pair sources. We focus on the methods that use the contrastive learning objective: TCL and CMCL. Note that these two methods share the same objective but CMCL trains on cross modalities which are images and captions while TCL only trains on captions. Overall, TCL performs slightly better than CMCL in low-resource and fully supervised settings. Interestingly, additional negative samples (ANS) and positive samples in TCL decreases the performance while they help CMCL to improve the performance. We conjecture that perturbed sentences in ANS might not be semantically negative to the original sentence so models learn from wrong labels. 73 Model PIQA VP CSQA OBQA RiddleSense 64 128 Full 64 128 Full 64 128 Full 64 128 Full 64 128 Full - BERT-base 52.6±0.9 53.8±0.1 62.5±1.3 85.9±1.1 86.6±0.7 93.1±0.4 35.8±0.7 37.8±0.3 53.2±1.2 31.3±1.2 32.0±0.7 52.2±0.5 24.7±0.1 25.2±0.2 38.9±0.9 CP. MLM 53.1±0.2 54.3±0.3 63.8±0.9 86.5±0.3 87.3±0.4 93.5±0.1 35.7±0.3 37.7±0.1 52.6±0.3 33.4±0.6 34.2±0.3 53.9±1.1 26.3±0.1 26.5±0.2 39.3±1.4 TCL 52.6±0.5 52.9±0.6 62.1±0.5 86.4±0.1 88.0±0.1 93.5±0.4 35.7±0.2 36.1±0.3 49.0±0.5 34.2±1.4 35.2±0.7 54.1±1.0 30.3±0.5 30.7±0.4 41.2±0.3 GK. MLM 53.2±0.1 53.6±0.4 64.9±0.1 86.2±0.9 87.6±0.3 93.0±0.3 34.6±0.7 35.3±1.3 51.6±0.5 31.7±0.9 32.3±1.0 53.1±0.9 25.8±0.6 26.3±0.1 39.3±0.7 TCL 56.0±1.0 56.4±0.2 64.4±0.1 88.9±0.7 89.4±0.2 93.3±0.5 37.8±1.2 38.7±0.5 51.0±0.5 31.7±0.9 32.3±1.0 52.6±0.8 27.4±0.2 28.1±0.7 40.9±0.8 BC. MLM 54.1±0.3 54.1±0.8 63.3±0.6 86.4±0.8 87.5±0.5 93.0±0.3 29.8±0.8 32.1±0.9 50.8±0.3 29.6±0.8 31.4±0.7 50.2±0.4 22.6±0.0 22.7±0.0 36.7±1.3 TCL 52.4±0.1 53.1±0.4 63.1±0.3 87.1±1.9 89.7±0.1 93.2±0.2 38.0±0.5 38.1±1.1 51.5±0.1 33.8±2.7 34.0 ±2.1 55.6±0.4 28.9±0.4 29.1±0.3 41.2±2.3 WT. MLM 52.7±0.2 53.0±0.3 63.8±0.6 85.3±2.8 88.1±0.3 93.5±0.1 33.2±1.4 34.6±0.5 52.5±0.2 32.4±2.3 33.0±0.7 52.3±0.3 24.4±0.0 24.4±0.0 39.4±2.0 TCL 52.9±0.9 53.4±0.4 62.7±0.6 67.3±0.6 68.6±0.7 93.3±0.3 31.3±1.6 32.4±0.7 48.2±0.3 31.5±3.5 33.1±0.6 53.0±0.0 24.8±1.3 24.8±0.6 36.3±1.0 Table 5.5: Results of text knowledge transfer methods with different corpora. We pre-train text knowledge transfer methods, MLM ans TCL, with different corpora. CP is MS COCO captions, GK is GenericsKB, BC is BooksCorpus, and WT is WikiText. Bold and underlined numbers refer to the best and second-best performance, respectively. 5.5.1 Ablation Study How do models perform on general NLU tasks? Table 5.4 presents results on GLUE benchmark. In GLUE, text intermediate pre-training methods slightly underperform the original BERT-base. We conjecture that the intermediate pre-training on caption data might sacrifice knowledge of general language understanding. Analysis on diverse text corpora Table 5.5 represents text approaches with different pre-training corpora: MS COCO captions [70, 19], GenericsKB [10], BooksCorpus [156], and WikiText103 [80]. We sample 250k sentences from each corpus for a fair comparison. We notice that caption datasets are useful on OBQA and RiddleSense datasets while GenericsKB are the most helpful on PIQA datasets. Results are expected since GenericsKB contains a lot of everyday statements that contain various types of commonsense. Different training sizes. We test different training sizes on PIQA in Fig. 5.4. In the experiment, we observe that CMCL consistently outperforms BERT on all training sizes. Additional negative sample (ANS) improves the CMCL on different training sizes, and positive sample augmentation boosts the performance of CMCL further. This suggests including perturbed sentences as positive and negative samples are useful to cross-modal knowledge transfer. 74 0 500 1000 1500 2000 Training size 51 52 53 54 55 56 57 58 ACC on PIQA BERT CMCL CMCL+ANS CMCL+PSA+ANS Figure 5.4: Results on varying training sizes. We test methods with different training sizes. 5.6 Related Work Text Knowledge enhanced methods. Recently, huge efforts on integrating knowledge into PTLMs have been made. One typical form of knowledge is a knowledge graph. There have been efforts of using knowledge graph to inject entity and relation representations, which are pre-computed from external source, into PTLMs [152, 140, 92, 44, 141]. Some other works try to retrieve or generate the sub-graph from the graph to solve the problem [68, 131]. Another existing form of knowledge is extra large-scale corpus. Works that use such corpus present knowledge-related pre-training objectives such as concept order recovering [154], entity category prediction [145] and source of knowledge prediction [132, 15]. They are mostly focused on injecting world knowledge presented in text, rather than physical and visual commonsense knowledge that can be found in images. Cross-modal knowledge enhanced methods. There is a extensive line of works for a variety of visionlanguage tasks, such as VL-BERT [116], VisualBert [63], and Uniter [21]. These models aim to improve vision-language tasks, e.g., VQA [37] and event understanding [65], and they are found to be not effective 75 in improving language tasks [120]. Another line of works is to transfer visual knowledge to language models: Vokenization [120] and VidLanKD [121]. Vokenization employs token-level text-to-image retrieval to transfer visual knowledge to language models. For this, Vokenization introduces 30k vokens and matches each token into the limited voken space. VidLanKD adopts contrastive learning to train a teacher model on video datasets and uses distillation approaches to distill visual knowledge from the teacher to a student model. 5.7 Conclusion We study whether intermediate pre-training on visual knowledge can help transfer visual knowledge into LMs. We investigate text knowledge transfer and cross-modal knowledge transfer using images and captions. In our empirical analysis, we observe that intermediate pre-training on captions can help improving performance and cross-modal knowledge transfer approaches consistently improve performance. When the transfer methods are equipped with additional positive and negative samples, they show better performance. Future works include improving both commonsense reasoning and general language understanding. 76 Chapter 6 Saliency-aware Knowledge Distillation for Multimodal Understanding To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large “teacher" model to a smaller “student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher’s behavior within each modality. The idea aims at mimicking a teacher’s modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets. 77 0.0 0.5 Density Multi 0.0 0.5 Image 0.0 0.5 Text Teacher Small model KD Probability Figure 6.1: Density of model outputs on Hateful-Memes: given multimodality samples as input (Multi), given only image modality as input (Image), and given only text modality as input (Text). KD denotes a student model with knowledge distillation and the small model is a student model without distillation. We observe that there is still a prediction gap between the teacher and the student trained by KD. In this paper, we study saliency explanations for each modality and propose modality-specific distillation (MSD) to minimize the gap. 6.1 Introduction Recent advances in computer vision and natural language processing are attributed to deep neural networks with a large number of layers. Current state-of-the-art architectures are getting wider and deeper with billions of parameters, e.g., BERT [30] and GPT-3 [13]. Such wide and deep models suffer from high computational costs and latencies at inference. To mitigate the heavy computational cost and the memory requirement, there have been several attempts to compress a larger model (a teacher) into a smaller model (a student) [8, 45, 106, 90, 84]. Among them, knowledge distillation (KD) [45] assumes the knowledge in the teacher as a learned mapping from inputs to outputs and transfers the knowledge from a larger model to a smaller model. Recently, KD has been explored in various studies such as improving a student model [45, 90, 106, 122, 84] and improving a teacher model itself by self-distillation [138, 56, 34]. 78 There has been a lot of interest in multimodal distillation setup such as cross-modal distillation [39, 122]. Multimodal problems involve relating information from multiple sources. For example, visual question answering (VQA) requires answering questions about an image [6, 37, 40, 114] and models should incorporate information from the text and image sources to answer the questions. Multimodal problems are important because many real-world problems require understanding signals from different modalities to make accurate predictions; information on the web and social media is often represented as textual and visual descriptions. Digesting such multimodal information in an effective manner is challenging due to their different types of information on each modality. In this paper, we offer a large-scale, systematic study on the effects of each modality through saliency explanations in KD. While KD approaches can be applied to multimodal applications, the student and teacher models may significantly differ in their outputs using each modality as input. We illustrate the point in Fig. 6.1. To minimize the gaps, we introduce a multimodal KD framework, modality-specific distillation (MSD), that aims to mimic the teacher’s modality-specific predictions. We show that the samples’ modalities have a different amount of information. Based on this observation, we improve the knowledge transfer by splitting the multimodality into separate modalities, using them as additional inputs, and thus distilling the modality-specific behavior of the teacher. MSD introduces auxiliary losses per modality to encourage each modality to be distilled effectively. To maximize the effect of modality-specific distillation, we investigate multiple weighting schemes to balance out the auxiliary losses. One of the weighting schemes is based on modality saliency scores that are proxy scores to modality importance. Furthermore, we leverage a meta-learning method to introduce weight-learning to automatically learn optimal weights per sample per modality. 6.2 Preliminaries In this section, we define notations and revisit conventional knowledge distillation (KD). 79 6.2.1 Problem Definition Given a trained and frozen teacher model T and a student model S, the output of our task is a trained student model. Our goal is to transfer knowledge from the teacher to the student on multimodal datasets. We let fT and fS be functions of the teacher and the student, respectively. t and s refer to the softmax output of the teacher and the student. Typically the models are deep neural networks and the teacher is deeper than the student. The function f can be defined using the output of the last layer of the network (e.g., logits). X is a multimodal (language-vision) dataset, Xt refers to only the text modality of X, Xv refers to only the image modality of X, and xi is a dataset instance. In this work, we focus on one text and one image modalities, but it is easy to extend the work to more/other modalities. 6.2.2 Conventional Knowledge Distillation In knowledge distillation [45], a student is trained to minimize a weighted sum of two different losses: (a) cross entropy with hard labels (one-hot encodings on correct labels) using a standard softmax function, (b) cross entropy with soft labels (probability distribution of labels) produced by a teacher with a temperature higher than 1 in the softmax of both models. The temperature controls the softness of the probability distributions. Thus, the loss for the student is defined as: Lstudent = λLCE + (1 − λ)LKD, (6.1) where LCE is a standard cross-entropy loss on hard labels, LKD is a distillation loss, which is a cross-entropy loss on soft labels, and λ ∈ [0, 1] controls the balance between hard and soft targets. 80 To be specific, knowledge distillation [45] minimizes the Kullback-Leibler divergence between soft targets from a teacher and probabilities from a student. The soft targets (or soft labels) are defined as softmax on outputs of fT with temperature τ . The distillation loss is defined as follows: LKD = τ 2 1 |X| X xi∈X KL(t(xi ; τ ), s(xi ; τ ))), (6.2) where t(xi ; τ ) = σ( fT (xi) τ ), s(xi ; τ ) = σ( fS(xi) τ ), σ is a softmax function. The temperature parameter τ controls the entropy of the output distribution (higher temperature τ means higher entropy in the soft labels). Following Hinton, Vinyals, and Dean [45], we scale the loss by τ 2 in order to keep gradient magnitudes approximately constant when changing the temperature. We omit τ for brevity. Limitations. This KD can be applied to multimodal setups and student models in this distillation are directly trained to mimic a teacher’s outputs. As a result, the student and teacher models may significantly differ in outputs with a single-modality input, i.e., modality-specific outputs, which may lead to inefficient distillation (Fig. 6.1). To better mimic the teacher’s behaviors, we introduce a multimodal KD approach, modality-specific distillation, in the next section. 6.3 Analysis Setup In this section, we introduce a multimodal KD approach, modality-specific distillation, to understand the importance of each modality (§6.3.1), experimental setup (§6.3.2), and datasets for the experiments (§6.3.3). 6.3.1 Modality-specific Distillation The idea of MSD is to feed each modality as a separate input into a teacher and a student, and transfer the modality-specific knowledge of the teacher to the student. Specifically, MSD introduces two loss terms, 81 LtextKD and LimageKD to minimize difference between probability distributions between the teacher and the student given each modality (assuming text and image as the only two modalities). LtextKD = τ 2 1 |Xt | X xi∈Xt KL(t(xi), s(xi)). (6.3) LimageKD is similarly defined; the input is the image modality instead. With above two auxiliary losses, the MSD loss for the student is defined as follows: LMSD = X xi∈X wiLKD(xi) + X xi∈Xv w v i LimageKD(xi) + X xi∈Xt w t iLtextKD(xi), (6.4) where we omit the scaling factor τ 2 1 |X| for brevity. wi , wt i , wv i ∈ [0, 1] control the balance between three distillation losses. These weights determine the importance of each modality and we hypothesize that the choice of weighting approaches affects the student’s performance. We will introduce four weighting schemes for distillation losses and discuss each of them in §6.4. 6.3.2 Experimental Setup Through our empirical analysis, we aim to answer the following questions: • Q1. How salient is each modality for predictions? • Q2. Can the saliency explanations aid students? • Q3. Can we learn a sample weighting strategy to better aid students? • Q4. Is the student with the weighting strategies consistent with the teacher? • Q5. Can this be applicable to other distillation methods? We first define saliency scores for modalities to investigate how salient each modality is for predictions. (Q1). Then, we analyze the influence in downstream task performance brought by different weighting 82 schemes for wi , wt i , wv i ∈ [0, 1] in MSD (Q2 and Q3). For Q4, we examine the student model’s sensitiveness to changes in modalities. Lastly, we try to understand the effect of MSD in various distillation approaches (Q5). To this end, we use Conventional KD [45] as a base distillation approach for MSD. In addition, we include several distillation baselines including Conventional KD [45], FitNet [106], RKD [90], and SP [126] for comparison. Other distillation approaches are applicable to MSD and we will discuss the results using other KD approaches in our experiments. To perform analysis, we adopt VisualBERT [63], a pre-trained multimodal model, as the teacher model and TinyBERT [49] as a student model. VisualBERT consists of 12 layers and a hidden size of 768, and has 109 million parameters, while TinyBERT consists of 4 layers and a hidden size of 312, and has 14.5 million parameters. We use the region features from images for both the teacher and the student and fine-tune the student on each dataset. For training the weight learner we use the datasets’ validation set as meta data. We find the best hyperparameters on the validation set. 6.3.3 Datasets and Evaluation Metrics To answer the questions, we select four multimodal datasets: Hateful-Memes [54] MM-IMDB [7], Visual Entailment (SNLI-VE) [137, 144], and VQA2 [37]. The Hateful-Memes dataset consists of 10K multimodal memes. The task is a binary classification problem, which is to detect hate speech in multimodal memes. We use Accuracy (ACC) and AUC as evaluation metrics for hateful memes. The MM-IMDB (Multimodal IMDB) dataset consists of 26K movie plot outlines and movie posters. The task involves assigning genres to each movie from a list of 23 genres. This is a multi-label prediction problem, i.e., one movie can have multiple genres and we use Macro F1 and Micro F1 as evaluation metrics following [7]. 83 The goal of Visual Entailment is to predict whether a given image semantically entails an input sentence. Classification accuracy over three classes (“Entailment", “Neutral" and “Contradiction") is used to measure model performance. We use accuracy as an evaluation metric following [137]. The task of VQA2 is to correctly answer a question given an image. VQA2 is built based on the COCO [70] and is split into train (83k images and 444k questions), validation (41k images and 214k questions), and test (81k images and 448k questions) sets. Following the experimental protocol in BUTD [5], we consider it a classification problem and train models to predict the 3,129 most frequent answers. We test models on test-dev of the VQA2 dataset. 6.4 Modality Weighting Methods For the analysis, we introduce three categories of weighting schemes for MSD, presented in the order of complexity: a) population-based (§6.4.1), b) saliency-based (§6.4.2) weighting approaches for the losses, and c) weight-learning approach (§6.4.3) to find the optimal weights. 6.4.1 Population-based Weighting Population-based weighting is to assign weights depending on a modality; we give constant weights (wi , wv i , wt i ) for each loss term in equation (6.4). This weighting approach assumes the weights are determined by the types of modality. Best weights or coefficients for each loss term are obtained by grid search on the validation set. However, population-based weighting is limited because it does not assign finer-grained weights to each data instance; each data instance might have different optimal weights for the loss terms. This is what we pursue next in saliency-based weighting. 84 6.4.2 Saliency-based Weighting While we observe prediction gaps between the teacher and the student (Fig. 6.1) on each modality, it is unclear which modality leads to such gaps between them and how salient modality is for predictions. Saliency-based weighting is to give different weights to each loss term depending on a data sample based on its saliency of each modality. The assumption is that each data point has different optimal weights for knowledge distillation. By assigning instance-level weights, we expect better learning for the student to mimic the teacher’s modality-specific behavior. As it is not possible to tune sample weights as separate hyper-parameters, we instead propose to use simple/intuitive fixed weighting functions, described as follows. Obviously, the next step to this approach would be to learn this weighting function alongside the rest of the model, i.e. weight learning, which we discuss further in §6.4.3. To better understand how these modalities affect the predictions, we first define saliency scores for modalities per sample. Similar to Li, Monroe, and Jurafsky [59], we erase one of the modalities and measure the saliency score by computing the difference between two probabilities. Although the saliency scores can be defined on all inputs, we limit our analysis to explanations to different modalities in this work. Quantifying Saliency of Modality. Given a teacher model t and a multimodal dataset, we define a saliency score as follows: S(m) = δ(t(x), t(x−m)), (6.5) where m denotes a modality and x−m denotes an input after masking out the corresponding modality input. δ is a function to measure difference between t(x) and t(x−m). We exploit teacher’s output to compute saliency scores. We introduce two saliency-based weighting approaches with different δ functions. KL divergence-based weighting. In this weighting approach, δ is defined as Kullback–Leibler (KL) divergence which measures the distance between two probability distributions. Thus, δ measures distance between predictions with multimodality and predictions by erasing one modality. Thus, weights for loss 85 terms are defined as w v i = g(Si,t) and w t i = g(Si,v), where g = tanh(·) to ensure the weights are in the range [0, 1]. In this strategy, we assign wi = 1 for the loss term for multimodality. Note that in this strategy we do not explicitly use the true labels to decide the distillation weights, and we use the teacher’s predictions instead. Loss-based weighting. Another idea of saliency-based weighting is to weight terms depending on how different loss of predictions with one modality is from loss of predictions with multimodality. We explicitly use the true labels to measure the loss, i.e., cross-entropy loss. If the loss of predictions with one modality is similar to that with multimodality, then we consider the modality salient for predictions. Thus, the weights are defined as wi : w v i : w t i = 1 : h(t(xi)) h(t(x v i )) : h(t(xi)) h(t(x t i )), (6.6) where h(x) = − Pc j=1 yi,j log x and yi,j ∈ {0, 1} are the correct targets for the j-th class of the i-th example. In this case, we also assign weights wi for multimodality depending on the other two weights. In order to choose the actual weights, we add a normalization constraint such that, wi + w v i + w t i = 1. It is worth noting that in this weighting scheme, the actual labels are directly used in deciding the weights unlike the previous one. 6.4.3 Weight Learning Although the aforementioned weighting schemes are intuitive, there is no reason to believe they are the optimal way of getting value out of modality-specific distillation. Moreover, it is not trivial to get optimal weight functions since it depends on a dataset. Thus, we propose a weight-learning approach to find optimal weight functions. Inspired by [113], we design weight learners to find the optimal coefficients. (wi , wv i , wt i ) is defined as follows: (wi , wv i , wt i ) = f(t(xi), t(x v i ), t(x t i ); Θ) = w(Θ), (6.7) 86 Algorithm 1: Weight-Learning Algorithm Input: Training data D, Meta-data set Dˆ, batch size n, m, learning rates α, β, max iterations T. 1 for t ← 0 to T − 1 do 2 {x, y} ← SampleMiniBatch(D, n). 3 {x (meta) , y(meta)} ← SampleMiniBatch(D, m ˆ ). 4 ˆw(t) (Θ(t) ) ← w(t) − α 1 n Pn i=1 ∇wLstudent(w(t) , Θ(t) ) 5 Θ(t+1) ← Θ(t) − β 1 m Pm i=1 ∇ΘLmeta(ˆw(t) (Θ(t) )) 6 w(t+1) ← w(t) − α 1 n Pn i=1 ∇wLstudent(w(t) , Θ(t+1)) 7 return Network parameters w(T) , Θ(T) where Θ defines the parameters for the weight learner network, a Multi-Layer Perceptron (MLP) with a sigmoid layer, which approximates a wide range of functions [25]. In general, the function for defining weights can depend on any input from the sample; but here we limit ourselves to the teacher’s predictions. Weight-Learning Objective. We assume that we have a small amount of unbiased meta-data set {x (meta) i , y (meta) i }M i=1, representing the meta knowledge of ground-truth sample-label distribution, where M is the number of meta samples and M ≪ N. In our setup, we use the validation set as the meta-data set. The optimal parameter Θ∗ can be obtained by minimizing the following cross-entropy loss: Lmeta(w∗ (Θ)) = − 1 M X M i=1 Xc j=1 yi,j log(s(xi ; w∗ (Θ)), (6.8) where w∗ is an optimal student’s parameter, which is defined as follows: w∗ (Θ) = arg min w Lstudent(w, Θ). (6.9) w ∗ is parameterized by Θ, a weight learner’s parameter. The weight learner is optimized for generating instance weights that minimize the average error of the student over the meta-data set, while the student is trained on the training set with the generated instance weights from the weight learner. 87 Table 6.1: Dataset Statistics. Stat. \ Data HatefulMemes MMIMDB SNLI-VE VQA2 Type Binary Multilabel Multiclass Multiclass # Classes 2 23 3 3,129 # Examples 10,000 25,959 565,286 1,105,904 # Training 8,500 15,552 529,527 443,757 # Validation 500 2,608 17,858 214,354 # Test 1,000 7,799 17,901 447,793 Weight-Learning Algorithm. Finding the optimal Θ∗ and w ∗ requires two nested loops; one gradient update of a weight learner requires a trained student on the training set. Thus, we adopt an online strategy following [113], which updates the weight learner with only one gradient update of the student. Algorithm 1 illustrates its learning process. First, we sample mini batches from the training and meta-data sets, respectively (lines 2 and 3). Then, we update the student’s parameter along the descent direction of the student’s loss on a mini-batch training data (line 4). Note that the student’s parameter is parameterized by the weight learner’s parameter. With the updated parameter, the weight leaner can be updated by moving the current parameter Θ(t) along the objective gradient of equation (6.8) on a mini-batch meta data (line 5). After updating the weight-learner, the student’s parameter can be updated on a mini-batch training data (line 6). 6.5 Empirical Analysis In this section, we revisit and discuss the questions we raised in §6.3.2. 88 6.5.1 Hyperparameters The teacher model is a VisualBERT [63], and the student model is TinyBERT [49]. We used the MMF library and pretrained checkpoints from it for VisualBERT∗ and used a pretrained checkpoint in TinyBERT† . VisualBERT consists of 12 layers and a hidden size of 768, and has 109 million number of parameters, while TinyBERT consists of 4 layers and a hidden size of 312, and has 14.5 million number of parameters. For all experiments, we performed a grid search to find the best hyperparameters. We adopt the AdamW optimizer to train networks. We use a linear learning rate schedule that drops to 0 at the end of training with warmup steps of 10% maximum iterations. Hateful-Memes. We performed a grid search over learning rates (1e-5, 3e-5, 5e-5, 1e-4), and temperatures (1, 2, 4, 8), and, batch sizes (10, 20, 30, 40, 50, 60), and the weight learner’s learning rates (1e-1, 1e-2, 1e-3, 1e-4). We set the maximum number of iterations to 5000. The balance parameter λ between cross entropy and distillation is set among (0.2, 0.4, 0.5, 0.6, 0.8). MM-IMDB. For MM-IMDB experiments, we follow a similar procedure, a grid search, to the HatefulMemes. The batch size is 20, temperature is 1, and the weight learner’s learning rate is 1e-4. We set the maximum number of iterations to 10000. The balance parameter λ is set to 0.5. SNLI-VE. For Visual Entailment (SNLI-VE), the batch size is 64, temperature is 4, the student model’s learning rate is 1e-4, and the weight learner’s learning rate is 1e-4. We set the maximum number of iterations to 60000. The balance parameter λ is set to 0.6. VQA2. For VQA2, the batch size is 120, temperature is 1, the student model’s learning rate is 1e-4, and the weight learner’s learning rate is 1e-4. We set the maximum number of iterations to 60000. The balance parameter λ is set to 0.8. ∗ https://mmf.sh † https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT 89 0 200 400 600 800 1000 sample index 0 2 4 6 8 10 score Image modality text modality (a) Hateful-Memes 0 5000 10000 15000 20000 sample index 0 2 4 6 8 10 12 14 score Image modality text modality (b) MM-IMDB Figure 6.2: Saliency scores in the Hateful-memes and MM-IMDB test sets. Saliency scores of text modality are mostly higher than those of image modality in MM-IMDB while Hateful-Memes does not show such a global pattern. Table 6.2: Main Results. Mean results (±std) over five repetitions are reported. MSD outperforms all the KD approaches. Here, we use MSD on top of conventional KD [45]. Also, our weight learning for weights shows the best performance. Method Hateful-Memes MM-IMDB SNLI-VE VQA2 (D) ACC AUC Macro F1 Micro F1 ACC ACC Teacher 65.28 71.82 59.92 66.53 77.57 70.91 Small model 60.83 (±0.20) 65.54 (±0.25) 38.78 (±4.03) 58.10 (±1.23) 72.30 (±0.35) 64.20 (±0.56) Conventional KD [45] 60.84 (±1.50) 66.53 (±0.27) 41.76 (±4.72) 58.96 (±1.62) 72.61 (±0.55) 64.70 (±0.85) FitNet [106] 62.00 (±0.26) 67.13 (±0.51) 46.21 (±2.12) 60.46 (±0.30) 73.06 (±0.50) 68.08 (±1.24) RKD [90] 61.43 (±0.40) 67.03 (±0.21) 51.16 (±1.64) 62.52 (±0.70) 73.09 (±0.53) 64.22 (±0.57) SP [126] 61.70 (±1.10) 66.11 (±0.45) 49.07 (±0.82) 61.41 (±0.34) 73.00 (±0.98) 64.15 (±0.81) MSD (Population) 62.15 (±1.71) 67.56 (±1.21) 51.85 (±0.34) 62.13 (±0.19) 73.64 (±0.54) 64.86 (±0.63) MSD (Saliency, KL div) 62.78 (±1.00) 67.94 (±0.52) 49.20 (±1.27) 61.84 (±0.49) 73.34 (±0.48) 64.93 (±0.48) MSD (Saliency, Loss) 63.27 (±0.45) 67.72 (±0.82) 51.02 (±0.70) 62.05 (±0.45) 73.52 (±0.54) 64.89 (±0.58) MSD (Weight learning) 63.86 (±1.28) 68.30 (±0.62) 53.12 (±0.08) 63.00 (±0.09) 73.58 (±0.23) 64.35 (±1.56) 6.5.2 Analysis Q1. How salient is each modality for predictions? To answer the question, we visualize saliency scores in the Hateful-Memes, MM-IMDB, and SNLI-VE datasets in Figs. 6.2 and 6.3. We use KL divergence in Eq. (6.5). We observe that the MM-IMDB dataset shows higher saliency scores of text modality than those of image modality, which implies that text information has important information in general. On the other hand, Hateful-Memes dataset does not show such a global pattern but one can observe some correlations for individual instances. In Fig. 6.3, we notice that saliency scores are correlated with labels 90 0 1000 2000 3000 4000 5000 6000 sample index 0.0 0.5 1.0 1.5 2.0 2.5 3.0 saliency score Image modality text modality (a) Entailment 0 1000 2000 3000 4000 5000 6000 sample index 0.0 0.5 1.0 1.5 2.0 2.5 3.0 saliency score Image modality text modality (b) Neutral 0 1000 2000 3000 4000 5000 6000 sample index 0.0 0.5 1.0 1.5 2.0 2.5 3.0 saliency score Image modality text modality (c) Contradiction Figure 6.3: Saliency scores in the SNLI-VE dev set. We observe that saliency scores for text modality are correlated with labels. For the "Entailment" label, scores for text modality are relatively lower, while they are higher for the "Contradiction" label. in SNLI-VE. For the "Entailment" label, scores for text modality are relatively lower, while they are higher for the "Contradiction" label, which implies the role of text modality is vital to predict the "Contradiction" label for the teacher model. Q2. Can the saliency scores aid students? Table 6.2 shows our main results on Hateful-Memes, MM-IMDB, SNLI-VE, and VQA2 datasets. The small model refers to a student model without knowledge distillation from the teacher. As is shown, existing KD approaches improve the student model on all datasets. However, the MSD approaches improve the small model substantially. Among them, saliencybased weighting outperforms population-based weighting in the Hateful-Memes dataset. We note that 91 population-based weighting shows good improvement, which means weighting based on modality only is still very effective on multimodal datasets. Also, population-based weighting outperforms saliency-based weighting on the MM-IMDB dataset, suggesting all samples are likely to have the same preference or dependency on each modality of the dataset. We will discuss results on weight learning in Q3. Interestingly, FitNet shows the best performance in VQA2. Note that MSD is based on Conventional KD. We will discuss the results of MSD based on other KD approaches in Q5. Q3. Can we learn a sample weighting strategy to better aid students? We observe that among weighting strategies, MSD with weight learning shows the best performance in Hateful-Memes and MMIMDB, indicating it finds better weights for each dataset in Table 6.2. We also find that MSD (Weight learning) shows a similar density curve to the teacher’s as shown in Fig 6.4, which implies that it effectively mimics the teacher’s predictions. However, there is a performance gap between the teacher model and the student model (KD) in predicting true labels given a multimodal sample and each of its individual modalities. For example, given only image modality as input (the middle plot in Fig 6.4), there is a considerable difference between the teacher and the small model for predicting benign samples. In addition, we measure Kullback-Leibler (KL) divergence between the teacher’s outputs and other models’ outputs on the MM-IMDB test set in Fig 6.5. This is to measure the difference between teacher’s probability distribution and others’. The MSD (learning) approach shows the smallest KL divergence from the teacher which means the student learned with MSD outputs probability distribution close to the teacher’s. Notably, MSD (population) shows the smaller KL divergence than MSD (saliency), which validates that one modality is generally dominant in the MM-IMDB dataset. Q4. Is the student with the weighting strategies consistent with the teacher? To showcase that our approach helps the student model to be more sensitive to important changes in modalities, we take examples from the Hateful-Memes test set and randomly replace one of the modalities with a modality from another sample. Hateful-Memes is a multimodal dataset and changing the modalities might or might 92 0.0 0.5 0 1 2 3 Density Multi 0.0 0.5 0 2 4 6 8 Density Image 0.0 0.5 0.0 2.5 5.0 7.5 10.0 Density Text Teacher Small model KD MSD (learning) Probability 0.0 0.2 0.4 0 1 2 3 Density Multi 0.0 0.2 0.4 0 2 4 6 8 Image 0.0 0.2 0.4 0 3 6 9 Text Probability Figure 6.4: Density of model outputs on samples of label 0 (not hateful) on the test set of HatefulMemes: given multimodal samples as input (Multi), given only image modality as input (Image), and given only text modality as input (Text). MSD with the weight-learning approach, minimizes the gap between the teacher and the student trained by KD. Multi Image Text 0.08 0.10 0.12 0.14 KL divergence 0.114 0.122 0.142 0.113 0.117 0.134 0.098 0.119 0.093 0.104 0.113 0.109 0.093 0.112 0.097 Small KD MSD (population) MSD (saliency) MSD (learning) Figure 6.5: Kullback-Leibler divergence on the MM-IMDB test set between the teacher’s outputs and other models’ outputs. This is a measure of how the teacher’s probability distribution is different from other models’. The lower divergence is, the closer a model is to the teacher. 93 Consistency ratio 0.60 0.62 0.64 0.66 0.68 0.70 0.72 Small model KD MSD Figure 6.6: Teacher-Student consistency ratio. We investigate the student model’s sensitiveness to changes in modalities. Higher ratio indicates its sensitiveness is closer to the teacher’s. not change the final label. In this case, we do not have the ground truth, but we use the teacher’s predicted label on the newly generated sample as a proxy for ground truth and count the times that the student/small model is consistent with the teacher on these generated samples. We define the ratio of such consistent predictions over the total generated samples as “Teacher-Student consistency ratio". Note that none of the models have seen these samples during the training. As it can be seen from Fig. 6.6, the MSD approach with weight learning has a larger “Teacher-Student consistency ratio" than the small model with and without KD. This indicates that MSD not only improves the accuracy but also improves the sensitivity of the student model to better match the teacher on the changes in modalities on unseen data. Please refer to case study in §6.5.5. Q5. Can this be applicable to other distillation methods? We present improvements over KD approaches with/without MSD in Table 6.3. We choose the population-based weighting approach in this experiment. Here, we use MSD on top of each KD approach. Note that the MSD approach is orthogonal to existing KD approaches. The results show the benefits of the MSD method on top of other approaches; 94 Table 6.3: Improvement over KD approaches with MSD. The MSD improves existing KD approaches. Method Hateful-Memes MM-IMDB VQA2 ACC AUC Macro F1 Micro F1 ACC KD [45] 60.84 66.53 41.76 58.96 64.70 +MSD 62.15 67.56 51.85 62.13 64.86 FitNet [106] 62.00 67.13 46.21 60.46 68.08 +MSD 62.22 68.91 50.42 61.43 68.17 RKD [90] 61.43 67.03 51.16 62.52 64.22 +MSD 62.30 66.71 52.56 63.27 64.40 SP [126] 61.70 66.11 49.07 61.41 64.15 +MSD 62.80 67.30 53.29 63.21 64.28 0 10000 20000 30000 40000 50000 60000 Training steps 0.60 0.62 0.64 0.66 0.68 0.70 0.72 0.74 0.76 0.78 0.80 Accuracy Teacher KD MSD (Population) MSD (Instance) MSD (Meta learning) Figure 6.7: Test accuracy of a student on SNLI-VE during training and comparison between knowledge distillation (KD) and modality-specific distillation (MSD) with population-based weighting, instance-wise weighting, and weight learning for weights. MSD improves diverse KD methods on multimodal datasets. Notably, MSD based on FitNet also improves the accuracy on the VQA2 dataset. 6.5.3 Learning Curve The MSD approaches can also help with training speed, measured by test metrics over training steps. Fig 6.7 shows the evolution of accuracy on the test set during training on the SNLI-VE dataset. When we train the student with MSD, training progresses faster than KD. Since the teacher provides two additional 95 0 100 200 300 400 500 sample index 0.0 0.2 0.4 0.6 0.8 1.0 probability multi-modality Image modality text modality (a) Hateful-Memes 0 5000 10000 15000 20000 sample index 0.0 0.2 0.4 0.6 0.8 1.0 probability multi-modality Image modality text modality (b) MM-IMDB Figure 6.8: Prediction probabilities of test samples for different modalities. Black points correspond to the predictions of samples with both modalities (original input), red points do with image modality, and blue points do with text modality. The samples are ordered based on their multimodal output probabilities. There is a strong correlation between multimodal predictions and predictions from text modality in MMIMDB, while there is no such a global pattern in Hateful-Memes. probabilities with each modality, the student learns faster and the final performance is better than KD. We observe a large performance increase early in training with the weight-learning approach, thus leading to the best accuracy. In this case, the weight learning for sample weighting finds the optimal weights for each data instance, so the student quickly learns from more important modality that is vital for the predictions. 6.5.4 Observation of Teacher’s Predictions Samples from multimodal datasets have different information on each modality. Fig. 6.8 shows a teacher model’s predictions for samples in Hateful-Memes and MM-IMDB test sets. For each sample, three probabilities are calculated: 1) predictions of samples with both of its modalities, 2) predictions of samples with just its text modality, and 3) predictions of samples with just its image modality. As one can see for MMIMDB there is a strong correlation between multimodal predictions and predictions from text modality, indicating the fact that in MM-IMDB text is a dominant modality. On the other hand, for Hateful-Memes dataset there is no such a global pattern but one can observe some correlations for individual instances. 96 Original (Hateful) Image replaced (Benign) Teacher Hateful Benign Small Hateful Hateful KD Hateful Hateful MSD Hateful Benign Figure 6.9: A multimodal violating sample (Left). We further replaced its image modality with a background picture that makes it benign and examined models on both examples (Right). This behavior is actually expected based on the way Hateful-Memes is built to include unimodal confounders [54]. Following these observations we introduce four weighting schemes for distillation losses and discuss each of them in §6.4. 6.5.5 Case Study We demonstrate the motivation behind our work through an example. Fig. 6.9 shows an example of a multimodal sample from Hateful Memes test dataset. The sample is violating based on both modalities together, and all models correctly predict that. To further probe the models, we replace the background image of the sample with a picture that makes the label benign. On this artificially generated sample we notice that only the teacher and MSD model correctly predict benign, while the other two models make wrong predictions (presumably by just looking at the text only). 6.6 Related Work Knowledge Distillation. There have been several studies of transferring knowledge from one model to another [8, 45, 106, 90, 84, 122, 34, 56]. Ba and Caruana [8] improve the accuracy of a shallow neural network by training it to mimic a deep neural network by penalizing the difference of logits between the 97 two networks. Hinton et al. [45] introduced knowledge distillation (KD) that trains a student model with the objective of matching the softmax distribution of a teacher model at the output layer. Park et al. [90] focused on mutual relations of data examples instead and they proposed relational knowledge distillation. Tian et al. [122] proposed to distill from the penultimate layer using a contrastive loss for cross-modal transfer. A few recent papers [34, 56] have shown that distilling a teacher model into a student model of identical architecture, i.e., self-distillation, can improve the student over the teacher. Learning for Sample Weighting. Recently, some methods were proposed to learn an adaptive weighting scheme from data to make the learning more automatic and reliable including Meta-Weight-Net [113], learning to reweight [103], FWL [27], MentorNet [48], and learning to teach [32, 136, 33]. These approaches were proposed to deal with noisy and corrupted labels and learn optimal functions from clean datasets. They are different in that they adopt different weight functions such as a multilayer perceptron [113], Bayesian function approximator [27], and a bidirectional LSTM [48]; and they take different inputs such as loss values and sample features. In our case, we adopt these ideas of meta-learning, specifically MetaWeight-Net, and utilize it in a different context, i.e. multimodal knowledge distillation. Bias in Multimodal Datasets. Different multimodal datasets were proposed to study whether a model uses a single modality’s features and the implications for its generalization properties [1]. Different approaches were proposed to deal with such problems where the model overfits to a single modality. Wang et al. [133] suggest regularizing the overfitting behavior to different modalities. REPAIR [67] prevents a model from dataset biases by re-sampling the training data. Cadene et al. [14] proposed RUBi that uses a bias-only branch in addition to a base model during training to overcome language priors. In our study, although we do not directly deal with the overfitting phenomena, we use different weighting schemes to better transfer the modality-specific information from the teacher to the student. 98 6.7 Conclusion We studied knowledge distillation on multimodal datasets; we observed that conventional KD may lead to inefficient distillation since a student model does not fully mimic a teacher’s modality-specific predictions. To better understand knowledge from a teacher on the multimodal datasets, we introduced saliency scores for a modality and modality-specific distillation; the student mimics the teacher’s outputs on each modality based on saliency scores. Furthermore, we investigated weighting approaches, population-based and saliency-based weighting schemes, and a weight-learning approach for weighting the auxiliary losses to take the importance of each modality into consideration. We empirically showed that we can improve the student’s performance with modality-specific distillation compared to conventional distillation. More importantly, we observe choosing the right weighting approach boosted the student’s performance. We believe that future work can expand on our methods, and search the space of weighting approaches beyond multimodal setups. 99 Chapter 7 Conclusions 7.1 Summary In this thesis, we presented approaches to build a reasoner that can do complex reasoning about the physical world and generalization on vision-language tasks. Specifically, in Chapter 2, we presented FewVLM, a few-shot prompt-based learner on vision-language tasks. On diverse datasets, FewVLM outperformed baselines and showed comparable results to PICa which is 246× larger than ours. We observed that prompts are vital in zero-shot and few-shot tasks and each pretraining objective helps different few-shot tasks. Also, we found out that models with larger training data are not significantly affected by noisy prompts. In Chapter 3, we proposed GRILL, a new VL model that can generalize to a variety of VL tasks including grounding tasks. The model learns object grounding and localization by introducing hybrid sequences in pre-training and easily adapt to diverse task by using a vision transformer for versatile image processing. To pre-train our model, we introduced our dataset using object-word alignments and pre-train it with masked language modeling, prefix language modeling, and the discriminative objective. On the empirical analysis, we observed that our model demonstrated good zero-/few-shot generalization on diverse tasks. We also observed that the discriminative objective and hybrid sequences in pre-training were vital for better zero-/few-shot performance. 100 In Chapter 4, we introduced a text-only WinoViz focused on question-answering tasks, comprising 5,606 examples exploring language models’ reasoning capabilities across various visual properties of objects under diverse contexts. Our findings revealed that large language models demonstrate effective performance overall but struggle particularly with the multi-hop version of our dataset. The performance on multi-hop data improved using chain-of-thought prompting. Vision-language models surpass their language-only counterparts, although image-generation approaches prove ineffective for our specific task. In Chapter 5, we studied whether intermediate pre-training on visual knowledge can help transfer visual knowledge into LMs. We investigated text knowledge transfer and cross-modal knowledge transfer using images and captions. In our empirical analysis, we observed that intermediate pre-training on captions could help improving performance and cross-modal knowledge transfer approaches consistently improve performance. When the transfer methods were equipped with additional positive and negative samples, they showed better performance. In Chapter 6, we studied knowledge distillation on multimodal datasets; we observed that conventional KD may lead to inefficient distillation since a student model does not fully mimic a teacher’s modalityspecific predictions. To better understand knowledge from a teacher on the multimodal datasets, we introduced saliency scores for a modality and modality-specific distillation; the student mimics the teacher’s outputs on each modality based on saliency scores. Furthermore, we investigated weighting approaches, population-based and saliency-based weighting schemes, and a weight-learning approach for weighting the auxiliary losses to take the importance of each modality into consideration. We empirically showed that we can improve the student’s performance with modality-specific distillation compared to conventional distillation. More importantly, we observed choosing the right weighting approach boosted the student’s performance. 101 7.2 Future Directions This dissertation thoroughly addressed several challenges of reasoning and generalization of models. While we have made much progress, realizing this goal requires extending existing methods and addressing several limitations, which we highlight here. Complex Reasoning with Multimodal Data. Language models have demonstrated their reasoning abilities over the real world. However, while they generally perform well, they encounter challenges with intricate reasoning tasks, as evidenced by our WinoViz dataset. We investigated the potential of pretraining on visual knowledge to enhance the transfer of such knowledge into language models. Given that humans acquire understanding of the world through interactions with multimodal inputs like vision and speech, large models should improve their reasoning ability through multimodal data. Consequently, future endeavors will focus on enhancing language models’ reasoning ability by integrating multimodal data sources such as speech and visual information, including but not limited to images, videos, and graphs. Universal Multimodal Generation. In addition to improving language models’ reasoning ability, it is necessary to build unified general-purpose models. The advantages of constructing such unified models include: a) obviating the need for designing task-specific models, b) leveraging diverse and extensive data corpora for training, c) facilitating knowledge transfer across various tasks and modalities, and d) enabling the performance of unknown and unseen tasks during inference Recent studies [77, 76] presented unified models that have achieved state-of-the-art performance on the GRIT benchmark. However, their models are much less reliable for certain modalities and tasks, and they still necessitate task-specific and modality-specific input and output layers, thereby requiring a larger number of parameters. Additionally, it remains unclear how a unified framework can effectively transfer knowledge across tasks and modalities. Hence, future research should delve into conducting a thorough investigation into knowledge transfer mechanisms across tasks and modalities, as well as refining unified input and output layers. 102 Unification of Embedding Layers. One intriguing area involves the integration of image and text embedding layers. Image embeddings are typically obtained through CNNs [58] or Vision Transformers (ViT) [31], while text embeddings involve tokenization and the use of an embedding look-up table for tokens. However, tokenization poses challenges such as encountering tokens outside the predefined vocabulary and being susceptible to noise, where even minor text variations can result in different token sequences. Recent research efforts [107, 124] have proposed visual text representations by converting text into images and encoding them using a vision encoder. These visual text representations offer notable advantages; they reduce design choices, such as tokenization methods, simplify data processing pipelines, and enhance cross-lingual generality. Nonetheless, achieving robust visual text representations may necessitate extensive training efforts. 103 Bibliography [1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. “Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 4971–4980. doi: 10.1109/CVPR.2018.00522. [2] Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, and Stefan Lee. “nocaps: novel object captioning at scale”. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019, pp. 8947–8956. doi: 10.1109/ICCV.2019.00904. [3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. “Flamingo: a visual language model for few-shot learning”. In: ArXiv preprint abs/2204.14198 (2022). url: https://arxiv.org/abs/2204.14198. [4] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. “Spice: Semantic propositional image caption evaluation”. In: European conference on computer vision. Springer. 2016, pp. 382–398. [5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 6077–6086. doi: 10.1109/CVPR.2018.00636. [6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. “VQA: Visual Question Answering”. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2015, pp. 2425–2433. doi: 10.1109/ICCV.2015.279. [7] John Arevalo, Thamar Solorio, Manuel Montes-y-Gómez, and Fabio A González. “Gated multimodal units for information fusion”. In: ArXiv preprint abs/1702.01992 (2017). url: https://arxiv.org/abs/1702.01992. 104 [8] Jimmy Ba and Rich Caruana. “Do Deep Nets Really Need to be Deep?” In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. Ed. by Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger. 2014, pp. 2654–2662. url: https: //proceedings.neurips.cc/paper/2014/hash/ea8fcd92d59581717e06eb187f10666d-Abstract.html. [9] John Bell. “Pragmatic reasoning: A model-based theory”. In: Applied Logic: How, What and Why: Logical Approaches to Natural Language (1995), pp. 1–27. [10] Sumithra Bhakthavatsalam, Chloe Anastasiades, and Peter E. Clark. “GenericsKB: A Knowledge Base of Generic Statements”. In: ArXiv preprint abs/2005.00660 (2020). url: https://arxiv.org/abs/2005.00660. [11] Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. “PIQA: Reasoning about Physical Commonsense in Natural Language”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 7432–7439. url: https://aaai.org/ojs/index.php/AAAI/article/view/6239. [12] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. “A large annotated corpus for learning natural language inference”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, 2015, pp. 632–642. doi: 10.18653/v1/D15-1075. [13] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. “Language Models are Few-Shot Learners”. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. 2020. url: https: //proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. [14] Rémi Cadène, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, and Devi Parikh. “RUBi: Reducing Unimodal Biases for Visual Question Answering”. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett. 2019, pp. 839–850. url: https://proceedings.neurips.cc/paper/2019/hash/51d92be1c60d1db1d2e5e7a07da55b26- Abstract.html. 105 [15] Iacer Calixto, Alessandro Raganato, and Tommaso Pasini. “Wikipedia Entities as Rendezvous across Languages: Grounding Multilingual Language Models by Predicting Wikipedia Hyperlinks”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 3651–3661. doi: 10.18653/v1/2021.naacl-main.286. [16] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. “SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation”. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1–14. doi: 10.18653/v1/S17-2001. [17] Khyathi Raghavi Chandu, Yonatan Bisk, and Alan W Black. “Grounding ‘Grounding’ in NLP”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021, pp. 4283–4305. doi: 10.18653/v1/2021.findings-acl.375. [18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. “A Simple Framework for Contrastive Learning of Visual Representations”. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. Vol. 119. Proceedings of Machine Learning Research. PMLR, 2020, pp. 1597–1607. url: http://proceedings.mlr.press/v119/chen20j.html. [19] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco captions: Data collection and evaluation server”. In: ArXiv preprint abs/1504.00325 (2015). url: https://arxiv.org/abs/1504.00325. [20] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. “Uniter: Learning universal image-text representations”. In: (2019). [21] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. “Uniter: Universal image-text representation learning”. In: European conference on computer vision. Springer. 2020, pp. 104–120. [22] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. “Unifying Vision-and-Language Tasks via Text Generation”. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 1931–1942. url: http://proceedings.mlr.press/v139/cho21a.html. [23] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. “Scaling instruction-finetuned language models”. In: ArXiv preprint abs/2210.11416 (2022). url: https://arxiv.org/abs/2210.11416. [24] Kevin Crowston. “Amazon mechanical turk: A research tool for organizations and information systems scholars”. In: Shaping the Future of ICT Research. Methods and Approaches: IFIP WG 8.2, Working Conference, Tampa, FL, USA, December 13-14, 2012. Proceedings. Springer. 2012, pp. 210–221. 106 [25] Balázs Csanád Csáji et al. “Approximation with artificial neural networks”. In: Faculty of Sciences, Etvs Lornd University, Hungary 24.48 (2001), p. 7. [26] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning”. In: ArXiv preprint abs/2305.06500 (2023). url: https://arxiv.org/abs/2305.06500. [27] Mostafa Dehghani, Arash Mehrjou, Stephan Gouws, Jaap Kamps, and Bernhard Schölkopf. “Fidelity-Weighted Learning”. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. url: https://openreview.net/forum?id=B1X0mzZCW. [28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [31] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. url: https://openreview.net/forum?id=YicbFdNTTy. [32] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. “Learning to Teach”. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. url: https://openreview.net/forum?id=HJewuJWCZ. [33] Yang Fan, Yingce Xia, Lijun Wu, Shufang Xie, Weiqing Liu, Jiang Bian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. “Learning to Teach with Deep Interactions”. In: ArXiv preprint abs/2007.04649 (2020). url: https://arxiv.org/abs/2007.04649. 107 [34] Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. “Born-Again Neural Networks”. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 1602–1611. url: http://proceedings.mlr.press/v80/furlanello18a.html. [35] Tianyu Gao, Adam Fisch, and Danqi Chen. “Making Pre-trained Language Models Better Few-shot Learners”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 3816–3830. doi: 10.18653/v1/2021.acl-long.295. [36] Tianyu Gao, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 6894–6910. doi: 10.18653/v1/2021.emnlp-main.552. [37] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. “Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 6325–6334. doi: 10.1109/CVPR.2017.670. [38] Yuling Gu, Bhavana Dalvi Mishra, and Peter Clark. “Do language models have coherent mental models of everyday things?” In: ArXiv preprint abs/2212.10029 (2022). url: https://arxiv.org/abs/2212.10029. [39] Saurabh Gupta, Judy Hoffman, and Jitendra Malik. “Cross Modal Distillation for Supervision Transfer”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 2827–2836. doi: 10.1109/CVPR.2016.309. [40] Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. “VizWiz Grand Challenge: Answering Visual Questions From Blind People”. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 2018, pp. 3608–3617. doi: 10.1109/CVPR.2018.00380. [41] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 8342–8360. doi: 10.18653/v1/2020.acl-main.740. [42] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 8342–8360. doi: 10.18653/v1/2020.acl-main.740. 108 [43] Lovisa Hagström and Richard Johansson. “What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 252–261. doi: 10.18653/v1/2022.acl-srw.19. [44] Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, and Tong Xu. “BERT-MK: Integrating Graph Contextualized Knowledge into Pre-trained Language Models”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 2281–2290. doi: 10.18653/v1/2020.findings-emnlp.207. [45] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In: ArXiv preprint abs/1503.02531 (2015). url: https://arxiv.org/abs/1503.02531. [46] Zehao Huang and Naiyan Wang. “Like what you like: Knowledge distill via neuron selectivity transfer”. In: ArXiv preprint abs/1707.01219 (2017). url: https://arxiv.org/abs/1707.01219. [47] Drew A. Hudson and Christopher D. Manning. “GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 6700–6709. doi: 10.1109/CVPR.2019.00686. [48] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. “MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels”. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 2309–2318. url: http://proceedings.mlr.press/v80/jiang18c.html. [49] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. “TinyBERT: Distilling BERT for Natural Language Understanding”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 4163–4174. doi: 10.18653/v1/2020.findings-emnlp.372. [50] Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. “A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 2763–2775. doi: 10.18653/v1/2022.acl-long.197. [51] Woojeong Jin, Dong-Ho Lee, Chenguang Zhu, Jay Pujara, and Xiang Ren. “Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 2750–2762. doi: 10.18653/v1/2022.acl-long.196. 109 [52] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. “MDETR - Modulated Detection for End-to-End Multi-Modal Understanding”. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 1760–1770. doi: 10.1109/ICCV48922.2021.00180. [53] Andrej Karpathy and Fei-Fei Li. “Deep visual-semantic alignments for generating image descriptions”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 3128–3137. doi: 10.1109/CVPR.2015.7298932. [54] Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj Goswami, Amanpreet Singh, Pratik Ringshia, and Davide Testuggine. “The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes”. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. 2020. url: https: //proceedings.neurips.cc/paper/2020/hash/1b84c4cee2b8b3d823b30e2d604b1878-Abstract.html. [55] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. “Bilinear Attention Networks”. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. Ed. by Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett. 2018, pp. 1571–1581. url: https: //proceedings.neurips.cc/paper/2018/hash/96ea64f3a1aa2fd00c72faacf0cb8ac9-Abstract.html. [56] Kyungyul Kim, ByeongMoon Ji, Doyoung Yoon, and Sangheum Hwang. “Self-Knowledge Distillation: A Simple Way for Better Generalization”. In: ArXiv preprint abs/2006.12000 (2020). url: https://arxiv.org/abs/2006.12000. [57] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. “Visual genome: Connecting language and vision using crowdsourced dense image annotations”. In: International journal of computer vision 123.1 (2017), pp. 32–73. [58] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States. Ed. by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger. 2012, pp. 1106–1114. url: https: //proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html. [59] Jiwei Li, Will Monroe, and Dan Jurafsky. “Understanding neural networks through representation erasure”. In: ArXiv preprint abs/1612.08220 (2016). url: https://arxiv.org/abs/1612.08220. [60] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models”. In: ArXiv preprint abs/2301.12597 (2023). url: https://arxiv.org/abs/2301.12597. 110 [61] Kai Li, Yulun Zhang, Kunpeng Li, and Yun Fu. “Adversarial Feature Hallucination Networks for Few-Shot Learning”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 13467–13476. doi: 10.1109/CVPR42600.2020.01348. [62] Lei Li, Jingjing Xu, Qingxiu Dong, Ce Zheng, Qi Liu, Lingpeng Kong, and Xu Sun. “Can Language Models Understand Physical Concepts?” In: ArXiv preprint abs/2305.14057 (2023). url: https://arxiv.org/abs/2305.14057. [63] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. “Visualbert: A simple and performant baseline for vision and language”. In: ArXiv preprint abs/1908.03557 (2019). url: https://arxiv.org/abs/1908.03557. [64] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. “Grounded Language-Image Pre-training”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 10955–10965. doi: 10.1109/CVPR52688.2022.01069. [65] Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. “CLIP-Event: Connecting Text and Images with Event Structures”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 16399–16408. doi: 10.1109/CVPR52688.2022.01593. [66] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. “Oscar: Object-semantics aligned pre-training for vision-language tasks”. In: European Conference on Computer Vision. Springer. 2020, pp. 121–137. [67] Yi Li and Nuno Vasconcelos. “REPAIR: Removing Representation Bias by Dataset Resampling”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 9572–9581. doi: 10.1109/CVPR.2019.00980. [68] Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang Ren. “KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 2829–2839. doi: 10.18653/v1/D19-1282. [69] Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, and Xiang Ren. “RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021, pp. 1504–1515. doi: 10.18653/v1/2021.findings-acl.131. [70] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755. 111 [71] Xiao Lin and Devi Parikh. “Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 2984–2993. doi: 10.1109/CVPR.2015.7298917. [72] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. “Visual instruction tuning”. In: ArXiv preprint abs/2304.08485 (2023). url: https://arxiv.org/abs/2304.08485. [73] Xiao Liu, Da Yin, Yansong Feng, and Dongyan Zhao. “Things not Written in Text: Exploring Spatial Commonsense from Visual Signals”. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 2365–2376. doi: 10.18653/v1/2022.acl-long.168. [74] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “Roberta: A robustly optimized bert pretraining approach”. In: ArXiv preprint abs/1907.11692 (2019). url: https://arxiv.org/abs/1907.11692. [75] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 9992–10002. doi: 10.1109/ICCV48922.2021.00986. [76] Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. “Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action”. In: ArXiv preprint abs/2312.17172 (2023). url: https://arxiv.org/abs/2312.17172. [77] Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. “Unified-io: A unified model for vision, language, and multi-modal tasks”. In: The Eleventh International Conference on Learning Representations. 2022. [78] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. “Generation and Comprehension of Unambiguous Object Descriptions”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016, pp. 11–20. doi: 10.1109/CVPR.2016.9. [79] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. “OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3195–3204. doi: 10.1109/CVPR.2019.00331. [80] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. “Pointer Sentinel Mixture Models”. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. url: https://openreview.net/forum?id=Byj72udxe. 112 [81] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. “Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 2381–2391. doi: 10.18653/v1/D18-1260. [82] Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. “Distributed Representations of Words and Phrases and their Compositionality”. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. Ed. by Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger. 2013, pp. 3111–3119. url: https: //proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html. [83] George A. Miller. “WordNet: A Lexical Database for English”. In: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992. 1992. url: https://aclanthology.org/H92-1116. [84] Rafael Müller, Simon Kornblith, and Geoffrey Hinton. “Subclass Distillation”. In: ArXiv preprint abs/2002.03936 (2020). url: https://arxiv.org/abs/2002.03936. [85] Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. “Adversarial NLI: A New Benchmark for Natural Language Understanding”. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 4885–4901. doi: 10.18653/v1/2020.acl-main.441. [86] Tobias Norlund, Lovisa Hagström, and Richard Johansson. “Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?” In: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 149–162. doi: 10.18653/v1/2021.blackboxnlp-1.10. [87] OpenAI. “GPT-4 Technical Report”. In: ArXiv preprint abs/2303.08774 (2023). url: https://arxiv.org/abs/2303.08774. [88] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 27730–27744. [89] Cory Paik, Stéphane Aroca-Ouellette, Alessandro Roncone, and Katharina Kann. “The World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 823–835. doi: 10.18653/v1/2021.emnlp-main.63. 113 [90] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. “Relational Knowledge Distillation”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 3967–3976. doi: 10.1109/CVPR.2019.00409. [91] Ethan Perez, Douwe Kiela, and Kyunghyun Cho. “True Few-Shot Learning with Language Models”. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan. 2021, pp. 11054–11070. url: https: //proceedings.neurips.cc/paper/2021/hash/5c04925674920eb58467fb52ce4ef728-Abstract.html. [92] Matthew E. Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. “Knowledge Enhanced Contextual Word Representations”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 43–54. doi: 10.18653/v1/D19-1005. [93] Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. “KILT: a Benchmark for Knowledge Intensive Language Tasks”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 2523–2544. doi: 10.18653/v1/2021.naacl-main.200. [94] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. “Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models”. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2015, pp. 2641–2649. doi: 10.1109/ICCV.2015.303. [95] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. “Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 5231–5247. doi: 10.18653/v1/2020.acl-main.467. [96] Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. “Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 5231–5247. doi: 10.18653/v1/2020.acl-main.467. 114 [97] Ehsan Qasemi, Filip Ilievski, Muhao Chen, and Pedro Szekely. “PaCo: Preconditions Attributed to Commonsense Knowledge”. In: Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 6781–6796. url: https://aclanthology.org/2022.findings-emnlp.505. [98] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. “Learning Transferable Visual Models From Natural Language Supervision”. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 8748–8763. url: http://proceedings.mlr.press/v139/radford21a.html. [99] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. “Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9. [100] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: J. Mach. Learn. Res. 21 (2020), 140:1–140:67. url: http://jmlr.org/papers/v21/20-074.html. [101] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: J. Mach. Learn. Res. 21 (2020), 140:1–140:67. url: http://jmlr.org/papers/v21/20-074.html. [102] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. In: J. Mach. Learn. Res. 21 (2020), 140:1–140:67. url: http://jmlr.org/papers/v21/20-074.html. [103] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. “Learning to Reweight Examples for Robust Deep Learning”. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Ed. by Jennifer G. Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 4331–4340. url: http://proceedings.mlr.press/v80/ren18a.html. [104] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. Ed. by Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett. 2015, pp. 91–99. url: https: //proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html. [105] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. “High-resolution image synthesis with latent diffusion models”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, pp. 10684–10695. 115 [106] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. “FitNets: Hints for Thin Deep Nets”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. url: http://arxiv.org/abs/1412.6550. [107] Elizabeth Salesky, David Etter, and Matt Post. “Robust Open-Vocabulary Translation from Visual Text Representations”. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 7235–7252. doi: 10.18653/v1/2021.emnlp-main.576. [108] Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. “Multitask Prompted Training Enables Zero-Shot Task Generalization”. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. url: https://openreview.net/forum?id=9Vrb9D0WI4. [109] Timo Schick and Hinrich Schütze. “Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference”. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, 2021, pp. 255–269. doi: 10.18653/v1/2021.eacl-main.20. [110] Timo Schick and Hinrich Schütze. “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners”. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 2339–2352. doi: 10.18653/v1/2021.naacl-main.185. [111] Timo Schick and Hinrich Schütze. “Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 8766–8774. url: https://aaai.org/ojs/index.php/AAAI/article/view/6403. [112] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 2556–2565. doi: 10.18653/v1/P18-1238. 116 [113] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. “Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting”. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. Ed. by Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett. 2019, pp. 1917–1928. url: https: //proceedings.neurips.cc/paper/2019/hash/e58cc5ca94270acaceed13bc82dfedf7-Abstract.html. [114] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. “Towards VQA Models That Can Read”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 8317–8326. doi: 10.1109/CVPR.2019.00851. [115] Shikhar Singh, Ehsan Qasemi, and Muhao Chen. “Viphy: Probing" visible" physical commonsense knowledge”. In: ArXiv preprint abs/2209.07000 (2022). url: https://arxiv.org/abs/2209.07000. [116] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. “VL-BERT: Pre-training of Generic Visual-Linguistic Representations”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. url: https://openreview.net/forum?id=SygXPaEYvH. [117] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. “A Corpus for Reasoning about Natural Language Grounded in Photographs”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 6418–6428. doi: 10.18653/v1/P19-1644. [118] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. “CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4149–4158. doi: 10.18653/v1/N19-1421. [119] Hao Tan and Mohit Bansal. “LXMERT: Learning Cross-Modality Encoder Representations from Transformers”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 5100–5111. doi: 10.18653/v1/D19-1514. [120] Hao Tan and Mohit Bansal. “Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 2066–2080. doi: 10.18653/v1/2020.emnlp-main.162. 117 [121] Zineng Tang, Jaemin Cho, Hao Tan, and Mohit Bansal. “VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer”. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan. 2021, pp. 24468–24481. url: https://proceedings.neurips.cc/paper/2021/hash/ccdf3864e2fa9089f9eca4fc7a48ea0aAbstract.html. [122] Yonglong Tian, Dilip Krishnan, and Phillip Isola. “Contrastive Representation Distillation”. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. url: https://openreview.net/forum?id=SkgpBJrtvS. [123] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. “Llama 2: Open foundation and fine-tuned chat models”. In: ArXiv preprint abs/2307.09288 (2023). url: https://arxiv.org/abs/2307.09288. [124] Michael Tschannen, Basil Mustafa, and Neil Houlsby. “Clippo: Image-and-language understanding from pixels only”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 11006–11017. [125] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. “Multimodal Few-Shot Learning with Frozen Language Models”. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual. Ed. by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan. 2021, pp. 200–212. url: https: //proceedings.neurips.cc/paper/2021/hash/01b7575c38dac42f3cfb7d500438b875-Abstract.html. [126] Frederick Tung and Greg Mori. “Similarity-Preserving Knowledge Distillation”. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 2019, pp. 1365–1374. doi: 10.1109/ICCV.2019.00145. [127] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett. 2017, pp. 5998–6008. url: https: //proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. [128] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. “CIDEr: Consensus-based image description evaluation”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2015, pp. 4566–4575. doi: 10.1109/CVPR.2015.7299087. 118 [129] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. “Matching Networks for One Shot Learning”. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. Ed. by Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett. 2016, pp. 3630–3638. url: https: //proceedings.neurips.cc/paper/2016/hash/90e1357833654983612fb05e3ec9148c-Abstract.html. [130] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. url: https://openreview.net/forum?id=rJ4km2R5t7. [131] Peifeng Wang, Nanyun Peng, Filip Ilievski, Pedro Szekely, and Xiang Ren. “Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 4129–4140. doi: 10.18653/v1/2020.findings-emnlp.369. [132] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. “K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021, pp. 1405–1418. doi: 10.18653/v1/2021.findings-acl.121. [133] Weiyao Wang, Du Tran, and Matt Feiszli. “What Makes Training Multi-Modal Classification Networks Hard?” In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 12692–12702. doi: 10.1109/CVPR42600.2020.01271. [134] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. “SimVLM: Simple Visual Language Model Pretraining with Weak Supervision”. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. url: https://openreview.net/forum?id=GUrhfTuf%5C_3. [135] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. “Chain-of-thought prompting elicits reasoning in large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 24824–24837. [136] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Jian-Huang Lai, and Tie-Yan Liu. “Learning to Teach with Dynamic Loss Functions”. In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada. Ed. by Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett. 2018, pp. 6467–6478. url: https: //proceedings.neurips.cc/paper/2018/hash/8051a3c40561002834e59d566b7430cf-Abstract.html. [137] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. “Visual entailment: A novel task for fine-grained image understanding”. In: ArXiv preprint abs/1901.06706 (2019). url: https://arxiv.org/abs/1901.06706. 119 [138] Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. “Self-Training With Noisy Student Improves ImageNet Classification”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 2020, pp. 10684–10695. doi: 10.1109/CVPR42600.2020.01070. [139] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. “Aggregated Residual Transformations for Deep Neural Networks”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 2017, pp. 5987–5995. doi: 10.1109/CVPR.2017.634. [140] Ruochen Xu, Yuwei Fang, Chenguang Zhu, and Michael Zeng. “Does Knowledge Help General NLU? An Empirical Study”. In: ArXiv preprint abs/2109.00563 (2021). url: https://arxiv.org/abs/2109.00563. [141] Yichong Xu, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, and Xuedong Huang. “Fusing Context Into Knowledge Graph for Commonsense Question Answering”. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, 2021, pp. 1201–1207. doi: 10.18653/v1/2021.findings-acl.102. [142] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. “An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA”. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 2022, pp. 3081–3089. url: https://ojs.aaai.org/index.php/AAAI/article/view/20215. [143] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. “Cpt: Colorful prompt tuning for pre-trained vision-language models”. In: ArXiv preprint abs/2109.11797 (2021). url: https://arxiv.org/abs/2109.11797. [144] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions”. In: Transactions of the Association for Computational Linguistics 2 (2014), pp. 67–78. doi: 10.1162/tacl_a_00166. [145] Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng. “JAKET: Joint Pre-training of Knowledge Graph and Language Understanding”. In: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 2022, pp. 11630–11638. url: https://ojs.aaai.org/index.php/AAAI/article/view/21417. [146] Wenhao Yu, Chenguang Zhu, Yuwei Fang, Donghan Yu, Shuohang Wang, Yichong Xu, Michael Zeng, and Meng Jiang. “Dict-BERT: Enhancing Language Model Pre-training with Dictionary”. In: Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 1907–1918. doi: 10.18653/v1/2022.findings-acl.150. 120 [147] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “From Recognition to Cognition: Visual Commonsense Reasoning”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019, pp. 6720–6731. doi: 10.1109/CVPR.2019.00688. [148] Chenyu Zhang, Benjamin Van Durme, Zhuowan Li, and Elias Stengel-Eskin. “Visual Commonsense in Pretrained Unimodal and Multimodal Models”. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, 2022, pp. 5321–5335. doi: 10.18653/v1/2022.naacl-main.390. [149] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. “GLIPv2: Unifying Localization and Vision-Language Understanding”. In: ArXiv preprint abs/2206.05836 (2022). url: https://arxiv.org/abs/2206.05836. [150] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. “VinVL: Revisiting Visual Representations in Vision-Language Models”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021, pp. 5579–5588. doi: 10.1109/CVPR46437.2021.00553. [151] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. “Contrastive learning of medical visual representations from paired images and text”. In: ArXiv preprint abs/2010.00747 (2020). url: https://arxiv.org/abs/2010.00747. [152] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. “ERNIE: Enhanced Language Representation with Informative Entities”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019, pp. 1441–1451. doi: 10.18653/v1/P19-1139. [153] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. “Unified Vision-Language Pre-Training for Image Captioning and VQA”. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 13041–13049. url: https://aaai.org/ojs/index.php/AAAI/article/view/7005. [154] Wangchunshu Zhou, Dong-Ho Lee, Ravi Kiran Selvam, Seyeon Lee, and Xiang Ren. “Pre-training Text-to-Text Transformers for Concept-centric Common Sense”. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. url: https://openreview.net/forum?id=3k20LAiHYL2. [155] Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu, Xin Wang, Miguel Eckstein, and William Yang Wang. “Visualize Before You Write: Imagination-Guided Open-Ended Text Generation”. In: Findings of the Association for Computational Linguistics: EACL 2023. Dubrovnik, Croatia: Association for Computational Linguistics, 2023, pp. 78–92. url: https://aclanthology.org/2023.findings-eacl.5. 121 [156] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books”. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2015, pp. 19–27. doi: 10.1109/ICCV.2015.11. [157] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books”. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE Computer Society, 2015, pp. 19–27. doi: 10.1109/ICCV.2015.11. 122
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Grounding language in images and videos
PDF
Common ground reasoning for communicative agents
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Multimodal reasoning of visual information and natural language
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Shape-assisted multimodal person re-identification
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Identifying and mitigating safety risks in language models
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Modeling, learning, and leveraging similarity
PDF
Transfer learning for intelligent systems in the wild
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Modeling social causality and social judgment in multi-agent interactions
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Fairness in natural language generation
PDF
3D deep learning for perception and modeling
PDF
Deep learning models for temporal data in health care
PDF
Leveraging cross-task transfer in sequential decision problems
Asset Metadata
Creator
Jin, Woojeong
(author)
Core Title
Bridging the visual reasoning gaps in multi-modal models
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
05/01/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
generalization,language models,multi-modal models,OAI-PMH Harvest,reasoning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ren, Xiang (
committee chair
), Liu, Yan (
committee member
), Mintz, Toby (
committee member
), Nevatia, Ramakant (
committee member
)
Creator Email
woojeong.jin@gmail.com,woojeong.jin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113939700
Unique identifier
UC113939700
Identifier
etd-JinWoojeon-12917.pdf (filename)
Legacy Identifier
etd-JinWoojeon-12917
Document Type
Dissertation
Format
theses (aat)
Rights
Jin, Woojeong
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1151
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
generalization
language models
multi-modal models
reasoning