Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Grounding language in images and videos
(USC Thesis Other)
Grounding language in images and videos
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Grounding Language in Images and Videos by Arka Sadhu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Arka Sadhu Dedication To my advisor Prof. Ram Nevatia and my parents Mausumi Mukhopadhyay and Amitabha Sadhu. ii Acknowledgements I am very grateful for the support I have received throughout the course of my PhD journey. First, I express my sincerest gratitude to my advisor Prof. Ram Nevatia for their constant support and feedback, routinely deep and insightful discussions and invaluable guidance. His expertise and continuous probing into understanding the problems at a granular level has been inspiring and a constant source of motivation. I also thank my advisory committee. Prof. Xiang Ren for allowing me the opportunity to collaborate on multiple papers, Prof. Toby Mintz for valuable feedback on our project. I also thank Prof. Yan Liu and Prof. Stefanos Nikolaidis for their valuable suggestions that has refined my research. I am extremely indebted to all my collaborators and mentors who have guided me through this journey: Kan Chen for helping me formulate my research ideas and having weekly conversations, Aniruddh Kembhavi, Tanmay Gupta and Mark Yatskar for their support during my internship at AI2, Animesh Sinha, Licheng Yu, Yu Chen, and Ning Zhang for their extensive collaboration during my research internship at Meta. I am also thankful to my colleagues and peers Xuefeng Hu, Zhaoheng Zheng and Haidong Zhu for sharing valuable insights and the countless interesting discussions. I am indebted to my parents Mausumi Mukhopadhyay and Amitabha Sadhu for their love and encouragement. I am also grateful to my friends and colleagues for their constant support: Parth Kothari, Sudeep Salgia, Sattwik Deb Mishra, Siddhant Garg, Kalpesh Krishna, Karan Chaddha, Abhin Shah, Meet Shah, Varun Bhatt, Sourav Kundu, Swarnabha Chattaraj, Rishabh Agarwal. iii Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Zero-Shot Grounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Design Considerations for ZSG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 ZSG Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Limitations in Phrase Grounding Systems . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Detailed Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2.1 Case 0: Q /∈ W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2.2 Case 1: A /∈ C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2.3 Case 2, 3: ∃B objects semantically close to A . . . . . . . . . . . . . . . . 21 2.4.2.4 Choice of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Dataset Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.4 Datasets Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.6 Results on Existing Grounding datasets . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.7 Results on ZSG Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.8 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.9 Proposals from Pre-Trained Detector(s) . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.10 Image Blind and Language Blind Ablations . . . . . . . . . . . . . . . . . . . . . . 31 2.4.11 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 iv 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 3: Video Object Grounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 Contrastive Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.3 Relative Position Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.4 Extended Relative Position Encoding Discussion . . . . . . . . . . . . . . . . . . . 44 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Constructing ActivityNet-SRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 Dynamic Contrastive Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.3 Extended Dataset Construction Details . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.3.1 Constructing ASRL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.3.2 Training, Validation, Test Splits . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3.3 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3.4 Dataset Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.5 Extended Evaluation Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.5.1 Concatenation Strategies with Examples . . . . . . . . . . . . . . . . . . 59 3.4.6 Extended Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.7 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.8 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4.9 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.10 Additional Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 4: Video Question Answering with Phrases. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Design Considerations for VidQAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3.1 Using SRLs to Generate Queries for VidQAP . . . . . . . . . . . . . . . . . . . . . . 77 4.3.2 Evaluating Answer Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.3 Model Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.1 Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4.2 Extended Dataset Construction Details . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.2.1 Semantic Role Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4.2.2 Construction Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4.2.3 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 Metric Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 v Chapter 5: VidSitu: Situation Recognition in Videos. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 Semantic Roles: A Brief Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 VidSRL: The Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5 VidSitu Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.1 Dataset Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.2 Extended Dataset Collection Details . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2.1 Dataset Curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2.2 Annotation pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.3 Dataset Analysis and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.4 Additional Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6.1 Additional Baseline and Implementation Details . . . . . . . . . . . . . . . . . . . . 128 5.6.1.1 Verb Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6.1.2 Argument Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.7.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.7.2 Extended Evaluation Metrics Details . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.7.2.1 Co-Reference Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.7.2.2 Evaluation of Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.7.3.1 All Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Chapter 6: Reasoning Across Images and Videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.3.1 Reasoning Across Images and Video (RAIV) Tasks . . . . . . . . . . . . . . . . . . 144 6.3.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.4.2 Additional Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.4.2.1 Creating Datasets for RAIV . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.4.2.2 Creating Natural Language Queries . . . . . . . . . . . . . . . . . . . . . 155 6.4.3 Baseline and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.4.4 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.4.6 Ablative Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.4.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Chapter 7: Detect Anything: An Open-Vocabulary Object Detection Model with Weak Supervision 169 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.3.1 Task: Weakly Supervised Open-Vocabulary Detection (WS-OVD) . . . . . . . . . . 174 vi 7.3.2 Method Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.3.3 Segment-Text Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.4.2 Additional Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.4.3 Baselines and Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.4.4 Additional Baseline Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.4.5 Results on Open-Vocabulary Object Detection . . . . . . . . . . . . . . . . . . . . . 187 7.5 Results on ODinW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 7.6 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Chapter 8: Conclusions and Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 vii List of Tables 2.1 Notations used to describe ZSG with examples (Fig 2.1). By close objects we mean their word embeddings are similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Comparison of our model with other state of the art methods. We denote those networks which use classification weights from ImageNet [206] using “cls" and those networks which use detection weights from Pascal VOC [50] using “det". The reported numbers are all Accuracy@IoU = 0.5 or equivalently Recall@1. Models marked with “*" fine-tune their detection network on the entities in the Flickr30k. . . . . . . . . . . . . . . . . . . . . 27 2.3 Category-wise performance with the default split of Flickr30k Entities. . . . . . . . . . . . 27 2.4 Accuracy across various unseen splits. For Flickr-Split-0,1 we use Accuracy with IoU threshold of 0.5. Since Visual Genome annotations are noisy we additionally report Accuracy with IoU threshold of 0.3. The second row denotes the IoU threshold at which the Accuracy is calculated. “B” and “UB” denote the balanced and unbalanced sets. . . . . . 28 2.5 Accuracy of various models on the balanced VG-Splits-2,3 w.r.t the semantic distance of the referred object (A) to the closest object seen at train time. VG-2B and VG-3B refer to the balanced test set for Case2, 3. #I-P denotes the number of image-phrase pairs in the given semantic distance range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6 Ablation study: BM=Base Model, softmax means we classify only one candidate box as foreground, BCE = Binary Cross Entropy means we classify each candidate box as the foreground or background, FL = Focal Loss, Img-Resize: use images of dimension 600 × 600 29 2.7 Proposal Recall Rates using top-300 proposals at IoU = 0.5 (0.3 for VG) calculated on test sets. FR: FasterRCNN, no f/t: pretrained on pascal voc, f/t: fine-tuned on the target set. For referit we use f/t model on Flickr30k to be consistent with QRC. . . . . . . . . . . . . . 30 2.8 Ablation study: Language Blind (LB) and Image Blind (IB) setting using Images of Resolution 300 × 300. Metric reported is Accuracy@IoU=0.5 (0.3 for VG) . . . . . . . . . . 30 3.1 An illustration of semantic-role assignment to a description. Here, the actor/agent (person) performs an action/verb (wash) using some instrument (water) at some location (sink). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 viii 3.2 Number of annotated boxes in ASRL training set. . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 An example of applying semantic role labeling to the video description. Each verb is treated independent of each other and the verbs “is”, “seen”, “begins” are not considered. For all other verbs, the last column shows the considered input to the system . . . . . . . . 48 3.4 Number of Videos in train, validation, and test splits. Some instances are removed from training if they don’t contain meaningful SRLs. Our test split is derived from AE validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Total number of lemmatized words (with at least 20 occurrence) in the train set of ASRL. . 55 3.6 Comparison of VOGNet against ImgGrnd and VidGrnd. GT5 and P100 use 5 and 100 proposals per frame. Here, Acc: Grounding Accuracy, VAcc: Video accuracy, Cons: Consistency, SAcc: Strict Accuracy (see Section 3.4.4 for details). On the challenging evaluation metrics of TEMP and SPAT, VOGNet (ours) shows significant improvement over competitive image and video grounding baselines. . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Evaluation of VOGNet in GT5 setting by training (first column) and testing (top row) on SVSQ, TEMP, SPAT respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.8 Comparison of Contrastive Sampling (CS) vs Random Sampling (Rnd) for training (row-1,2) and evaluation (row-2,3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.9 Training VOGNet in SPAT setting with different number of concatenated videos and tested on SPAT with 4 videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.10 Ablative study comparing gains from Multi-Modal Transformer (MTx) and Object Transformer (OTx) and Relative Position Encoding (RPE). L: Number of Layers, H: Number of Heads in the Transformer. Note that VOGNet = ImgGrnd +MTx(1L,3H) +OTx(1L,3H) + RPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.11 Comparing models trained with GT5 and P100. All models are tested in P100 setting. . . . 66 3.12 Ablative study layers and heads of Transformers. . . . . . . . . . . . . . . . . . . . . . . . 67 4.1 Comparison of Existing datasets for VidQA with our proposed ASRL-QA and Charades-SRL-QA. Here, OE = Open-Ended, MC = Multiple Choice. “Scripts”: if answering questions requires access to scripts or subtitles. “Box”: if dataset provides bounding box annotations. *: Includes Yes/No questions . . . . . . . . . . . . . 74 4.2 Detailed dataset statistics for both ASRL-QA and Charades-SRL-QA with respect to different argument roles. Recall that ARG0 is not present in Charades-SRL-QA, and hence the corresponding rows are kept blank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 ix 4.3 Comparison of our extended models for VidQAP and Classification based (CL) models across two datasets on our proposed Metric. Here, “R-" prefix implies it is the final metric computed after relative scoring and contrastive scoring with threshold 0. “BS": BertScore, “Cons”: Consistency on BertScore, B@2: Sentence BLEU-2, R: ROUGE, M: METEOR, C: CIDEr. Reported numbers are on the test set. For classification models, the number within the parenthesis denotes the size of fixed vocabulary of phrases. Best result, Second Best result. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Comparison of our extended models per SRL. All reported scores are R-BS: BertScore computed after relative scoring and contrastive scoring with threshold 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.5 BertScore Metrics computed Directly on answer phrases. Rel Score: After Relative Scoring. CS@T: Contrastive scoring with threshold T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.6 Effect of Adding Region Proposals. All reported scores are R-BS. Best result, Second Best result. . . . . . 95 5.1 A non-exhaustive summary of video understanding tasks, required annotations and benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 10A and 20A denote 10 and 20 annotations respectively. Majority denotes choosing most frequent verbs for the validation set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3 Statistics on splits of VidSitu. Note that VidSitu contains multiple verb and role annotations for val and test sets for accurate evaluation. . . . . . . . . . . . . . . . . . . . 120 5.4 Dataset statistics across video description datasets. We highlight key differences from previous datasets such as explicit SRL, co-reference, and event-relation annotations, and greater diversity and density of verbs, entities, and semantic roles. For a fair comparison, for all datasets we use a single description per video segment when more than one are available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5 The distribution of Event Relations before and after filtering by taking consensus of at least two workers i.e. we consider only those instances where two workers agree on the event relation when given the verb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6 CIDEr score for all collected Arguments with 5 annotations on 100 videos. . . . . . . . . . 133 5.7 Semantic role prediction and co-referencing metrics. Vis. denotes the visual features used (✗ if not used), and Enc. denotes if video features are contextualized. C: CIDEr, R-L: ROUGE-L, C-Vb: CIDEr scores averaged across verbs, C-Arg: CIDEr scores averaged over arguments. Lea-S: Lea-soft. See Section 5.7.1 for details. . . . . . . . . . . . . . . . . . . . 134 5.8 Verb classification metrics. Acc@K: Event Accuracy considering 10 ground-truths and K model predictions. Rec@K: Macro-Averaged Verb Recall with K predictions. Kin. denotes whether Kinetics is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.9 Event relation classification metrics. Macro-Averaged Accuracy on Validation and Test Sets. We evaluate only on the subset of data where two annotators agree. . . . . . . . 135 x 5.10 Semantic Role Prediction on Validation Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.11 Semantic Role Prediction on Test Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.1 Dataset Statistics for RAIV datasets. U-Im, and U-S denote unique numbers of images and sentences, respectively. I-{Tr, Val, Test, Tot} denotes the number of instances. Note that in template-based queries some sentences are duplicates. . . . . . . . . . . . . . . . . . . . . . 149 6.2 Accuracy@1 of fine-tuned pre-trained models on NLVR2 and RAIV datasets. All models are obtained from METER. Pre-Training refers to data used for pre-training. TSP Data refers to data used for task-specific pre-training which is obtained from COCO and VATEX. (T) and (G) refers to whether the statements are obtained via template or generated via Language Model. NLVR2 refers to NLVR2-dev set. . . . . . . . . . . . . . . . . . . . . . . . 161 6.3 Accuracy@1 of fine-tuned pre-trained models on the Reasoning Task of RAIV datasets. . . 161 6.4 Accuracy@1 across RAIV datasets using image pre-trained baselines with and without Task-Specific Pre-Training (TSP) which uses data from COCO + VTX. . . . . . . . . . . . . 162 6.5 Accuracy@1 for different fusions (early, mid, late) with varying amounts of data from IP2P. 162 6.6 Accuracy@1 for Random vs Conditional Sampling for RAIV and Reasoning tasks. By default, conditional sampling is used for task-specific pre-training. . . . . . . . . . . . . . . 162 7.1 Comparison of different Segment-Text Matching approaches . . . . . . . . . . . . . . . . . 179 7.2 Dataset Statistics. COCO and LVIS have the same training image set so their statistics are identical. U-NP refers to Unique Noun-Phrases. . . . . . . . . . . . . . . . . . . . . . . . . 182 7.3 Results on OVD-COCO. Here, M-R50 refers to Faster-RCNN [199] or Mask-RCNN [73] with Res50 [75] backbone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.4 LVIS object detection results. * denotes using CLIP RN50, ** denotes using CLIP RN50x4. r, c, f refer to rare, common, frequent respectively. . . . . . . . . . . . . . . . . . . . . . . 189 7.5 Box AP on LVIS with FBase Split. Numbers for ViLD and EZAD on FBase split are taken from [151]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.6 Acc@1 on RefCoCo+. “FT”: Fine-tuned, “ZS”: Zero-shot. . . . . . . . . . . . . . . . . . . . 190 7.7 Ablation for different CLIP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 7.8 Fusion Ablation. E: Ensemble, LF: Late Fusion, EF: Early Fusion . . . . . . . . . . . . . . . 191 xi 7.9 Results on ODinW. All numbers are AP50. wo/P and w/P refers to with and without prompting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 xii List of Figures 2.1 Illustration of the key difference between current scope of phrase grounding and the proposed zero-shot grounding. The query word is italicized in all cases. (a)-(f) denote the image-query pairs input to the system. (a) and (b) are examples of training images. A test image query pair for phrase grounding could be (c). Zero-shot grounding additionally can be tested on (d), (e) and (f) in which “chair”, “minivan” and “moped” are object categories not annotated in the training data. (f) additionally contains a “car” object which is a trained category, indicating that both novel and related trained category objects may be present in a test image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Previous phrase grounding systems (a) produce a small subset of proposals without considering the query restricting it to the entities of the detection network. Our system (b) considers dense proposals, looks at the query to disambiguate and learns directly from grounding dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 A schematic of the ZSGNet Architecture. Input to the system is an image-query pair. A deep network is used to produce K image feature maps taken at different resolutions. The anchor generator uses the image size to produce anchors at different scales and resolution. We append the anchor centers at each cell of each feature map. The query phrase is encoded using a bidirectional LSTM (Bi-LSTM) and the language feature obtained is appended at every cell location of every feature map along the channel dimension. The resulting multi-modal feature maps are input to a Fully Convolution Network (FCN) block to output a prediction score and regression parameters which are trained using focal-loss (Lpred)) and SmoothL1-loss(Lreg)) respectively. . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Category-wise distribution of various unseen splits. First row: training, validation and test set splits for Case 0; second row: unbalanced training and test sets for Case2 and Case 3; third row: balanced training and test sets for Case 2 and Case 3. In a row, the colors represent the same entities or the same clusters. . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Few grounding visualizations. In all cases, red denotes the ground truth box; green is the box predicted by ZSGNet. Row-1:Flickr30k, ReferIt; Row-2: Flickr-Split-0, 1; Row-3: VG-Split-2,3. In Row-3, the query word Q is emphasised and the closest seen object is provided in parenthesis. The last column shows incorrect predictions. . . . . . . . . . . . . 31 xiii 3.1 Illustration of different formulations of VOG when the same query and video frames are used. (a) [306] treats each query word independently and doesn’t distinguish between different instances of the same object. (b) [35] makes this distinction using independent spatio-temporal tubes. Ours (c) involves localizing only those objects which are being referenced in the query and requires additional disambiguation using object relations. . . 35 3.2 (a) illustrates contrastive sampling based on semantic roles. Q1 contains a single agent (“man”) and a single patient (“dog”). We use the SRL structure Arg0-Verb-Arg1 but replace one queried object (Q2, Q4) or action (Q3). (b) shows temporal concatenation where we resize each video to the same width, height. (c) shows spatial concatenation where we resize the height and sample a fixed number of frames across the videos (d) shows an unreasonable spatial concatenation as videos have a top-down order (“ocean” is always below “sky”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 An overview of VOGNet. It takes a video-query pair as an input. A visual encoder extracts object features for each frame and concatenates them with segment features (rgb+flow). A language encoder encodes the whole query with a BiLSTM [217, 81] and then maintains a separate encoding for each phrase in the query (Eq. 3.1). A Transformer [246] is first applied to the visual features to model object relations. These self-attended visual features are fused with the language features. Finally, a separate transformer models the interaction among the fused multi-modal features followed by a 2-layer MLP. VOGNet is trained with Binary Cross Entropy (BCE) loss. . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Train, val and test splits for AC, AE, ASRL. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Number of Noun-Phrases for each role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Number of groundable objects for each role . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Frequently appearing SRL-Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.8 Top-20 Lemmatized Verb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.9 Top-20 Lemmatized Arg0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.10 Top-20 Lemmatized Arg1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.11 Top-20 Lemmatized Arg2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.12 Top-20 Lemmatized ArgM-LOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.13 SVSQ: Illustration of the ground-truth annotations for the “man” (green) obtained from AE. The red boxes show equally correct boxes for “man” but are not annotated. As a result, we only consider the third frame to compute accuracy of the predicted box. . . . . 60 xiv 3.14 TEMP: Two videos are concatenated along the time dimension (we show 2 frames from each video) and with the description “man throwing a ball” and we are considering the object “man”. If the predicted box is within the same video as ground-truth but the frame doesn’t have any annotation (red box) we ignore it. However, if the predicted box belongs to another video (yellow boxes), we say the prediction is incorrect. . . . . . . . . . . . . . 60 3.15 SPAT: Similar to previous case, we have the same description of “man throwing a ball” and we consider the object “man” but the videos are concatenated along the width dimension (we show 2 frames in the concatenated video). Again, if the predicted box lies in the same video as ground-truth (red box), we ignore it. If the predicted box is in another video (yellow boxes), the predictions are deemed incorrect. . . . . . . . . . . . . . . . . . . . . . 60 3.16 Left(L): concatenated using SPAT with query: [Arg0: The man] [Verb: throws] [Arg1: the ball] [Arg2: in the air]. Middle(M): concatenated using TEMP with query: [Arg1: He] [Verb: spins] [ArgM-DIR: around the board]. Right(R): concatenated using SPAT with query: [Arg0: He] [Verb: attaches] [Arg1: a dummy]. In L, R the top-two and bottom-two frames are seen concatenated. In M, there are four frames following the order: tl-bl-tr-br. In L,M our model VOGNet correctly finds the referred objects (“man”, “ball”, “boy”, “board”). In R: VOGNet is unable to find “dummy” and ends up localizing the incorrect person. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.17 VOGNet predictions TEMP strategy in GT5 setting. We show two frames from each video, but the model looks at F=40 frames. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.18 VOGNet predictions SPAT strategy in GT5 setting. We show two frames from each video, and each frame contains 4 videos concatenated together. . . . . . . . . . . . . . . . . . . . 69 3.19 Incorrect prediction of VOGNet for SPAT strategy . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 Previous methods formulate VidQA as a N-way classification task. The questions are converted via question generation tool (Q1, Q2) or masking-out strategy (Q3). However, such QA has a theoretical recall upper bound when the correct answer is not among the choice list. In comparison, we propose a free-form text generation task which do not suffer such limitation (Q4-Q7) . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Illustration of our query generation process. In (a) DIR is ignored from both Query and Answers. In (b) the question is removed from validation set since at most two arguments from considered set are present. . . . . 78 4.3 Illustration of the Relative Metric Computation. “moves” is the ground-truth answer and “lifts” is a model’s prediction. Relative Metric compares the relative improvement from using the model’s prediction as compared to an empty string. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4 Illustration of Contrastive Sampling Process. For the same query-expression, we retrieve two videos with different answers. The model is required to correctly answer both the original and contrastive sample query. . 80 xv 4.5 Schematic of the various models used to benchmark VidQAP. Input Query: “A person picks up ”. Ground-Truth Answer: “a pair of shoes”. (a) Lang-QAP is a language-only model which encodes the query input and passes to a decoder. (b) BUTD-QAP uses the pooled feature representation from language encoder and attends over the visual features. (c) VOG-QAP uses an additional phrase encoder and applies a Transformer over the multi-modal features (d) MTX-QAP consumes both the language and visual features with a multi-modal transformer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Lemma Distribution for both ASRL-QA and Charades-SRL-QA. The number of instances across the whole dataset are given in the parenthesis of each lemmatized noun or verb. . . 90 4.7 Queries of Type ARG0 and V on ASRL-QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.8 Queries of Type ARG1 and ARG2 on ASRL-QA . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.9 Queries of Type ARGM-LOC on ASRL-QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.1 A sample video and annotation from VidSitu. The figure shows a 10-second video annotated with 5 events, one for each 2-second interval. Each event consists of a verb (like “deflect”) and its arguments (like Arg0 (deflector) and Arg1 (thing deflected)). Entities that participate in multiple events within a clip are co-referenced across all such events (marked using the same color). Finally, we relate all events to the central event (Event 3). The video can be viewed at: https://youtu.be/3sP7UMxhGYw?t=20 (from 20s-30s). . . . . . . 103 5.2 Bar graph showing number of unique verbs with respect to the rank of the video segment as computed via our heuristic based on predicted labels from SlowFast Network [52] trained on AVA[66]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3 Illustration of our annotation interface. (a) depicts the initial screen an annotator sees. In the first step, one needs to watch the entire 10 second video. (b) depicts the second step of choosing a verb from a drop-down which contains verb senses obtained from PropBank. After selecting a verb, an example usage is shown along with corresponding argument roles which need to be filled. (c) depicts filling the argument slots for each verb which can be phrases of arbitrary length. Each filled in phrase can be re-used in a subsequent slot, to enforce co-reference of the entities. (d) shows the final step of choosing event relations once all the arguments for all events are filled. The event relations should be classified based on causality and contingency for Events 1,2,4,5 with respect to Event 3. . . . . . . . 114 5.4 Data analysis. An analysis of VidSitu in comparison to other large scale relevant video datasets. We focus on the diversity of actions and entities in the dataset (a and b), the complexity of the situations measured in terms of the number of unique verbs and entities per video (c and d) and the richness of annotations (e and f). . . . . . . . . . . . . 121 5.5 Distribution of 100 most frequent verbs (a), genre tuples (b), and movies (c). Note that for (a), the count represents the number of events belonging to the particular verb, whereas for (b), (c) it represents the number of video segments belonging to a particular genre or movie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 xvi 5.6 50 Most frequent words (after removing stop-words) for Arg0, Arg1, Arg2, ALoc (location), ADir (direction ) and AMnr(Manner). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7 Models. The figure illustrates our baselines for verb, semantic role, and event prediction using state-of-the-art network components such as SlowFast [52] network for video feature extraction, transformers [245] for encoding events in a video and verb-conditional decoding of roles, and RoBERTa [149] language encoder for event-relation prediction. . . . 126 6.1 Existing fine-tuning tasks such as (b) VQA operate on single image which is similar to pre-training objective (a) such as Masked-Language Modeling or Image-Text Matching. Here, we expand the scope to include reasoning (c) across images or (d) across image and a video. (e) across videos. Here “A” denotes the answer (True/False), and “R” denotes reason. 140 6.2 Sentence generation for RAIV tasks. Given images from ImSitu (same process applies for videos from VidSitu) along with their SRLs, we find the common object (in this case bread) and use them along with True/False templates to generate sentences. . . . . . . . . . . . . 143 6.3 A schematic of framework for RAIV task. Input is a pair of images, videos or an image and a video (shown here) with a text. The visual inputs (denoted by I1 and I2) are first patchified then appended with temporal position embedding denoting the frame number. We note that images are considered to be a single frame video. Then, we add the ID embedding denoting whether it is the first or the second visual input. This is input into two METER model (shared weights) which takes both vision patches and text as input. The appended CLS from both inputs are concatenated and a Binary Cross Entropy Loss is used given the ground-truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.4 Example creation of generating template-based queries. . . . . . . . . . . . . . . . . . . . . 153 6.5 Illustration of generating queries for IP2P dataset. For a given image pair, we have (a) the sentence queries used to generate the image via Stable Diffusion (b) Using LLM to generate a True statement and (c) False statement by changing the edit caption. . . . . . . 158 6.6 Model Predictions vs Ground-Truth for template-based (“T”) validation datasets. (a) Im-Im (T), (b) Im-Vid (T), (c) Vid-Vid (T). PrA and GtA refer to Predicted and Ground-truth Answers respectively. PrR and GtR refer to predicted and ground-truth reasoning respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.7 Model Predictions vs Ground-truth for IP2P dataset. For a given pair of images, both the chosen True and False sentences are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.8 Model Predictions vs Ground-Truth for generated queries (“G”) validation datasets. (a) Im-Im (G), (b) Im-Vid (G), (c) Vid-Vid (G). PrA and GtA refer to Predicted and Ground-truth Answers respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 xvii 7.1 (a) The standard OVD task. Given an annotated training set with base classes of “Dog” and “Person”, the goal is detect novel objects not seen during training such as “Suitcase” and “Conveyor” in addition to the seen classes. (b) The standard OVD task, dubbed OVD-BT. A model is trained on an annotated detection dataset with “base” classes and at test time we are provided the set of “novel” classes. (c) Our proposed Weakly-Supervised OVD (WS-OVD) task. The training set contains only raw images but without any notion of base and/or novel classes instead having a large vocabulary. At test time, the goal is to find all objects given a class set by the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.2 A schematic framework our Detect-Anything-Model (DAM). First SAM [111] is applied to obtain all the image segments. Then, a segment-text matching step is performed via CLIP [191] (details in Section 7.3.3). Then the features of the SAM model are aligned with those of CLIP text features. Finally, at inference time, one can use the best matching text features to obtain the class labels for all image segments. . . . . . . . . . . . . . . . . . . . 174 7.3 An example from GLG1 dataset. We sample the tags “dog” and “apple” which is used to create a caption using a fixed template. The boxes are generated at random and used as input and overlayed on the images for visualization purposes only. The same parameters are used for both (a) and (b), with the only difference being the random seed. Only (a) passes both CLIPScore Filter and the obtained Tag list. . . . . . . . . . . . . . . . . . . . . 183 7.4 Example from GLG2 dataset. Given the original image in (a), we use its caption and sample two tags also found in the caption to generate new images (b), (c), and (d). Bounding boxes are randomly sampled (kept same for (c) and (d)). Even though the resulting images don’t exactly match the captions, these are still valuable for detection. . . 184 7.5 Top row: First is the original image, Second is the COCO-Ground-Truth. Bottom row: First is the raw segmentation output from SAM. Second is the output provided in open-space. Third is the output provided in COCO-vocabulary. . . . . . . . . . . . . . . . . . . . . . . . 193 7.6 Top row: First is the original image, Second is the COCO-Ground-Truth. First is the raw segmentation output from SAM. Second is the output provided in open-space. Third is the output provided in LVIS-vocabulary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 xviii Abstract While machine learning research has traditionally explored image, video and text understanding as separate fields, the surge in multi-modal content in today’s digital landscape underscores the importance of computation models that adeptly navigate complex interactions between text, images and videos. This dissertation addresses this challenge of grounding language in visual media - the task of associating linguistic symbols with perceptual experiences and actions. The overarching goal of this dissertation is to bridge the gap between language and vision as a means to a “deeper understanding” of images and videos to allow developing models capable of reasoning over longer-time horizons such as hour-long movies, or a collection of images, or even multiple videos. A pivotal contribution of my work is the use of Semantic Roles for images, videos and text. Unlike previous works that primarily focused on recognizing single entities or generating holistic captions, the use of Semantic Roles facilitates a fine-grained understanding of “who did what to whom” in a structured format. It maintains the advantages of having free-form language phrases and at the same time also being comprehensive and complete like entity recognition, thus enriching the model’s interpretive capabilities. In this thesis, we will introduce the various vision-language tasks developed during my Ph.D. This includes grounding unseen words, spatio-temporal localization of entities in a video, video question answering, visual semantic role labeling in videos, reasoning across more than one image or a video, and finally, weakly-supervised open-vocabulary object detection. Each task is accompanied by the creation and development of dedicated datasets, evaluation protocols, and model frameworks. These tasks aim xix to investigate a particular phenomenon inherent in image or video understanding in isolation, develop corresponding datasets and model frameworks, and outline evaluation protocols robust to data priors. The resulting models can be used for other downstream tasks like obtaining common-sense knowledge graphs from instructional videos or drive end-user applications like Retrieval, Question Answering, and Captioning. By facilitating the deeper integration of language and vision, this dissertation represents a step-forward in machine learning models capable of finer-understanding of the world around us. xx Chapter 1 Introduction I posit that one of the most important use cases of AI in this evolving digital landscape is the ability to process ever growing large volumes of data generated in the form of images, videos, and text. From realtime video content on streaming platforms to the rich, varied imagery found in online retail catalogs, the ability of AI to handle and interpret this data profoundly influences numerous sectors. As the volume and variety of such digital media continues to grow, developing methods that can effectively synthesize and interpret these complex data sets becomes essential for the advancement of intelligent systems. Historically, the fields of vision and language processing have developed along separate trajectories, each confined to its methodologies and frameworks. This separation significantly limits the ability of systems to handle open and large-scale vocabularies that are reflective of real-world scenarios. Current models, often restricted to fixed, closed vocabularies, fail to capture the vast and dynamic nature of human language intertwined with the visual context. Naive techniques in vision-language integration typically lack the sophistication needed to understand complex interactions between textual and visual elements. Simple correlational methods or direct annotation techniques are insufficient for parsing nuanced relationships or for extracting deeper semantic meanings. Such approaches fall short in scenarios requiring detailed comprehension of the context, intent, and subtleties present in multi-modal data, necessitating more advanced, innovative solutions. 1 While advanced systems designed to integrate vision and language offer sophisticated capabilities, they are not immune to failures. These complex systems can often misinterpret or completely overlook nuances in multi-modal data, leading to inaccuracies that can have cascading effects on their performance. The challenge is not only in building models that can handle the intricacies of vision and language but also in developing mechanisms to diagnose and rectify errors at a granular level. Such diagnostics are crucial for refining the system’s ability to understand and interact with the world accurately. This necessitates a detailed examination of how these models process and analyze data, ensuring they can reliably perform in diverse and dynamic real-world scenarios. This thesis takes a step towards enhancing vision-language systems, addressing distinct challenges essential for deeper multi-modal understanding. A key issue in existing vision-based or even vision-language based systems is that the spectrum of considered tasks either lack in flexibility (such as pure image/action classification) or lack in completeness (such as image/video captioning). To mitigate this issue, we propose to to utilize Semantic Roles which answer the question “who did what to whom” leading to both flexible and complete descriptions in a structured format. However, obtaining such Visual Semantic Roles is challenging, thus we approach it in steps. 1. Establishing Foundations in Video Understanding: Understanding Nouns, Object Entities and Object Relations via Natural Language. Our first task is to understand the various objects in an image or video. Crucially, we aim to associate and find the objects with just natural language instead of fixed vocabulary set. To this end, we introduce zero-shot grounding system [211] which can learn to identify objects even if the referred objects were never seen during training. The task also requires the model to uniquely identify the referred object instance given the natural language query. This involves object relations such as a “person”-“holding”-“bottle”. We introduce the task of video object grounding [209] to extend this task of grounding to videos. 2 2. Interpreting Actions and Relationships in Videos: Understanding Verbs, Actions, and Object Interactions via Natural Language. Once we have a notion of objects, we move our attention to understanding the key components in understanding videos, namely, the role of actions, and how the objects related to the action are interacting with each other across time. Instead of relying on single-label answer such as in action-classification, we propose the use of free-form phrases in natural language which allows for both richer latent representation as well as informative answers. One way to handle this would be using Question-Answering framework and to this end we introduce Video-QA with phrases [208] where we pose the problem of Video QA to be generative phrases. However, VideoQA requires a particular question. Thus, we extend this to Situation Recognition [212] task where the model is only provided a video and the task is to obtain all the verbs, semanticroles for each of the verbs as well as identify event-relations. 3. Advancing Image and Video Understanding via Visual Contexts through Reasoning and Open Vocabulary Detection. Once we have an understanding of the videos, we wish to expand across as well as refine within visual contexts. For expanding visual contexts, we introduce the task of reasoning across images and videos [214] where the model has to not just understand a single video in isolation but also has to reason across videos and images. For refining visual contexts, we tackle the task of detecting objects but without a fixed set vocabulary in a weakly-supervised manner [213] such that we are not bound the available annotated datasets. In this setting, we are not confined to any existing vocabulary set and further constrain that we don’t have any annotated bounding box annotations. This allows us to create models that can be directly trained over raw images or image-caption pairs. 3 1.1 Thesis Organization The thesis is organized to tackle each of the above challenges outlined above: Establishing Foundations in Image and Video Understanding. We start by focusing on Understanding Nouns, Object Entities and Object Relations via Natural Language. To this end, we first introduce the task of Zero-Shot Grounding [210] in Chapter 2. In this setting, our task is to localize (via a bounding box) a given language phrase query but the referred object are not have been seen at train time. We show that instead of using a two-stage framework, a single-stage grounding framework can utilize wordembeddings to learn a mapping to ground unseen words. Further, such a model retains the performance on even for seen words and has the additional advantage of having a faster inference speed. Next, in Chapter 3, we extend the task phrase grounding to Videos [209]. In this setting, we are provided a video and a natural language sentence and the goal is to identify each referred object in the sentence. We show that existing video grounding datasets often have a single object instance and thus the grounding task becomes degenerate case of just object detection i.e., there is no requirement of disambiguation of instances of same object type. To this end, we propose contrastive sampling and spatial and temporal concatenation to make harder examples. To create contrastive samples we utilize the SemanticRoles obtained from video captions. We show that naive video grounding struggle in this setting and propose a new architecture VOGNet that accounts for object relations. Interpreting Actions and Relationships in Videos. Previously, our attention was solely on object entities and their relations. However, a key component in a video involves human actions. Thus, we pivot to Understanding Verbs, Actions, and Object Interactions via Natural Language. We first tackle the task of Video understanding through the lens of Video Question Answering (Video QA) in Chapter 4. Previous methods in Video QA posed the task as a N-way classification task where a set of N-phrases (N being in order of 1k-10k) obtained from the dataset. However, this severely restricts the answers that can be generated. To this end, we propose using language generation modules to allow free-form phrases. To 4 circumvent the challenges of evaluating free-form phrases, we propose relative scoring and contrastive scoring. We show that generative models perform much better especially when possible phrase-list becomes very large. While Video Question Answering with phrases does expand our possibilities, it still requires a user to provide a suitable question for the user. However, the end-user is usually interested in understanding a particular given video. This brings us to the main contribution of the thesis in Chapter 5, Situation Recognition in Videos. Given a 10 second long video, the model is tasked with identifying a suitable verb for each 2 second clip, provide semantic roles for the chosen verbs for each clip maintaining co-referencing for the referred object and finally providing an event relation across the clips. We contribute a new manually annotated dataset which is rich, diverse and complex and new evaluation metrics to account for long-tail distributions and co-referencing. We further provide Transformer-based encoder-decoder models for this end-to-end task. Advancing Video Understanding via Visual Contexts through Reasoning and Open Vocabulary Detection In the previous two segments, we looked at both object entities as well as their interactions with human-actions and event relations. Here, we expand this notion to two tasks: Reasoning and OpenVocabulary Object Detection. We first explore whether we existing image-video-language models can reason across visual contexts. In Chapter 6, we introduce the task of Reasoning across Images and Videos [214], where instead of a single image or a video, we are given pair of images, an image and a video or a pair of videos along side a natural language query and the task is to compare the provided images and provide a true or false answer. Further, we require the model to provide a reason for its answer. We find that instead of naively using existing vision-language pre-training, a second-stage pre-training greatly significantly improves the model performance. Finally, in Chapter 7, we investigate the task of open-vocabulary object detection where the model is required to detect novel classes. Crucially, we explore the weakly supervised setting where no annotated 5 images are available, thus, there is no notion of base or novel classes. We suggest three methods namely using an image tagger, image captions or generative diffusion models to obtain pseudo-ground truth data using Segment-Anything Model [111] which is then utilized for the detection task. In our resulting DetectAnything Model (DAM) [213], we find that each method complements the other leading to state-of-art novel class detection accuracy even in the absence of training data. The relevant publication list for each chapter is provided below: 1. Chapter 2: Zero-Shot Grounding of Objects from Natural Language Queries [210] (ICCV’2019) 2. Chapter 3: Video Object Grounding using Semantic Roles in Language Description [209] (CVPR’2020) 3. Chapter 4: Video Question Answering with Phrases via Semantic Roles [208] (NAACL’2021) 4. Chapter 5: Visual Semantic Role Labeling for Video Understanding [212] (CVPR’2021) 5. Chapter 6: Leveraging Task-Specific Pre-Training To Reason Across Images and Videos [214] (WACV’2024) 6. Chapter 7: Detect Anything: An Open-Vocabulary Object Detection Model with Weak Supervision [213] 6 1.2 Chapter Organization Given the diverse tasks considered in the works covered in this dissertation, we don’t have a unified “previous works” section. Instead, each chapter is intended to be self-contained and all relevant prior works are provided in the corresponding chapter itself. Some sections in the chapters are marked as “Additional Details” or “Extra Details”. These are provided mainly for completeness but are not required to understand the main paper. Some experiments which are not integral to the main message of the chapter are also marked as such. Each chapter follows the rough outline: Introduction, Related Works, Method, Experiments and Results, Conclusion. An additional Task section is provided before the Method section if the formulation is new and hasn’t been explored in previous works. 7 Chapter 2 Zero-Shot Grounding. 2.1 Introduction Detecting objects in an image is a fundamental objective of computer vision. A variation of this task is phrase grounding (also called visual grounding and referring expressions) where the objective is to detect objects referenced by noun phrases in a text query [86, 200, 32, 277]. It can be directly applied to other tasks such as visual question answering [7, 299] and image retrieval [30] and has thus garnered wide interest. While existing phrase grounding systems accept novel query phrases as inputs, they are limited to the nouns encountered in the training data (i.e. the referred object types need to have been “seen" in training images before). As an important extension, we define zero-shot grounding (ZSG) to allow the use of phrases with nouns that the grounding system has not encountered in training set before. Fig 2.1 illustrates this concept with examples. To enable grounding of novel object categories, we need to relate the appearance of referred objects to their linguistic descriptions. Current, state-of-art phrase grounding systems [277, 41, 32, 284, 183] rely on an explicit object detector to obtain proposed object bounding boxes and their ROI-pooled features as a pre-processing step. This essentially limits these systems to a fixed set of object categories that the detector was trained on. In ZSG, we need to have a reasonable proposal box for the novel object, classify it as a foreground and regress the box to be a more accurate spatial fit. In traditional phrase grounding, a 8 (a) red car (b) blue shirt (c) blue car (d) blue chair (e) red minivan (f) silver moped Figure 2.1: Illustration of the key difference between current scope of phrase grounding and the proposed zero-shot grounding. The query word is italicized in all cases. (a)-(f) denote the image-query pairs input to the system. (a) and (b) are examples of training images. A test image query pair for phrase grounding could be (c). Zero-shot grounding additionally can be tested on (d), (e) and (f) in which “chair”, “minivan” and “moped” are object categories not annotated in the training data. (f) additionally contains a “car” object which is a trained category, indicating that both novel and related trained category objects may be present in a test image. key challenge is to disambiguate between similar objects using the query phrase, but ZSG requires us to also first find likely image regions that may contain the referenced objects. To address the above issues, we replace the traditional two-stage approach, where the first stage generates proposal bounding boxes and the second stage does the classification, by a single-stage network with dense proposals; we call this network ZSGNet. It takes combined language query features and visual features from the image proposals and predicts classification scores and regression parameters. The system is trained directly on the grounding training data, in an end-to-end manner, and does not utilize any externally trained object detector. We show that, besides enabling grounding of novel categories, it does not degrade performance on learned categories even though our method does not utilize external 9 training data. Moreover, our design is computationally efficient especially during inference owing to its single-stage architecture akin to SSD [147]. Evaluating the performance of a ZSG method is complex due to the influence of the relationship of the new query category to the learned categories. To make the evaluations and distinctions clearer, we define four specific cases for different conditions: (i) when the query word is novel (Fig 2.1 d-f) (ii) when the referred object belongs to a novel category (Fig 2.1-d) (iii) when the referred object is “similar” to objects seen during training but none of the latter are present (Fig 2.1-e) (iv) when at least one similar object also exists in the test image (Fig 2.1-f)(more details in Section 2.3.1). To support evaluation of zero-shot grounding for the four cases, we introduce new datasets which are sub-sampled from the existing Visual Genome [117] and Flickr30k Entities [184]. We create examples of the four cases outlined above (dataset creation details are in Section 2.4.1, experiments on these datasets are in Section 2.4.7). Our contributions can be summarized as follows: (i) we introduce the problem of Zero-shot grounding, (ii) we propose a simple yet effective architecture ZSGNet to address limitations of current phrase grounding systems for this task, (iii) we create new datasets suitable for evaluating zero-shot grounding and (iv) we evaluate performance on these datasets and show the effectiveness of our approach. Our code and datasets are publicly released∗ . 2.2 Related Work Phrase grounding: Extensive work in creating grounding datasets like Flickr30k, ReferIt, RefCoCo, RefCoCo+, RefCoCog, Visual Genome, GuessWhat [184, 101, 279, 162, 117, 40, 275] have been crucial to the success of phrase grounding. Early works use reconstruction based approach [200] or integrate global context with the spatial configurations [86]. Recent approaches [32, 284, 183] learn directly in the multi-modal ∗ https://github.com/TheShadow29/zsgnet-pytorch 10 feature space and use attention mechanisms [277, 41] which have also been extended to phrase grounding in dialogue systems [40, 312]. Few approaches also look at unsupervised learning using variational context [294] and semi-supervised learning via gating mechanisms [31]. Above techniques use an object detector like FasterR-CNN [198] or MaskR-CNN [73] as a pre-processing step to get the bounding boxes and ROI-pooled features which effectively limits them to the object categories of the detector. We combine the detection and the grounding networks and learn directly from the grounding dataset and thus no pre-processing step is involved. Multi-modal feature representation has many flavors like linear transformation, concatenation, hadamard product [105], bilinear pooling [146] and have shown success in vision-language tasks like VQA [54, 283, 282, 16, 17], Scene Graph Generations[135, 270] and Image Captioning [155]. We stick to feature concatenation for simplicity and a fair comparison with previous works in phrase grounding. Single stage networks used in object detection are popular for their real-time inference speed. Prominent works include SSD [147], YOLO [194, 195, 196] and more recently FPN [142] and RetinaNet [143]. In this work, we combine a single-stage detection network directly into the grounding framework; besides enabling zero-shot grounding, it also results in highly efficient inference. Zero-shot grounding is unexplored but there are a few similar works. [68] aims at open-vocabulary object retrieval though it still assumes the entity of the referred object is seen at train time. Recently [13] proposed zero-shot detection (ZSD) where they consider a set of unseen classes for which no bounding box information at train time. At test time, all the objects including the unseen classes must be detected. However, the set of background classes is needed prior to training, but this is not needed in ZSG. 2.3 Design Considerations for ZSG We fist discuss the zero-shot grounding cases and then describe the limitations in extending current phrase grounding systems to ZSG. Finally, we present a new architecture to address the limitations. 11 Notation Meaning Example T Test Image Fig 2.1-(f) P Test query phrase silver moped A Referred object at test time Moped Q Word in P referring to A moped B Set of objects close to A and seen during training {Car} C Set of categories seen during training {Vehicles, Clothing} W Set of words seen during training {red, blue, car, shirt} Table 2.1: Notations used to describe ZSG with examples (Fig 2.1). By close objects we mean their word embeddings are similar. 2.3.1 ZSG Cases We now describe the four cases for zero-shot grounding in detail. For brevity, we use the notations in Table 2.1. Each case defines the scope for what is classified as a zero-shot grounding example. Further, we assume that Q (the word which refers to the object in the image) is not an OOV (out of vocabulary word) which is reasonable if we use word embeddings which are trained on a large language corpus. Case 0: Q /∈ W. The query noun, Q, is not included in any training example before. We only look at the lemmatized word so synonyms are considered to be different (novel) words. Fig 2.1(d)-(f) are examples of this case. Fig 2.1-c with the phrase “blue automobile” would also be considered zero-shot since we haven’t seen the word automobile before even though it is a synonym of “car”. Case 1: A /∈ C. Here, we assume that objects seen at train time belong to a set of pre-defined categories and the referred object A doesn’t belong to these categories. In Fig 2.1-d, “chair” is considered zero-shot as this category was not seen at train time. Case 2: ∃B but ∀b ∈ B we have b /∈ T. Here, objects that are semantically close (similar) to the referred object A are present in the training set but not in the test image. Fig 2.1-e is an example as “minivan” (novel object) is semantically close to “car” (seen in train set) but there is no other similar object like “car” in the test image. 12 Case 3: ∃B and ∃b ∈ B such that b ∈ T. Same as Case 2 but at least one of the objects semantically close (similar) to A is also present in the test image. For example, Fig 2.1-f containing “moped” (a novel object) and “car” (seen in the training set) which are semantically close. For Case 2 and Case 3, there can be multiple interpretations for being “semantically close”. Here, we assume two objects are “close” if their word embeddings are similar. In our implementation, we cluster the word embeddings of the objects and objects belonging to the same cluster are considered semantically close (more details in Section 2.4.1). 2.3.2 Limitations in Phrase Grounding Systems Prior works view phrase grounding either as an entity selection problem [277, 41] or that of regressing sparse proposals to a tighter bounding box [32, 284, 183]. In either case, given an image I, we have N candidate boxes and their ROI-pooled features {oi} N i=1. Given a query phrase P, the problem reduces to finding a good candidate box with a possible additional regression step. Grounding systems using this framework don’t have a mechanism to generalize to object categories not in the detection dataset. Consider a novel category X whose instances may be present in the training images but not annotated. The object detector learns to classify X as background and this error is propagated to the grounding system. [32, 183] suggest fine-tuning the detector on the grounding categories (entities) but the grounding datasets are not densely annotated, i.e. not all object instances of X in every image are annotated. Additionally, some grounding datasets like ReferIt [101] don’t have entity information so fine-tuning is not feasible. Object detectors also favor features invariant to intra-class changes but grounding systems need to capture intra-class variations as well. 13 Image Query Pre-trained Proposal Generator Proposal Features Language Feature LSTM Regression & Classification Stage 1 Stage 2 Image Query Image Feature Map CNN Language Feature LSTM Regression & Classification Stage 1 Proposal generation and Feature Extraction Anchor boxes (a) Vanilla 2-stage phrase grounding system (b) Our 1-stage phrase grounding system Anchor Generator Figure 2.2: Previous phrase grounding systems (a) produce a small subset of proposals without considering the query restricting it to the entities of the detection network. Our system (b) considers dense proposals, looks at the query to disambiguate and learns directly from grounding dataset 2.3.3 Model Design We propose the following new formulation: Given an image I with fixed candidate boxes (also called anchor boxes or dense proposals) DP = [dp1, . . . dpN ] and a query phrase P, the task is to choose the best candidate box dpi and regress it to a tight bounding box bi . Since our candidate boxes depend only on the size of the image, we can use any image encoder to compute the visual features at run-time and remove the need for a pre-trained object detector as illustrated in Fig 2.2. This design is similar to singleshot architecture used in object detection [147, 143]. Framework: Our model consists of a language module to encode the query phrase, a visual module to obtain image feature maps followed by fully convolutional networks to output a 5d vector (for each candidate box) one for score and the rest for regressing to a tighter bounding box. Fig 2.3 provides an overview of our proposed architecture. ZSGNet directly learns about the entities in a grounding dataset in an end-to-end fashion. Moreover, since a query phrase refers to a particular object with possibly different attributes, the visual features are 14 K Image Feature Maps at different scales Language Features Channel-wise Concatenation Multi-Modal Features Binary Classification (Focal Loss) Regression (Smooth L1 Loss) Anchor Centers Bi-LSTM FCN Query: “Dolphin jumping out of water” Size, Ratio, Scale Anchor Information CNN Anchor Generator Figure 2.3: A schematic of the ZSGNet Architecture. Input to the system is an image-query pair. A deep network is used to produce K image feature maps taken at different resolutions. The anchor generator uses the image size to produce anchors at different scales and resolution. We append the anchor centers at each cell of each feature map. The query phrase is encoded using a bidirectional LSTM (Bi-LSTM) and the language feature obtained is appended at every cell location of every feature map along the channel dimension. The resulting multi-modal feature maps are input to a Fully Convolution Network (FCN) block to output a prediction score and regression parameters which are trained using focal-loss (Lpred)) and SmoothL1-loss(Lreg)) respectively. no longer invariant to intra-class changes. This way, we address the limitations posed by previous systems. Finally, owing to its single-stage approach, the network is computationally efficient. Language module consists of an embedding layer followed by a Bi-LSTM [81, 216] to encode the input query phrase. Given a query phrase P = {wi} n i=1 we use GloVe vectors [182] to encode each word in P as word embedding vectors {wi} n i ∈ R dq , where dq is the dimension of the embedding vector. These are fed into a Bi-LSTM [81, 216]. We use the normalized last hidden state vector {hˆ} ∈ R 2dl of Bi-LSTM as the query feature, where dl is the dimension of the hidden layer of a single LSTM. Visual Module consists of an image encoder to produce K feature maps {vi} K i=1 at different resolutions. We use ResNet-50 [76] with FPN [142] as our default image encoder. We first normalize the visual feature maps along the channel dimension. Then we expand the language feature to the same dimensions of the visual feature maps and concatenate it along the channel dimension for each normalized visual feature map vˆi . Finally, we append the normalized locations of the feature maps (Cx, Cy = [cx/W, cy/H]) 15 along the channel dimension to aid in location based grounding (phrases which contain location information) and obtain the multi-modal feature maps mi . At a particular index of the i th feature map (indexed by x, y) we have mi [x, y] = [ˆvi [x, y]; hˆ; Cx; Cy] (2.1) where ; denotes the concatenation operation. Anchor Matching Following [143] we match 9 candidate boxes to every index of a feature map. We use a fully convolutional network (FCN) to process the multi-modal features to output a 5 dimensional output (score and regression parameters) for each box. In the anchor matching step, we use an IoU threshold of 0.5 (found ideal via experimentation). Loss Function For the binary classification into fore-ground and background we use the focal loss as described in [143]. For regressing to a tighter box we use the same encoding scheme as [198] with smooth-L1 loss. Let dpj denote the j th anchor and gt denote the ground truth bounding box. Let gdpj = 1IoU(dpj ,gt)≥0.5 (2.2) G = {gdpj |gdpj = 1} (2.3) Here 1 denotes the indicator random variable. Thus, gdpj = 1 means the candidate box dpj matches with the ground-truth box and G is the set of all such candidate boxes. Now denoting focal loss [143] with default parameters (α = 0.25,γ = 2) by LF and the predicted score for the box dpj as pdpj we get Lpred = 1 |G| | X DP| j=1 LF (pdpj , gdpj ) (2.4) 16 Similarly, denoting SmoothL1-loss by LS, the predicted regression parameters by rdpj and the groundtruth regression parameters by gtdpj we get Lreg = 1 |G| | X DP| j=1 gdpjLS(rdpj , gtdpj ) (2.5) The final loss is calculated as L = Lpred + λLreg. Here λ is a hyper-parameter (we set λ = 1). Training: We match the candidate boxes (anchors) to each feature map generated by the feature encoder. We classify each candidate box as a foreground or a background using a prediction loss (Lpred) and regress it to get a tighter box (Lreg). Foreground means that the candidate box shares IoU ≥ 0.5 with the ground truth box. For the regression loss we only consider the foreground candidate boxes. Testing At test time, we choose the candidate box with the highest score and use its regression parameters to obtain the required bounding box. 2.4 Experiments This section describes the dataset construction methods, followed by experiments and visualization. 2.4.1 Dataset Construction We sub-sample Flickr30k [184] and Visual Genome [117] to create datasets for the cases described in Section 2.3.1. Flickr30k Entities contains 5 sentences per image with every sentence containing 3.6 queries and has bounding box information of the referred object and its category (e.g. “people”, “animal”). Visual Genome (VG) has a scene graph for every image. The objects in the scene-graph are annotated with bounding boxes, region description and a synset (obtained from [167]). 17 We briefly describe the steps taken to create the ZSG datasets (more details can be found in the supplementary material). We follow the notation described in Table 2.1. Case 0 is sampled from Flickr30k Entities[184]. We need to ensure that Q /∈ W. For this, we first obtain lemmatized representation of the query words. As the query phrases are noun-phrases of a complete annotated sentence, the query word Q referring to the noun is almost always the last word of the query phrase P, we take it be so. We do a 70:30 split of the extracted words to obtain “included” and “excluded” word lists respectively. We then create a training set from the included list and validation and tests from the excluded list, removing images that have overlap between train, validation or test lists. We call the resulting split Flickr-Split-0. Case 1 is also sampled from Flickr30k Entities [184] but this time we also the use predefined entity information. We need the referred object A to belong to a category which is not in C. Flickr30k has 7 common object categories (e.g. “people”, “animals”) and one category called “other” for objects which do not belong to the seven categories. We take images with “other” objects and split them evenly to create validation and test sets, The remaining images comprise the train set; we remove any box annotations that belong to the “other” category to create Flickr-Split-1. Case 2 and Case 3 are sampled from Visual Genome [117]. In addition to region phrases, visual genome also provides entity names mapped to synsets in wordnet [167]. We count all the objects present in the dataset, choose topI(= 1000) objects and get their word embeddings, skipping without an embedding (we use GloVe [182] trained on common crawl corpus). We apply K-Means clustering (K = 20) to cluster similar words. We sort the words in each cluster k by their frequency in the dataset and take the top half to be in the seen objects set (Sk) and the bottom half to be in the unseen objects set (Uk). If an image contains an object oi such that oi ∈ Uk and another object oj ∈ Sk then it is an example of Case 3. If no such oj exists then it is Case 2. Finally, we take the union of images of the two cases to constitute the test set. We call the resulting splits VG-Split-2 and VG-Split-3. This design ensures that for both Cases 2 and 3 the 18 referred object A (in the test set) has a set of objects B (in the training set) which are in the same semantic cluster. We consider the remaining images to be candidates for the training set and include them if they contain at least one object oi in S (where S = ∪kSk) and remove the annotations for any object oj ∈ U (where U = ∪kUk). This ensures that the training set contains objects in S and does not contain any objects in U. However, such a training set turns out to be extremely imbalanced with respect to the clusters as clusters containing common entities such as “person” are much more prevalent than clusters containing “cakes”. We balance the training set following a simple threshold based sampling strategy (details in supplementary material) which results in most clusters (except 2) to have similar number of query phrases. We follow the same strategy to create balanced test splits of VG-Split-2 and VG-Split-3. 2.4.2 Detailed Dataset Construction We re-use the notation introduced in Table 1 of the main paper. We use Flickr30k for Case 0, 1 and Visual Genome for Case 2, 3 (reasons detailed in 2.4.2.4). 2.4.2.1 Case 0: Q /∈ W This is sampled from Flickr30k Entities [184]. In Flickr30k Entities each image has 5 associated sentences. The noun phrases (query-phrases) in each sentence are annotated with the bounding box information. Note that query-phrases in different sentences could refer to the same bounding box. Finally, each bounding box has an associated “entity” which we exploit in Case1. For Case0, we consider the last word in the query phrase and use the lemmatized representation obtained from spacy [83]. This means that words like “car” and “cars” would be considered the same. However, this doesn’t consider synonyms so “automobile” and “cars” are considered different. 19 We sort the lemmatized words in descending order of frequency and consider the topI = 1000 words to be always seen. This is reasonable for words like “woman”, “sky” etc. Of the remaining words we do a 70:30 split and consider the first part to be in the include (seen) list (S) and the rest to be in the exclude (unseen) list (U). Note that even though S, U are disjoint they could share few similar words. The resulting include list contains 7k words and the exclude list contains 3k words. For the test set we use only those images whose annotations have query word Q ∈ U. For the training set we consider the remaining image and remove annotations which have query word Q ∈ U. We also ensure that there is no overlap between the train and test images. The resulting split is called Flickr-Split0. The main motivation behind Case0 Novel Words (NW) is to see how well our model can perform without explicitly having seen the word during training. 2.4.2.2 Case 1: A /∈ C This is also sampled from Flickr30k Entities [184]. We use the entity information of each noun phrase (query-phrase) provided in the dataset. The entities provided are “people”, “clothing”, “bodyparts”, “animals”, “vehicles”, “instruments”, “scene” and “other”. “Other” is used to denote those phrases which cannot be categorized into one of the remaining entities. We extract out all the images with at least one phrase belonging to the “other” category. Of these, we randomly sample 50% and use them for testing. Of the remaining images, we remove the annotations with the “other” category and use them for training. The main motivation behind Case1 is to see how well the model generalizes to novel object categories. 20 2.4.2.3 Case 2, 3: ∃B objects semantically close to A The two cases share the same training images but different test images. We sample the images and queries from the Visual Genome dataset [117]. The dataset creation process has three major components: (i) cleaning the annotations to make them consistent (ii) clustering the objects and creating the train/test splits to satisfy the dataset properties of Case 2, 3 (iii) balancing the resulting splits. Cleaning Visual Genome Annotations: In visual genome each image has an average of 200 phrases. A phrase refers to a single object but may contain multiple objects in it. Consider the phrase“man holding a pizza"; it is not directly specified if the referred object is a “man" or a “pizza" but there will be a bounding box in the image corresponding to the referred object, let us call it phrase BB; we need to infer the synset for this phrase BB. In addition, for each image, there are also annotated bounding boxes for each object type that appears in any of the phrases; in our example, there would be annotations for “man", “pizza" and other objects that may appear in other phrases. To identify the synset for a phrase BB, we find the object bounding box that it has the maximum IoU with and use the object label associated with that bounding box. Another difficulty is that if the same object instance is referred to in different phrases, it will have a different phrase BB associated with it. For consistency, we choose one and apply to all phrases. In implementation, we apply a non-maxima suppression algorithm; even though, there are no scores associated with the boxes, the algorithm selects on among highly overlapping alternatives. This step provides us with a consistent set of annotations. Even though the resulting annotations are consistent, the annotations are still spatially imprecise. Due to this reason, we recommend measuring detection accuracy with with IoU threshold of 0.3 instead of the more common value of 0.5. Clustering Objects: Once we have a clean and consistent set of annotations, we sort all the objects (nearly 5k objects) by the number of appearances in the image. However, the objects at the tail end of 21 (a) Case0 Training Set (b) Case0 Validation Set (c) Case0 Test Set (d) Case2,3 Unbalanced Training Set (e) Case2 Unbalanced Test Set (f) Case3 Unbalanced Test Set (g) Case2,3 Balanced Training Set (h) Case2 Balanced Test Set (i) Case3 Balanced Test Set Figure 2.4: Category-wise distribution of various unseen splits. First row: training, validation and test set splits for Case 0; second row: unbalanced training and test sets for Case2 and Case 3; third row: balanced training and test sets for Case 2 and Case 3. In a row, the colors represent the same entities or the same clusters. the distribution are very infrequent so we consider only the top 1k objects. Few of these don’t have a corresponding word embedding (not available in spacy [83]) so we discard them. This results in a total of 902 objects. 22 Next, we cluster the GloVe [182] word embeddings of the objects using K-Means clustering (with K = 20). We sort the objects in each cluster in descending order with respect to their frequency. For a particular cluster k, we consider the first half to be “seen” (Sk) and the other half to be “unseen” (Uk). This gives us a total of 445 seen objects and 457 unseen objects. For a given cluster k we consider all the images which have at least one object oi ∈ Uk to be test images. If there is another object in the same image oj such that oj ∈ Sk, we put this image query pair into Case3 else into Case2. For the remaining images, we remove annotations for any object oi ∈ ∪kUk and ensure there is at-least one object oi ∈ ∪kSk and use these to form the training set. However, by construction the training set turns out to be imbalanced with respect to clusters. Balancing the Dataset To address the above issue we use the following balancing strategy: We use Zipf’s law approximation that freq×rank ≈ C. That is as the rank of the cluster increases the number of annotations for that cluster decreases in a hyperbolic way. We use this to calculate an approximate mean of the clusters. Finally, we also consider 2 × min_cluster_freq and take the max of the two. Thus, we have an approximate threshold at which we would like to sample. If for a particular cluster this threshold is more than the number of annotations in that cluster, we leave that cluster as it is, else we randomly sample n = threshold annotations. Note that balancing is only done with respect to the clusters and not with respect to the object names. Using this balancing strategy we get a balanced train set. We use 25% of it for validation. For test sets we keep both balanced and unbalanced sets. The main motivation for Case2, 3 is to see how well the model generalizes to novel objects even if it depends on the semantic distance of the “seen” objects and if it can disambiguate the novel objects from the “seen” objects. 23 2.4.2.4 Choice of Datasets We note that Flickr30k Entities doesn’t provide synset information which is important to disambiguate synonym cases hence it cannot be used for Case2, 3. Visual Genome doesn’t contain wide categories like “vehicles” hence it cannot be used for Case 1. For Case0, we could use Visual Genome as well, however, we choose Flickr30k Entities due to its precise bounding boxes. 2.4.3 Dataset Distributions We provide statistics for each dataset in Fig 2.4. For Case0 we show the entity-wise distribution (a),(b),(c). In particular we note that the “other” category occupies a larger set in the validation and test sets. This is because the “other” category has a more diverse vocabulary and encompasses a larger part of the exclude vocabulary list. For Case1, since it only has “other” category in its validation and test set, the entity-wise distributions are not meaningful and we don’t include them here. For Case2,3 we show the distributions with respect to the clusters formed via K-Means for both the unbalanced [(d),(e),(f)] and balanced cases [(g), (h), (i)]. We don’t train on the unbalanced set but do test on the unbalanced set as well. Note that the distribution across clusters in the balanced sets are uniform which means our balancing strategy was successful. Dataset Caveats: (i) We note that polysemy is not taken care of i.e. the same word can have different meanings. (ii) Neither Visual Genome nor Flickr30k is a true referring expressions dataset i.e. the query phrase may not always uniquely identify an object. 2.4.4 Datasets Used Flickr30k Entities [184] contains 30k images each with 5 sentences and each sentences has multiple query phrases. We use the same splits used in [32, 86, 200]. 24 ReferIt(RefClef) [101] is a subset of Imageclef [49] containing 20k images with 85k query phrases. We use the same split as [32, 86, 200]. Flickr-Split-0 We create an unseen split of Flickr30k based on the method outlined in Section 2.4.1. It contains 19K train images with 11K queries, 6K validation images with 9K queries and 6K test images with 9K queries. Flickr-Split-1 This split of Flickr30k has “other” category only in the validation and test images. It contains 19k training images with 87k query phrases and 6k images with 26k query phrases for validation and test each. VG-Split We use a balanced training set (as described in Section 2.4.1) containing 40K images and 264K query phrases. We use a subset (25%) for validation. VG-Split-2 contains 17K images and 23K query phrases in the unbalanced set, 10K images and 12K query phrases in the balanced set. VG-Split-3 contains 41K images with 68K query phrases in the unbalanced set, 23K images and 25K query phrases in the balanced set. 2.4.5 Experimental Setup Evaluation Metric: We use the same metric as in [32]. For each query phrase, we assume that there is only one ground truth bounding box. Given an image and a query phrase if the IoU of our predicted bounding box and the ground truth box is more than 0.5 we mark it as correct. However, in the case of Visual Genome splits, the annotations are not precise so we use 0.3 as the threshold. The final accuracy is averaged over all image query phrase pairs. Baselines: To explicitly compare the performance of dense proposals, we create a new baseline QRG based on QRC [32] which uses GloVe embeddings instead of embeddings learned from the data. We benchmark it on Flickr30k to show there is no drop in performance compared to QRC. We further use it as a strong baseline on the unseen splits. In all cases, we use a fasterR-CNN [198] pretrained on Pascal-VOC 25 [50] and fine-tune it on the target dataset. For Flickr-Split-0 we fine-tune on all the entities, for FlickrSplit-1 we fine-tune on all entities except “other”. We use the top-100 box predictions provided by the fine-tuned network after applying non-maxima suppression to be consistent with implementation in [32]. For VG-Split, we train on all the seen-classes, i.e. union of all seen objects in every cluster (∪kSk). In this case, we don’t use non-maxima suppression and instead consider all the 300 boxes provided by the fine-tuned region-proposal network. Implementation details: We train ZSGNet and baseline models till validation accuracy saturates and report our values on the test set. We found Adam [107] with learning rate 1e −4 for 20 epochs to be sufficient in most cases. For ZSGNet, to generate image feature maps at different scales, we use two variations: (i) SSD [147] with VGG network (ii) RetinaNet [143] with Resnet-50 [76] network. Note that these are not pretrained on any detection dataset. Initially, we resize the image to 300 × 300 for faster training and later retrain with image sizes 600 × 600 which gives a consistent 2 − 3% improvement. We note that while image augmentations (like horizontal flipping) are crucial for object detectors it is harmful for grounding as the query phrases often have location information (like “person standing on the left”, “person to the right of the tree”). 2.4.6 Results on Existing Grounding datasets Table 2.2 compares ZSGNet with prior works on Flickr30k Entities [184] and ReferIt [101]. We use “det” and “cls” to denote models using Pascal VOC [50] detection weights and ImageNet [42, 206] classification weights. Networks marked with “*” fine-tune their object detector pretrained on Pascal-VOC [50] on the fixed entities of Flickr30k [184]. However, such information is not available in ReferIt dataset which explains ∼ 9% increase in performance of ZSGNet over other methods. This shows that our model learns about the entities directly from the grounding dataset. 26 Method Net Flickr30k ReferIt SCRC [86] VGG 27.8 17.9 GroundeR (cls) [200] VGG 42.43 24.18 GroundeR (det) [200] VGG 48.38 28.5 MCB (det) [54] VGG 48.7 28.9 Li (cls) [128] VGG - 40 QRC* (det) [32] VGG 60.21 44.1 CITE* (cls) [183] VGG 61.89 34.13 QRG* (det) VGG 60.1 - ZSGNet (cls) VGG 60.12 53.31 ZSGNet (cls) Res50 63.39 58.63 Table 2.2: Comparison of our model with other state of the art methods. We denote those networks which use classification weights from ImageNet [206] using “cls" and those networks which use detection weights from Pascal VOC [50] using “det". The reported numbers are all Accuracy@IoU = 0.5 or equivalently Recall@1. Models marked with “*" fine-tune their detection network on the entities in the Flickr30k. Method Overall people clothing bodyparts animals vehicles instruments scene other QRC - VGG(det) 60.21 75.08 55.9 20.27 73.36 68.95 45.68 65.27 38.8 CITE - VGG(det) 61.89 75.95 58.50 30.78 77.03 79.25 48.15 58.78 43.24 ZSGNet - VGG (cls) 60.12 72.52 60.57 38.51 63.61 64.47 49.59 64.66 41.09 ZSGNet - Res50 (cls) 63.39 73.87 66.18 45.27 73.79 71.38 58.54 66.49 45.53 Table 2.3: Category-wise performance with the default split of Flickr30k Entities. For Flickr30k we also note entity-wise accuracy in Table 2.3 and compare against [32, 183]. We don’t compare to the full model in [32] since it uses additional context queries from the sentence for disambiguation. As these models use object detectors pretrained on Pascal-VOC [50], they have somewhat higher performance on classes that are common to both Flickr30k and Pascal-VOC (“animals”, “people” and “vehicles”). However, on the classes like “clothing” and “bodyparts” our model shows much better performance; likely because both “clothing” and “bodyparts” are present along with “people” category and so the other methods choose the “people” category. Such biases are not exhibited in our results as our model is category agnostic. 27 Method Net FlickrSplit-0 FlickrSplit-1 VG-2B VG-2UB VG-3B VG-3UB 0.3 0.5 0.3 0.5 0.3 0.5 0.3 0.5 QRG VGG 35.62 24.42 13.17 7.64 12.39 7.15 14.21 8.35 13.03 7.52 ZSGNet VGG 39.32 29.35 17.09 11.02 16.48 10.55 17.63 11.42 17.35 10.97 Res50 43.02 31.23 19.95 12.90 19.12 12.37 20.77 13.77 19.72 12.82 Table 2.4: Accuracy across various unseen splits. For Flickr-Split-0,1 we use Accuracy with IoU threshold of 0.5. Since Visual Genome annotations are noisy we additionally report Accuracy with IoU threshold of 0.3. The second row denotes the IoU threshold at which the Accuracy is calculated. “B” and “UB” denote the balanced and unbalanced sets. Method Semantic Distances 3-4 4-5 5-6 6-7 7-8 VG 2B # I-P 310 1050 3543 5321 1985 QRG (Vgg) 25.16 16.67 15.16 10.96 12.54 ZSGNet (Vgg) 28.71 21.52 19.02 15.37 14.76 ZSGNet (Res50) 31.94 25.14 21.99 17.89 17.98 VG 3B # I-P 974 3199 7740 9873 3765 QRG (Vgg) 23.1 20.13 14.73 12.19 11.24 ZSGNet (Vgg) 23.82 25.73 17.16 16 14.56 ZSGNet (Res50) 29.57 27.85 21.3 18.77 16.71 Table 2.5: Accuracy of various models on the balanced VG-Splits-2,3 w.r.t the semantic distance of the referred object (A) to the closest object seen at train time. VG-2B and VG-3B refer to the balanced test set for Case2, 3. #I-P denotes the number of image-phrase pairs in the given semantic distance range. 2.4.7 Results on ZSG Datasets Table 2.4 shows the performance of our ZSGNet model compared to QRG on the four unseen splits described in Section 2.4.1 and 2.4.4. Across all splits, ZSGNet shows 4 − 8% higher performance than QRG even though the latter has seen more data (the object detector is pretrained on Pascal VOC [50]). Next, we observe that the accuracy obtained on Flickr-Split-0,1 are higher than the VG-split likely due to larger variation of the referred objects in Visual Genome. Finally, the accuracy remains the same across the balanced and unbalanced sets indicating the model performs well across all clusters as our training set is balanced. 28 Model Accuracy on RefClef BM + Softmax 48.54 BM + BCE 55.20 BM + FL 57.13 BM + FL + Img-Resize 61.75 Table 2.6: Ablation study: BM=Base Model, softmax means we classify only one candidate box as foreground, BCE = Binary Cross Entropy means we classify each candidate box as the foreground or background, FL = Focal Loss, Img-Resize: use images of dimension 600 × 600 We also study the relationship between accuracy and the distance of the referred object at test time (A) from the training set. We re-use the clusters obtained while creating the VG-Split and consider the closestseen object which lies in the same cluster as that of the referred object. For every unseen object (oi ) in a particular cluster k (oi ∈ Uk) we find the closest seen object in the same cluster oj = argminoj∈Sk dist(oi , oj ). For dist calculation, we use the GloVe embeddings [182] corresponding to the objects and take the L2-norm of their difference. We group the distances into 5 intervals of unit length and report the accuracy of the subset where the distance of the referred object from the training set lies in that interval in Table 2.5. We note few examples of various intervals. We use the notation (A, b) i.e. a tuple of the referred object and the closest object in the same cluster. Examples for the interval 3-4 are {(bouquet, flower), (cupcake, cake), (tablecloth, curtain)} and for 7-8 are {(printer, paper), (notebook, pen), (tattoo, poster)}. As expected, the accuracy declines with the semantic distance but smoothly i.e. there is no sudden drop in performance. 2.4.8 Ablation Study We show the performance of our model with different loss functions using the base model of ZSGNet on the validation set of ReferIt [101] in Table 2.6. Note that using soft-max loss by itself places us higher than the previous methods. Further using Binary Cross Entropy Loss and Focal loss [143] give a significant (7%) performance boost which is expected in a single shot framework. Finally, image resizing gives another 4% increase. 29 Flickr30k ReferIt Flickr case0 Flickr case1 VG 2B VG 3B FR (no f/t) 73.4 25.4 64.95 62.9 15.87 13.92 FR (f/t) 90.85 58.35 85.18 74.85 26.17 25.07 Table 2.7: Proposal Recall Rates using top-300 proposals at IoU = 0.5 (0.3 for VG) calculated on test sets. FR: FasterRCNN, no f/t: pretrained on pascal voc, f/t: fine-tuned on the target set. For referit we use f/t model on Flickr30k to be consistent with QRC. Model Flickr30k ReferIt Flickr case0 Flickr case1 VG 2B VG 3B LB 0.008 0.0042 0.009 0.0024 0.0084 0.0093 IB 28.07 24.75 24.42 17.15 9.5 9.27 Table 2.8: Ablation study: Language Blind (LB) and Image Blind (IB) setting using Images of Resolution 300 × 300. Metric reported is Accuracy@IoU=0.5 (0.3 for VG) 2.4.9 Proposals from Pre-Trained Detector(s) A crucial difference between ZSGNet and prior work is the removal of proposals obtained from a pretrained network. To explicitly analyze the the errors caused due to missing proposals we calculate the proposal recall. Proposal Recall: We measure the recall rates (@300) of the region proposal network (RPN) from FasterRCNN [198] pretrained on Pascal VOC [50] and fine-tuned on the target dataset in Table 2.7. For ReferIt [101] we use the fine-tuned model on Flickr30k Entities [275] to be consistent with QRC [32]. We note that (i) proposal recall significantly improves when we fine-tune on the target dataset (ii) performance of QRG on Flickr30k, case0, case1 follows the same trend as the proposal recall (iii) proposal recall is significantly smaller on Visual Genome [117] due to (a) a large number of classes in visual genome (b) considering the “unseen” classes during training as negatives. These recall scores motivate the use of dense proposals for zero-shot grounding. 30 a group of older men a red beanie cap rightmost animal nightstand between beds microphones a cigar a two-seat kayak a handstand couch countertop a rocky cliff (hill) large boulders (rock) stairway (wall) shorts (person) planter (plant) Figure 2.5: Few grounding visualizations. In all cases, red denotes the ground truth box; green is the box predicted by ZSGNet. Row-1:Flickr30k, ReferIt; Row-2: Flickr-Split-0, 1; Row-3: VG-Split-2,3. In Row-3, the query word Q is emphasised and the closest seen object is provided in parenthesis. The last column shows incorrect predictions. 2.4.10 Image Blind and Language Blind Ablations Model Ablations: We ablate our model in two settings: language blind (LB) (the model sees only the image and not the query) and image blind (IB) (the model considers the query but not the image). We provide the results obtained after retraining the model in Table 2.8. In the LB case, our model sees multiple correct solutions for the same image and therefore gives a random box output leading to a very low accuracy across all datasets. In the IB case, our model learns to always predict a box in the center. We note that the referred object lies in the center of the image for Flickr30k and ReferIt. This is because Flickr30k Entities contains queries derived from captions which refer to the central part of the image and ReferIt is a two player game with a high chance of referring to the central object, leading to relatively high accuracy 25 − 30%. However, this is substantially lower for Visual Genome (9 − 10%) which has denser object annotations. 31 2.4.11 Visualization To qualitatively analyze our model we show a few visualizations in Fig 2.5. The first row shows grounding results on Flickr30k (first, second column) and ReferIt (third, fourth column). Our model learns about the attribute(s) (“red”), location (“leftmost”) and entities (“cap”, “nightstand”) and predicts very tight bounding box. In the last column our model incorrectly predicts only one “microphone”. The second row shows Flickr-Split-0 (first, second column) and Flickr-Split-1 (second, third column) predictions. The query phrases “cigar”, “kayak” are never encountered in the training set though close synonyms like “cigarettes”, “kayakers” are i.e. our model generalizes to unseen words even if they haven’t been explicitly seen before. This generalization can be attributed to pre-trained GloVe[182] embedding. On Flickr-Split-1 the model predicts a good bounding box even though referred objects lies in a different category. However, when there are too many objects in the image the model gets confused (last column). The third row shows predictions on VG-Split-2 (first, second column) and VG-Split-3 (third, fourth column). Additionally, we italicize the query word Q which refers to the object A in the image and mention the closest object encountered during training in parenthesis. In VG-Split-2 our model effectively utilizes word embedding knowledge and performs best when the closest object seen during training is visually similar to the referred object. In VG-Split-3 our model additionally needs to disambiguate between a seen object and the referred object and performs well when they are visually distinct like “shorts” and “person”. However, when the two are visually similar as in the case of “planter” and “plants” our model incorrectly grounds the seen object. 2.5 Conclusion In this work, we introduce the task of Zero-Shot grounding (ZSG) which aims at localizing novel objects from a query phrase. We outline four cases of zero-shot grounding to perform finer analysis. We address 32 the limitations posed by previous systems and propose a simple yet effective architecture ZSGNet. Finally, we create new datasets by sub-sampling existing datasets to evaluate each of the four grounding cases. We verify that our proposed model ZSGNet performs significantly better than existing baseline in the zero-shot setting. 33 Chapter 3 Video Object Grounding. 3.1 Introduction Grounding objects in images [279, 32, 277] and videos [103, 306, 35] from natural language queries is a fundamental task at the intersection of Vision and Language. It is a building block for downstream grounded vision+language tasks such as Grounded-VQA [310, 297, 123, 125, 58], Grounded-Captioning [156, 154, 158, 305] and Grounded Navigation [85]. In this work, we address the task of Video Object Grounding (VOG): given a video and its natural language description we aim to localize each referred object. Different from prior VOG methods on finding objects from query mentions [306] or distinguishing spatio-temporal tubes from a referring expression [35], we formulate VOG as localizing only the specific referred objects in the query. Prior work has focused on attending to each object in isolation; our formulation additionally requires incorporating object-object relations in both time and space. Figure 3.1 illustrates the key differences. Despite the importance of associating natural language descriptions with objects in videos, VOG has remained relatively unexplored due to two practical requirements: (i) a large-scale video dataset with object-level annotations, (ii) the videos should contain multiple instances of the same object category so making a distinction among them becomes necessary. Recently, [305] released ActivityNet-Entities dataset which contains bounding box annotations relating the noun-phrases of the video descriptions [116] to the 34 Query: The man passes a ball to a group of kids Arg0 Verb Arg1 Arg2 (a) localize individual queries: “man”, “ball”, “kids” (b) localize the spatio-temporal tube from query uniquely identifying it (“man passing the ball”) (c) localize only the referred objects in the query (“man”, “ball”, “group of kids”) Figure 3.1: Illustration of different formulations of VOG when the same query and video frames are used. (a) [306] treats each query word independently and doesn’t distinguish between different instances of the same object. (b) [35] makes this distinction using independent spatio-temporal tubes. Ours (c) involves localizing only those objects which are being referenced in the query and requires additional disambiguation using object relations. 35 corresponding objects instances in ActivityNet [23] videos. Despite its scale, a majority of the videos in ActivityNet contain single instances of various objects. For instance, in Figure 3.1 “ball” can be localized simply using an object detection system such as FasterRCNN [198] without relating “ball” to the “man” or the “kids”. We mitigate this absence of multiple object instances in two steps. First, we sample contrasting examples from the dataset; these are examples that are similar to but not exactly the same as described by the language query. To sample contrasting examples, we obtain semantic-roles (SRLs) using a state-of-theart Semantic Role Labeling (SRL) system [224] on the language descriptions. SRLs answer the high-level question of “who (Arg0) did what (Verb) to whom (Arg1)” [231]. We sample videos with descriptions of the same semantic roles structure as the queried description, but the role is realized by a different noun or a verb. In the next step, we need to present the contrasting videos to a model. If the contrasting samples are processed independently, a model could easily “cheat” and find the associated video by simply adding the object detection and action recognition scores as per the query. To prevent this, we propose novel spatial and temporal concatenation methods to merge contrasting samples into one video. With contrasting objects and their relations in the same video, the model is forced to encode object relations in order to ground the referred objects (details in Section 3.3.1). Clearly, encoding object relations is of primary importance for VOG. Recently, [62] and [305] show promising results using self-attention [246] to encode object relations. However, there are two issues in directly adapting self-attention on objects for VOG. First, such object relations are computed independent of the language creating ambiguities when two objects have multiple relations. For instance, in Figure 3.1 “The man is playing with a group of kids” is an accurate description for the same video but the queried relation between “the man” and “kids” is different. Second, the transformer module for self-attention [246] expects positional encoding for its input but absolute positions are not meaningful in a video. 36 We address these issues in our proposed VOGNet framework which applies self-attention to both the object features and fused multi-modal features to encode language dependent and independent object relations. To encode positions, we propose a relative position encoding (RPE) scheme based on [221]. Essentially, RPE biases the model to weigh related objects based on their proximity (details on model architecture in Section 3.3.2). To evaluate our models, we contribute ActivityNet-SRL which adds semantic roles to the descriptions [116] and aligns with the noun-phrase annotations in [305]. We further show by pre-computing lemmatized noun-phrases, contrastive sampling process can be used in training (details on dataset construction in Section 3.4.1,3.4.2). Our contributions are three-fold: (i) we explore VOG and propose contrastive sampling with temporal and spatial concatenation to allow learning object relations (ii) we design VOGNet which extends selfattention to encode language-dependent object relations and relative position encodings (iii) we contribute ActivityNet-SRL as a benchmark for VOG. Our code and dataset are publicly available∗ . 3.2 Related Work Grounding objects in images is a heavily studied topic under referring expression [279, 277, 162, 100] and phrase localization [32, 200, 184, 183, 210]. In contrast, grounding objects in videos has garnered less interest. Apart from [306, 35], [103] enforces temporal consistency for video object segmentation and requires the target to be in each frame and [87] use structured representations in videos and language for co-reference resolution. Different from them, our proposed formulation of VOG elevates the role of object relations and supports supervised training due to use of a larger dataset. Object relations is also fairly well-studied in images under scene-graph generation [269, 135, 117, 170] and human-object interaction [29, 27, 204, 55, 63, 311] and referring relations [115]. However, a majority of ∗ https://github.com/TheShadow29/vognet-pytorch 37 the relations are spatial (“to-the-left-of”, “holding”) with considerable biases caused due to co-occurrence [291]. On the video side, it has been explored for spatio-temporal detection [236, 15, 62]. In particular, [62] showed self-attention using transformers [246] to be more effective than relation-networks [215] based detectors [236]. For VOG, relation networks would not be effective due to high memory requirements and thus we only explore self-attention mechanism. Different from [62], we use bottom-up features [6] which don’t maintain any order. As an alternative, we employ relative position encoding. Video relation detection [218, 218, 242] is closely related to VOG where relations between two objects need to detected across video frames. However, the metrics used (recall@50/100) are difficult to interpret. Moreover, densely annotating the relations is expensive and results in less diverse relations. In contrast, ours uses sparsely annotated frames and leverages off-the-shelf SRL systems. Visual Semantic Role Labeling in images has focused on situation recognition [274, 272, 229]. To annotate the images, [274] employed FrameNet[53] annotations and [229] shows using semantic parsers on image captions significantly reduces annotation cost. We instead PropBank annotations [175] which is verb-oriented and thus more suited to video descriptions. Finally, our use of semantic roles is guided by contrastive sampling and not assigning semantic roles to visual entities. Contrastive Training via max-margin loss has been commonly used in vision+language tasks [277, 97, 306, 295]. Here, we don’t use contrastive losses, instead, the concatenation of the videos directly informs us which objects are related. As such, we train using binary cross-entropy. 3.3 Method We describe our sampling and concatenation process which enables learning object relations for VOG (Section 3.3.1), followed by details of VOGNet (Section 3.3.2) and relative position encoding scheme (Section 3.3.3). 38 Height Width Time T1 T2 … … … T3 … T4 (b) Method 1: Temporal Concatenation Height Width Time … Tsample Height Width … Time Tsample (c) Method 2: Spatial Concatenation along width (d) Spatial Concatenation along height is not reasonable W1+W2+W3+W4 H1+H2+H3+H4 Q1: man petting dog Q2: woman petting dog Q3: man picking up dog Q4: man petting cat (a) Contrastive Sampling Figure 3.2: (a) illustrates contrastive sampling based on semantic roles. Q1 contains a single agent (“man”) and a single patient (“dog”). We use the SRL structure Arg0-Verb-Arg1 but replace one queried object (Q2, Q4) or action (Q3). (b) shows temporal concatenation where we resize each video to the same width, height. (c) shows spatial concatenation where we resize the height and sample a fixed number of frames across the videos (d) shows an unreasonable spatial concatenation as videos have a top-down order (“ocean” is always below “sky”) Agent Verb Patient Modifier Instrument Person washes cups in a sink with water. Arg0 Verb Arg1 ArgM-Loc Arg2 Table 3.1: An illustration of semantic-role assignment to a description. Here, the actor/agent (person) performs an action/verb (wash) using some instrument (water) at some location (sink). 3.3.1 Contrastive Sampling Most large scale video datasets [99, 23, 2] are curated from Internet sources like YouTube which rarely contain multiple instances of the same object in the same video. VOG on such datasets can be trivially solved using object detection. To mitigate this issue, we propose a two-step contrastive sampling method. First, we assign semantic roles labels (SRLs) to every language descriptions of the videos (see Table 3.1) and sample other descriptions by replacing each role in a one-hot style (Figure 3.2(a)). In the second step, we aggregate our samples. One simple method is to present each video separately, similar to standard multiple-choice in Question-Answering tasks [290, 288, 123]; we call this “separate” 39 (SEP) strategy (i.e. the videos are viewed separately). However, SEP doesn’t force learning object relations, as one could independently add the scores for each referred object. For instance, in Figure 3.2-(a) we can score “man”, “petting”, “dog” individually and choose the objects in the video with the highest aggregate score essentially discarding object relations. Alternatively, we generate new samples by concatenation along the time axis (TEMP) or the width axis (SPAT). For TEMP, we resize the sampled videos to have the same width and height (Figure 3.2(b)). For SPAT, we resize the height dimension and uniformly sample F frames for each video (Figure 3.2(c)). Generally, it is not reasonable to concatenate along the height dimension as most real-world images obey up-down order (“sky” is on the top while “ocean” is below) but not left-to-right order (Figure 3.2(d)). Such concatenated videos, by construction, have multiple instances of the same object category. To associate an instance described in the language query to its bounding box in the video, a model would need to disambiguate among similar object instances by exploiting their relations to the other objects. For e.g., in Figure 3.2(c) “man” or “dog” cannot be uniquely identified without considering other objects. Caveats: (i) in TEMP, one could use an activity proposal network like [141, 57] and bypass the problem altogether, (ii) in SPAT uniformly sampling F frames from two different videos, would result in different parts of the image moving faster or slower and could partially affect our results. 3.3.2 Framework Notation: We are given a video V sampled with F frames and a language description L with k roles. In general, not all k roles can be visually grounded in the video, however, this information is not known apriori. Given P proposals for each frame using an object detector, we denote O={pi,j} (i th proposal in j th frame) as the set of proposals in the video. In VOG we learn the mapping H : (V, O, L) → [{p ∗ l } F j=1] k l=1 where p ∗ ∈ O. That is, for each of the k roles, we output a proposal p ∗ in each frame. We allow p ∗=ϕ if the object is not visible in a particular frame, or the object cannot be localized. 40 Query: The man passes a ball to a group of kids Bi-LSTM Phrase Encoder Action Classifier Visual Feature Extraction Time … RGB feature flow feature Concat MLP Self-Attention Relation Encoding Object Transformer Proposal visual features Global features Multimodal Transformer Predictions BCE Loss Language Feature Encoding Proposal position information language features Object Detector MLP Figure 3.3: An overview of VOGNet. It takes a video-query pair as an input. A visual encoder extracts object features for each frame and concatenates them with segment features (rgb+flow). A language encoder encodes the whole query with a BiLSTM [217, 81] and then maintains a separate encoding for each phrase in the query (Eq. 3.1). A Transformer [246] is first applied to the visual features to model object relations. These self-attended visual features are fused with the language features. Finally, a separate transformer models the interaction among the fused multi-modal features followed by a 2-layer MLP. VOGNet is trained with Binary Cross Entropy (BCE) loss. We build a VOGNet framework that contains a Language Module to encode the query descriptions at the phrase level, a Visual Module to encode the object and frame level features in the video and a MultiModal Relation Module to encode both a language independent and dependent object relations. Figure 3.3 gives an overview of VOGNet. Language Module first encodes the query q = {wi} n i=1 as n hidden vectors [h1, . . . , hn] with a BiLSTM [81, 217]. The j-th Semantic Role Label (SRL) in query q, Argj, spanning a set of words Sj (e.g., in Figure 3.3, Arg0 includes the words S0 = {“The”, “man”}) is encoded as q˜j is q˜j = Mq(G({δ(wi ∈ Sj ) · hi} n i=1)) (3.1) where δ(.) is an indicator function, and G(.) is an aggregation function. In VOGNet, we set G as the concatenation of the first word and the last word for each SRL, followed by Mq which denotes a Multiple Layer Perceptron (MLP). 41 Visual Feature Extraction: An off-the-shelf object detector [198] returns P proposals for each frame. Let pi,j be the i th proposal in j th frame and vi,j∈R dv be its ROI-pooled feature. Similarly, an action classifier returns temporal features containing image-level and flow-level features of the video. In general, the number of frames considered by the action classifier could be greater than F. We consider the local segment feature corresponding to the F frames to get sj∈R ds , and append it to each proposal feature in j th frame. The final visual feature is vˆi,j = Mv(vi,j ||sj ), where Mv is a MLP. Object Transformer is a transformer [246] and applies self-attention over the proposal features vˆi,j , i.e. self-attention is applied to P×F proposals. We denote the self-attended visual features as vˆ sa i,j . Similar module is used in [305] but there are two differences: first, vˆi,j contains additional segment features; second absolute positions are replaced with relative position encoding (Section 3.3.3). Multi-Modal Transformer: We concatenate the self-attended visual features vˆ sa and the language features q˜ to get multi-modal features m where m[l, i, j] = [ˆv sa i,j ||q˜l ]. We apply self-attention with relative position encoding to get self-attended multi-modal features msa. However, due to hardware limitations, it is extremely time consuming to perform self-attention over all the proposals especially when P×F×k is large. Thus, we perform this self-attention per frame i.e. self-attention is applied to P × k features F times. Subsequently, msa is passed through 2-layered MLP to get prediction for each proposal-role pair to get m˜ sa . Loss Function: Let Lg be the set of groundable roles i.e. have a corresponding bounding box. Thus, a proposal-role pair is considered correct if it has IoU≥0.5 and negative otherwise. We train using Binary Cross Entropy (BCE) loss and average over the phrases with a bounding box: Lpred = 1 |Lg| X lg∈Lg BCE( ˜msa[lg, i, j], gt[lg, i, j]) (3.2) Minor changes for SEP: When training and evaluating models using SEP strategy we have access to the individual videos. Here, we use the temporal features to learn a Verb score which can be used to 42 disambiguate between videos with the same objects but different verbs. In general, this didn’t translate to other strategies and hence it is not included in our framework. 3.3.3 Relative Position Encoding Relative Position Encoding (RPE) uses the relative distances between two proposals as an additional cue for attention. We denote the normalized positions of the proposal pa,b whose 5d coordinate is[xtl, ytl, xbr, ybr, j] with posa,b = [xtl/W, ytl/H, xbr/W, ybr/H, j/F]. We encode the relative distance between two proposals A and B as ∆A,B = Mp(posA − posB), where Mp is a MLP. Let the Transformer contain nl layers and nh heads. Here, ∆A,B ∈ R nh When self-attention is applied to a batch A(Q, K, V ) = SoftMax(QKT / p dk)V (3.3) We change this to A(Q, K, V ) = SoftMax((QKT + ∆[h])/ p dk)V (3.4) Note that ∆[h] has the same dimensions as QKT and leading to a simple matrix addition. That is, our relative position encoding (RPE) encodes the distance between each proposal pair and this encoding is different for each head. Intuitively, RPE biases the self-attention to weigh the contribution of other objects relative to their proximity. Our solution is based on prior work [221] but differs in two key aspects: (i) the relative positions are not embedding layers rather modeled by a MLP to encode the difference (ii) our relative encoding is different for different heads. Another way to extend [221] to visual setting would be to categorize the distances into multiple bins and learn encoding for each bin. We leave this study for future work. Caveat: While we resolve the issue of adding RPE to the transformer network efficiently, computation of ∆i,j remains expensive as it requires O(n 2 ) difference computation and is a bottleneck of our proposed solution. 43 3.3.4 Extended Relative Position Encoding Discussion In this section, we describe the challenges of using relative position encoding, followed by an overview of the method used in [221] and finally show how we adapt their formulation to our setting. For an overview of the technical details of the Transformer [246], we refer to the following well-written blogs “The Annotated Transformer”† , “The Illustrated Transformer”‡ , “Transformers From Scratch”§ . In general, Transformer performs self-attention with multiple heads and multiple layers. For a particular head, to compute self-attention, it derives the query Q, key K and value V from the input X itself as follows: Q = WqX K = WkX V = WvX (3.5) Using the derived Q, K, V triplet, it assigns new values to each input X using attention A given by A(Q, K, V ) = SoftMax QKT √ dz V (3.6) Here Q, K, V are each of shape B×T ×dz where B is the batch size, T is the sequence length, and dz is the dimension of each vector. The attention A can be computed efficiently using batch matrix multiplication since the multiplication QKT and the subsequent multiplication with V have the common B × T. For instance, when computing QKT we perform batch matrix multiplication with B ×T ×dz and B ×dz ×T resulting in B matrix multiplications to give B × T × T matrix. Since the attention mechanism itself doesn’t encode the positions of the individual T vectors, it is insensitive to the order of the T inputs. To address this, a position encoding is added to each of the T inputs to make the transformer dependent on the order of inputs. [221] follows up by using an additional † https://nlp.seas.harvard.edu/2018/04/03/attention.html ‡ http://jalammar.github.io/illustrated-transformer/ § http://www.peterbloem.nl/blog/transformers 44 relative position encoding. They define two new matrices a K i,jand a V i,j (both of shape B × T × T × dz) and change the attention equation as follows: A(Q, K, V ) = SoftMax Q(KT + a K i,j ) √ dz ! (V + a V i,j ) (3.7) As [221] notes, this removes the computation efficiency in the original transformer due to computation of a K i,j for all pairs, and more importantly, the efficient batch matrix multiplication cannot be used due to addition of a K i,j to K making it of shape B × T × T × dz. To resolve this, they propose the following equivalent formulation for computing QKT (similarly for multiplying V ): Q(KT + a K i,j ) = QKT + QaK i,j (3.8) Such formulation removes the additional time to compute K + a K i,j which would otherwise be a major bottleneck. There are two related challenges in adopting it to the visual domain: (i) the positions are continuous rather than discrete (ii) both a K i,j and a V i,j have dz dimension vector which is highly over-parameterized version of the 5d position vector (dz ≫ 5). To address (i) we use a Mp (MLP) to encode the 5d position which is a reasonable way to encode continuous parameters. For (ii) we change Eq. 3.7 as A(Q, K, V ) = SoftMax QKT + ∆ √ dz V (3.9) Here ∆ is of shape B × T × T same as QKT and ∆ is computed from the relative positions of two object proposals pi , pj as ∆i,j = Mp(pi − pj ) is a scalar. For added flexibility, we have that ∆i,j ∈ R nh where nh is the number of heads allowing us to use different ∆ for different heads. As mentioned in Section 3.3 (of the main paper), the computation of pi − pj for every pair remains the major bottleneck of our proposed relative position encoding. 45 Arg0 Arg1 Arg2 ArgM-Loc 42472 32455 9520 5082 Table 3.2: Number of annotated boxes in ASRL training set. 3.4 Experiments We briefly describe the dataset construction (see Section 3.4.3 for more details) followed by experimental setup, results and visualizations. 3.4.1 Constructing ActivityNet-SRL Our proposed dataset ActivityNet-SRL (ASRL) is derived from ActivityNet [23], ActivityNet-Captions (AC) [116] and ActivityNet-Entities (AE) [305]. There are two key steps in creating ASRL: (i) add semantic role labels (SRLs) to the descriptions in AC and filter it using heuristics (ii) add lemmatized words for each groundable phrase labeled as a semantic role for efficient contrastive sampling. For (i) we apply [224], a BERT-based [44] semantic-role labeling system to the video descriptions in AC. We use the implementation provided in [60] trained on OntoNotes5[188] which uses the PropBank annotation format [175]. The obtained semantic-roles are cleaned using heuristics like removing verbs without any roles usually for “is”, “are” etc. In general, each description contains multiple “verbs” and we treat them separately. For (ii) we utilize bounding box annotations in AE. First, we align the tokens obtained from the SRL system with the tokens of AE using [83]. Then, for each phrase labeled with a semantic role, we check if the corresponding phrase in AE has a bounding box and mark the phrase as being groundable or not. Since AE provides object names derived from the noun-phrases parsed using [161] we use them as the lemmatized word for the phrase. Table 3.2 shows the top-4 semantic roles with bounding box annotations in the training set of ActivityNet-Entities. We confine to this set of SRLs for contrastive sampling. 46 For training, we use the training set of ActivityNet which is the same as AC and AE. However, to create test set for AE, we need the ground-truth annotations which are kept private for evaluative purposes. As an alternative, we split the validation set of AE equally to create our validation and test set. When contrastive sampling is used in training, we only sample from the train set. However, since the size of validation and test sets is reduced, it is difficult to find contrastive examples. As a remedy, we allow sampling of contrastive examples from the test set during validation and vice versa for testing but never used in training. 3.4.2 Dynamic Contrastive Sampling While Contrastive Sampling is mainly used to create the validation and test sets to evaluate VOG, it can also be used for training where speed is the bottleneck. Given a particular description belonging to training index T containing roles R = [r1, . . . , rk] with the corresponding lemmatized words S = [s1, . . . , sk] we need to efficiently sample other descriptions with the same semantic-roles but containing one different lemmatized word. That is, we need to sample indices Ti whose lemmatized words are Si = [s1, . . . , s′ i , . . . sk] for every 1 ≤ i ≤ k. To address this, we first create a separate dictionary Di for each semantic role ri containing a map from the lemmatized words to all the annotation indices where it appears as ri . Given S, we can efficiently obtain Ti by randomly sampling from the set Ei = ∩j∈{1...k},j̸=iDj (sj ). Due to hardware limitations, we restrict k ≤ 4. For k > 4, we randomly drop k − 4 indices. If k < 4, then we randomly sample a training index Tj with the only restriction that the T and Tj describe different videos. 3.4.3 Extended Dataset Construction Details We derive ActivityNet-SRL from ActivityNet-Entities (AE) [305] and ActivityNet-Captions (AC) [116] (Section 3.4.3.1), provide the train, valid, and test split construction and statistics (Section 3.4.3.2), show the 47 Sentence: A woman is seen speaking to the camera while holding up various objects and begins brushing her hair. Verb Semantic-Role Labeling Considered Inputs is A woman [V: is] seen speaking to the camera while holding up various objects and begins brushing her hair. x seen A woman is [V: seen] [ARG1: speaking to the camera while holding up various objects and begins brushing her hair] x speaking [ARG0: A woman] is seen [V: speaking] [ARG2: to the camera] [ARGM-TMP: while holding up various objects] and begins brushing her hair . A woman speaking to the camera while holding up various objects holding [ARG0: A woman] is seen speaking to the camera while [V: holding] [ARGM-DIR: up] [ARG1: various objects] and begins brushing her hair . A woman holding up various objects begins [ARG0: A woman] is seen speaking to the camera while holding up various objects and [V: begins] [ARG1: brushing her hair] x brushing [ARG0: A woman] is seen speaking to the camera while holding up various objects and begins [V: brushing] [ARG1: her hair] A woman brushing her hair Table 3.3: An example of applying semantic role labeling to the video description. Each verb is treated independent of each other and the verbs “is”, “seen”, “begins” are not considered. For all other verbs, the last column shows the considered input to the system distribution of the dataset (Section 3.4.3.3) and finally compare ActivityNet against other datasets with object annotations (Section 3.4.3.4). 3.4.3.1 Constructing ASRL We first use a state-of-the-art BERT [44] based semantic role labeling system (SRL) [224] to predict the semantic roles of the video descriptions provided in AC. For SRL system, we use the implementation provided in AllenNLP [60] ¶ . It is trained on OntoNotes 5 [188] which uses PropBank annotations [175]. PropBank annotations are better suited for Verb oriented descriptions. The system achieves 86.4% on OntoNotes5. To ensure the quality, we randomly picked 100 samples and looked at the various labeled roles. We found a majority of these to be unambiguous and satisfactory. The few that were not found were removed by the following heuristics: (i) in a sentence like “Man is seen throwing a ball”, we remove the “seen” verb even though it is detected because “seen” verb doesn’t provide any extra information (ii) similarly we also ¶ see https://demo.allennlp.org/semantic-role-labeling for a demo 48 remove single verbs like “is”, “was” which are already considered when some other verb is chosen (iii) finally, in a small number of cases, no semantic-roles could be found, and such cases were discarded. In general, each description can contain multiple verbs, in such cases, we treat each verb separately. Table 3.3 shows this with an example. Once we have all the SRL annotated, we align them with the annotations of AE. This is non-trivial due to mis-match between the tokenization used by AE (which is based on Stanford Parser [161]) compared to the tokenization used in AllenNLP [60]. Thus, we utilize the Alignment function provided in spacy v2 [83] to align the tokens from the two systems. To provide bounding box information to each role, we look at the tokens within the boundaries of the semantic role, and if any of them has been assigned a bounding box, we mark the semantic-role groundable, and assign it the corresponding bounding box. Figure 3.5 shows the most common considered roles followed by Figure 3.6 depicting the most common roles which have a bounding box annotations (groundable roles). Note that a particular role could be considered multiple times, e.g. in Table 3.3 “A woman” is considered for each of the verbs “speaking”, “holding”, “begins”, “brushing” i.e. some of the roles (in particular Arg0) are counted more than once. While some roles like ArgM-TMP and ArgM-DIR appear more often than ArgM-LOC (see Figure 3.5), the number of groundable instances for the latter is much higher as locations are generally easier to localize. Further, note that Verb doesn’t refer to an object and hence doesn’t have any corresponding bounding boxes. After having matched the annotated semantic roles with the bounding box annotations from AE, we lemmatize the arguments and create a dictionary for efficient contrastive sampling (as described in Section 4.2 in the main paper). To obtain the lemmatized words, we use the object-name annotations given in AE which are themselves derived from stanford parser [161]. To lemmatize the verbs, we use the inbuilt lemmatizer in spacy [83]. 49 Train Val Train Val Test Train Val Test ActivityNet Captions ActivityNet Entities ActivityNet SRL Figure 3.4: Train, val and test splits for AC, AE, ASRL. 3.4.3.2 Training, Validation, Test Splits Once the roles and lemmatized words have been assigned, we need to create a train, validation and test splits. Train: We keep the same train split as AC, AE, ActivityNet. This allows using activity classification networks like TSN [253] trained on ActivityNet. Validation and Test: Creating the validation and test splits is non-trivial. Since the test split of AC is kept private, AE uses half of validation split of AC as its test split which is again kept private. Thus, we divide the existing validation set into two to create the validation and test set for ASRL (see Figure 3.4 for an illustration of deriving the splits, and Table 3.4 for the exact numbers). Dividing the original validation set implies high miss-rate (i.e. not enough examples to sample contrastive examples). To address this, we allow contrastive sampling from the test set during validation and vice-versa during testing for more robust evaluation. 3.4.3.3 Dataset Distribution Figure 3.7 highlights the distributions of the semantic-role-structures (i.e. the order of the semantic role labels) found in the sentences. It is interesting to note Arg0-Verb-Arg1 far outnumbers all competing structures. This also motivates the choice of considering k=4 videos at a time (if structure contains 3 roles, we can sample 3 more videos). 50 63812 48342 47335 16200 12061 8876 7408 5702 3661 1417 1238 905 854 Number of Noun Phrases Argument Roles V ARG0 ARG1 ARG2 ARGM-TMP ARGM-DIR ARGM-LOC ARGM-MNR ARGM-ADV ARGM-PRP ARG4 ARGM-PRD ARG3 0 20000 40000 60000 80000 Figure 3.5: Number of Noun-Phrases for each role 42472 32455 9520 5082 3505 2936 2168 2036 947 690 538 Number of Groundable Objects Argument Roles ARG0 ARG1 ARG2 ARGM-LOC ARGM-TMP ARGM-DIR ARGM-MNR ARGM-ADV ARG4 ARGM-PRP ARG3 0 10000 20000 30000 40000 50000 Figure 3.6: Number of groundable objects for each role 51 13654 3372 3135 3080 2269 2075 1689 1631 1383 1358 1290 862 838 754 743 735 0 5000 10000 15000 ARG0-V-ARG1 ARG0-V-ARG1-ARG2 ARG1-V-ARG2 ARG0-V ARG0-V-ARGM-DIR ARG0-V-ARG2 ARG0-V-ARG1-ARGM-LOC ARG1-V V-ARG1 ARG0-V-ARGM-LOC ARG0-V-ARG1-ARGM-TMP ARG0-V-ARG1-ARGM-MNR ARG0-V-ARG1-ARGM-DIR ARG0-ARGM-TMP-V-ARG1 ARG1-V-ARG2-ARGM-ADV ARG1-V-ARGM-DIR Figure 3.7: Frequently appearing SRL-Structures We look at the total number of lemmatized words in Table 3.5 and the most frequent (top-20) lemmatized words for each role with their frequencies: (i) Verb Figure 3.8 (ii) Arg0 Figure 3.9 (iii) Arg1 Figure 3.10 (iv) Arg2 Figure 3.11 . The higher number of verbs (Table 3.5) shows the diversity in the caption and their distribution is reasonably balanced (Figure 3.8) which we attribute to the curation process of ActivityNet [23]. In comparison, Arg0 is highly unbalanced as agents are mostly restricted to “people”. We also observe that “man” appears much more often than “woman”/“she”. This indicates gender bias in video curation or video description. Another interesting observation is that “person” class dominates in each of argument roles which suggest “person-person” interactions are more commonly described than “person-object” interactions. 52 2395 2152 16621626 14581402 1190 10571053 993 990 945 930 906 904 803 765 740 737 714 Lemmatized Verbs 0 500 1000 1500 2000 2500 stand play hold talk put sit speak use run take walk throw go ridemove walks wear get do look Figure 3.8: Top-20 Lemmatized Verb 8252 7973 4095 4081 3360 2048 1785 1067 1053 789 436 372 360 339 337 272 263 250 218 196 Lemmatized Arg0 0 2500 5000 7500 10000 man he woman she people they person girl boy lady player dog child team kid athlete shirt guy gymnast other Figure 3.9: Top-20 Lemmatized Arg0 2459 1967 1433 116311321097 967 792 467 413 380 354 318 317 300 298 297 269 245 235 Lemmatized Arg1 0 500 1000 1500 2000 2500 heman it woman she people ball they hand hair person dog car girl screen water boy rope shoe shirt Figure 3.10: Top-20 Lemmatized Arg1 525 322 259 244 204 203 186 176 151 149 148 144 141 139 128 114 101 100 100 94 Lemmatized Arg2 0 200 400 600 he table she woman waterman it people floor field room wall board car ground chair tree bar they ball Figure 3.11: Top-20 Lemmatized Arg2 53 243 238 230 200 134 128 124 106 99 91 87 86 84 80 79 77 75 72 68 65 Lemmatized ARGM-LOC 0 50 100 150 200 250 he water room field screen gym stage table floor court bar beach street pool she board mat woman ground track Figure 3.12: Top-20 Lemmatized ArgM-LOC Training Validation Testing AC 37421 17505 AE 37421 8774 8731 ASRL 31718 3891 3914 Table 3.4: Number of Videos in train, validation, and test splits. Some instances are removed from training if they don’t contain meaningful SRLs. Our test split is derived from AE validation set. 3.4.3.4 Dataset Choice Existing Datasets: (As of Nov 2019) Other than ActivityNet, there are three video datasets which have visual and language annotations in frames namely EPIC-Kitchens [37], TVQA+ [125] and Flintstones [71]. We consider the pros and cons of each dataset. EPIC-Kitchens contains ego-centric videos related to kitchen activity. It provides object level annotations, with narrative descriptions. While the annotations are rich, the narrative descriptions are too short in length (like “open the fridge” or “cut the vegetable”) and the actors Arg0 are not visible (ego-centric). 54 V Arg0 Arg1 Arg2 ArgM-LOC 338 93 281 114 59 Table 3.5: Total number of lemmatized words (with at least 20 occurrence) in the train set of ASRL. TVQA+ is a question-answering dataset subsampled from TVQA [123] with additional object annotations. While the videos are themselves rich in human activities, the questions are heavily dependent on the sub-titles which diminishes the role of actions. Flintstones is a richly annotated dataset containing clips from the cartoon Flintstones. The frames are 2-4 seconds long with 1-4 sentence description of the scene. With the objects in cartoons easier to identify it also serves as a diagnostic dataset for video understanding. However, the provided descriptions are less verb oriented and more image/scene-oriented due to shorter clips. In contrast, ActivityNet contains longer videos, as a result the corresponding descriptions in ActivityNet Captions capture verbs over an extended period of time. While the object annotations are richer in EPIC-Kitchens, TVQA+ and Flintstones, the richer verb-oriented language descriptions make it more suitable for our task. Using Natural Videos for evaluation: Our test data is generated “synthetically” by contrastive sampling followed by SPAT and TEMP strategies. An alternative evaluation protocol would be to test on naturally occurring videos. We discuss the challenges in obtaining such a dataset. Recall that in our formulation of VOG a model needs to understand the relations among various objects prior to localizing them. For instance, to evaluate if a model understands “man petting a dog” (example from Fig 3.2a Q1,), we need contrastive examples Q2: “X petting a dog”,Q3: “man X a dog”,Q4: “man petting X” in the same video. In the absence of any of these examples, it is hard to verify that the model indeed understands to query. (e.g. without Q3, “man” and “dog” could be localized without understanding 55 “petting”). Creating such a test set from web sources is impractical at present because there is no large-scale densely annotated video dataset to isolate such contrastive videos. A different (and quite expensive) method would be crowd-sourcing the video creation process by handing out detailed scripts to be enacted [226]. Here we would need to perform an additional “domain adaptation” step since we would be training and testing on different sources of videos (“YouTube” → “CrowdSourced”). This makes it challenging to attribute the source of error i.e. whether the reduced performance is due to poor generalization of object interactions or due to domain shift in the data. In practice, SPAT and TEMP strategies when applied to contrastive videos from ActivityNet are effective proxies to obtaining naturally occurring contrastive examples from the web. This is validated by the drop from SVSQ to SPAT and TEMP (Table 3.6). 3.4.4 Experimental setup Dataset Statistics: In total, ASRL contains 39.5k videos with 80k queries split into training, validation, and testing with 31.7k, 3.9k, 3.9k videos and 63.8k, 7.9k, 7.8k queries. Each video contains around 2 queries containing 3.45 semantic roles and each query has around 8 words. Evaluation Metrics: We compute the following four metrics: (i) accuracy: correct prediction for a given object in a query (recall that a query has references to multiple objects) (ii) strict accuracy: correct prediction for all objects in the query (iii) consistency: the predictions for each object lie in the same video (iv) video accuracy: predictions are consistent and lie in the correct video. While strict accuracy is the most important metric to note for VOG, other metrics reveal useful trends helpful for model diagnosis and building robust VOG models and datasets. Metric Computation: In AE, the noun phrases are only localized in the frame where it is most easily visible. This complicates the evaluation process when the same objects appear across multiple frames (a common occurrence). Thus, we select the highest-scoring proposal box for each role in the query in every 56 frame and set a score threshold. Given a phrase referring to a groundable object, we consider the prediction correct when the predicted box in an annotated frame has an IoU ≥ 0.5 with a ground-truth box. This allows us to compute accuracy in a single video single query (SVSQ) setting. For SEP, TEMP, SPAT we have additional information about which video frames and proposal boxes are not ground-truths. To evaluate SEP: we check if the predicted video is correct (which gives us video accuracy), and if so compute the accuracy similar to SVSQ. In TEMP and SPAT, for a given role if the predicted boxes not belonging to the ground-truth video have a score higher than a threshold, then the prediction for the role is marked incorrect. If the boxes are in the ground-truth video, we evaluate it similar to SVSQ (see Section 3.4.5 for examples of each strategy). Baselines: Prior work on VOG cannot be evaluated on ASRL due to their restrictive formulations. For instance, [306] grounds all objects when using TEMP and SPAT resulting in 0 accuracy and [35] needs spatio-temporal tubes. Recently, [305] proposed GVD, a model for grounded video description. GVD calculates its grounding accuracy by feeding the ground-truth description into a captioning system and finding the highest scored objects. However, this is not applicable to our task because it considers the language in a sequential manner. For an input query “man throwing ball”, GVD would ground “man” without looking at the remaining description and thus fail at grounding in our proposed contrastive setting. As an alternative, we propose two competitive baselines: (i) ImgGrnd: an image grounding system which treats each frame independently and does not explicitly encode object relations. (ii) VidGrnd: a video grounding system based on GVD using an object transformer to encode object relations. For fair comparisons, we use the same language features, visual features (the proposal and segment features) for both ImgGrnd and VidGrnd 57 Implementation details: We re-use the extracted visual features provided by [305] for AE. The object proposals and features are obtained from a FasterRCNN [198] trained on visual genome [117]. Segment features (both RGB and Flow features) are obtained using TSN [253] trained on ActivityNet [23]. For each video, F=10 frames are uniformly sampled and for each frame, we consider P=100 proposals which gives a recall of 88.14%. However, training with 100 proposals is time-consuming and computationally expensive. Instead, we introduce GT5 setting where we use exactly 5 proposals per frame. In unannotated frames, it includes the highest-scoring proposals, and for annotated frames, for each ground-truth box, it prioritizes the proposal having the highest IoU. GT5 maintains a similar recall score (86.73%), and allows experimenting with more variations and sets upper performance bound. For self-attention, both Object Transformer (OTx) and Multi-Modal Transformer (MTx) use multi-head attention [246] with nl=1 layer and nh=3 heads unless mentioned otherwise. In general, Object Transformer (OTx) applies self-attention across all proposals and frames whereas the Multi-Modal Transformer (MTx) applies self-attention to each frame separately due to higher computation load. We train all models until the validation accuracy saturates. For SEP, TEMP, SPAT we found 10 epochs with batch size 4 for GT5 and 2 for P100, using Adam with learning rate 1e −4 to be sufficient for most models. For SVSQ, we set batch size 4 for all models. We use the model with the highest validation accuracy for testing. We set the threshold used in evaluating TEMP and SPAT as 0.2 for GT5 and 0.1 for P100 across all models. More implementation details are provided in Section 3.4.6. 3.4.5 Extended Evaluation Discussion We use the following evaluation metrics: 1. Accuracy: correct box is predicted for the given phrase in a sentence (a sentence has multiple phrases) 2. Strict Accuracy: correct box is predicted for all the phrases in the sentence 58 3. Consistency: predicted boxes (for all the phrases) belong to the same video, even if they are incorrect 4. Video Accuracy: the predicted boxes are consistent, and the chosen video is also correct. Since there is only one video in SVSQ, both consistency and video accuracy are not meaningful. Similarly, we first choose a video in SEP, it is trivially consistent. As mentioned earlier, the bounding box annotations in AE is sparse, the object has a bounding box in only one frame in the video where it is most clearly visible. Since such sparse annotations complicate the computation of the above metrics, we describe their computation for each case. 3.4.5.1 Concatenation Strategies with Examples SVSQ: We have a video with F frames, however, for each object, the bounding boxes are available in exactly one frame. Moreover, this annotated frame could be different for every object (the guideline provided in AE [305] is to annotate in the frame where it is most clearly visible). As a result, we cannot be sure if the same object appears in a frame where it is not annotated. To address this, we require the model to predict exactly one bounding box in every frame. During evaluation, we consider only the annotated frame for a given object. If in this annotated frame, there is a predicted bounding box with IoU ≥ 0.5, we consider the object correctly predicted as illustrated in Figure 3.13. This gives us Accuracy for SVSQ. Strict Accuracy can be easily computed by considering all the phrases in the query i.e. the predicted boxes for each phrase should have IoU ≥ 0.5 with the ground-truth boxes. SEP: We have k videos and we choose one of these k videos which gives us the Video Accuracy. If the chosen video is correct, we then apply scoring based on SVSQ otherwise mark it incorrect. Accuracy and Strict Accuracy computation is same as SVSQ. TEMP: We have k videos concatenated temporally. In other words, we have kF frames in total of which we know (k − 1)F frames don’t contain the queried object. Thus, if among the (k − 1)F frames 59 Figure 3.13: SVSQ: Illustration of the ground-truth annotations for the “man” (green) obtained from AE. The red boxes show equally correct boxes for “man” but are not annotated. As a result, we only consider the third frame to compute accuracy of the predicted box. TIME WIDTH WIDTH WIDTH WIDTH Figure 3.14: TEMP: Two videos are concatenated along the time dimension (we show 2 frames from each video) and with the description “man throwing a ball” and we are considering the object “man”. If the predicted box is within the same video as ground-truth but the frame doesn’t have any annotation (red box) we ignore it. However, if the predicted box belongs to another video (yellow boxes), we say the prediction is incorrect. TIME WIDTH WIDTH Figure 3.15: SPAT: Similar to previous case, we have the same description of “man throwing a ball” and we consider the object “man” but the videos are concatenated along the width dimension (we show 2 frames in the concatenated video). Again, if the predicted box lies in the same video as ground-truth (red box), we ignore it. If the predicted box is in another video (yellow boxes), the predictions are deemed incorrect. 60 not containing the queried object, if a predicted box has a score greater than a certain threshold, we mark it incorrect. For the F frames belonging to the queried video, we use the same method as for evaluating SVSQ. This is illustrated in Figure 3.14. SPAT: In SPAT, we have k videos concatenated along the width axis. That is, we have F frames and each of width kW × H (here W, H are the width and height of a single video). In each of the F frames, there should not be a predicted box outside the boundaries of the correct video with a score greater than some threshold and for the annotated frame the predicted box should have IoU ≥ 0.5. This is illustrated in Figure 3.15. For TEMP and SPAT strategies, Consistency is computed by how often the various objects are grounded in the same video. Video Accuracy is derived from Consistency and is marked correct only when the correct video is considered. Finally, Strict Accuracy measures when all the phrases in the query are correctly grounded. Selecting Threshold for Evaluation: As noted earlier, we pose the proposal prediction as a binary classification problem, if a proposal has a score higher than a threshold (hyper-parameter tuned on validation set), it is considered as a predicted box. For evaluation, we consider only the boxes which have the highest score in each frame. But in both SVSQ and SEP cases there is no incentive to set a threshold (>0), as the false positives cannot be identified in the same video. On the other hand, in both TEMP and SPAT cases, false positives can be identified since we are sure boxes in a different video are negatives. 3.4.6 Extended Implementation Details ImgGrnd is an image grounding system that considers each frame separately. It concatenates the language features to the visual features of each object which is then used to predict whether the given object is correct. More specifically, given q˜j (Eqn 3.1) and the visual features vˆi,j we concatenate them to get the multi-modal features mIG where mIG[l, i, j] = [ˆvi,j ||q˜l ]. These are passed through a two-layered MLP 61 classifier and trained using BCE Loss. In essence, ImgGrnd can be derived from VOGNet by removing the object transformer and the multi-modal transformer. VidGrnd is a video grounding system which builds upon ImgGrnd. Specifically, it has an object transformer to encode the language-independent relations among the objects. More formally, given vˆi,j we apply object transformer to get vˆ sa i,j . The remaining steps are the same as ImgGrnd. We concatenate the language features q˜j with the self-attended object features vˆ sa i,j to get the multi-modal features mV G where mV G[l, i, j] = [ˆv sa i,j ||q˜j ]. After passing through a 2 layer MLP classifier, it is trained using BCE Loss. In essence, VidGrnd can be derived from VOGNet by removing the multi-modal transformer altogether and the relative position encoding from object transformer. VOGNet: Our models are implemented in Pytorch [178]. VOGNet SPAT using GT5 takes nearly 25-30 mins per epoch (batch size 4), compared to 3 hours per epoch for P100 (batch size 2). All models are trained for 10 epochs (usually enough for convergence). All experiments can be run on a single 2080Ti GPU. Language Feature Encoding: We use a Bi-LSTM [81, 217] ( fairseq [174] implementation). The words are embedded in R 512 and the Bi-LSTM contains 2 layers with hidden size of 1024, max sequence length of 20, and Mq with input/output size of 256. Visual Feature Encoding: The object features are obtained from a FasterRCNN [198] with ResNext [262] pre-trained on Visual Genome [117]. Each object feature is 2048d vector. The image level features (2048d) and optical flow (1024d) are extracted using resnet-200 [76] and TVL1 [285] respectively and are encoded using temporal segment networks [253]. They are concatenated to give segment features for each frame which are 3072d vector. We project both object and segment features into 512d vectors and then concatenate them to get 1024d vector for each object. Object Transformer uses 3 heads and 1 layer with each query, key, value of 1024d (full feature dimension which is divided by number of heads for multi-headed attention). 62 Model SVSQ SEP TEMP SPAT Acc SAcc Acc VAcc SAcc Acc VAcc Cons SAcc Acc VAcc Cons SAcc GT5 ImgGrnd 75.31 56.53 39.78 51.14 30.34 17.02 7.24 34.73 7.145 16.93 9.38 49.21 7.02 VidGrnd 75.42 57.16 41.59 54.16 31.22 19.92 8.83 31.70 8.67 20.18 11.39 49.01 8.64 VOGNet 76.34 58.85 42.82 55.64 32.46 23.38 12.17 39.14 12.01 23.11 14.79 57.26 11.90 P100 ImgGrnd 55.22 32.7 26.29 46.9 15.4 9.71 3.59 22.97 3.49 7.39 4.02 37.15 2.72 VidGrnd 53.30 30.90 25.99 47.07 14.79 10.56 4.04 29.47 3.98 8.54 4.33 36.26 3.09 VOGNet 53.77 31.9 29.32 51.2 17.17 12.68 5.37 25.03 5.17 9.91 5.08 34.93 3.59 Table 3.6: Comparison of VOGNet against ImgGrnd and VidGrnd. GT5 and P100 use 5 and 100 proposals per frame. Here, Acc: Grounding Accuracy, VAcc: Video accuracy, Cons: Consistency, SAcc: Strict Accuracy (see Section 3.4.4 for details). On the challenging evaluation metrics of TEMP and SPAT, VOGNet (ours) shows significant improvement over competitive image and video grounding baselines. SVSQ TEMP SPAT Acc SAcc Acc SAcc Acc SAcc SVSQ 76.38 59.58 1.7 0.42 2.27 0.6 TEMP 75.4 57.38 23.07 12.06 18.03 8.16 SPAT 75.15 57.02 22.6 11.04 23.53 11.58 Table 3.7: Evaluation of VOGNet in GT5 setting by training (first column) and testing (top row) on SVSQ, TEMP, SPAT respectively Multi-Modal Transformer also uses 3 heads and 1 layer but the query, key, value are 1280d vectors (additional 256 due to concatenating with the language features). 3.4.7 Results and Discussions In Table 3.6, we compare VOGNet against two baselines ImgGrnd and VidGrnd across GT5 (5 proposal boxes per frame) and P100 (100 proposal boxes per frame). Comparison of Strategies: We note that in the SVSQ column, all the models perform comparably. However, these results fail to generalize to other cases which indicates that evaluating on SVSQ is insufficient. Next, the SEP column shows that models can distinguish contrastive samples by considering the contribution of each object independently with very high accuracy and can easily distinguish similar examples achieving ≈ 50% on video accuracy even in the P100 setting. Such cues are not present in SPAT and TEMP where the model is given a single video and single query but now the video contains more than 63 SEP TEMP SPAT Train Test Acc VAcc Acc Acc Rnd CS 44.9 57.6 22.89 22.72 CS+Rnd CS 44.8 56.94 23.07 23.53 CS+Rnd Rnd 57.44 74.1 36.48 36.05 Table 3.8: Comparison of Contrastive Sampling (CS) vs Random Sampling (Rnd) for training (row-1,2) and evaluation (row-2,3). #vids #epochs Acc VAcc Cons SAcc 2 20 20.18 10.18 52.45 8.84 3 13 21.7 13.33 55.55 10.68 5 8 23.34 14.53 56.51 11.71 Table 3.9: Training VOGNet in SPAT setting with different number of concatenated videos and tested on SPAT with 4 videos. one actor performing some action. The performance in both SPAT and TEMP is still very low (strict accuracy for P100 is <5%), which suggests that VOG remains an extremely challenging problem for current state-of-art models. Comparison with Baselines: For both TEMP and SPAT, we find ImgGrnd performs relatively well (≈17% in GT5) despite not using any object relations. This is likely because the model can exploit attribute information (such as “red shirt”) in the phrases. VidGrnd which uses language independent object relations obtains gains of 2−3%. Finally, VOGNet, which additionally uses language-dependent object relations, outperforms VidGrnd by another 3−4%. GT5 vs P100: We observe that both GT5 and P100 follow similar patterns across metrics suggesting GT5 is a good proxy to explore more settings. For the remaining experiments, we consider only the GT5 setting. Performance across Strategies: Table 3.7 shows that VOGNet trained in SPAT and TEMP settings performs competitively on SVSQ (maintaining ≈75% accuracy). However, the reverse is not true i.e. models trained on SVSQ fail miserably in SPAT and TEMP (accuracy is <3%). This suggests that both TEMP and 64 SPAT Acc VAcc Cons SAcc ImgGrnd 17.03 9.71 50.41 7.14 +OTx(1L, 3H) 19.8 10.91 48.34 8.45 +RPE 20.2 11.66 49.21 9.28 +MTx(1L, 3H) 19.23 10.49 48.19 8.14 +RPE 19.09 10.46 50.09 8.23 +OTx(3L, 6H) 21.14 12.1 49.66 9.52 +OTx + MTx + RPE 23.53 14.22 56.5 11.58 VOGNet +MTx(3L,6H) 24.24 15.36 57.37 12.52 +OTx(3L,6H) 24.99 7.33 66.29 14.47 Table 3.10: Ablative study comparing gains from Multi-Modal Transformer (MTx) and Object Transformer (OTx) and Relative Position Encoding (RPE). L: Number of Layers, H: Number of Heads in the Transformer. Note that VOGNet = ImgGrnd +MTx(1L,3H) +OTx(1L,3H) + RPE SPAT moderately counter the bias caused by having a single object instance in a video. Interestingly, while VOGNet trained on TEMP doesn’t perform well on SPAT (performance is worse than VidGrnd trained on SPAT), when VOGNet is trained on SPAT and tested on TEMP it significantly outperforms VidGrnd trained in TEMP. This asymmetry is possibly because the multi-modal transformer is applied to individual frames. Contrastive Sampling: Table 3.8 compares Contrastive Sampling (CS) to a Random Sampling (RS) baseline for evaluation and training. Using RS for validation, SEP video accuracy is very high 75% implying that CS is a harder case; similarly, we find higher performance in both TEMP and SPAT cases. Interestingly, using RS for training is only slightly worse for SPAT, TEMP while outperforming in SEP. Thus, CS in SPAT and TEMP helps learn better object relations, but random sampling remains a very competitive baseline for training. Table 3.9 shows that using more videos in training helps; we use 4 videos due to GPU memory considerations and training time. Ablation Study: In Table 3.10 we record the individual contributions of each module in SPAT. We observe: (i) self-attention via object is an effective way to encode object relations across frames (ii) multimodal transformer applied on individual frames gives modest gains but falls short of object transformer due to lack of temporal information (iii) relative position encoding (RPE) boosts strict accuracy for both 65 Model Train SVSQ SEP TEMP SPAT Acc SAcc Acc VAcc SAcc Acc VAcc Cons SAcc Acc VAcc Cons SAcc ImgGrnd GT5 46.31 24.83 20.55 47.49 9.92 8.06 2.68 25.35 2.68 4.64 2.47 34.17 1.31 P100 55.22 32.7 26.29 46.9 15.4 9.71 3.59 22.97 3.49 7.39 4.02 37.15 2.72 VidGrnd GT5 43.37 22.64 22.67 49.6 11.67 9.35 3.37 28.47 3.29 5.1 2.66 33.6 1.74 P100 53.30 30.90 25.99 47.07 14.79 10.56 4.04 29.47 3.98 8.54 4.33 36.26 3.09 VOGNet GT5 46.25 24.61 24.05 51.07 12.51 9.72 3.41 26.34 3.35 6.21 3.40 39.81 2.18 P100 53.77 31.9 29.32 51.2 17.17 12.68 5.37 25.03 5.17 9.91 5.08 34.93 3.59 Table 3.11: Comparing models trained with GT5 and P100. All models are tested in P100 setting. transformers (iv) object transformer with 3 layers and 6 heads performs worse than using a single multimodal transformeri.e. adding more layers and attention heads to object transformer is not enough (v) using both object and multi-modal transformers with more layers and more heads gives the best performing model. 3.4.8 Additional Experiments We perform two additional experiments: (i) if the representations learned in GT5 transfer to the more general case of P100 (ii) the effect of adding more heads and layers to the object transformer (OTx) and multi-modal transformer (MTx). GT5 models in P100 setting: In Table 3.11 we compare the models ImgGrnd, VidGrnd, and VOGNet trained in GT5 and P100 and tested in P100 setting to calculate the transfer-ability of GT5 setting. While testing in P100, for TEMP and SPAT, we set the threshold for models trained in GT5 as 0.5 which is higher than the threshold used when testing in GT5 (0.2). This is expected as a lower threshold would imply a higher chance of a false positive. In general, the drop from P100 to GT5 is significant (a 15−25% drop) for almost all models suggesting training with just ground-truth boxes is insufficient. Nonetheless, since the relative drops are same across models, GT5 remains a valuable proxy for carrying out larger number of experiments. 66 SPAT Acc VAcc Cons SAcc ImgGrnd 17.03 9.71 50.41 7.14 +OTx (1L, 3H) 19.8 10.91 48.34 8.45 +OTx (2L, 3H) 20.8 11.38 49.45 9.17 +OTx (2L, 6H) 21.16 12.2 48.86 9.58 +OTx (3L, 3H) 20.68 11.34 48.66 9.19 +OTx (3L, 6H) 21.14 12.1 49.66 9.52 VOGNet 23.53 14.22 56.5 11.58 +MTx (2L,3H) 23.38 14.78 55.5 11.9 +MTx (2L,6H) 23.96 14.44 55.5 11.59 +MTx (3L,3H) 24.53 14.84 56.19 12.37 +MTx (3L,6H) 24.24 15.36 57.37 12.52 +OTx(3L,6H) 24.99 17.33 66.29 14.47 Table 3.12: Ablative study layers and heads of Transformers. Transformer Ablation: In Table 3.12 we ablate the object transformer and the multi-modal transformer with number of layers and heads. It is interesting to note adding more heads better than adding more layers for object transformer, while in the case of multi-modal transformer both number of heads and number of layers help. Finally, we find that simply adding more layers and heads to the object transformer is insufficient, as a multi-modal transformer with 1 layer and 3 heads performs significantly better than the object transformer with 3 layers and 6 heads. 3.4.9 Visualizations Time Figure 3.16: Left(L): concatenated using SPAT with query: [Arg0: The man] [Verb: throws] [Arg1: the ball] [Arg2: in the air]. Middle(M): concatenated using TEMP with query: [Arg1: He] [Verb: spins] [ArgM-DIR: around the board]. Right(R): concatenated using SPAT with query: [Arg0: He] [Verb: attaches] [Arg1: a dummy]. In L, R the top-two and bottom-two frames are seen concatenated. In M, there are four frames following the order: tl-bl-tr-br. In L,M our model VOGNet correctly finds the referred objects (“man”, “ball”, “boy”, “board”). In R: VOGNet is unable to find “dummy” and ends up localizing the incorrect person. 67 (a) Query: A woman standing on a sidewalk. From left to right, other videos are: (1): A woman standing in kitchen (2): A man solving a puzzle (3): Men standing on sidewalk. Our model disambiguates the two “sidewalks”, as well as the “woman” and localizes them in the same video. Here (2) is a randomly sampled (“woman”, “sidewalk” only have “stand” relation). (b) Query: The ball hits the pins creating a strike. From left to right, other videos are: (2): The girl with the ball hits it (3): A bowling ball hits the pins. (4): He uses razor to trim. While our model correctly chooses the correct frame, we note (3) is very close to (1) in terms of description. Here, our sampling method fails by providing “too” similar videos. Figure 3.17: VOGNet predictions TEMP strategy in GT5 setting. We show two frames from each video, but the model looks at F=40 frames. For qualitative analysis, we show the visualizations of SPAT and TEMP strategies in Figure 3.16. In the interest of space, we use k=2 contrastive sampling (visualizations with k=4 are provided in the Section 3.4.10). In the first image, the videos are concatenated along the width axis and both contain a “man” and “ball”. Our model correctly identifies which “ball” is being thrown into the air and by whom. Note that only viewing the last frame doesn’t uniquely identify if the “man” visible in the current frame has thrown the ball. In general, our SPAT model performed with high consistency i.e. it chose objects nearer to each other which we attribute to RPE. In the second image, the videos are concatenated along the time-axis and in both videos, the person “spins” something. Using “board” as an additional cue, our model correctly finds both “the person” and the “board that he spins”. Our TEMP model performs slightly worse than SPAT model possibly because encoding temporal information is more challenging. Finally, in the third image, our model grounds “he” incorrectly likely due to not being able to ground “dummy”. 3.4.10 Additional Visualization In general, contrastive examples differ in exactly one part of the phrase. However, we observed that some contrastive examples were very difficult to distinguish. We identify two reasons: (i) Considering only one 68 TIME (a) Query: He pours the mixed drink [Arg3: to the cup]. Left-to-right other videos are: (2): Two men drinking an energy drink. (3): A drink poured into martini glass. (4): A young man pours oil into the pan. Our model finds the “man” and the “mixed drink” correctly but fails to localize the “cup” due to small number of queries containing Arg3. TIME (b) Query: The man riding the bike. Left-to-right other videos are: (2): Men put the other bike down. (3): A boy rides his skateboard. (4): We see the boy riding his dirtbike. Here, our model correctly distinguishes among the bikes, and who is riding what. Figure 3.18: VOGNet predictions SPAT strategy in GT5 setting. We show two frames from each video, and each frame contains 4 videos concatenated together. verb in the query becomes restrictive. For instance, in Figure 3.17-(b) video (3), the complete description has “the bowling ball that goes around the ring and then hits the pins” and the initial part of it going around the ring is lost. (ii) Language ambiguity of the form “person playing guitar” vs “person practicing guitar”, while “playing” and “practicing” have distinct meanings, in some situations they can be used interchangeably. 69 TIME Query: A man and two kids building a sand castle. Left-to-right other videos: (1): A group of kids trying to build a sand castle. (3): They are driving through the sand. (4): She is building a castle. In the first frame, the ground-truths are marked in light-green and orange and in the second frame is our model’s incorrect prediction. It is unable to distinguish based on “man” due to influence of “kids” in the agent. Figure 3.19: Incorrect prediction of VOGNet for SPAT strategy We now visualize a few examples for TEMP and SPAT in Figure 3.17, 3.18, 3.19. All visualizations are obtained using VOGNet trained in GT5 setting. For each case, we show 2 frames from each video and color-code the arguments in the given query (Arg0 is Green, Verb is Red, Arg1 is Blue, Arg2 is Magenta) Remaining arguments are mentioned in the query (like in Figure 3.18 (a)). For TEMP, since objects are not being considered independent of each other, the model doesn’t ground objects which are present in the query but not related. For instance in Figure 3.17-(a), even though “woman” and “sidewalk” are separately present in two other videos, these are given very low score. Similarly, in Figure 3.17-(b), “ball” in (2) is not grounded which is not related to the query. These suggest VOGNet is able to exploit the cues in the language query to ground the objects and their relations in the visual domain. For SPAT, in Figure 3.18-(a) our model finds the correct video. It is able to differentiate among someone pouring drink into a glass (2), someone pouring oil (3), or someone drinking the drink (1). However, it is unable to find the “cup” which we attribute to the smaller number of examples containing Arg3 which is 70 limited to verbs like “pour”. In Figure 3.18-(b) our model correctly finds both “man” and the “bike” that he is riding and distinguishes between “ride” and “put”, “bike” and “skateboard” (3). Finally, in Figure 3.19, we find the language ambiguity of “trying to build” and “building” which are synonymously used. While our model is able to distinguish (4) by its agent “she” compared to “man and two kids”, it is unable to make the distinction between “a man and two kids” and “a group of kids” (1). We attribute this to the use of a single embedding for each role (here Arg0) and not differentiating among the various objects in that role. 3.5 Conclusion In this work, we analyze the problem of VOG which aims to localize the referred objects in a video given a language query. We show that semantic-role labeling systems can be used to sample contrastive examples. We then enforce that the model views the contrastive samples as a whole video so that the model explicitly learns object relations. We further propose an additional self-attention layer to capture language dependent object relations along with a relative position encoding. Finally, we validate our proposed model VOGNet on our dataset ActivityNet-SRL which emphasizes the role of object interactions. 71 Chapter 4 Video Question Answering with Phrases. 4.1 Introduction Given a video, Video Question Answering (VidQA) requires a model to provide an answer to a video related question. However, existing works treat VidQA as an N-way (N∼1k) classification task across a fixed set of phrases. Models trained under such formulations are strictly restricted in their recall rate, generalize poorly, and have severe limitations for end-user applications. In this work, we introduce Video Question Answering with Phrases (VidQAP) which treats VidQA as a fill-in-the-phrase task. Instead of a question, the input to VidQAP consists of a query expression with a query-token. Then, given a video, VidQAP requires replacing query-token with a sequence of generated words. To generate a query, we leverage video descriptions and assign semantic roles to each phrase in these descriptions. Replacing a particular semantic-role with a query token produces a query-answer pair. We illustrate this in Figure 4.1 (details in Section 4.3.1). While free-form answer generation is highly desirable, evaluating them is non-trivial due to two main challenges. First, existing language generation metrics like BLEU [177] or BERTScore [301] operate on sentences rather than phrases. When applied to short phrases, in the absence of context, even close matches like “A person” and “The man” would be falsely rejected due to no n-gram overlap or poor contextual 72 Video description: A man on top of a building throws a bowling ball towards the pins Q4: throws a bowling ball towards the pins. Model’s generated answer: A man standing on a house Correct answer: A man on top of a building Q5: A man on top of a building a bowling ball towards the pins. Model’s generated answer: throws Correct answer: throws Q6: A man on top of a building throws towards the pins. Model’s generated answer: a ball Correct answer: a bowling ball Q7: A man on top of a building throws a bowling ball Model’s generated answer: towards some bottles Correct answer: towards the pins (b) Free-form Answer Generation Semantic Roles: ARG0 V ARG1 ARG2 Q1: Who throws a bowling ball towards the pins? Model’s Top Predictions: A: A man B: A man under the tree C: A person D: This boy Correct Answer: A man on top of a building Q2: Does a man on top of a building throw a bowling ball towards to pins? Model’s Top Predictions: A: Yes B: No C: Maybe Correct Answer: Yes Q3: A man on top of a building throws a bowling ball towards the ___. Model’s Top Predictions: A: field B: pins C: basket D: man Correct Answer: pins (a) N-way Classification of Phrases Figure 4.1: Previous methods formulate VidQA as a N-way classification task. The questions are converted via question generation tool (Q1, Q2) or masking-out strategy (Q3). However, such QA has a theoretical recall upper bound when the correct answer is not among the choice list. In comparison, we propose a free-form text generation task which do not suffer such limitation (Q4-Q7) 73 Dataset Source #Clips Clip Duration(s) #QA-Pairs # QA / Clip Task Type Scripts Box QA Pair Creation Movie-QA Movies 6771 202.7 6462 0.95 MC ✓ ✗ Human Movie-FIB Movies 128,085 4.8 348,998 2.72 OE ✗ ✗ Automatic VideoQA* Internet videos 18100 45 174,775 9.66 OE ✗ ✗ Automatic MSVD-QA Internet videos 1,970 9.7 50,505 25.64 OE ✗ ✗ Automatic MSR-VTT-QA Internet videos 10,000 14.8 243,680 24.37 OE ✗ ✗ Automatic TGIF-QA Tumblr GIFs 62,846 3.1 139,414 2.22 OE+MC ✗ ✗ Human+Automatic TVQA TV Show 21,793 76 152,545 7 MC ✓ ✗ Human TVQA+ TV Show 4200 61.5 29,383 7 MC ✓ ✓ Human ActivityNet-QA* Internet videos 5800 180 58000 10 OE ✗ ✗ Human ASRL-QA Internet videos 35805 36.2 162091 5.54 OE + Phrase ✗ ✓ Automatic Charades-SRL-QA Crowd-Sourced 9513 29.85 71735 7.54 OE + Phrase ✗ ✗ Automatic Table 4.1: Comparison of Existing datasets for VidQA with our proposed ASRL-QA and Charades-SRL-QA. Here, OE = OpenEnded, MC = Multiple Choice. “Scripts”: if answering questions requires access to scripts or subtitles. “Box”: if dataset provides bounding box annotations. *: Includes Yes/No questions embeddings. Second, natural language questions often have strong language priors making it difficult to ascertain if the model retrieved information from the video. To propose a reasonable evaluation metric, we revisit our fill-in-the-phrase formulation. Since we know where exactly the generated answer fits in the original query, we can create a complete sentence. With this key insight, we propose relative scoring: using the description as reference sentence, we compute the metrics once by replacing the query-token once with the predicted answer phrase and once with an emptystring. The model’s performance is measured by the relative improvement from the predicted answer compared to the empty string. In particular, substituting the answer phrase in the query expression allows the computing the contextual embeddings required by BERTScore. To mitigate the language-bias issue, we emulate the procedure proposed by [65] where for a given question, another image (or video in our case) is retrieved which has a different answer for the same question. To retrieve such a video, we use a contrastive sampling method [207] over the dataset by comparing only the lemmatized nouns and verbs within the semantic roles (SRLs). We then propose contrastive scoring to combine the scores of the two answer phrases obtained from the contrastive samples (details on evaluation in Section 4.3.2). To investigate VidQAP, we extend three vision-language models namely, Bottom-Up-Top-Down [5], VOGNet [207] and a Multi-Modal Transformer by replacing their classification heads with a Transformer 74 [244] based language decoder. To facilitate research on VidQAP we construct two datasets ActivityNetSRL-QA (ASRL-QA) and Charades-SRL-QA and provide a thorough analysis of extended models to serve as a benchmark for future research (details on model framework in Section 4.3.3 and dataset creation in Section 4.4.1). Our experiments validate the merits of moving away from N-way classification, and further show even among sequence generation models there exists a large disparity in performance across semantic-roles (i.e. queries for some roles can be answered very easily compared to other roles). Moreover, certain roles hardly benefit from vision-language models suggesting room for improvement. Finally, we investigate the effects of relative scoring and contrastive scoring for VidQAP with respect to BertScore. Our contributions in this work are two-fold: (i) we introduce VidQAP and propose a systematic evaluation protocol to leverage state-of-art language generation metrics and reduce language bias (ii) we provide extensive analysis and contribute a benchmark on two datasets evaluated using three vision-language models. Our code and dataset are publicly available. ∗ 4.2 Related Works Question Answering in Images has received extensive attention in part due to its end-user applicability. Key to its success has been the availability of large-scale curated datasets like VQA v2.0 [65] for visual question answering and GQA [89] for relational reasoning. To address the strong language priors, the datasets are balanced by retrieving images which given the same question lead to a different answer. However, these procedures cannot be extended for VidQA since crowd-sourcing to retrieve videos is expensive and there exists no scene-graph annotations for videos. In this work, we perform the retrieval using lemmatized nouns and verbs of the semantic roles labels obtained from video descriptions to balance the dataset. ∗ https://github.com/TheShadow29/Video-QAP 75 Question Answering in Videos: has garnered less attention compared to ImageQA. A major bottleneck is that there is no principled approach to curating a VidQA dataset which reflects the diversity observed in ImageQA datasets. For instance, naively crowd-sourcing video datasets leads to questions about color, number which is same as ImageQA datasets and doesn’t reflect any spatial-temporal structure. To address this issue, TGIF-QA [91] and ActivityNet-QA [281] use a question-template to enforce questions requiring spatio-temporal reasoning but forgo the question diversity. An orthogonal approach is to combine VidQA with movie scripts [240] or subtitles [123]. However, this severely restricts the domain of videos. Moreover, recent works have noted that language-only baselines often outperform vision-language baselines [92, 268, 289]. A separate line of related research has focused on scene-aware dialogue [4]. Instead of a single annotator providing both questions and answers, the annotation procedure follows a two-player game setup with one player asking a question and the other player answering with the roles switching after each turn. However, the evaluation method utilizes recall metrics which require the set of phrases to be known apriori. As a result, it doesn’t strictly measure the performance of free-form generation but rather how well the ground-truth answer is ranked given a competing set of phrases which is analogous to multiple-choice questions. Automatic Question Generation: Due to the above limitations, the dominant approach to create large-scale VidQA dataset has been automatic question generation from existing video descriptions which can be easily crowd-sourced. Our proposed formulation of using SRLs to generate query-expressions falls in this category. Prior works include VideoQA [292], MSR-VTT-QA and MSVD-QA [265] which use a rule based question generator [78] to convert descriptions to questions and Movie-Fill-in-the-Blanks [159] which mask outs at most one word which could be a noun, adjective or verb in a sentence. In comparison, our method poses VidQAP as fill-in-blanks but with phrases, explicitly asks questions about actions, and the answer phrases are not constrained to a fixed set. As a result of this increased space of phrases, methods 76 on existing datasets cannot be directly applied to VidQAP. To enable further research, we contribute two datasets ASRL-QA and Charades-SRL-QA. In Table 4.1 we compare these with existing VidQA datasets. SRL in Vision: has been explored in the context of human object interaction [70], situation recognition [273], and multi-media extraction [133]. Most related to ours is the usage of SRLs for grounding [228] in images and videos [207]. Our work builds on [207] in using SRLs on video descriptions, however, our focus is not on grounding. Instead, we use SRLs primarily as a query generation tool and use the argument as a question directive. 4.3 Design Considerations for VidQAP The VidQAP task is conceptually simple: given a video and a query expression with a query-token, a model should output an answer phrase that best replaces the query-token. This leads to three main design considerations: (i) How to generate a query-expression from existing resources (Section 4.3.1) (ii) How to evaluate the answer phrases returned by a model (Section 4.3.2) (iii) What modeling framework choices enable VidQAP (Section 4.3.3). 4.3.1 Using SRLs to Generate Queries for VidQAP We first briefly describe semantic-role labels (SRLs)† . Then we detail how SRLs are used to create VidQAP queries. Query Generation Using SRLs: Semantic Role Labels (SRLs) provide a high-level label to entities extracted from a sentence in the form of who (ARG0), did what (V) to whom (ARG1) [231]. Other roles such as to whom / using what (ARG2) and where (LOC) are also common. As a pre-processing step, we assign SRLs to video descriptions using a state-of-art SRL labeler [224]. A particular description could consist of multiple verbs, in which case, we consider each verb and its associated SRLs independently. †Detailed discussion is provided in supplementary. A demo is available here: https://demo.allennlp.org/semantic-rolelabeling 77 A person moves exercise equipment around in the office ARG0 V ARG1 DIR LOC Query-Expressions Answers moves exercise equipment in the office A person A person exercise equipment in the office moves A person moves in the office exercise equipment A person moves exercise equipment in the office A person climbs down with his hands folded ARG0 V DIR MNR (a) Following SRLs are considered: ARG0,ARG1,ARG2,V,LOC to generate query-expressions and answers. Here, the phrase corresponding to the semanticrole DIR is removed from both query-expressions and answers. (b) Query-expressions would have less than 3 semantic-roles and hence ignored. Figure 4.2: Illustration of our query generation process. In (a) DIR is ignored from both Query and Answers. In (b) the question is removed from validation set since at most two arguments from considered set are present. For a particular semantic-role, we substitute the corresponding phrase with a query token to generate the query expression. The replaced phrase is the corresponding answer. Using this method we are able to generate multiple queries from a single description. An added merit of using SRLs is that query phrases are centered around “verb-phrases” which are highly relevant to the video content. Generating queries using every SRL is not beneficial as some SRLs are more concerned with phrasing of the language rather than the video. For instance, in the phrase “Players are running around on the field”, if we mask out the word “around” (DIR), it can be answered without looking at the video. To address the above issue, we confine our description phrases to a fixed set of semantic-roles namely: ARG0, ARG1, V, ARG2, ARGM-LOC. Only those phrases which belong to the above set of SRLs may appear in the queryexpression or as an answer phrase. We further remove phrases which have only two arguments as these are too ambiguous to fill. Figure 4.2 illustrates these steps. While using a slot for each slot could potentially limit the vocabulary used in each slot (for instance, the vocabulary set for could be limited to a small number of objects), empirically we don’t 78 Query Expression: A person exercise equipment. Reference (Ground Truth): A person moves exercise equipment. Hypothesis (Prediction): A person lifts exercise equipment. Baseline (Empty String): A person exercise equipment. ! = B(Ref, Base), β = B(Ref, Hyp), " = B(Ref, Ref) Br (Ref, Hyp) = β !! " !! Relative Metric Score Figure 4.3: Illustration of the Relative Metric Computation. “moves” is the ground-truth answer and “lifts” is a model’s prediction. Relative Metric compares the relative improvement from using the model’s prediction as compared to an empty string. find this to be the case (see Appendix 4.4.2.3 for detailed statistics). As a result, VidQAP is no simpler than VidQA task. We also remark that generating queries need not be strictly limited to masking out a single SRL and one could easily mask multiple SRLs in the same description. However, we find two problems: first, for many cases, the output of masking multiple SRLs becomes exceedingly similar to video description task; second, using contrastive scoring (described in Section 4.3.2) for multiple SRLs becomes considerably more involved. As a result, in this work, we focus on using a single SRL and keep the generalization to include multiple SRL queries for future work. 4.3.2 Evaluating Answer Phrases A key challenge in VidQAP is the lack of any standard protocol to evaluate free-form generated phrases. A simple way is to adopt metrics like BLEU [177], ROUGE [138], METEOR [12], and CIDER [247] which are already used for captioning in images and videos. However, these metrics suffer from limited generalization: BLEU, ROUGE, and CIDER require exact n-gram matches. While this is fine for captioning where longer phrases average out errors, answers phrases are typically much smaller than a complete sentence. This leads to many near-correct answers receiving very low scores. 79 A person holding in their hands Answer: a dog Answer: a hair dryer Figure 4.4: Illustration of Contrastive Sampling Process. For the same query-expression, we retrieve two videos with different answers. The model is required to correctly answer both the original and contrastive sample query. This issue is resolved to a certain extent for captioning by learned metrics like BERTScore [301] which utilize contextual embeddings obtained from large pretrained models like BERT [44] and RoBerta [149]. However, answer phrases are usually short and don’t provide meaningful contextual embeddings. In the extreme case when the answer is a single word, for instance when the query is about a Verb, these embeddings turn out to be very noisy leading to large number of false-positives. Relative Scoring: To enable usage of contextual embeddings, we propose evaluating the relative improvement of the generated answer phrase compared to the ground-truth phrase. We denote the input query expression as Q, the ground-truth answer is Agt ,and the predicted answer is Apred. Let Q(X) denote Q with the question tokens replaced by X. Then for a given metric B, we compute the relative metric Br as (see Figure 4.3 for illustration) Ref=Q(Agt), Hyp=Q(Apred), Base=Q(“”) Br(Agt, Apred) = B(Ref, Hyp) − B(Ref, Base) B(Ref, Ref) − B(Ref, Base) (4.1) Note that B(Ref, Ref)=1 for BLEU, METEOR, ROUGE, BERTScore but not for CIDEr. The empty-string baseline in Eqn 4.1 could be replaced with predictions from any model trained for this task. In this work, we restrict to only empty-string baseline due to two desirable properties: its computational simplicity and it being agnostic to models and datasets. 80 Figure 4.5: Schematic of the various models used to benchmark VidQAP. Input Query: “A person picks up ”. GroundTruth Answer: “a pair of shoes”. (a) Lang-QAP is a language-only model which encodes the query input and passes to a decoder. (b) BUTD-QAP uses the pooled feature representation from language encoder and attends over the visual features. (c) VOG-QAP uses an additional phrase encoder and applies a Transformer over the multi-modal features (d) MTX-QAP consumes both the language and visual features with a multi-modal transformer. We further observe that Eqn 4.1 is very similar to the re-scaling proposed in BERTScore. However, in BertScore re-scaling aims at making the score more readable and doesn’t change the relative ranking of the hypothesis. In our case, Eqn 4.1 plays two roles: first, it allows computing the contextual embeddings because the answers are now embedded inside a complete phrase, second while the ranking is not affected for a particular query, the score would be different across queries and hence affect the overall relative metric. Contrastive Scoring: Visual Question Answering suffers from heavy language priors, and as a result, it is often difficult to attribute whether the image or video played a role in the success. For images, [65] resolved this by balancing the dataset where they crowd-sourced the task of collecting an image that has a different answer for the same question. However, such a crowd-sourcing method is difficult to extend to videos since searching for videos requires a much longer time. This is further complicated by accepting answer phrases compared to single word. We simulate the balancing process using the contrastive sampling method used in [207]. Specifically, for a given video-query-answer (V1, Q1, A1) tuple we retrieve another video-query-answer (V2, Q2, A2) tuple which share the same semantic role structure as well as lemmatized noun and verbs for the question, 81 but a different lemmatized noun for the answer. At test time, the model evaluates the question separately, but the evaluation function requires both answers to be correct. Since our answers comprise of phrases, the notion of correctness is not absolute (unlike say accuracy metric). Thus, we put a threshold t below which the answer is deemed incorrect. Mathematically, let Si=Br(Agti , Apredi ) be the relative score for sample i, and we are given sample j is a contrastive example for sample i. Then the contrastive score (CSi ) for sample i at a threshold TCS would be CSi = max(Si1[Sj > TCS ∗ B(Refj , Refj )], 0) (4.2) Here 1[] is the indicator variable which is 1 if the expression within brackets is True, otherwise 0. The max operator ensures the scores don’t become negative. For our experiments, we use TCS=0 which requires that the answer for the contrastive sample should be better than an empty string. We further use the contrastive samples to compute a consistency metric. For sample i, the consistency Consi for a threshold Tcons is given by Consi = 1[(Si − Tcons) ∗ (Sj − Tcons) > 0] (4.3) As such, Consistency requires the model to be either correct or incorrect for both the original and the contrastive sample. Combined Metric at a Glance: Given metric B, for a given sample i and contrastive sample j 1. Compute relative metric (Eqn 4.1) for i, j 2. Compute contrastive score (Eqn 4.2) 3. Optionally compute Consistency (Eqn 4.3) 82 We use the prefix “R-” such as in R-B to denote both relative scoring and contrastive scoring is being computed. We report Consistency for BertScore with Tcons=0.1 We note that, by construction, the relative scoring (Eqn 4.1) is positively correlated with human judgment, as the closer, the hypothesis is to the reference, the higher would the score be. The contrastive scoring is a metric used to prevent the model from guessing the correct answer by exploiting language biases and instead use the video to give a suitable prediction. Since humans don’t have the ability to exploit such biases, it is difficult to relate to human evaluation. 4.3.3 Model Framework Models for VidQAP require a language encoder to encode the question, a visual encoder to extract video features, a multi-modal module to jointly learn over vision-language space and a decoder to generate a sequence of words. Inputs include query expression {w} L i=1 (L is number of words), video segment features for F1 frames and optionally k RCNN features for F2 frames. In either case, frames are sampled uniformly from the video segment time-span. While the models differ in their encoding scheme, our language decoder model (Transformer based) used to generate the output answer phrase is kept same across all models with QAP suffix. Lang-QAP: is a language-only (video-blind) model using only the query input. It uses Transformer based encoder to encode the query into qˆ ∈ R L×d . The decoder subsequently uses the last layer output of the encoder (Figure4.5-(a)). BUTD-QAP: Bottom-up-Top-Down [5] is a popular approach for image question answering as well as captioning. It first computes attention between the question and the RCNN visual features to generate an attended visual feature, which is then used with the question to produce an output answer. Here, we replace the RCNN features with the segment features (vˆ ∈ R F1×d ). We can also include RCNN features by 83 projecting them to same dimension as segment features and then concatenate them along the frame-axis (vˆ ∈ R (F1+F2∗k)×d ). For language features, we use the [CLS] token representation from the last layer of the language encoder used in Lang-QAP. The output using the language and visual features is (mˆ ∈ R d ) passed to the decoder (Figure 4.5(b)). VOG-QAP: VOGNet [207] has been proposed for grounding objects in videos given a natural language query. Following the architecture, we first derive phrase encoding which corresponds to a single SRL i.e. qˆ ∈ R S×d (S is number of semantic roles). These phrase features are concatenated with the visual features (same as those used in BUTD-QAP (i.e. vˆ)) to get multi-modal features m[l, i]=[ˆvi ||qˆl ] and then reshaped to get m ∈ R S∗F ×d . These multi-modal features are subsequently passed to decoder to generate the output sequence (Figure 4.5 (c)). MTX-QAP: Recently, transformer models pre-trained on large-scale paired image-text data have become popular. Even in the absence of pre-training, such architectures can achieve competitive performance [153]. In the context of videos, ActBert [309] has been proposed. We create a similar architecture to ActBert but we replace their proposed Tangled-Transformer with a vanilla Transformer ‡ . Specifically, we jointly encode the language and visual features in a single transformer and feed the output to the decoder (Figure 4.5 (d)). LangCL and MTxCL: Apart from QAP models, we also consider their phrase classification counterparts where the decoder is replaced with a N-way classifier (two-layered MLP in our case) across a fixed set of phrases. For our experiments, we used N=1k phrases for LangCL and N∈{1k, 10k} for MTxCL. ‡The code for ActBert is not publicly available. 84 4.4 Experiments We briefly discuss the dataset creation process (Section 4.4.1), followed by experimental setup (Section 4.4.3). We then summarize our results (Section 4.5.3) and discuss key-findings. We provide implementation details, qualitative visualizations of our dataset, metrics and trained models in the appendix. 4.4.1 Dataset Creation We create two datasets ASRL-QA and Charades-SRL-QA derived from ActivityNet-Captions [113] and Charades [227] respectively. There are three key steps to create QA datasets from descriptions: (i) assign semantic-roles to the descriptions (ii) perform co-reference resolution so that the questions are self-contained (iii) obtain lemmatized nouns and verbs to perform contrastive sampling. For semantic-role labeling, we use [224]. For co-reference resolution, we use the co-reference resolution model provided by allennlp library [60] which uses the model by [122] but replaces the GloVe [181] embeddings with SpanBERT embeddings [95] § . Since Charades primarily involves videos with a single person, we discard questions involving ARG0. We limit to using a single description per video to avoid repetitive questions. We re-use the same train split for both datasets. For ASRL-QA, test set of ActivityNet is not public and Charades only has a test set but no official validation set. Thus, we split the existing validation set by video names and create the validation and test sets. For both validation and test splits, we remove those questions for which no contrastive sample was found as it indicates data-biases. § https://demo.allennlp.org/coreference-resolution 85 4.4.2 Extended Dataset Construction Details We first discuss semantic-role labeling used in natural language processing. Then, we detail the dataset construction process used for ASRL-QA and Charades-SRL-QA (Section 4.4.2.2) and then provide the dataset statistics (Section 4.4.2.3). 4.4.2.1 Semantic Role Labeling Semantic-Role Labels extract out high-level meanings from a natural language description. Two widely used SRL annotations are PropBank [109] and FrameNet [11]. Here we use SRLs which follow PropBank annotation guidelines (see [20] for complete guideline). Most commonly used argument roles are • V: the verb. All remaining roles are dependent on this verb. While the numbered arguments differ slightly based on the verb used, they share common themes across verbs as listed below (see [20] for full details). For instance, “cut” is a Verb. • ARG0: the agent, or the one causing the verb. For most action verbs, this is usually a human or an animal. For instance, “A person cuts a vegetable”, “A person” is ARG0. • ARG1: the object, on which the action is being performed. In “A person cuts a vegetable”, “a vegetable” is ARG1. • ARG2: the tool being used for the verb, or someone who benefits from the verb. For instance, in “A person is cutting a vegetable with a knife”, “with a knife” denotes the tool and is ARG2. In “A person throws a basketball to the basket”, “to the basket” denotes the benefactor and is ARG2. • ARGM-LOC or simply LOC denotes the place or location where the verb takes place. For instance, in “A person is cutting a vegetable on a plate”, “on a plate” is the LOC. 86 To assign SRLs to language descriptions we use allennlp library [60] which provides an implementation of a BERT [44] based semantic-role labeler [224]. The system achieves 86.49 F1 score on OntoNotes [189] 5.0 dataset. 4.4.2.2 Construction Process Both ASRL-QA and Charades-SRL-QA follow the same process with few subtle differences: 1. Pre-Process Data: • Assign semantic role labels (SRLs) to video descriptions using SRL labeller [224]. • Remove stopword verbs with lemmas: “be”, “start”, “end”, “begin”, “stop”, “lead”, “demonstrate”, “do”. • For the original descriptions spread across multiple video segments, combine the sentences into a document. Use a co-reference resolution model on this model (we use [122] with SpanBERT embeddings [95] provided in allennlp library [60]). • Replace the following pronouns: “they”, “he”, “she”, “his”, “her”, “it” with the relevant nounphrase obtained from the co-reference resolution output. 2. Query-Generation: • For each verb-role set within a description (each description can have multiple verbs), consider the role set ARG0, ARG1, V, ARG2, LOC for ASRL-QA and ARG1, V, ARG2, LOC for Charades-SRL-QA. • If there are at least 3 verb-roles for the given verb, for each SRL replace it with a query token (with where R is the role). This forms one query. Repeat for all SRLs in the considered set. 87 • The minimum of 3 verb-roles is present to avoid ambiguity in the query. Limiting the argument role-set helps in generating queries less likely to have strong language-priors (though as seen in qualitative examples, some priors are still present). • After the queries are generated, create lemmatized verbs, and nouns set for each query, and store the video segment ids in a dictionary. This is similar to the process used in [207], with the difference that we additionally have query-tokens. • For each query, use the dictionary to sample set of video segment ids which share the same semantic role structure, but for the query-token have a different answer. These are used for matching when computing the scores for the validation and testing set using the contrastive score. 3. Creating Train/Test Splits: • Keep the training set for each dataset the same. • For validation and testing, we split the dataset based on the video ids (half video ids are set as validation, and half as testing). The queries are then split based on the video ids. • Note that while contrastive sampling is done before validation test split. So validation and test ids are used for computing the other’s score for contrastive sampling. This is similar to the setting used in [207] as the total number of videos available for validation, and testing are insufficient for contrastive sampling. 4.4.2.3 Dataset Statistics Dataset statistics can be found in Table 4.2. Lemma distributions are visualized in Figure 4.6 Overall, we find slightly skewed distribution of Argument roles across the datasets. For instance, ARG0, ARG1 are 88 ASRL-QA Charades-SRL-QA Train Val Test Train Val Test Overall Videos 30337 2729 2739 7733 860 876 Queries 147439 7414 7238 59329 4431 4520 Query Length 8.03 6.03 6 7.11 5.6 5.62 Answer Length 2.2 2.33 2.33 1.83 1.96 1.94 Vocabulary 4597 3261 3261 1479 884 884 ARG0 Videos 24483 1372 1419 Queries 37218 1603 1643 Query Length 7.31 5.73 5.65 Answer Length 2.51 2.37 2.48 Vocabulary 1763 840 840 V Videos 29922 1737 1733 7733 802 811 Queries 52447 2247 2187 27745 1824 1829 Query Length 9.2 7.26 7.18 7.7 6.37 6.44 Answer Length 1 1 1 1 1 1 Vocabulary 1860 1167 1167 678 377 377 ARG1 Videos 24863 1810 1793 7600 808 828 Queries 36787 2250 2179 21557 1857 1874 Query Length 7.4 5.4 5.43 6.43 5.07 5.04 Answer Length 2.8 2.82 2.83 2.31 2.39 2.39 Vocabulary 3560 2124 2124 935 527 527 ARG2 Videos 12048 850 805 5433 490 522 Queries 14321 941 886 8279 651 699 Query Length 7.49 5.45 5.36 6.94 5.13 5.13 Answer Length 3.55 3.69 3.62 3.11 3.22 3.04 Vocabulary 2607 1326 1326 556 365 365 LOC Videos 6025 340 319 1578 87 112 Queries 6666 373 343 1748 99 118 Query Length 7.57 5.17 5.35 6.93 4.75 5.06 Answer Length 3.61 3.87 3.63 3.22 3.19 3.08 Vocabulary 1390 669 669 265 138 138 Table 4.2: Detailed dataset statistics for both ASRL-QA and Charades-SRL-QA with respect to different argument roles. Recall that ARG0 is not present in Charades-SRL-QA, and hence the corresponding rows are kept blank. much more frequent than ARG2 and LOC. Also, since every SRL needs to have a verb (V), the distribution of the videos is the same as the overall. 89 ARG0 ARG1 ARG2 LOC V girl(1860) man(11713) people(3117) person(2224) woman(4818) ball(1318) girl(695) man(3007) people(1079) woman(1266) chair(163) man(443) table(336) water(226) woman(287) field(197) man(170) room(221) side(178) water(235) hold(1638) play(1844) put(1420) sit(1302) stand(2187) (a) Top-5 lemmatized nouns or verbs for the considered semantic roles in ASRL-QA ARG2 ARG1 LOC V ball(1318) girl(695) man(3007) people(1079) woman(1266) chair(163) man(443) table(336) water(226) woman(287) field(197) man(170) room(221) side(178) water(235) hold(1638) play(1844) put(1420) sit(1302) stand(2187) (b) Top-5 lemmatized nouns or verbs for the considered semantic roles in Charades-SRL-QA Figure 4.6: Lemma Distribution for both ASRL-QA and Charades-SRL-QA. The number of instances across the whole dataset are given in the parenthesis of each lemmatized noun or verb. As shown in Table 4.2, vocabularies in both the train and validation/test sets for each argument role (slot) are reasonably large compared (eg. 60% for ARG1) to the total vocabulary and not too limited. This results is further consistent across both datasets. 4.4.3 Experimental Setup Dataset Statistics: ASRL-QA has 35.7k videos and 162k queries split into train, validation and test sets with 30.3k, 2.7k, 2.7k videos and 147k, 7.5k, 7.5k queries. We observe that the size of validation and test sets are proportionately smaller compared to their respective train sets. This is because only queries with corresponding contrastive sample are included while no such filtering is done for the train set (∼95k queries in train set have a contrastive pair). Charades-SRL-QA contains 9.4k videos and 71.7k queries split across train, validation and test sets with 7.7k, 0.8k, 0.8k videos and 59.3k, 6.1k, 6.2k queries. Despite its smaller size, the size of validation, test sets of Charades-SRL-QA is comparable to ASRL-QA as Charades is 90 curated with the goal of diversifying subject, verb, object tuples. Supplementary material provides further details on the dataset statistics and visualizations. Evaluation Metrics: As discussed in Section 4.3.2, we report the combined metric (i.e. metrics prefixed with “R-”) for the commonly used generation metrics: BLEU, METEOR, ROUGE, CIDEr and BertScore (implementations from [33, 301]). For BLEU, we report the sentence level BLEU-2. All reported results are test set results using the model which performs best on validation set. 4.5 Implementation Details We first report the implementation details for the metrics (Section 4.5.1). Then, we detail the model implementation details (Section 4.5.2). 4.5.1 Metric Implementation For Bleu [177], Rouge [138], Meteor [12], and CIDEr [247] we use the implementations provided in cococaptions repository¶ [33]. For BERTScore we use the official implementation ∥ BLEU-2: computes Bleu with n-gram with n=2. We use sentence-bleu score instead of the more commonly used corpus bleu score. This is further used for contrastive sampling. ROUGE: we use ROUGE-L which computes the longest common sub-sequence. METEOR: we use Meteor 1.5 version [43]. CIDEr: we use CIDEr-D implementation which includes idf-weighting. BertScore: we use BertScore with hash “roberta-large_L17_idf_version=0.3.5(hug_trans=3.0.2)-rescaled” We show examples of computing the metrics. ¶ github url: https://github.com/tylin/coco-caption ∥ github url: https://github.com/Tiiiger/bert_score 91 4.5.2 Model Implementation We report all model implementation details. General Settings: Our code is implemented using Pytorch [179]. For Transformer, we use the implementation provided in FairSeq [174]. The vocabulary consists of 5k words for ASRL-QA and 3k words for Charades-SRL-QA. The segment features are of dimension 3072 and 512 for ASRL-QA and CharadesSRL-QA respectively obtained from TSN [252] and S3D [263] trained on HowTo100M [165] using the loss function presented in [163] ∗∗. The proposal features are of dimension 1024 and only used for ASRL-QA extracted using FasterRCNN [199] trained on Visual Genome [114]. For all cases, we report the output dimension of MLP. Unless otherwise stated, MLP is followed by ReLU activation. Decoder: The decoder uses an input of T ×512 (where T refers to the length of the input embedding). Note that for Lang-QAP, T is same as sequence length of the query, for BUTD-QAP T=1, for VOG-QAP, T is number of SRLs ∗ number of segment features. For MTX-QAP, T is sequence length of query + number of segment features. To generate output sequences, we use the usual beam-search with a beam-size of 2, with a temperature of 1.0. Encoder: Encoder differs based on the specific model. All encoders are transformer based using 8 attention heads and 3 layers unless otherwise mentioned. Lang-QAP: The language encoder uses 3 encoding layers, with 8 attention heads each. The embedding layer uses a dimension of 512. BUTD-QAP: We use the same language query, with and pre-pend a [CLS] token. The embedding of the [CLS] token serves as the language embedding, and is passed through a MLP of dimension 512. The language encoder is the same as Lang-QAP. The segment features are passed through MLP of dimension 512. If proposal features are used, they are passed through a separate MLP of dimension 512. The language ∗∗https://github.com/antoine77340/S3D_HowTo100M 92 embedding (also of dimension 512) is used to compute attention score with the visual features, and finally obtain an attended visual feature. These attended visual features are concatenated with the language embedding along the last axis, and then passed to the decoder. VOG-QAP: We use the same language encoder, but further use the SRL phrase start and end-points for the phrase encoder. The phrase encoder uses these start and end points to gather the language embeddings corresponding to these start and end points, concatenate them (dimension 512+512=1024) and use MLP with dimension 512. This gives an output of the phrase encoder of size number of SRLs ∗s512. The phrase encoded query is then concatenated with all the segment features and passed through a MLP. Finally a multi-modal transformer encoder is applied over the phrase encoded input, and is passed to the language decoder. MTX-QAP: We collate all the language tokens (passed through embedding layer) as well as segment features passed through MLP, to get all features of dimension 512. A transformer based encoder is applied on these features, and the output is passed to the decoder. Training: We train using standard cross-entropy loss. The decoder is trained using teacher forcing. All models are trained for 10 epochs with batch size of 32. On a TitanX, for ASRL-QA each epoch takes around 30 − 40 mins. Our training infrastructure included a 8 GPU Titan X machine 4.5.3 Results and Discussions Table 4.3 compares performance of the proposed VidQAP models with N-way classification baselines (denoted with suffix “CL”) on ASRL-QA and Charades-SRL-QA. Comparing Metrics: It is evident that compared to other metrics, R-BertScore shows a higher relative improvement. This is because BertScore allows soft-matches by utilizing contextual embeddings obtained from a pre-trained BERT [44] or Roberta [149] model. 93 ASRL-QA Charades-SRL-QA R-BS Cons R-B@2 R-R R-M R-C R-BS Cons R-B@2 R-R R-M R-C LangCL (1k) 0.253 0.889 0.120 0.098 0.071 0.044 0.293 0.697 0.224 0.209 0.114 0.077 MTxCL (1k) 0.255 0.869 0.130 0.114 0.080 0.050 0.288 0.707 0.215 0.208 0.116 0.075 MTxCL (10k) 0.286 0.788 0.157 0.133 0.100 0.061 0.408 0.695 0.286 0.261 0.142 0.108 Lang-QAP 0.402 0.728 0.228 0.182 0.125 0.095 0.406 0.719 0.277 0.253 0.147 0.121 BUTD-QAP 0.413 0.716 0.237 0.203 0.147 0.105 0.399 0.714 0.271 0.231 0.115 0.105 VOG-QAP 0.414 0.717 0.239 0.204 0.142 0.108 0.442 0.739 0.297 0.274 0.165 0.136 MTX-QAP 0.414 0.715 0.247 0.206 0.149 0.113 0.439 0.757 0.294 0.267 0.157 0.139 Table 4.3: Comparison of our extended models for VidQAP and Classification based (CL) models across two datasets on our proposed Metric. Here, “R-" prefix implies it is the final metric computed after relative scoring and contrastive scoring with threshold 0. “BS": BertScore, “Cons”: Consistency on BertScore, B@2: Sentence BLEU-2, R: ROUGE, M: METEOR, C: CIDEr. Reported numbers are on the test set. For classification models, the number within the parenthesis denotes the size of fixed vocabulary of phrases. Best result, Second Best result. ASRL-QA Charades-SRL-QA ARG0 V ARG1 ARG2 LOC V ARG1 ARG2 LOC LangCL (1k) 0.598 0.423 0.102 0.125 0.018 0.564 0.291 0.146 0.173 MTxCL (1k) 0.607 0.399 0.106 0.142 0.019 0.549 0.346 0.152 0.106 MTxCL (10k) 0.697 0.379 0.161 0.144 0.049 0.601 0.445 0.315 0.272 Lang-QAP 0.697 0.519 0.325 0.322 0.145 0.631 0.458 0.33 0.206 BUTD-QAP 0.681 0.515 0.372 0.334 0.162 0.568 0.413 0.316 0.299 VOG-QAP 0.671 0.513 0.366 0.332 0.188 0.63 0.467 0.365 0.305 MTX-QAP 0.702 0.478 0.374 0.344 0.17 0.633 0.455 0.364 0.304 Table 4.4: Comparison of our extended models per SRL. All reported scores are R-BS: BertScore computed after relative scoring and contrastive scoring with threshold 0. Comparison Across Datasets: We find that performance on both datasets follow very similar trends across all metrics. Charades-SRL-QA has slightly higher scores compared to ASRL-QA likely because it has lesser data variations (Charades is mostly confined indoor videos) suggesting findings on either dataset would transfer. Comparison within N-way Classification: We notice that when 1k fixed set of phrases are used classification models show very limited performance. Allowing 10k phrases gives a significant improvement in performance on Charades-SRL-QA (12 points on R-BS) however this doesn’t translate to ASRL-QA. This is because ASRL-QA contains many more probable phrases (29K compared to 8K) in their respective training sets. We also notice that increasing the number of phrases vocabulary coincides with decreasing consistency. 94 ARG0 V ARG1 ARG2 LOC Lang-QAP Direct 0.552 0.9268 0.234 0.302 0.216 Rel Score 0.7 0.534 0.332 0.237 0.1 CS@0 0.697 0.519 0.325 0.322 0.145 CS@0.1 0.69 0.492 0.295 0.28 0.132 CS@0.2 0.68 0.459 0.262 0.212 0.106 CS@0.3 0.657 0.423 0.219 0.149 0.085 MTX-QAP Direct 0.566 0.929 0.269 0.321 0.258 Rel Score 0.706 0.488 0.366 0.25 0.14 CS@0 0.702 0.478 0.374 0.344 0.17 CS@0.1 0.693 0.45 0.343 0.305 0.145 CS@0.2 0.681 0.413 0.306 0.239 0.117 CS@0.3 0.659 0.376 0.27 0.17 0.08 Table 4.5: BertScore Metrics computed Directly on answer phrases. Rel Score: After Relative Scoring. CS@T: Contrastive scoring with threshold T. ARG0 V ARG1 ARG2 LOC Overall BUTD-QAP 0.706 0.506 0.388 0.36 0.196 0.431 VOG-QAP 0.704 0.516 0.366 0.352 0.202 0.429 MTX-QAP 0.685 0.465 0.378 0.355 0.19 0.416 Table 4.6: Effect of Adding Region Proposals. All reported scores are R-BS. Best result, Second Best result. Comparing Free-from Answer Generation (QAP) with N-way Classification (CL): We investigate the advantages of using a decoder network to generate phrases compared to an N-way classification over a fixed set of phrases (denoted with the suffix “CL” and number of phrases used in parenthesis). Table 4.3 shows that both Lang-QAP and MTX-QAP outperform their classification counterparts, namely Lang-CL and MTX-CL on both datasets. This implies the free-form generation are not limited to simply generating the most frequently appearing phrases in the training set, thereby showing its effectiveness. Comparison Across Models: We find that multi-modal models outperform language-only baseline. However, the improvement over language baseline is small. The reason for the small gap is elucidate in Table 4.4 where we report R-BertScore for every considered SRL. 95 We find a large disparity in performance depending on the SRL. Most strikingly, multi-modal models perform worse than language-only model on ARG0 and V. For ARG0, the strong performance of the LangQAP arises because most of the time the agent who causes an action is a human. Therefore answer phrases having simply “A man” or “A woman” or “A person” leads to reasonable performance. This additionally suggests that grounding “who” is performing the action remains non-trivial. The more surprising result is the strong performance of Lang-QAP on V which is consistent across both datasets despite using contrastive sampling. There are two likely causes. First, the distinction between verbs is not as strict as object nouns, i.e. even similar verbs are classified as a separate verb diminishing the returns of contrastive sampling. For instance, “jumping” and “hoping” have different lemma and thus considered distinct verbs but R-BS would treat them as similar even if the specific action would be classified “jumping” rather than ”hoping”. Second, SRLs such as ARG1 confines the set of possible verbs. For instance, if the object is “glass”, only limited verbs such as “drink”, “hold” are probable. On the remaining arguments namely ARG1, ARG2, and LOC, multi-modal models show a steady improvement over language-only baseline ranging from 1−10%. However, the performance in absolute terms remains very low. As such, our proposed task VidQAP remains extremely challenging for current multi-modal models. Evaluation Metric Scores: In Table 4.5 we record the BertScore computation in three parts: directly computing over the answer phrases, performing relative scoring, finally performing contrastive scoring with different thresholds. We observe that for V, naive computation leads to absurdly high scores. This is because verbs consist of a single word which means the embeddings are not contextual. This is remedied by relative scoring and is further controlled by combining with contrastive sampling. Further note that relative scoring operates differently based on the SRLs. For instance, it increases the score for ARG0 and ARG1 where the answers more often paraphrased the ground-truth questions while 96 for ARG2 and LOC, it decreases the score due to incorrect matches. While contrastive scoring is aimed at reducing language-only bias and as such should always reduce the relative score, we observe increased score in ARG2 for both Lang-QAP and MTX-QAP. This is caused by the max function which restricts the lower-limit to be 0. Effect of Region Boxes: As noted earlier, the visual features can also include region features extracted from an object detector like FasterRCNN [199]. In Table 4.6 we record the effect of including regional features. In particular, we use the GT5 setting used in [207] where 5 region proposals are used from 10 frames uniformly sampled from the video segment. Interestingly, MTX-QAP under-performs than both BUTD-QAP and VOG-QAP on ARG0. A possible reason is that the transformer is unable to effectively reason over both language and vision over such a large range of inputs. 4.5.4 Visualization We visualize the model outputs on ASRL-QA in Figure 4.7 (a), (b), Figure 4.8 (a), (b) and Figure 4.9. For each case, we show the considered input in the first row, and the contrastive sample in the second row. Each row contains 5 frames uniformly sampled from the video segment to be representative of the content observed by the model. For every query, we show the ground-truth answer and the outputs from LangQAP, BUTD-QAP, VOG-QAP and MTX-QAP. Overall, we often find Lang-QAP suggesting very probable answers, but as expected they are not grounded in the video. As a result, in either of the original sample or the contrastive sample, it performs poorly. Our observations are consistent with our expectations. 97 Query: play the song on the piano Target Answer: A little girl Lang-QAP: The man BUTD-QAP: A young child VOG-QAP: A woman MTX-QAP: The woman Query: playing a song Target Answer: A man wearing a hat Lang-QAP: A woman BUTD-QAP: A man VOG-QAP: A man wearing a hat MTX-QAP: A man (a) Query of type ARG0 Query: A man a skateboard Target Answer: holding Lang-QAP: riding BUTD-QAP: picks VOG-QAP: holding MTX-QAP: holding Query: Men skateboards Target Answer: riding Lang-QAP: riding BUTD-QAP: riding VOG-QAP: riding MTX-QAP: riding (b) Query of type V Figure 4.7: Queries of Type ARG0 and V on ASRL-QA 98 Query: People hit Target Answer: a pinata Lang-QAP: the ball BUTD-QAP: the pinata VOG-QAP: the pinata MTX-QAP: the pinata Query: The people hit Target Answer: the ball Lang-QAP: the ball BUTD-QAP: the ball VOG-QAP: the ball MTX-QAP: the ball (a) Query of type ARG1 Query: A man sitting Target Answer: behind a drum kit Lang-QAP: on a bed BUTD-QAP: on a drum set VOG-QAP: behind a drum set MTX-QAP: in front of a drum set Query: A man sits next to a baby Target Answer: on a playground swing Lang-QAP: on a bed BUTD-QAP: on the ground VOG-QAP: on a swing MTX-QAP: on a swing (b) Query of type ARG2 Figure 4.8: Queries of Type ARG1 and ARG2 on ASRL-QA 99 Query: A lady washing clothes Target Answer: in a bucket Lang-QAP: in a sink BUTD-QAP: in a bowl VOG-QAP: in a bucket MTX-QAP: in the water Query: People washing their clothes Target Answer: in a river Lang-QAP: in a sink BUTD-QAP: in a lake VOG-QAP: on a river MTX-QAP: in the water Figure 4.9: Queries of Type ARGM-LOC on ASRL-QA 100 4.6 Conclusion In this work, we introduce Video Question Answering with Phrases (VidQAP) where we pose VidQA as a fill-in-the-phrase task. Given a video and query expression, a model needs to compose a sequence of words to answer. We then propose a method to leverage semantic roles from video descriptions to generate query expressions and outline a robust evaluation protocol. This involves computing the relative improvement of the prediction answer compared to an empty string followed by a contrastive sampling stage which reduces language-only biases. We then contribute two datasets ASRL-QA and Charades-SRLQA to facilitate further on VidQAP and benchmark them with three vision-language models extended for our proposed task. 101 Chapter 5 VidSitu: Situation Recognition in Videos. 5.1 Introduction Videos record events in our lives with both short and long temporal horizons. These recordings frequently relate multiple events separated geographically and temporally and capture a wide variety of situations involving human beings interacting with other humans, objects and their environment. Extracting such rich and complex information from videos can drive numerous downstream applications such as describing videos [116, 267, 258], answering queries about them [281, 266], retrieving visual content [166], building knowledge graphs [160] and even teaching embodied agents to act and interact with the real world [276]. Parsing video content is an active area of research with much of the focus centered around tasks such as action classification [98], localization [77] and spatio-temporal detection [66]. Although parsing human actions is a critical component of understanding videos, actions by themselves paint an incomplete picture, missing critical pieces such as the agent performing the action, the object being acted upon, the tool or instrument used to perform the action, location where the action is performed and more. Expository tasks such as video captioning and story-telling provide a more holistic understanding of the visual content; but akin to their counterparts in the image domain, they lack a clear definition of the type of information being extracted making them notoriously hard to evaluate [104, 250]. 102 Verb: deflect (block, avoid) Arg0 (deflector) woman with shield Arg1 (thing deflected) boulder Scene city park Verb: talk (speak) Arg0 (talker) woman with shield Arg2 (hearer) man with trident ArgM (manner) urgently Scene city park Verb: leap (physically leap) Arg0 (jumper) man with trident Arg1 (obstacle) over stairs ArgM (direction) towards shirtless man ArgM (goal) to attack shirtless man Scene city park Verb: punch (to hit) Arg0 (agent) shirtless man Arg1 (entity punched) man with trident ArgM (direction) far into distance Scene city park Verb: punch (to hit) Arg0 (agent) shirtless man Arg1 (entity punched) woman with shield ArgM (direction) down the stairs Scene city park 2 Seconds Event 1 0s-2s Event 2 2s-4s Event 3 4s-6s Event 4 6s-8s Event 5 8s-10s Ev3 is a reaction to Ev2 Ev4 is a reaction to Ev3 Ev5 is unrelated to Ev3 Ev3 is enabled by Ev1 Figure 5.1: A sample video and annotation from VidSitu. The figure shows a 10-second video annotated with 5 events, one for each 2-second interval. Each event consists of a verb (like “deflect”) and its arguments (like Arg0 (deflector) and Arg1 (thing deflected)). Entities that participate in multiple events within a clip are co-referenced across all such events (marked using the same color). Finally, we relate all events to the central event (Event 3). The video can be viewed at: https://youtu.be/3sP7UMxhGYw?t=20 (from 20s-30s). Recent work in the image domain [273, 190, 70] has attempted to move beyond action classification via the task of visual semantic role labeling - producing not just the primary activity in an image or region, but also the entities participating in that activity via different roles. Building upon this line of research, we propose VidSRL – the task of recognizing spatio-temporal situations in video content. As illustrated in Figure. 5.1, VidSRL involves recognizing and temporally localizing salient events across the video, identifying participating actors, objects, and locations involved within these events, co-referencing these entities across events over the duration of the video, and relating how events affect each other over time. We posit that VidSRL, a considerably more detailed and involved task than action classification with more precise definitions of the extracted information than video captioning, is a step towards obtaining a holistic understanding of complex videos. To study VidSRL, we present VidSitu, a large video understanding dataset of over 29K videos drawn from a diverse set of 3K movies. Videos in VidSitu are exactly 10 seconds long and are annotated with 5 103 verbs, corresponding to the most salient event taking place within the five 2 second intervals in the video. Each verb annotation is accompanied with a set of roles whose values ∗ are annotated using free form text. In contrast to verb annotations which are derived from a fixed vocabulary, the free form role annotations allow the use of referring expressions (e.g. boy wearing a blue jacket) to disambiguate entities in the video. An entity that occurs in any of the five clips within a video is consistently referred to using the same expression, allowing us to develop and evaluate models with co-referencing capability. Finally, the dataset also contains event relation annotations capturing causation (Event Y is Caused By/Reaction To Event X) and contingency (Event X is a pre-condition for Event Y). The key highlights of VidSitu include: (i) Diverse Situations: VidSitu enjoys a large vocabulary of verbs (1500 unique verbs curated from PropBank [176] with 200 verbs having at least 100 event annotations) and entities (5600 unique nouns with 350 nouns occurring in at least 100 videos); (ii) Complex Situations: Each video is annotated with 5 inter-related events and has an average of 4.2 unique verbs, 6.5 unique entities and; (iii) Rich Annotations: VidSitu provides structured event representations (3.8 roles per event) with entity co-referencing and event-relation labels. To facilitate further research on VidSRL, we provide a comprehensive benchmark that supports partwise evaluation of various capabilities required for solving VidSRL and create baselines for each capability using state-of-art architectural components to serve as a point of reference for future work. We also carefully choose metrics that provide a meaningful signal of progress towards achieving competency on each capability. Finally, we perform a human-agreement analysis that reveals a significant room for improvement on the VidSitu benchmark. Our main contributions are: (i) the VidSRL task formalism for understanding complex situations in videos; (ii) curating the richly annotated VidSitu dataset that consists of diverse and complex situations for studying VidSRL; (iii) establishing an evaluation methodology for assessing crucial capabilities needed ∗ Following nomenclature introduced in ImSitu[273], every verb (deflect) has a set of roles (Arg0 deflector, Arg1 thing deflected) which are realized by noun values. Here, we use “value” to refer to free-form text used describing the roles (woman with shield, boulder). 104 Task Required Annotations Dataset Action Classification Action Labels Kinetics[98], ActivityNet [77], Moments in Time [168], Something-Something[64], HVU [46] Action Localization Action Labels, Temp. Segments ActivityNet, Thumos[90], HACS [302], Tacos[197], Charades[227], COIN[239] Spatio-Temporal Detection Action Labels, Temp. Segments, BBoxes AVA[66], AVA-Kinetics[126], EPIC-Kitchens [38], JHMDB[93] Video Description Captions, Temp. Segments ActivityNet[77], Vatex[258], YouCook[39], MSR-VTT [267] , LSMDC [201] Video QA Q/A, Subtitle or Script (optional) MSRVTT-QA[266], VideoQA[292], ActivityNetQA[281], TVQA[123], MovieQA[240] Text to Video Retrieval Text Query, ASR output (optional) HowTo100M[166], TVR[124], DiDeMo[79], Charades-STA[56] Video Object Grounding Text Query, Temp. Segments, BBoxes ActivityNet-SRL[207], YouCookII[306], VidSTG [300],VID-sentence[35] VidSRL Verbs, SRLs, Corefs, Event Relations, Temp. Segments VidSitu Table 5.1: A non-exhaustive summary of video understanding tasks, required annotations and benchmarks. for VidSRL and establishing baselines for each using state-of-art components. The dataset and code are publicly available at vidsitu.org. 5.2 Related Work Video Understanding, a fundamental goal of computer vision, is an incredibly active area of research involving a wide variety of tasks such as action classification [25, 52, 252], localization [141, 140] and spatio-temporal detection [61], video description [258, 116], question answering [281], and object grounding [207]. Tasks like detecting atomic actions at 1 second intervals [61, 260, 235] are short horizon tasks whereas ones like summarizing 180 second long videos [307] are extremely long horizon tasks. In contrast, our proposed task of VidSRL operates on 10 second video at 2 second intervals. It entails producing a verb for the salient activity within each 2 second interval as well as predicting multiple entities that fulfill various roles related to that event, and finally relating these events across time. In support of these tasks, the community has also proposed datasets [98, 77, 66], over the past few years. While early datasets were small datasets with several hundred or thousand examples[230, 119], recent datasets are massive[166] enabling researchers to train large neural models and also employ pretraining strategies[164, 309, 131]. Section 5.5, Table 5.4 and Figure 5.4 provide a comparison of our proposed dataset to several relevant datasets in the field. Due to space constraints, we are unable to provide a thorough description of all the relevant work. Instead we point the reader to relevant surveys on video understanding [1, 112, 293] and also present a holistic overview of tasks and datasets in Table 5.1. 105 Visual Semantic Role Labeling has been primarily explored in the image domain under situation recognition [273, 190], visual semantic role labeling [70, 133, 228] and human-object interaction [29, 28]. Compared to images, visual semantic role labeling in videos requires not just recognizing actions and arguments at a single time step but aggregating information about interacting entities across frames, coreferencing the entities participating across events. Movies for Video Understanding: The movie domain serves as a rich data source for spatio-temporal detection [66], movie description [201], movie question answering [240] , story-based retrieval [9], generating social graphs [248] tasks, and classifying shot style [88]. In contrast to a lot of this prior work, we focus only on the visual activity of the various actors and objects in the scene, i.e. no additional modalities like movie-scripts, subtitles or audio are presented in our dataset. 5.3 Semantic Roles: A Brief Summary Semantic Role Labeling attempts to abstract out at a high-level who does what to whom [231]. It is a popular natural language task which attempts at obtaining such structured outputs from natural language descriptions. As such there are multiple sources to obtain semantic roles such as FrameNet [11], PropBank [176] and VerbNet [22]. Prior work on situation recognition in images (ImSitu) [273] have curated list of verbs (situations) from FrameNet, and action recognition dataset (Moments in Time) [168] have curated action vocabulary from VerbNet. However, we qualitatively found both vocabulary to be insufficient to represent actions, and thus chose PropBank which contained action-oriented verbs. As such, PropBank has been used for video object grounding [207] but not in the context of collecting semantic roles from visual data. 106 PropBank contains a set of numbered semantic roles for each verb ranging from Arg0 to Arg4. Each numbered argument has a specific definition for a particular verb but some themes are similar across verbs (adapted from PropBank annotation guidelines [19] † ). For the verb “throw”: • Arg0: Agent – object performing the action. For e.g. “person” • Arg1: Patient – object on which action is performed. For e.g. “ball” • Arg2: Instrument, Benefactive, Attribute. For e.g. “towards a basket” • Arg3: Starting Point • Arg4: Ending Point • ArgM: Modifier – location (LOC), manner(MNR), direction (DIR), Purpose (PRP), Goal (GOL), Temopral (TMP), Adverb (ADV) In general, we noticed that Arg3 and Arg4 were exceedingly rare for visual verbs, thus we restrict our attention to Arg0, Arg1, Arg2 for numbered arguments. For modifier arguments, we found Location (LOC) to be universally valid for all video segments. Thus, for those verbs where LOC doesn’t apply usually, we additionally add a semantic role “Scene” which refers to “where” the event takes place (such as “living room”, “near a lake”). Other arguments were chosen based on their appearance in MPIID dataset, and we most commonly used Manner (which suggests “how” the action takes place) and Direction (details in the Section 5.5.2). For rest of the paper, we use ALoc, ADir, AMnr, and AScn to denote location, direction, manner and scene arguments respectively. † http://clear.colorado.edu/compsem/documents/propbank_guidelines.pdf 107 5.4 VidSRL: The Task State-of-the-art video analysis capabilities like video activity recognition and object detection yield a fairly impoverished understanding of videos by reducing complex events involving interactions of multiple actors, objects, and locations to a bag of activity and object labels. While video captioning promises rich descriptions of videos, the open-ended task definition of captioning lends itself poorly to a systematic representation of such events and evaluation thereof. The motivation behind VidSRL is to expand the video analysis toolbox with vision models that produce richer yet structured representations of complex events in videos than currently possible through video activity recognition, object detection, or captioning. Formal task definition. Given a video V , VidSRL requires a model to predict a set of related salient events {Ei} k i=1 constituting a situation. Each event Ei consists of a verb vi chosen from a set of of verbs V and values (entities, location, or other details pertaining to the event described in text) assigned to various roles relevant to the verb. We denote the roles or arguments of a verb v as {Av j } m j=1 and Av j←a implies that the j th role of verb v is assigned the value a. In Fig. 5.1 for instance, event E1 consists of verb v=“deflect (block, avoid)" with Arg0 (deflector) ← “woman with shield". The roles for the verbs are obtained from PropBank [176]. Finally, we denote the relationship between any two events E and E′ by l(E, E′ ) ∈ L where L is an event-relations label set. We now discuss simplifying assumptions and trade-offs in designing the task. Timescale of Salient Events. What constitutes a salient event in a video is often ambiguous and subjective. For instance given the 10 sec clip in Fig. 5.1, one could define fine-grained events around atomic actions such as “turning" (Event 2 third frame) or take a more holistic view of the sequence as involving a “fight”. This ambiguity due to lack of constraints on timescales of events makes annotation and evaluation challenging. We resolve this ambiguity by restricting the choice of salient events to one event per fixed time-interval. Previous work on recognizing atomic actions [66] relied upon 1 sec intervals. An appropriate choice of time interval for annotating events is one that enables rich descriptions of complex videos while 108 avoiding incidental atomic actions. We observed qualitatively that a 2 sec interval strikes a good balance between obtaining descriptive events and the objectiveness needed for a systematic evaluation. Therefore, for each 10 sec clip, we annotate 5 events {Ei} 5 i=1. Appendix 5.5.2.1 elaborates on this choice. Describing an Event. We describe an event through a verb and its arguments. For verbs, we follow recent work in action recognition like ActivityNet [77] and Moments in Time [168] that choose a verb label for each video segment from a curated list of verbs. To allow for description of a wide variety of events, we select a large vocabulary of 2.2K visual verb from PropBank [176]. Verbs in PropBank are diverse, distinguish between homonyms using verb-senses (e.g. “strike (hit)” vs “strike (a pose)”), and provide a set of roles for each verb. We allow values of arguments for the verb to be free-form text. This allows disambiguation between different entities in the scene using referring expression such as “man with trident" or “shirtless man" (Fig. 5.1). Understanding of a video may require consolidating partial information across multiple views or shots. In VidSRL, while the 2 sec clip is sufficient to assign the verb, roles may require information from the whole video since some entities involved in the event may be occluded or lie outside the camera-view for those 2 secs but are visible before or after. For e.g., in Fig 5.1 Event 2, information about “Arg2 (hearer)” is available only in Event 3. Co-Referencing Entities Across Events. Within a video, an entity may be involved in more than one event, for instance, “woman with shield” is involved in Events 1, 2, and 5 and “man with trident” is involved in Events 2, 3, and 4. In such cases, we expect VidSRL models to understand co-referencing i.e. a model must be able to recognize that the entity participating across those events is the same even though the entity may be playing different roles in those events. Ideally, evaluating coreferencing capability requires grounding entities in the video (e.g. using bounding boxes). Since grounding entities in videos is an expensive process, we currently require the phrases referring to the same entity across multiple events within each 10 sec clip to match exactly for coreference assessment. See supp. for details on how coreference is enforced in our annotation pipeline. 109 Event Relations. Understanding a video requires not only recognizing individual events but also how events affect one another. Since event relations in videos is not yet well explored, we propose a taxonomy of event relations as a first step – inspired by prior work on a schema for event relations in natural language [82] that includes “Causation” and “Contingency”. In particular, if Event B follows (occurs after) Event A, we have the following relations: (i) Event B is caused by Event A (Event B is a direct result of Event B); (ii) Event B is enabled by Event A (Event A does not cause Event B, but Event B would not occur in the absence of Event A); (iii) Event B is a reaction to Event A (Event B is a response to Event A); and (iv) Event B is unrelated to Event A (examples are provided in supplementary). 5.5 VidSitu Dataset To study VidSRL, we introduce the VidSitu dataset that offers videos with diverse and complex situations (a collection of related events) and rich annotations with verbs, semantic roles, entity co-references, and event relations. We describe our dataset curation decisions (Section 5.5.1) followed by analysis of the dataset (Section 5.5.3). 5.5.1 Dataset Curation We briefly describe the main steps in the data curation process and provide more information in Appendix 5.5.2. Video Source Selection. Videos from movies are well suited for VidSRL since they are naturally diverse (wide-range of movie genres) and often involve multiple interacting entities. Also, scenarios in movies typically play out over multiple shots which makes movies a challenging testbed for video understanding. We use videos from Condensed-Movies [9] which collates videos from MovieClips- a licensed YouTube channel containing engaging movie scenes. 110 Video Selection. Within the roughly 1000 hours of MovieClips videos, we select 30K diverse and interesting 10sec videos to annotate while avoiding visually uneventful segments common in movies such as actors merely engaged in dialogue. This selection is performed using a combination of human detection, object detection and atomic action prediction followed by a sampling of no more than 3 videos per movieclip after discarding inappropriate content. Curating Verb Senses. We begin with the entire PropBank [176] vocabulary of ∼6k verb-senses. We manually remove fine-grained and non-visual verb-senses and further discard verbs that do not appear in the MPII-Movie Description (MP2D) dataset [201] (verbs extracted using a semantic-role parser [225]). This gives us a set of 2154 verb-senses. Curating Argument Roles. We wish to establish a set of argument roles for each verb-sense. We initialize the argument list for each verb-sense using Arg0, Arg1, Arg2 arguments provided by PropBank and then expand this using frequently used (automatically extracted) arguments present in descriptions provided by the MP2D dataset. Annotations. Annotations for the verbs, roles and relations are obtained via Amazon Mechanical Turk (AMT). The annotation interface enables efficient annotations while encouraging rich descriptions of entities and enabling a reuse of entities through the video (to preserve co-referencing). See Appendix 5.5.2.2 for details. Dataset splits. VidSitu is split into train, validation and test sets via a 80:5:15 split, ensuring that videos from the same movie end up in exactly one of those sets. Table 5.3 summarizes these statistics of these splits. We emphasize that each of the three tasks namely Verb Prediction, Semantic Role Prediction and Co-Referencing and Event Relation Prediction have separate test sets. Multiple Annotations for Evaluation Sets. Via controlled trials (see Sec 5.7.1) we measured the annotation disagreement rate for the train set. Based on this data, we obtain multiple annotations for validation and test sets using a 2-stage annotation process. In the first stage, we collect 10 verbs for each 111 2 second clip (1 verb per worker). In the second stage, we get role labels for the verb with the highest agreement from 3 different workers. 5.5.2 Extended Dataset Collection Details In this section we describe details on dataset collection including curation of verbs and arguments, followed by details on annotation interface, quality control and reward structure. 5.5.2.1 Dataset Curation We provide more details on Dataset Curation which were omitted from Section 5.5.1 of the main paper. Video Source Selection. As suggested in the Section 5.5.1 we aimed at a domain with two criterion: the videos should be by themselves cover diverse situations (“climb” verb should not just be associated with rocks or mountains, but also things like top of a car), and that the each video should contain complex situation (the video shouldn’t depict someone doing the same task over extended period of time, which would lower chances of finding meaningful event relations and be repetitive in verbs and arguments over the entire video). After a brief qualitative analysis, we found instruction domain videos (HowTo100M [166], YouCookII [306], COIN [239]) to have very fine-grained actions with less diversity and less complexity within small segments, open domain sources (ActivityNet [77], Moments in Time [168], Kinetics [98], HACS[293]) to be somewhat diverse but low complexity within a small segment. This led us to Movie domain which span multiple genres leading to appreciable diversity as well as complexity. We converged on using MovieClips [9] rather than other movie sources such as MPII [201], since MovieClips already provide one-stage of filtering to provide interesting videos. While using the same movies as used in AVA[66] was an option, we found that the video retention was quite low (around 20% of the movie are removed from you-tube), and the movie contained long contiguous segments with low 112 0 100k 200k 300k 400k 500k 0 1 2 3 4 Histogram of Vb vs Rank rank_ix a v g o f n u m _ v b s Rank of Video Segment Number of Unique Verbs Figure 5.2: Bar graph showing number of unique verbs with respect to the rank of the video segment as computed via our heuristic based on predicted labels from SlowFast Network [52] trained on AVA[66]. complexity. We also note some other datasets like MovieNet [88], Movie Synopsis Dataset [264], Movie Graphs [248] do not provide movie videos and cannot be used for collecting annotations. One demerit of using movie domain is that the verb distributions are skewed towards actions like “talk”, “walk”, “stare”. Despite this we find the videos to be reasonably complex. Video Selection. MovieClips spans a total of 1k Hours which is far beyond what can be reasonably annotated. To best utilize available annotation budget, we are primarily interested in identifying video segments depicting complex situations with a high precision while avoiding visually uneventful segments common in movies such as those simply involving actors engaged in dialogue. To avoid such segments, we use the following heuristic: a video with more atomic actions per person is likely to be more eventful. So, we divide all movieclips into 10 second videos with a stride of 5 seconds, obtain human bounding boxes from the MaskRCNN [74] object detector trained on the MSCOCO [144] dataset, predict atomic actions for each detected person using the SlowFast [52] activity recognition model trained on the AVA [66] dataset, and rank all videos by the average number of unique atomic actions per person in the video. In particular, we discard labels such as “talk”, “listen”, “stand” and “sit” as these atomic actions didn’t correlate with complexity of situations. Since “action” sequences like “fight scenes” are favored by our ranking measure, we use simple heuristic of removing “martial arts” actions to avoid oversampling such scenes and improve diversity of situations represented in the selected videos. 113 (a) Annotation Interface at the start. First step is to watch the entire 10 second video. (b) Second step is to select a verb from drop-down list which shows example usage and displays slots for arguments. (c) Third step, is filling in argument slots for each verb and re-using entity names which appear in drop-down. (d) Final step is to choose an Event Relation for Events 1,2,4,5 with respect to Event 3 Figure 5.3: Illustration of our annotation interface. (a) depicts the initial screen an annotator sees. In the first step, one needs to watch the entire 10 second video. (b) depicts the second step of choosing a verb from a drop-down which contains verb senses obtained from PropBank. After selecting a verb, an example usage is shown along with corresponding argument roles which need to be filled. (c) depicts filling the argument slots for each verb which can be phrases of arbitrary length. Each filled in phrase can be re-used in a subsequent slot, to enforce co-reference of the entities. (d) shows the final step of choosing event relations once all the arguments for all events are filled. The event relations should be classified based on causality and contingency for Events 1,2,4,5 with respect to Event 3. 114 To confirm the usefulness of the proposed heuristic, we conduct an experiment where we annotate 1k videos chosen uniformly sampled across the entire dataset (as shown in Figure 5.2). Reducing number of unique verbs shows the effectiveness of our heuristic and suggests at least 80K videos segments (which translates to 27K non-overlapping video segments) can be richly annotated. For final video selection, we randomly choose set of videos from the top-K ranks, such that the newly chosen videos don’t overlap with already chosen videos, and that no more than 3 videos are uploaded from the same Youtube video within a particular batch. Curating Verb Senses. To curate verb senses, we follow a two-step process: from the initial list of ∼ 6k verb senses in PropBank [176], first we manually filter verb senses which share the same lemmatized verb (as previously stated “go” has 23 verb senses) to retain only “visual” verb senses (for instance we remove the verb sense of “run” which refers to running a business). We keep all 3.7K verbs with a single verb sense and of the remaining 2364 verbs-senses (shared across 809 verbs) we retain 629 verb senses (shared across 561 verbs). Second, to further restrict the set of verbs to those useful for describing movies, we discard verbs that do no appear at all in the MPII-Movie Description (MP2D) dataset [201]. To extract verbs from the descriptions we use a semantic-role parser [225]. This results in a final set of 2154 verbsenses. Curating Argument Roles. Once we have curated the verb-senses from PropBank, we aim to delegate a set of argument roles for each verb-sense which would be filled based on the video. While PropBank provides numbered arguments for each verb-sense there are two issues with directly using them: first, some arguments are less relevant for visual scenes (for instance Arg1 (utterance) for “talk” is not visual), second, auxiliary arguments like direction and manner are not provided (for instance direction and manner for “look” are important to describe a scene). To address this issue, we re-use the MP2D dataset to inform us what arguments are used with the verbs. For each verb, we choose set of 5 most frequently used argument role-set and use their union. We also remove roles such as TMP (usually referring to words like “now”, 115 “then”) since temporal context is implicit in our annotation structure. We also removed roles like ADV (adverb) which were too infrequent. Finally, we use the following modifier roles: “Manner”, “Location”, “Direction”, “Purpose”, “Goal”, but note that “purpose” and “goal” were restricted to a small number of verbs and hence not considered for evaluation. We further added the modifier role “Scene” which describes “where” the event takes place, and only applies to verbs which don’t have “Location”. For instance, “stand” has the argument role “location” which refers to “where” the person is standing and doesn’t have “Scene”, whereas “run” doesn’t contain “location” and hence contains “Scene”. In general, “Scene” refers to the “place” of the event such as “in an alleyway” or “near a beach”. Event Relations. We started with the set of three event relations namely: no relation (Events A and B are unrelated), causality (Event B is Caused By Event A i.e. B happens directly as a result of A) and contingency based (Event B is Enabled By Event A i.e. A doesn’t directly cause B but B couldn’t have happened without A happening first) on prior work in cross-document event relations [82]. However, we found adding an additional case of “Reaction To” for causality helpful to distinguish between event relations. For instance, in the case “X punches Y” followed by “Y falls down” would be definitely “B is Caused By A”, however for the case “X punches Y” followed by “Y crouches” it is unclear if “B is Caused By A” since Y makes a voluntary decision to crouch. As a result, we call this relation “B is a Reaction To A”. 5.5.2.2 Annotation pipeline With videos, the list of verb-sense and their roles curated, we are now ready to crowd-source annotations on Amazon Mechanical Turk (AMT). Annotation Interface. Figure 5.3 shows screenshots depicting our annotation interface. For annotating a given 10 second video, the assigned worker is instructed to first watch the entire 10-second video 116 (Figure 5.3 (a)). Then for every 2 second interval, the annotator selects a verb corresponding to the most salient event from our curated list of verb-senses using a search-able drop-down menu. Once the verb is chosen, slots for the corresponding roles are displayed along with an example usage (Figure 5.3 (b)). The worker fills in the values for each role using free-form text (typically a short phrase). When referring to an entity, we instruct the worker to use phrases that uniquely identify the entity in the full 10 second video. Furthermore, these phrases can be reused in filling semantic-roles in other events within the video, which provides the co-reference information about the entities i.e. co-referenced entities are maintained via exact-string match (Figure 5.3 (c)). Once all verbs and their roles are annotated, we ask the worker to label the relation of Events 1, 2, 4, and 5 with respect to Event 3 (Figure 5.3 (d)). Note that the order of causality and contingency is different for Events 4,5 compared to Events 1,2 respecting the temporal order. Partitioning into 2-second clips: We emphasize that splitting the video into 2-second intervals is strictly a design choice motivated by reduction in annotation cost and consistent quality of annotations. In an early version of the data collection, we asked annotators to provide “start” and “end” points for events and allowed overlaps (consistent with other datasets such as ActivityNet Captions[116]). A close analysis showed that the noise in annotations was tremendous, took significantly longer (roughly 3x) and would lead to a much smaller and lower quality dataset given a budget. We thus simplified the task via 2-sec interval annotations and saw large improvements in consensus and speed. Clearly, using such a scheme leads to imprecise temporal boundaries for the events. Furthermore, it doesn’t allow annotating hierarchical actions. However, we argue that he downsides of this design choice are reasonably mitigated since: (a) Longer duration events get annotated via a repeat of the same verb across consecutive clips (we see many occurrences in our dataset) & (b) In the presence of multiple verbs in a clip, the most salient one gets annotated. The 2s duration was chosen after an analysis of ∼50 videos showed that events typically spanned more than 1s but clips longer than 2s often contained multiple interesting events that we would not want 117 to discard. Finally, we note that 2-second duration choice may not be suitable for vastly different domains (e.g. fewer actions and more talking) where 2s may be too dense, and relaxing this to longer clips may be more efficient (annotation cost wise). Event Relation Annotation w.r.t. Middle Event: We note there are two alternatives to our proposed annotation strategy for event relation which involves only annotating only all events only with respect to middle event. First, exhaustively annotate all event-event relations which would result in 10 annotations per video. Clearly, this is a 2.5× the annotation (in practice it is even more challenging). As a result, we decided to restrict to only one event relation. Second option is to allow choosing one of the 2-second intervals as the main event and annotating event relations with respect to it. In practice, we found the choice of main event to be subjective and inconsistent across annotations. Moreover, choosing the main event could lead to biased event relations (for instance “Caused By” relation would be more pronounced). Thus, we simplified the step by choosing Event 3 spanning from 4-6 seconds as the main event and annotated other events with respect to Event 3. Worker Qualification and Quality Control. To ensure that annotators have understood the task requirements, we put up a qualification task where a worker has to successfully annotate 3 videos. These annotations are manually verified by the first author who then provides feedback on their annotations. To filter potential workers, we restrict to more than 95% approval rate and having done at least 500 tasks. In total we qualified around 120 annotators, with at least 60 workers annotating more than 30 videos every batch of 2K videos. In addition to manual qualification, we put automated checks one average number of unique verbs provided within a video, and average description lengths. We further manually inspect around 3 random samples from every annotator after every 3K − 5K videos and provide constant feedback. Annotating Validation and Test Sets. We ran a controlled experiments using 100 videos and annotated 25 verbs for each event. We report the human agreement in Table 5.2. To compute human agreement 118 Acc@1 Acc@5 Recall@5 10 A 20 A 10 A 20 A 10 A 20 A Majority 0.20 0.21 0.66 0.75 0.03 0.02 Human 0.62 0.71 0.96 1.00 0.64 0.59 Table 5.2: 10A and 20A denote 10 and 20 annotations respectively. Majority denotes choosing most frequent verbs for the validation set. score for any event, we use one human annotation (out of 25) as a prediction and the remaining 10 or 20 annotations as ground-truths (denoted by 10A or 20A). The final score is the average over all possible prediction/ground-truth partitions. Essentially, we find that even moving from 10 to 20 annotations, the human agreement improves from 62% to 71% which suggests even at higher number of annotations, we receive verbs which are suggested by a single annotator (and hence no agreement). This rules out metrics like accuracy, precision, or F1 scores because they would penalize predictions that may be correct but are not present in a reasonably sized set of ground truth annotations. This analysis leads us to the metric Recall@5 which measures if the verbs most agreed upon by humans are indeed recalled by the model in its top-5 predictions. Furthermore, this prompts us to collect the annotations for validation and test set in two-stages, in the first stage we collect 9 additional annotations for verb and then in the second-stage 3 annotations for argument roles and event relations given the verb (we choose the set of verbs chosen by the annotator with the highest agreement, followed by highest number of unique verbs within the video). We find this twostage process to be of similar cost of obtaining 5 independent annotations but with the added advantage of being comparable across annotations. In total we annotation 3789 videos for validation and test sets. Reward. We set the reward for annotating one 10-second video (for training videos) to $0.75 after estimating the average time of completing an annotation to be around 5mins. This translates to around $9/hour. Overall, we received generous reviews for the reward on popular turk management website. For validation and test sets, we set the reward to $0.2 for the first stage (collecting only verbs from 9 annotators 119 Train Valid Test-Vb Test-SRL Test -ER Total # Movies 2431 151 151 153 151 3037 # Videos 23626 1326 1353 1598 1317 29220 # Clips 118130 6630 6765 7990 6585 146100 # Verbs Ann / Clip 1 10 10 10 10 # Verb Ann 118130 66300 67650 79900 65850 397830 # Unique Verb Tuples 23196 1317 1341 1571 1299 28724 # Values Ann / Role 1 3 3 3 3 # Role Ann 118130 19890 20295 23970 19755 202040 Table 5.3: Statistics on splits of VidSitu. Note that VidSitu contains multiple verb and role annotations for val and test sets for accurate evaluation. Dataset Domain SRLs, Coref EvRel Videos Clips Descr. Descr./Clip (Train) Avg. Clip Len. (s) Uniq Vbs/Vid Uniq Ents/Vid Avg. Roles/Event MSR-VTT open Implicit ✗ 7k 10k 200k 20 14.83 1.88 2.80 1.56 MPII-MD movie Implicit ✗ 94 68k 68.3 1 3.90 1.87 2.99 2.24 ActyNet-Cap open Implicit ✗ 20k 100k 100k 1 36.20 2.30 3.75 2.37 Vatex-en open Implicit ✗ 41.3k 41.3k 413k 10 10.00 2.69 4.04 1.96 VidSitu movie Explicit ✓ 29.2k 146k 146k 1 10.00 4.21 6.58 3.83 Table 5.4: Dataset statistics across video description datasets. We highlight key differences from previous datasets such as explicit SRL, co-reference, and event-relation annotations, and greater diversity and density of verbs, entities, and semantic roles. For a fair comparison, for all datasets we use a single description per video segment when more than one are available. and $0.7 for the second-stage (collecting argument and event relations from 3 annotators). As a result, the cost for annotating a single video in the validation and test set turns out to be $3.9 (0.2 × 9 + 0.7 × 3) which is around 5.2× the cost of annotating a single training video. Total cost for the process comes around $36.7K (note: this doesn’t account for pilot experiments, qualifications, and discarded annotations due to human errors). Collection Timeline. Collecting the entire training set was done over a period of about 1.2 months, and an additional 1 month for collecting the validation and test sets. 5.5.3 Dataset Analysis and Statistics We present an extensive analysis of VidSitu focusing on three key elements: (i) diversity of events represented in the dataset; (ii) complexity of the situations; and (iii) richness of annotations. We provide 120 (a) Diversity of actions (b) Diversity of entities 224 verbs appear in at least 100 events 336 nouns appear in at least 100 videos (c) Density of actions per video (d) Density of entities per video (e) Density of semantic roles per event (f) Histogram of coreference chain lengths Diverse Complex Rich Figure 5.4: Data analysis. An analysis of VidSitu in comparison to other large scale relevant video datasets. We focus on the diversity of actions and entities in the dataset (a and b), the complexity of the situations measured in terms of the number of unique verbs and entities per video (c and d) and the richness of annotations (e and f). comparisons to four prominent video datasets containing text descriptions – MSR-VTT [267], MPII-Movie Description [201], ActivityNet Captions [116], and Vatex-en [258] (the subset of descriptions in English). Table 5.4 summarizes basic statistics from all datasets. For consistency, we use one description per video segment whenever multiple annotations are available, as is the case for Vatex-en, MSR-VTT, validation set of ActivityNet-Captions and both validation and test sets of VidSitu. For datasets without explicit verb or semantic role labels, we extract these using a semantic role parser [225]. Diversity of Events. To assess the diversity of events represented in the dataset, we consider cumulative distributions of verbs‡ and nouns (see Fig. 5.4-a,b). For any point n on the horizontal axis, the curves show the number of verbs or nouns with at least n annotations. VidSitu not only offers greater diversity in verbs and nouns as compared to other datasets but also a large number of verbs and nouns occur sufficiently frequently to enable learning useful representations. For instance, 224 verbs and 336 nouns have at least 100 annotations. In general, since movies inherently intend to engage viewers, movie ‡As a fair comparison to datasets which do not have senses associated with verbs, we collapse verb senses into a single unit for this analysis. 121 datasets such as MPII and VidSitu are more diverse than open-domain datasets like ActivityNet-Captions and VATEX-en. Complexity of Situations. We refer to a situation as complex if it consists of inter-related events with multiple entities fulfilling different roles across those events. To evaluate complexity, Figs. 5.4-c,d compare the number of unique verbs and entities per video across datasets. Approximately, 80% of videos in VidSitu have at least 4 unique verbs and 70% have 6 or more unique entities, in comparison to 20% and 30% respectively for VATEX-en. Further, Fig. 5.4-e shows that 90% of events in VidSitu have at least 4 semantic roles in comparison to only 55% in VATEX-en. Thus, situations in VidSitu are considerably more complex that existing datasets. Richness of Annotations. While existing video description datasets only have unstructured text descriptions, VidSitu is annotated with rich structured representations of events that includes verbs, semantic role labels, entity coreferences, and event relations. Such rich annotations not only allow for more thorough evaluation of video analysis techniques but also enable researchers to study relatively unexplored problems in video understanding such as entity coreference and relational understanding of events in videos. Fig. 5.4-f shows the fraction of entity coreference chains of various lengths. 5.5.4 Additional Dataset Statistics In this section we report additional dataset statistics not included in Section 5.5.3 due to space constraints. In Table 5.5 we report the distributions of Event Relations before and after filtering for validation and test sets. For filtering, we use consensus of two workers i.e. at least two workers agree on the argument relation which we use as the ground-truth. We largely find that the consensus on Caused By and Reaction To is low, but Enabled By and No Relations are higher. 122 Figure 5.5: Distribution of 100 most frequent verbs (a), genre tuples (b), and movies (c). Note that for (a), the count represents the number of events belonging to the particular verb, whereas for (b), (c) it represents the number of video segments belonging to a particular genre or movie. 123 Figure 5.6: 50 Most frequent words (after removing stop-words) for Arg0, Arg1, Arg2, ALoc (location), ADir (direction ) and AMnr(Manner). 124 Total Caused By Reaction To Enabled By No Relation Train Set 94016 16.94 24.05 33.76 25.25 Val Set 5304 20.99 20.29 33.82 24.88 Val Set* 4089 (77.09%) 15.3 18.95 39.05 26.66 Test Set 6392 20.19 34.88 24.44 20.4 Test Set* 4851 (75.89%) 13.39 19.04 40.9 26.5 Table 5.5: The distribution of Event Relations before and after filtering by taking consensus of at least two workers i.e. we consider only those instances where two workers agree on the event relation when given the verb. Next, we plot the distributions for the 100 most frequent verbs, genres and chosen movies in Figure 5.5. For verbs and genres we find Zipf’s law in action. For verbs, we find most common verbs such as “talk”, “speak”, “walk”, “look” which are also part of frequent atomic actions despite explicitly not scoring them. This is an inherent effect due to the movie domain where dialogue is a large focus. For genres we find that “Comedy”, “Drama”, “Action”, “Romance” are the most frequent which tend to have more movements than “Mystery”, “Thriller” which have less movements on actors with often extended still-frames. In Figure 5.6 we plot the top 50 most frequent words within the argument (after removing stop-words). We find “man”, “woman” are the most frequent word in all of Arg0, Arg1, Arg2 which is not surprising since the movies are human-centric. We note the over-abundance of “man” compared to “woman” is an amplification of the biases present in the movie. Interestingly, the distribution is less skewed for Location, Direction, and Manner 5.6 Baselines For a given video, VidSRL requires predicting verbs and semantic roles for each event as well as event relations. We provide powerful baselines to serve as a point of comparison these crucial capabilities. These models leverage architectures from state-of-the-art video recognition models. 125 Video Feature Extractor Video Feature Extractor Video Feature Extractor Video Feature Extractor Video Feature Extractor [Arg0] woman with shield [Arg1] boulder deflect [Arg0] woman with shield [Arg1] Transformer Decoder [Arg0] man with trident [Arg1] over leap [Arg0] man with trident [Arg1] Transformer Decoder Video Feature Extractor Video Feature Extractor Video Feature Extractor Video Feature Extractor Video Feature Extractor deflect talk Classifier leap punch punch �" �# �$ �% �& �$ �� ������� �� �" �" �# �$ �% �& Step 1: Verb Prediction and Video Representation Learning Step 2(a): Contextualized Event Representations (reuses feature extractor from Step 1) Step 2(b): Decoding Semantic Roles for Predicted Verbs (jointly trained with Encoder in Step 2(a)) Step 3: Event Relation Prediction (trains a separate Encoder similar to Step 2(a)) [CLS] deflect [Arg0] woman with shield [Arg1] boulder … [CLS] leap [Arg0] man with trident [Arg1] over … Decoded �" arguments given the verb deflect: Decoded �$ arguments given the verb leap: Transformer Encoder Classifier Classifier Classifier Classifier RoBERTa RoBERTa Classifier Figure 5.7: Models. The figure illustrates our baselines for verb, semantic role, and event prediction using state-of-the-art network components such as SlowFast [52] network for video feature extraction, transformers [245] for encoding events in a video and verb-conditional decoding of roles, and RoBERTa [149] language encoder for event-relation prediction. 126 Verb Prediction. Given a 2 sec clip, we require a model to predict the verb corresponding to the most salient event in the clip. As baselines, we provide state-of-art action recognition models such as I3D [25] and SlowFast [52] networks (Step 1 in Fig. 5.7). We consider variants of I3D both with and without NonLocal blocks [256] and for SlowFast networks, we consider variants with and without the Fast channel. For each architecture, we train a model from scratch as well as a model finetuned after pretraining on Kinetics [98]. All models are trained with a cross-entropy loss over the set of action labels. For subsequent stages, these verb classification models are frozen and used as feature extractors. Argument Prediction Given Verbs: Given a 10 sec video and a verb for each of the 5 events, a model is required to infer entities and their roles involved in each event. To this end, we adapt seq-to-seq models [237] that consist of an encoder and a decoder (Step 2(a,b) in Fig. 5.7). Specifically, independent event features are fed through a transformer [245] encoder (TxEnc) to get contextualized event representations. Then for each event, the corresponding encoded representation and the verb are passed to a transformer decoder (TxDec) to generate the sequence of arguments and roles for that event. As an example, for Event 1 in Fig 5.1, we expect to generate the following sequence: [Arg0] woman with shield [Arg1] boulder [Scene] city park The generated sequence is post-processed to obtain the argument role structure similar to those of the annotations Figure 5.1. We also provide language only baselines using our TxDec architecture as well as a GPT2 decoder. Event Relation Prediction: A model must infer how the various events within a video are related given the verb and arguments. For a pair of ordered events (Ei , Ej ) with i < j, with corresponding verbs and semantic roles, we construct a multimodal representation of each event denoted by mi and mj (Step 3 in Fig. 5.7). Each of these representations is a concatenation of visual representation from TxEnc and a language representation of the sequence of verbs, arguments, and roles obtained from a pretrained 127 RoBERTa [149]-base language model. mi and mj are concatenated and fed through a classifier to predict the event relation. 5.6.1 Additional Baseline and Implementation Details We detail some of the implementation details for our models. All implementations are coded in PyTorch [180]. Unless otherwise mentioned we use Adam [108] optimizer with learning rate of 1e −4 . 5.6.1.1 Verb Prediction Models All our implementations for verb prediction models such as I3D[25], Slow-only and SlowFast networks [52] is based on the excellent repository SlowFast [51]. We use the checkpoints from the repository for kinetics pre-trained models. All models are trained with a batch size of 8 for 10 epochs, and the model with best recall@5 is chosen for testing. For classification, we use a set of 1560 verbs composed two MLP projections (first projects to half the input dimension, the second to 1560 verbs) separated with a ReLU activation. For inference, we choose the top-5 scoring verbs. Training requires considerable GPU space, and on 8 TITAN GPUs, with batch size of 8 each epoch takes around 1 hour, with total being 10 hours. 5.6.1.2 Argument Prediction Models We extract the features from underlying base networks which is 2048 and 2304 for I3D and SlowFast respectively. For transformers, we use the implementation provided in Fairseq library [174] (§ and for GPT2 (medium) and Roberta (base) we use the implementation by HuggingFace transformer library [259] ¶ . For tokenization and vocabulary, we utilize Byte-Pair Encoding and add special argument tokens such as [Arg0] to encode the phrases. § https://github.com/pytorch/fairseq/ ¶ https://github.com/huggingface/transformers 128 For both transformer encoder and decoder we use 3 layers with 8 attention heads. The decoder uses the last encoder layer outputs as encoder attention for subsequent decoding. For training, we use cross-entropy loss over the predicted sequence. For sequence generation, we use greedy-decoding with temperature 1.0 as we didn’t find improvements using beam-search or using different temperature. For training, we used a batch size of 16 for all models other than GPT2 for which we could only use a batch size of 8 due to memory restrictions. Training time for GPT2 is around 10 hours over 8 GPUs (recall that GPT2 medium has 24 transformer layers and 16 attention heads). All other models take around 15 mins per epoch with batch size of 16 on a single TITAN GPU with total time around 3 hours for 10 epochs which we found sufficient for convergence. For computing natural language generation metrics like ROUGE, CIDEr we use the official MSCOCO Captions implementation [144] ∥ . For co-reference metrics, we use the implementation provided in coval [169] ∗∗ 5.7 Experiments VidSitu allows us to evaluate performance in 3 stages: (i) verb prediction; (ii) prediction of semantic roles with coreferencing given the video and verbs for each event; and (iii) event relations prediction given the video and verbs and semantic roles for a pair of events. 5.7.1 Evaluation Metrics In VidSRL, multiple outputs are plausible for the same input video. This is because of inherent ambiguity in the choice of verb used to describe the event (e.g. the same event may be described by “fight", “punch" or “hit"), and the referring expression used to refer to entities in the video (e.g. “boy with black hair" or “boy in the red shirt"). We confirm this ambiguity through a human-agreement analysis on a subset of ∥ https://github.com/tylin/coco-caption ∗∗https://github.com/ns-moosavi/coval 129 100 videos (500 events) with 25 verb annotations and 5 role annotations per event. Importantly, through careful manual inspection we confirm that a majority of differences in annotation for the same video across AMT workers are due to this inherent ambiguity and not due to a lack of annotation quality. Verb Prediction. The ambiguity in verbs associated with events suggests that commonly used metrics such as Accuracy, Precision, and F1 are ill suited for the verb prediction task as they would penalize correct predictions that may not be represented in the ground truth annotations. However, recall based metrics such as Recall@k are suitable for this task. Since the large verb vocabulary in VidSitu presents a classimbalance challenge, we use a macro-averaged Recall@k that better reflects performance across all verbsenses instead of focusing on dominant classes. We now describe our macro-averaged Verb Recall@k metric. For any event, we only consider the set of verbs which appears at least twice within the ground-truth annotations (each event in val and test sets has 10 verb annotations). For event Ej (where j indexes events in our evaluation set), let this set of agreedupon ground-truth be denoted by Gj . We compute recall@k for each verb-sense vi ∈ V (where i indexes verb-senses in the vocabulary V) as R k i = P j 1(vi ∈ Gj ) × 1(vi ∈ P k j ) P j 1(vi ∈ Gj ) (5.1) where 1 is an indicator function and P k j denotes the set of top-k verb predictions for Ej . Macro-averaged verb recall@k is given by 1 |V| P i Rk i . We report macro-average verb recall@5 (R@5) but also report top-1 and top-5 accuracy (Acc@1/5) for completeness. Semantic Role Prediction and Co-referencing. Given a video and verb for each event, we wish to measure the semantic role prediction performance. Through a human-agreement analysis we discard arguments such as direction (ADir) and manner (AMnr) which do not have a high inter-annotator agreement and retain Arg0, Arg1, Arg2, ALoc, and AScn for evaluation. This agreement computation is computed 130 using the CIDEr metric by treating one of the chosen annotations as a hypothesis and remaining annotations as references for each argument. In addition to reporting a micro-averaged CIDEr score (C), we also compute macro-averaged CIDEr where the macro-averaging is performed across verb-senses (C-Vb) or argument-types (C-Arg). ROUGE-L (R-L) [139] is shown for completeness. Since VidSitu provides entity coreference links across events and roles, we use LEA [169] a linkbased co-reference metric to measure coreferencing capability. Other metrics (MUC [249], BCUBE [8], CEAFE [157]) can be found in the supp. Co-referencing in our case is done via exact string matching over the predicted set of arguments. Thus, even if the predictions are incorrect, but just the coreference is correct, LEA would give it a higher score. To address this, we propose a soft version of LEA termed LEA-soft (denoted with Lea-S) which assigns weights to cluster matches using their CIDEr score (defined in the supp.). Event-Relation Prediction Accuracy. Event-relation prediction is a 4-way classification problem. For the subset of 100 videos, We found event relations conditioned on the verbs to have 60% agreement. For evaluation, we use the subset of event pairs for which 2 out of 3 workers agreed on the relation. We use top-1 accuracy (Acc@1) averaged across the classes as the metric for relation prediction. 5.7.2 Extended Evaluation Metrics Details In this section, we provide details on LEA as well as our proposed LEA-soft. We further report additional metrics such BLEU [177] and METEOR [12], and coreference metrics. We also report per-argument scores for the baselines. 5.7.2.1 Co-Reference Metrics We primarily use the metric LEA [169] which is a link-based metrics. We also note there exists other metrics such as MUC [249], BCUBE [8], CEAFE[157]. We point the reader to a seminal paper on visualizing these 131 metrics [187] for a brief overview of MUC, BCUBE and CEAFE, and [169] for comparison of other metrics with LEA. LEA and LEA-soft As noted in the paper [169], LEA computes an importance score and resolution score for each entity given as P ei∈E imp(ei) × res(ei) P ei∈E imp(ei) (5.2) The final score is the F1-measure computed based on recall (entities are ground-truths) and precision (entities are predictions). As noted earlier, LEA doesn’t consider if the proposed entity by itself is correct and thus even incorrect entity predictions could lead high co-reference score as long as the co-referencing is correct. We address this using LEA-soft which additionally weights the importance of each entity during precision computation with the sum of cider scores in the numerator and len of cider scores in the denominator. As a result, we have P recLEA = P ei∈E imp(ei) × res(ei) P ei∈E imp(ei) (5.3) P recLEA−sof t = P ei∈E ( P ei C(ei)) × imp(ei) × res(ei) P ei∈E |ei | × imp(ei) (5.4) where C(ei) denotes the cider score for the i th entity. We keep the recall computation unchanged and use the modified precision to compute the final F1-Score for LEA-soft. Since we have multiple ground-truth reference, we compute the F1-score for each ground-truth reference individually and average over the 3 ground-truths. 132 cider Arg0 Arg1 Arg2 ALoc AScn ADir AMnr GPT2 0.39 0.40 0.39 0.45 0.43 0.22 0.37 0.15 Human 0.70 0.73 0.74 0.73 0.90 0.96 0.40 0.15 Table 5.6: CIDEr score for all collected Arguments with 5 annotations on 100 videos. 5.7.2.2 Evaluation of Arguments We examine the cider scores for different arguments over a set of 100 videos (same used for verb prediction results). To compare semantic role values, which are free-form text phrases, we compute CIDEr metric treating one of the chosen annotations as a hypothesis and remaining annotations as references for each argument. Table 5.6 compares CIDEr scores for all semantic roles and scores by argument type for a GPT2 based language only baseline that generates the sequence of roles and values given the verb for an event. We find that human-agreement is high for all arguments except direction (ADir) and manner (AMnr). For both “direction" (ADir) and “manner" (AMnr), we find that both language-only baseline and human agreements are poor. On further inspection, we find that the argument “manner” describes “how” the event took place is open to subjective interpretation, and the argument “direction” has a wide range of correct values (e.g. for “walk” directions “forward”, “down the path”, and “through the trees”) may all be correct. For a reliable evaluation, we evaluate argument prediction performance only on arguments that achieved high human-agreement i.e. Arg0, Arg1, Arg2, ALoc, and AScn, and leave the evaluation of Direction and Manner for future work. 5.7.3 Results Verb Classification: We report macro-averaged Rec@5 (preferred metric; Sec. 5.7.1) and Acc@1/5 on both validation and test sets in Tab. 5.8. We observe verb prediction in VidSitu follows similar trends as 133 Model Vis Enc Val Test C R-L C-Vb C-Arg Lea Lea-S C R-L C-Vb C-Arg Lea Lea-S GPT2 ✗ ✗ 34.67 40.08 42.97 34.45 48.08 28.1 36.48 41.33 44.27 36.51 49.38 30.24 TxDec ✗ ✗ 35.68 41.19 47.5 32.15 51.76 28.6 35.34 41.45 44.44 32.06 52.46 29.18 Vid TxDec SlowFast ✗ 44.78 40.61 49.97 41.24 37.88 28.69 44.95 41.12 49.46 41.98 38.91 30.21 Vid TxEncDec SlowFast ✓ 45.52 42.66 55.47 42.82 50.48 31.99 47.25 43.46 52.92 45.48 50.88 33.5 Vid TxDec I3D ✗ 47.14 40.67 51.61 41.29 37.89 30.38 47.9 41.5 51.29 43.62 38.77 31.73 Vid TxEncDec I3D ✓ 47.06 42.41 51.67 42.76 48.92 33.58 48.51 42.96 53.88 44.53 49.61 35.46 Human* 84.85 39.77 91.7 80.15 72.1 70.33 83.68 40.04 87.78 79.29 71.77 70.6 Table 5.7: Semantic role prediction and co-referencing metrics. Vis. denotes the visual features used (✗ if not used), and Enc. denotes if video features are contextualized. C: CIDEr, R-L: ROUGE-L, C-Vb: CIDEr scores averaged across verbs, C-Arg: CIDEr scores averaged over arguments. Lea-S: Lea-soft. See Section 5.7.1 for details. Model Kin. Val Test Acc@1 Acc@5 Rec@5 Acc@1 Acc@5 Rec@5 I3D ✗ 31.18 67.00 5.24 31.91 67.36 5.33 I3D+NL ✗ 30.17 66.83 4.88 31.43 67.70 5.02 Slow+NL ✗ 33.05 68.83 5.82 34.29 69.56 6.24 SlowFast+NL ✗ 32.64 69.22 6.11 33.94 70.54 6.56 I3D ✓ 29.65 60.77 18.21 29.87 59.10 19.54 I3D+NL ✓ 39.40 70.82 17.12 38.42 69.27 18.46 Slow+NL ✓ 29.05 58.69 19.19 29.03 58.77 21.06 SlowFast+NL ✓ 46.79 75.90 23.38 46.37 75.28 25.78 Table 5.8: Verb classification metrics. Acc@K: Event Accuracy considering 10 ground-truths and K model predictions. Rec@K: Macro-Averaged Verb Recall with K predictions. Kin. denotes whether Kinetics is used. other action recognition tasks. Specifically, SlowFast architectures outperform I3D and Kinetics pretraining significantly and consistently improves recall across all models by ≈ 10 to 16 points. Argument Prediction: We report micro and macro-averaged version of CIDEr and ROUGE-L in Tab. 5.7 (see supp. for other metrics). First, video conditioned models significantly outperform video-blind baselines. Next, we observe that using an encoder to contextualize events in a video improves performance across almost all metrics. Interestingly, while SlowFast outperformed I3D in verb prediction, the reverse is true for semantic role prediction. Even so, a large gap exists between current methods and human performance. We also evaluate coreferencing ability demonstrated by models without explicitly enforcing it during training. In Tab. 5.7, we report both Lea and Lea-S (preferred; Sec. 5.7.1) metrics and find that current 134 Verb Args Val Macro-Acc Test Macro-Acc Roberta ✓ ✓ 25.00 25.00 TxEnc ✓ ✓ 25.00 25.00 Vid TxEnc ✗ ✗ 34.13 30.97 Vid TxEnc ✗ ✓ 34.54 32.89 Vid TxEnc ✓ ✓ 34.15 32.98 Table 5.9: Event relation classification metrics. Macro-Averaged Accuracy on Validation and Test Sets. We evaluate only on the subset of data where two annotators agree. techniques are unable to learn coreferencing directly from data. Among all models, only Vid TxEncDec outperformed a language only baseline (GPT2) on both val and test sets, leaving lots of room for improvement in future models. Event Relation Prediction results are provided in Table 5.9. Crucially, we find video-blind baselines don’t train at all and end up predicting the most frequent class “Enabled By” (hence it gets 0.25 for always predicting majority class). This suggests there exists no exploitable biases within the dataset and underscores the importance and challenge posed by event relations. In contrast, video encoder models even when given just the video without any verb description outperform video-blind baselines. Adding context in the form of verb senses and arguments yields small gains. In summary, powerful baselines show promise on the three sub-tasks. However, it is clear that VidSitu poses significant new challenges with a huge room for improvement. 5.7.3.1 All Metrics We report BLEU@1, BLUE@2, METEOR, ROUGE, and CIDEr for both val (Table 5.10) and test set (Table 5.11). For each metric we further report macro-averaged scores across verbs and arguments, and report per argument scores. Note that only CIDEr is able to take advantage of the macro-averaged scores due to its inverse document frequency re-weighting. Finally, we report the co-reference metrics MUC, BCUBE, CEAFE , LEA and our proposed metric LEA-Soft. 135 Model GPT2 TxDec Vid TxDec Vid TxEncDec Vid TxDec Vid TxEncDec Human Vis Feats ✗ ✗ SlowFast SlowFast I3D I3D B@1 40.91 42.79 43.45 44.65 41.69 45.3 43.56 B@1-Vb 38.08 41.02 39.59 41.98 38.96 40.54 39.93 B@1-Arg 40.91 42.62 42.89 44.49 40.18 44.6 41.69 B@1-Arg0 44.67 46.32 48.26 48.14 49.58 49.36 49.71 B@1-Arg1 31.88 31.69 32.81 34.72 34.76 36.17 40.61 B@1-Arg2 34.13 36.3 34.93 35.86 35.17 37.36 39.87 B@1-ALoc 46.88 48.07 48.97 51.39 42.73 49.37 38.7 B@1-AScn 46.99 50.74 49.48 52.33 38.66 50.71 39.56 B@2 27.66 28.8 29.87 30.86 28.47 30.73 29.89 B@2-Vb 23.92 26.52 25.73 27.54 25 25.39 25.14 B@2-Arg 27.63 28.4 29.19 30.61 26.82 30.06 28.37 B@2-Arg0 31.06 32.07 34.09 33.78 35.33 34.03 34.74 B@2-Arg1 19.53 19.87 20.25 22.39 22.3 22.6 26.72 B@2-Arg2 22.1 23.52 22.22 23.46 21.81 24 26.76 B@2-ALoc 32.92 32.24 34.19 35.98 28.58 34.04 27.06 B@2-AScn 32.55 34.29 35.21 37.42 26.06 35.61 26.59 M 16.99 17.51 17.28 18.26 17.68 18.32 22.24 M-Vb 15.33 16.4 15.8 17.14 16.39 16.77 22.08 M-Arg 15.88 16.03 16.2 17.23 15.93 16.95 21.02 M-Arg0 21.12 21.97 20.99 21.46 22.23 22.05 25.21 M-Arg1 15.49 14.81 13.94 16.14 15.93 16.16 22.22 M-Arg2 14.99 16.27 15.21 15.65 14.76 14.85 20.75 M-ALoc 15.21 13 15.03 16.26 12.17 15.19 17.88 M-AScn 12.59 14.11 15.85 16.63 14.54 16.51 19.02 R 40.08 41.19 40.61 42.66 40.67 42.41 39.77 R-Vb 37.07 37.89 36.89 39.18 36.38 38.14 39.16 R-Arg 39.62 40.47 39.58 41.96 38.56 41.39 38.43 R-Arg0 44.77 46.7 46.78 47.36 48.65 47.71 45.84 R-Arg1 34.25 33.24 32.83 35.7 34.66 36.65 40.23 R-Arg2 33.72 36.14 34.12 35.13 34.71 35.85 36.43 R-ALoc 42.87 41.41 39.82 44.6 32.22 41.49 34.38 R-AScn 42.46 44.84 44.33 46.99 42.55 45.26 35.25 C 34.67 35.68 44.78 45.52 47.14 47.06 84.85 C-Vb 42.97 47.5 49.97 55.47 51.61 51.67 91.7 C-Arg 34.45 32.15 41.24 42.82 41.29 42.76 80.15 C-Arg0 28.33 32.1 41.64 34.6 48.99 39.42 88.24 C-Arg1 38.58 38.47 41.42 45.47 45.42 47.06 83.37 C-Arg2 36.82 40.51 42.28 41.02 40.19 44.52 74.82 C-ALoc 47.77 27.05 43.01 46.97 33.75 39.75 76.72 C-AScn 20.73 22.62 37.86 46.05 38.11 43.03 77.62 MUC 59.13 64.54 45.59 65.48 46.01 61.57 80.75 BCUBE 73.53 74.43 69.39 72.97 68.74 73.34 86.32 CEAFE 61.75 63.84 57.26 59.7 56.2 61.16 77.8 LEA 48.08 51.76 37.88 50.48 37.89 48.92 72.1 LEA Soft 28.1 28.6 28.69 31.99 30.38 33.58 70.33 Table 5.10: Semantic Role Prediction on Validation Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument. 136 Model GPT2 TxDec Vid TxDec Vid TxEncDec Vid TxDec Vid TxEncDec Human Vis Feats ✗ ✗ SlowFast SlowFast I3D I3D B@1 41.89 42.9 43.4 45.36 43.69 45.56 43.46 B@1-Vb 38.41 39.4 39.28 41.03 39.43 40.52 39.73 B@1-Arg 41.9 42.56 42.84 45.25 42.04 44.83 41.47 B@1-Arg0 45.65 46.06 47.56 48.92 48.96 49.75 48.2 B@1-Arg1 32.17 31.53 33.15 34.46 33.93 35.42 41.06 B@1-Arg2 35.02 37.34 34.85 36.69 36.32 38.55 39.69 B@1-ALoc 48.7 46.53 48.74 52.95 43.91 49.18 36.74 B@1-AScn 47.94 51.34 49.88 53.23 47.07 51.25 41.65 B@2 28.43 29.15 30.08 31.64 30.34 31.34 29.43 B@2-Vb 24.25 25.49 25.83 26.9 25.45 26.22 24.37 B@2-Arg 28.41 28.7 29.42 31.56 28.79 30.59 27.95 B@2-Arg0 31.69 31.92 33.56 34.33 34.84 34.76 32.99 B@2-Arg1 19.8 19.88 20.98 22.69 22.3 22.46 26.88 B@2-Arg2 22.43 24.39 22.36 24.15 23.05 24.81 26.27 B@2-ALoc 34.36 31.63 34.18 37.95 30.69 34.32 25.66 B@2-AScn 33.76 35.67 36.03 38.66 33.05 36.62 27.93 M 17.74 17.67 17.45 18.83 18.22 18.7 21.86 M-Vb 15.8 15.84 15.72 17.02 16.92 16.83 22.44 M-Arg 16.63 16.21 16.46 17.9 16.63 17.44 20.55 M-Arg0 21.82 21.83 20.72 21.96 22.2 22.23 24.61 M-Arg1 15.99 14.97 14.39 16.31 16.28 16.53 21.55 M-Arg2 15.39 16.63 15.15 16.22 15.34 15.41 20.11 M-ALoc 16.41 12.96 15.76 17.63 13.59 16.2 16.89 M-AScn 13.55 14.68 16.3 17.36 15.74 16.82 19.58 R 41.33 41.45 41.12 43.46 41.5 42.96 40.04 R-Vb 37.71 36.96 36.66 38.6 36.69 37.72 39.24 R-Arg 40.91 40.65 40.14 42.88 39.68 42.04 38.55 R-Arg0 45.89 46.6 46.75 48.22 48.69 48.3 45.5 R-Arg1 35.13 33.05 33.35 35.67 34.9 36.34 40.03 R-Arg2 34.13 36.83 33.77 35.26 35.58 36.49 37.29 R-ALoc 45.33 40.96 41.53 47.17 35.1 43.06 32.94 R-AScn 44.04 45.82 45.31 48.08 44.14 46.04 36.97 C 36.48 35.34 44.95 47.25 47.9 48.51 83.68 C-Vb 44.27 44.44 49.46 52.92 51.29 53.88 87.78 C-Arg 36.51 32.06 41.98 45.48 43.62 44.53 79.29 C-Arg0 26.17 27.83 36.84 33.51 41.89 38.64 81.62 C-Arg1 39.08 37.99 42.93 43.79 46.53 46.47 81.47 C-Arg2 35.36 41.93 39.16 39.48 41.66 43.84 73.21 C-ALoc 55.05 25.83 48.3 58.38 43.83 45.15 77.38 C-AScn 26.9 26.71 42.65 52.22 44.18 48.57 82.77 MUC 60.51 65.42 47.51 65.91 47.63 62.62 80.8 BCUBE 74.21 74.76 69.84 72.95 69.2 73.6 86.26 CEAFE 62.19 63.85 57.33 59.57 56.65 61.41 77.38 LEA 49.38 52.46 38.91 50.88 38.77 49.61 71.77 LEA Soft 30.24 29.18 30.21 33.5 31.73 35.46 70.6 Table 5.11: Semantic Role Prediction on Test Set. B@1: Bleu-1, B@2: Bleu-2, M: METEOR, R: ROUGE-L, C: CIDEr, Metric-Vb: Macro Averaged over Verbs, Metric-Arg: Macro Averaged over arguments, Metric-Argi: Metric computed only for the particular argument. 137 5.8 Conclusion We introduce visual semantic role labeling in videos in which models are required to identify salient actions, participating entities and their roles within an event, co-reference entities across time, and recognize how actions affect each other. We also present the VidSitu dataset with diverse videos, complex situations, and rich annotations. 138 Chapter 6 Reasoning Across Images and Videos. 6.1 Introduction Vision-Language tasks, i.e., tasks that require understanding and reasoning over vision and text, have gained widespread popularity in recent years. This increase can be primarily attributed to the user-friendly nature of these tasks, which allow for natural language communication with minimal guidance for the end-user. Popular downstream Vision-Language tasks and benchmarks include Image Classification [42], Visual Question Answering (VQA) [3, 65], Image-Text Retrieval and Captioning [33]. However, such tasks focus on reasoning over a single image or video. In this work, we aim to broaden the scope and investigate downstream tasks which additionally require reasoning over a set of images and/or videos. The common approach to train models on Vision-Language tasks is to utilize Vision-and-Language Pre-training (VLP), then fine-tune the model on the downstream task. First, the model is trained on large amounts of, potentially noisy, paired vision and language corpora obtained directly from the web. The pretrained model is then fine-tuned over a range of unimodal or multi-modal downstream tasks with a separate head added for each task. During the pre-training stage, models are trained over synthetic tasks generated from the paired text data with the most commonly used tasks being masked language modeling, imagetext matching, and contrastive learning. These pre-training tasks have several advantages, including the ability to be directly applied to any paired image-text corpus, ease of training, and empirical evidence of 139 Two dolphins jumping out of water during sunsets (a) VLP with MLM + ITM Both Images show dolphins jumping out of water. A: True How many dolphins are present in the image? (b) VQA (c) Im-Im VVT task Both image and the video show dolphins jumping out of water. A: False R: Only left image has dolphin jumping out of water. (d) Im-Vid VVT task Both videos show dolphins jumping out of water. A: False R: Neither video shows dolphins jumping out of water. (e) Vid-Vid VVT task Figure 6.1: Existing fine-tuning tasks such as (b) VQA operate on single image which is similar to pretraining objective (a) such as Masked-Language Modeling or Image-Text Matching. Here, we expand the scope to include reasoning (c) across images or (d) across image and a video. (e) across videos. Here “A” denotes the answer (True/False), and “R” denotes reason. 140 large improvements when fine-tuned on downstream tasks such as VQA and image-text retrieval [48, 106, 130]. However, the downstream tasks used as benchmarks are often close to the original pre-training tasks usually reasoning over a single image and text. In this work, we investigate downstream tasks which additionally require reasoning over a set of images and/or videos. The closest work in this space is NLVR2 [234] where given a pair of images and a corresponding statement, the model is required to classify the statement as True or False. NLVR2 has been used as a diagnostic dataset for a number of vision-language pre-training methods [34, 48]. However, NLVR2 suffers from three key deficiencies: First, the dataset is strictly limited to a pair of images and doesn’t include videos; Second, it is not possible to diagnose why the model classified a statement as True or False as there is no reasoning component; Third, the statements are overwhelmingly about either the existence or the cardinality of objects. We extend the NLVR2 task [234] to include both images and video. For brevity, we denote this task as Reasoning Across Images and Video (RAIV). We leverage annotations from existing datasets with semantically rich annotations, namely ImSitu [273] and VidSitu [212] which provide fine-grained information about the activity, and the entities involved in the activity. This allows us to create new datasets that have statements about image-image (Im-Im), image-video (Im-Vid), and video-video (Vid-Vid). These rich datasets allow the creation of statements about actions, objects, and other semantic roles. Further, since our statement queries are generated in an automatic fashion and we have access to the ground-truth annotations, we also explore the task of reasoning, i.e., why the model chose a specific answer (true or false), using a multiple-choice answer framework. Finally, for rich image-image comparison, we also utilize Instruct-Pix2Pix [21] where the image pair consists of the original image and an edited image obtained via a generative model (Stable-Diffusion[203]). Figure 6.1 illustrates this with an example. Though the obtained queries are rich and diverse in semantic content, the queries themselves follow a fixed template structure that doesn’t capture human-like natural language. To fix this issue, we utilize the 141 progress in large-language models [171, 241] and provide the reference captions obtained from the source annotations to generate queries. We note that RAIV involves more than one image and video input which is different from the conventional vision-language pre-training setup. To bridge this gap, we introduce a second pre-training step which is task-specific before fine-tuning on the target downstream dataset. For this task-specific pretraining, we leverage the same dataset employed in the initial pre-training, and don’t require access to the downstream dataset. We exploit object detectors as well as provided image and video captions to obtain semantic roles to create synthetic pairs for RAIV task. Specifically, we initialize the weights from a pretrained vision-language model. The model is then trained for the downstream tasks but is confined to the original pre-training datasets. Our experiments show that while pre-training is quintessential to obtaining state-of-art results, taskspecific pre-training leads to significant gains (over 1-3%). The differences are further exacerbated in imagevideo and video-video tasks. We also find task-specific pre-training can achieve competitive performance even with significantly smaller amount of downstream dataset. Our main contributions can be summarized as (i) introducing Reasoning Across Images and Video (RAIV) task with multiple datasets ranging from Im-Im, Im-Vid and Vid-Vid (ii) task-specific pre-training for RAIV and (iii) detailed ablative study and benchmark with multiple baselines. Our code and dataset are available ∗ . 6.2 Related Works Vision-Language Pre-Training (VLP) has effectively become the standard for almost every visionlanguage task. Earlier works replicated the success of language pre-training in GPT [192], BERT[45] to ∗ https://github.com/TheShadow29/raiv-task 142 Verb Cut Arg0 Person Arg1 Bread Arg2 Knife Loc Chopping Board Verb Eat Arg0 Person Arg1 Bread Arg2 Hand Loc Kitchen Common Object <c-obj>: Bread False Template: <c-obj> is present only in the left image. True Templates: In both images, <c-obj> is present. Verb Eat Arg0 Person Arg1 Bread Arg2 Hand Loc Restaurant Image-1 Image-2 Image-3 (1 & 2) In both images, bread is present. (True) In only left image bread is present. (False) (2 & 3) In both images, a person is eating bread. (True) In both images, a person is eating in a restaurant. (False) Figure 6.2: Sentence generation for RAIV tasks. Given images from ImSitu (same process applies for videos from VidSitu) along with their SRLs, we find the common object (in this case bread) and use them along with True/False templates to generate sentences. the image-language domain using pre-extracted object features such as LXMERT[238], ViLBERT[153], VLBERT [233], UNITER[34]. Recent works extend the vision-transformer (ViT) architectures [47] to visionlanguage transformers such as ViLT [106], ALBEF [130], METER [48] and learn directly from patches from raw images. Such models can be initialized from strong vision backbones trained via contrastive losses over a very large image-text corpus such CLIP [191] and ALIGN [94]. Here, our aim is not to design a new architecture, but instead to validate the generalization of existing pre-training losses to downstream tasks which differ considerably from the pre-training tasks in their input format. For our experiments we use METER [48] as our base model, but also show comparisons with ALBEF[130], VinVL [296], FROZEN [10]. Fine-Tuning for most common image-language tasks such as VQA, image-text retrieval, and imagecaptioning involves adding a task-specific head and training it over the target dataset. As noted before, downstream tasks often vary based on input type such as in NLVR2 which requires two images instead of one. To accommodate this, previous work [238, 34, 36, 48] create new image token type embedding. Such heuristic has largely been successful in improving results over non-pre-trained models. Different from 143 previous work which performs additional training on the target domain, our focus is to perform training on original pre-training datasets with additional synthetic tasks. Here, we re-use the same idea of new image-type embedding but don’t differentiate between images and videos, essentially treating images as single-frame videos. Visual Semantic Role Labeling (SRLs) for Reasoning has been previously explored under humanobject interaction [28], situation recognition [273, 190, 212, 102]. In this work, we utilize SRL annotations from existing datasets, particularly ImSitu [273] and VidSitu [212] to semi-automatically create new downstream datasets to include reasoning over set of images and videos. Using SRL annotations for constructing datasets has also been used for video grounding [209] and video question answering [208]. We further use existing SRL system [224] to obtain SRLs in pre-training datasets and utilize them in creating synthetic tasks for pre-training. 6.3 Method We first describe Reasoning Across Images and Video (RAIV) tasks in detail (Section 6.3.1) followed by our model framework (Section 6.3.2). 6.3.1 Reasoning Across Images and Video (RAIV) Tasks Given a pair of visual inputs such as pair of images, an image and a video, or a pair of videos along with a corresponding statement about the pair, the model has to correctly classify the statement as true or false. We call this task Reasoning Across Images and Video (RAIV). This extends the well-known NLVR2 task [234] to include both images and videos instead of just images. Though conceptually simple, creating new datasets requires considerable human resources and can still fall victim to dataset biases. For instance, the cost of obtaining a unique sentence in NLVR2 was $0.65. Further, extending the NLVR2 annotation approach for videos is prohibitively expensive due to a 144 People are hoisting a flag in both I1 and I2 Image / Video and Text Pair Patches Temporal Position ID Position 0 0 0 0 0 0 0 0 1 3 1 1 1 1 Vision Transformer Vision Transformer Text Transformer Text Transformer Cross Attention Layers Cross Attention Layers METER CLS CLS Concat BCE Loss Figure 6.3: A schematic of framework for RAIV task. Input is a pair of images, videos or an image and a video (shown here) with a text. The visual inputs (denoted by I1 and I2) are first patchified then appended with temporal position embedding denoting the frame number. We note that images are considered to be a single frame video. Then, we add the ID embedding denoting whether it is the first or the second visual input. This is input into two METER model (shared weights) which takes both vision patches and text as input. The appended CLS from both inputs are concatenated and a Binary Cross Entropy Loss is used given the ground-truth. 145 significant increase in annotation time. To circumvent this issue, we instead choose to create new datasets semi-automatically from existing datasets with semantically rich annotations. In creating datasets for RAIV, we have three main considerations: (i) the statement queries should include rich object and activity semantic information (ii) the visual inputs should be similar for finding fine-grained differences (iii) the dataset should support a reasoning component to identify why a statement is classified as true or false. Unfortunately, no single dataset satisfies the above three criteria. Thus, we create individual datasets to test these components. Rich Visual Semantics. To obtain rich semantic data, we utilize semantic role labeling (SRL) which answers the high-level question of “who did what to whom” [232]. To obtain SRLs we can either apply an existing semantic role labeling system [224] or utilize annotations in existing Visual Semantic Role datasets [212, 273]. For the purpose of creating rich downstream datasets, we opt for the latter with humanannotated SRLs. We also utilize an object detector to obtain the unique objects within a given image or video. Recall that our task is to obtain pair of images or videos and a corresponding statement to be classified as true or false. To this end, we design a template-based statement generation method with the templates closely following exemplar statements in NLVR2 dataset. While template-based statements are significantly less rich and diverse compared to human-annotated systems, there are two key advantages. First, it is inexpensive and directly allows us to create balanced training, validation, and test sets. Second, we are able to generate reasoning for the classification of the statement. Since the reasoning for classifying a statement is often a tautology, we instead opt for reasoning classification only for the False statements. The chosen templates test for existence, similarity, or differences about the “object”, “action” and “action+entity” or other semantic roles between the two visual inputs. The former is obtained from an object detector and the latter from SRLs. To generate a statement, we first condition whether the statement would be “True” or “False” and choose a template, for instance “obj-X is present in both images”. Based on the 146 condition and the template we then sample two images with at least one common object and convert it into a statement. In practice, we remove very common objects such as “sky”, and “person” expected to appear in a large number of images. Figure 6.2 illustrates the use of templates with an example. Similar Visual Inputs. Instances of visual input pairs based on semantic inputs often differ significantly. For instance, two images involving “riding a horse” may have different point-of-view, different numbers of horses, varying locations, etc. Unfortunately, obtaining natural images which are visually similar is non-trivial. To circumvent this issue, we look into image generation models, in particular, InstructPix2Pix [21] (IP2P) which builds on Stable Diffusion [203] to allow image edits. We use a subset of the dataset from IP2P for RAIV. Reasoning Task. Since we have the ground-truth annotations for both visual inputs, we can further provide a reason. For instance, if the original statement was “False”, the reason could be “obj-X is present in both image-1 but not in image-2”. Even though the reasoning can be posed as a generation task, evaluation metrics for language generation can often be unreliable. To keep the evaluation straightforward, we instead opt for a 3-way multiple-choice over pre-generated reasons. The model is provided the original query along with each multiple-choice option separately and the highest-scoring option is chosen. We evaluate the Reasoning Task separately from the RAIV task. Natural Language Queries. A key issue with using template-based queries is the limited types of variation of the queries. However, human annotation would be very expensive. To address this problem, we utilize the advances in Large Language Models such as LLaMa [241], GPT4 [171]. In particular, we provide captions and/or SRLs for a particular image/video and require the LLM to create a True/False question. Note that the LLM doesn’t have access to the image/video but only the annotations. We utilize Vicuna-13B [303] model to obtain these queries. We discuss dataset creation in more detail in Section 6.4.1, and provide examples of generated queries in supplementary. 147 6.3.2 Framework Model Design. For our experiments, we utilize a patch-based vision-language transformer trained on image-text corpora based on METER [48]. To accommodate RAIV tasks, we use a late-fusion model where given an image-image-text as input, the model processes image1-text and image2-text separately concatenates the output, and passes it to a binary classification head which is trained using binary cross-entropy (BCE) loss. An overview of our model design is provided in Figure 6.3. The images/videos are provided with ID number embedding to denote if it is the first or the second image/video. Since RAIV tasks also include videos, we extend the METER framework to process multiple frames. Specifically, we sample k frames (k=4 in our experiments) from the video, add temporal position embedding ,and concatenate the patches from each of these frames and pass it to the METER module. We don’t differentiate between image and video type; instead, consider images as single-frame videos. For Reasoning Task, the original input text is appended with one of the possible choices at a time and fed to the network. We re-use the same model framework and train with BCE loss. During inference, the model returns the choice with highest score. Task-specific Pre-Training. As noted earlier, the pre-training objectives such as masked-language modeling, image-text matching, and contrastive learning primarily consider single visual input which is characteristically different from RAIV task requiring models to consider two inputs. To this end, we propose a second pre-training step where the objective is the same as that of the downstream task. We denote this as task-specific pre-training. This is different from fine-tuning which requires the downstream dataset; here we only require the objective which is independent of the downstream dataset. For instance, if a model is pre-trained on COCO-Captions [145] and the target task is VQA, the second pre-training step would involve generating QA-pairs from the available image-text pairs in COCO-Captions. For RAIV task, since we don’t have access to detailed SRL information in the pre-training web curated dataset, we utilize a state-of-art SRL system [224] on the paired text samples to obtain visual semantic roles 148 U-Im U-S I-Tr I-Val I-Test I-Tot Im-Im (T) 63k 54k 94.5k 13.5k 27k 135k Im-Vid (T) 169k 65k 109.9k 15.7k 31.4k 157k Vid-Vid (T) 106k 62k 104.3k 14.9k 29.8k 149k IP2P (G) 75k 150k 105k 15k 30k 150k Im-Im (G) 63k 75k 105k 15k 30k 150k Im-Vid (G) 169k 75k 105k 15k 30k 150k Vid-Vid (G) 106k 75k 105k 15k 30k 150k Table 6.1: Dataset Statistics for RAIV datasets. U-Im, and U-S denote unique numbers of images and sentences, respectively. I-{Tr, Val, Test, Tot} denotes the number of instances. Note that in template-based queries some sentences are duplicates. and use object detectors to obtain the entities. Given this information, designing objectives is straightforward: we use similar templates as that for RAIV datasets but mine these during the training step itself. Specifically, for a particular image/video instance, we retrieve another image/video from our set with at least one shared object and then construct a template query from the two annotations. Further, we can perform this retrieval process dynamically at train time. 6.4 Experiments We discuss the dataset creation details (Section 6.4.1) followed by key implementation details (Section 6.4.3) and then results and takeaways (Section 6.4.5). 6.4.1 Datasets We design datasets for RAIV to have (i) rich semantic representation using existing vision-language datasets which contain SRL annotations, namely, ImSitu [273] and VidSitu [212] (ii) visually similar inputs using generative models like Stable Diffusion [203, 21] (iii) provide a reason for the classification. (iv) allow natural queries by passing annotations to an LLM. A summary of the dataset statistics can be found in Table 6.1. 149 For rich semantic representation, we create the following variations: Image-Image (Im-Im), ImageVideo (Im-Vid), and Video-Video (Vid-Vid) with images taken from ImSitu and videos taken from VidSitu. We note that while videos in VidSitu are 10 seconds long, for our experiments we only consider 2 second long clips which correspond to a particular event in the video. We utilize the same splits as in the original datasets to avoid any training dataset leakage into validation splits. For each of the datasets, we create approximately the same number of samples as in NLVR2 around 120k annotations with an even distribution of the verbs and objects but we note that our process allows creating more examples without any additional human effort. We further take care to not introduce any spurious dataset bias. Similar to NLVR2, we create balanced validation and test sets using the same unique statement where it is true for a particular pair and false for another pair in the given dataset to ensure no language-only bias in the dataset. Finally, we split the dataset into Train, Val, and Test in a 7:1:2 ratio making sure no leakage of visual inputs. We use the suffix “T” to denote the statement queries based on templates. To obtain natural language queries for the above dataset, we use LLM in particular Vicuna-13B. We input the semantic roles for the two images and require the LLM to provide a true statement. We use the suffix “G” to denote such statements which are obtained using LLMs. We use the template-based dataset (suffix “T”) for the reasoning task. Since the queries themselves were based on templates, and we have access to the ground-truth information, we create 3-way multiple choice questions and require the model to choose a correct answer. We opt for multiple-choice due to ease of evaluation similar to previous work in common-sense reasoning [289, 123]. We note that only “False” statements are used in the Reasoning Task. For visually similar inputs, we use the dataset provided by InstructPix2Pix (denoted by IP2P) which contains pair of images, both generated via Stable Diffusion but with some key edits to the text. The captions for the original images, as well as the edit caption, are provided. To obtain a true statement, we input both the original and the edit caption to a LLM (Vicuna-13b) to obtain the output caption. To obtain 150 a false statement, we input the original caption but change the edit caption. We provide more details on dataset creation, statistics and visualization in Appendix 6.4.2.1. Pre-Training Datasets. We closely follow previous work [48, 130]. In particular, we use the METER pre-trained model which is pre-trained on CC3M [220], SBU [173], COCO [145] and Visual Genome [118]. Task-specific pre-training We leverage COCO-Captions for images which includes 5 captions per image and VATEX-en [257] for videos which is a subset of Kinetics-400 videos consisting of 25k videos with 10 captions each. To obtain action-object information we utilize SRL labeling system [224] on the provided paired caption for both COCO-Captions and VATEX-en. 6.4.2 Additional Dataset Details 6.4.2.1 Creating Datasets for RAIV We first discuss the creation of Im-Im, Im-Vid and Vid-Vid datasets which are aimed to have semantically rich representations. RAIV tasks involve a pair of images/videos and a given statement to be classified as True or False. We create multiple datasets using existing vision-language datasets which contain SRL annotations, namely, ImSitu [273] and VidSitu [212]. The main reason for choosing datasets with SRL annotations is to obtain high-quality “action+object” information in the image or video. We first summarize these two datasets. Briefly, the ImSitu dataset is created by first obtaining a set of verbs and their corresponding roles from FrameNet [205]. Then top image results are retrieved from the web which includes the particular verb, followed by a strict annotation pipeline to denote the various entities participating in the action. The VidSitu dataset, which serves as an extension of ImSitu to videos, obtains 10-second-long movie clips with multiple actions. Each video is then segmented into five 2-second events, with each segment annotated with a verb obtained from PropBank [110]. Then, a referring expression is used to denote the entities appearing in the videos, which are filled in the various roles. 151 For both ImSitu and VidSitu, we obtain the “object” information from an object detector. We utilize VinVL [296] which involves a FasterRCNN [199] trained on multiple object detection datasets OpenImages [121], COCO[145], Visual Genome [118] and Object365 [219], and then fine-tuned on Visual Genome. We note that both ImSitu and VidSitu use different sets of verbs for annotations. Since our datasets include both images and videos, we simplify our setting by only utilizing verbs that are common to both datasets. While this reduces the total amount of available data, it hugely simplifies the dataset creation pipeline. We also prune verbs with less than 20 annotations in either dataset. This results in 243 verbs which are shared in both datasets. Another issue arises in the semantic role labeling formats for the two datasets. ImSitu annotations are based on FrameNet [205] whereas VidSitu annotations are based on PropBank [110]. We use existing heuristics based on the ordering and the use of roles to map the SRLs from FrameNet to Propbank annotations. Since we are mostly concerned about the “action+object” setting and not the individual roles such as instruments or tools, noise in this conversion doesn’t adversely affect the dataset quality. Further, the annotations for the entities in VidSitu have referring expressions or phrases describing the entity which is different from entity annotation in ImSitu containing only a single noun. We circumvent this issue by considering only the lemmatized noun for the referring expressions. We also avoid very common objects such as “person” which is usually associated with the agent performing an action. With both ImSitu and VidSitu datasets in hand, we now create RAIV datasets. We create the following variations: Image-Image (Im-Im), Image-Video (Im-Vid) and Video-Video (Vid-Vid) with images taken from ImSitu and videos taken from VidSitu. We note that while videos in VidSitu are 10 seconds long, for our experiments we only consider 2 second long clips which correspond to a particular event in the video. We further ensure the event is not duplicated in the next segment to avoid annotated entities not appearing within the given segment. After pruning, we are left with 63k images from ImSitu and 106k video segments from VidSitu. We utilize the same splits as in the original datasets to avoid any training dataset leakage 152 (b) Verb: crouch, Arg0: Several People, Loc: In the Street (c) Verb: dance, Arg0: A man in blue shirt, Loc: In the lab (a) Verb: crouch, Arg0: man, Loc: desert Figure 6.4: Example creation of generating template-based queries. into validation splits. For each of the datasets, we create approximately the same number of samples as in NLVR2 around 120k annotations with an even distribution of the verbs and objects but we note that our process allows creating more examples without any additional human effort. We further take care to not introduce any spurious dataset bias. We follow NLVR2 in creating balanced validation and test sets by using the same unique statement where it is true for a particular pair and false for another pair in the given dataset to ensure no language-only bias in the dataset. The resulting datasets are suffixed with “T” to denote the statements are generated using templates resulting in Im-Im (T), Im-Vid (T), and Vid-Vid (T). As our datasets are created semi-automatically, we also provide reasons for the false statements. For ease of evaluation we follow previous work in common-sense reasoning [289, 123] involving multiplechoice question setup where three reasons are provided and only one of the reasons is correct. The options are also generated via templates to prevent any language-only biases. We summarize our pipeline for creating RAIV template datasets, i.e., Im-Im (T), Im-Vid (T), Vid-Vid (T) below. 1. Unify the annotations for ImSitu and VidSitu datasets, in particular the verbs. 2. Create mapping of objects, actions, and action+objects to image/video IDs in the datasets. 153 3. Sample a particular template based on object, action, or action + object. Then choose a particular object, action, action+object. 4. Choose a particular image/video satisfying the above criteria. 5. Choose two other image/video, one which satisfies and another which doesn’t satisfy the criteria. This provides us with a True and False statement. 6. In previous step, choosing them at random makes the problem too simple, so we condition it on having at least one shared SRL such as verb, object or location. 7. For the false statement, provide the reason for being false. 8. Repeat the process until enough samples are obtained. We illustrate this with an example in Fig 6.4. Suppose the chosen template was “action”, “In both images, people are doing X" where “X” is the action. Assume the chosen action was “crouch”. Let the first sample chosen be Fig 6.4 (a). Given this image, we choose a “true” video as in Fig 6.4(b) and “false” video (c). Further, for the “false” pair, we know both contain the verb crouch, so we can provide the reason “people crouch in I1 but not in I2.”. We note we restrict to limited possible templates yet covering a wide-variety of possibility based on whether it is “action”, “object” or “action + object”. The possible templates are: 1. "In both I1 and I2, {p1}." 2. "In at least one of I1 or I2, {p1}." 3. "In exactly one of I1 or I2, {p1}." 4. "In neither I1 nor I2, {p1}. Here, {p1} is short for placeholder and {Image} refers to Image1 or Image2. We also note that the clause can be easily modified such as “In both I1 and I2, {p1}” is same as “{p1} in both I1 and I2”. The placeholder 154 {p1} depends on the type of template. For instance, if it is object, it is “Obj is present”, for actions it is “Subj is performing Verb”. These templates can then be used to get the reasoning in the form of: “In both I1 and I2”, “In I1 but not in I2”, “In I2 but not in I1” or “In neither I1 nor I2". Note that for during training, the SRLs are obtained from a pre-trained SRL detection system on the provided captions such as [224]. For validation and test sets, we utilize all the available annotaitons. For instance, VidSitu provides 10 verb annotations for each segment. Thus, when comparing for same verb, we consider all 10 annotations. Similarly, for other SRLs. This makes our validation and test sets more robust to noisy ground-truth data. 6.4.2.2 Creating Natural Language Queries As noted in main paper, we utilize LLMs to create Natural Language Queries. We note that there are both pros and cons of using natural language queries as opposed to template queries. The main advantage of templated queries is that the output sentence has very controlled information and as a result we can create a reasoning question directly from the template. However, such model is of little practical use. On the other hand, natural language queries can be directly used by end-user but obtaining natural language queries via humans is prohibitively expensive. Instead, we opt to use natural language queries using LLMs. However, we note that use of LLMs can cause errors in the generated sentence and there is no easy way to rectify them. Further, the obtained LLM outputs cannot be used for reasoning. To generate the queries, we use Vicuna-13B [303] model which is initialized from LLaMA [241] and trained on outputs from ChatGPT [171] a closed-source model by OpenAI. We use the LLM in two ways: (i) to create Im-Im (G), Im-Vid (G) and Vid-Vid (G) which are generated counterparts to the original templated datasets introduced above (ii) to create IP2P dataset which is obtained from InstructPix2Pix. While used in similar ways, there are some key distinctions. 155 For Im-Im, Im-Vid and Vid-Vid datasets, we directly take all the visual input pairs, obtain their annotation information and pass it to the LLM and require it to generate a True statement. The obtained statement is then matched to another input pair for which it is false. Essentially, the “T” and “G” counterparts of the dataset have same visual input pairs but the exact sentences are different. We prompt our LLM based on the original input query in the templated dataset. We use the following input: """ Provide a True statement comparing the two images with the following information: Image 1: {SRL} Image 2: {SRL} The statement should be in the form of "{Template}, ...", only point out about {Image}. """ Here, {SRL} denotes the semantic roles for the given image/video, the {Template} denotes the chosen template as noted in previous section, and {Image} denotes which image was chosen (I1 or I2) for the true statement. For instance, if the original query involved Fig6.4 (a), (c) with the template “In exactly one of”, with action+object, the input would be: """ Provide a True statement comparing the two images with the following information: Image 1: Verb: crouch, Subj: man, Loc: desert Image 2: Verb: dance, Subj: a man in blue shirt, Loc: in the lab The statement should be in the form of "In exactly one of the images, ...", only point out about image 1 """ This returns the output: "In exactly one of the images, a man is crouching in a desert." The same true statement is considered “false” for the other pair taken from corresponding “(T)” dataset. For IP2P, the images are created using Stable-Diffusion. We have access to the image-caption pairs as well as the edit caption. To create a true statement, we provide the LLM with the original caption and the edit caption and ask it compare the images. To create a false statement, we keep the original caption but change the edit caption. Here, for each image pair we have a unique true and false statement. 156 To generate these queries: 1. Choose a given image-pair from IP2P dataset. This has an associated original caption, edit caption, and an output caption. 2. Randomly sample another edit caption different from the given edit caption. Denote this as edit caption 2. 3. Use LLM to compare original caption, and edit-caption for True statement. Similarly, use original caption and edit caption 2 for False statement. For generating output with a new caption we use the following prompt: """ The original image caption: ... The original image is modified with the instruction: ... Provide a true statement comparing original and new image. """ As an example, to generate true statement for given pair : """ The original image caption: Jim Hansel 500 Piece Puzzle (Head Waters) The original image is modified with the instruction: turn it into a stained glass window Provide a true statement comparing original and new image. """ Instead for a false pair: """ The original image caption: Jim Hansel 500 Piece Puzzle (Head Waters) The original image is modified with the instruction: have it be a comedy sketch Provide a true statement comparing original and new image. """ 6.4.3 Baseline and Implementation Details Baselines. As noted in Section 6.3.2, we build on the METER model. Specifically, we use the pre-trained checkpoint based on CLIP-VITB/16 [191] with Roberta [148] (named METER-CLIP16-RoBERTa-288) which 157 The original image caption: Jim Hansel 500 Piece Puzzle (Head Waters) The original image is modified with the instruction: turn it into a stained glass window Provide a true statement comparing original and new image. The original image is a 500-piece puzzle featuring a image called "Head Waters" by Jim Hansel. The new image is the same image as the original, but it has been modified to look like a stained glass window. TRUE The original image caption: Jim Hansel 500 Piece Puzzle (Head Waters) The original image is modified with the instruction: have it be a comedy sketch Provide a true statement comparing original and new image. The original image is a 500-piece puzzle featuring the artwork "Head Waters" by Jim Hansel. The new image is a comedy sketch. FALSE Input: Jim Hansel 500 Piece Puzzle (Head Waters) Edit: turn it into a stained glass window Output: Stained Glass Window of Jim Hansel (Head Waters) (a) (b) (c) Figure 6.5: Illustration of generating queries for IP2P dataset. For a given image pair, we have (a) the sentence queries used to generate the image via Stable Diffusion (b) Using LLM to generate a True statement and (c) False statement by changing the edit caption. 158 is trained on multiple image datasets namely, CC3M, SBU, COCO, VG. For convenience, we call this collection of datasets ImgAll. Apart from fine-tuning the pre-trained checkpoint, we also consider a random baseline that simply performs a majority voting, a no pre-training baseline where the model is directly trained on the downstream datasets. Implementation Details Our model and code are implemented in Pytorch. For all fine-tuning experiments, we follow identical settings as METER. For videos, we sample K=4 frames per video where each video is 2 seconds long and sampled at 30 frames per second and use sinusoidal position embeddings [246]. In the task-specific pre-training step, we primarily use the COCO dataset instead of the entire ImgAll dataset in order to limit computation time, similar to the fine-tuning process on the downstream task. We also note that instead of using the object annotations available in COCO, we use the VinVL object detector outputs instead as it detects a larger number of categories outside of COCO. For videos, we use a subset of Kinetics videos from VATEX-en. We note that the videos in Kinetics are 10s long compared to 2s in the downstream dataset. To circumvent this issue, we first obtain an intersection of the videos from AVA-Kinetics [127] which gives us 5.7k videos where the keyframe of the person performing the action is provided. We particularly sample 2s clips around the keyframe. In general, we randomly sample 4 frames from the entire video. We train for 10 additional epochs but reduce batch size to 256 with AdamW optimizer [152] with linear warm-up for initial 10% to 1e − 4 of the training followed by linear decay. We only utilize the last checkpoint and then perform fine-tuning on the target dataset. We provide detailed hyper-parameter settings in supplementary (Appendix 6.4.4). 159 6.4.4 Additional Implementation Details Implementation Details Our model and code are implemented in Pytorch. For all fine-tuning experiments, we follow identical settings as METER. For each dataset, we separately fine-tune the model for 10 epochs with differential learning rates of 1e −5 and 1e −4 for the bottom and top layers respectively. We use 288 × 288 as the image dimension in all cases. For videos, we sample K = 4 frames per video where each video is 2 seconds long and sampled at 30 frames per second. For images, we simply provide a single temporal position embedding while for videos we have K temporal position embeddings. We use sinusoidal position embeddings following previous work [246]. In the task-specific pre-training step, we primarily use the COCO dataset instead of the entire ImgAll dataset in order to limit computation time, similar to the fine-tuning process on the downstream task. We also note that instead of using the object annotations available in COCO, we use the VinVL object detector outputs instead as it detects a larger number of categories outside of COCO. For videos, we use a subset of Kinetics videos from VATEX-en. We note that the videos in Kinetics are 10s long compared to 2s in the downstream dataset. To circumvent this issue, we first obtain an intersection of the videos from AVA-Kinetics [127] which gives us 5.7k videos where the keyframe of the person performing the action is provided. We particularly sample 2s clips around the keyframe. In general, we randomly sample 4 frames from the entire video. We train for 10 epochs but reduce batch size to 256 with AdamW optimizer [152] with linear warm-up for initial 10% to 1e − 4 of the training followed by linear decay. We only utilize the last checkpoint and then perform fine-tuning on the target dataset. Most of our experiments are carried on 4x 2080Ti and 4x 3090Ti machines. 160 Pre-Training TSP Data NLVR2 Im-Im (T) Im-Vid (T) Vid-Vid (T) IP2P Im-Im (G) Im-Vid (G) Vid-Vid (G) Majority voting 50 50 50 50 50 50 50 50 ✗ ✗ 54.52 57.23 52.53 51.84 51.76 52.66 52.75 51.08 ImgAll ✗ 82.05 70.61 65.64 59.34 68.72 68.16 67.80 64.63 ImgAll COCO 83.43 74.82 66.48 59.4 70.15 71.06 68.24 65.23 ImgAll COCO + VTX 83.57 74.12 68.3 61.82 70.04 71.25 70.77 66.83 Table 6.2: Accuracy@1 of fine-tuned pre-trained models on NLVR2 and RAIV datasets. All models are obtained from METER. Pre-Training refers to data used for pre-training. TSP Data refers to data used for task-specific pre-training which is obtained from COCO and VATEX. (T) and (G) refers to whether the statements are obtained via template or generated via Language Model. NLVR2 refers to NLVR2-dev set. Pre-Training TSP Data Im-Im (T) Im-Vid (T) Vid-Vid (T) Majority Voting 33.33 33.33 33.33 ✗ ✗ 34.29 34.37 34.37 ImgAll ✗ 56.32 49.86 44.73 ImgAll COCO 62.17 52.9 46.62 ImgAll COCO + VTX 64.11 56.32 51.85 Table 6.3: Accuracy@1 of fine-tuned pre-trained models on the Reasoning Task of RAIV datasets. 6.4.5 Results In Table 6.2, we report results on the True/False classification task of various RAIV datasets. In Table 6.3, we report the results for the Reasoning task (Multiple Choice Question) for the same baselines. We note that the reasoning task is treated separately from the classification task. “Accuracy@1” is the metric used everywhere. We make the following observations. Pre-Training is quintessential In both Table 6.2 and Table 6.3 we note that without pre-training the model performs very similar to a simple majority voting. The main reason is the extremely sparse signal in the RAIV task which requires two visual inputs but provides only a singular true/false as output. Thus, there is not enough training signal for the model to learn to perform the task. Importance of Task-Specific Pre-Training Across all RAIV datasets, we find that Task-Specific Pre-Training is helpful but the relative improvements depend on the specific dataset. On the image-image datasets, the improvements vary from ∼ 1.5 points in NLVR2, IP2P, and Im-Im (G) to ∼ 4 points in Im-Im (T). However, for image-video and video-video datasets, simply using images for task-specific pre-training 161 Model NLVR2 Im-Im (T) Im-Vid (T) Vid-Vid (T) IP2P Im-Im (G) Im-Vid (G) Vid-Vid (G) METER 82.05 70.61 65.6 59.34 68.72 68.16 67.8 64.6 +TSP 83.57 74.12 68.3 61.82 70.15 71.25 70.77 66.83 ALBEF 80.24 67.41 62.35 58.14 67.15 66.5 65.84 61.41 +TSP 81.07 70.76 66.13 60.77 70.13 68.4 68.35 64.77 VinVL 82.05 69.14 64.7 59.83 67.29 68.42 67.72 65.2 + TSP 84.56 74.81 69.84 61.96 71.8 71.45 69.76 67.45 Table 6.4: Accuracy@1 across RAIV datasets using image pre-trained baselines with and without TaskSpecific Pre-Training (TSP) which uses data from COCO + VTX. % IP2P → 1 % 10 % 50 % 100 % Early 53.11 58.91 63.88 67.48 + TSP 63.16 66.17 68.45 71.88 Mid 57.13 61.17 65.95 69.29 + TSP 64.2 67.47 68.15 71.21 Late 56.28 60.91 64.51 68.72 + TSP 59.61 63.71 66.18 70.15 Table 6.5: Accuracy@1 for different fusions (early, mid, late) with varying amounts of data from IP2P. is not effective, leading to only small improvements ∼ 0.5 points. But when videos are added to the task-specific pre-training routine, the improvements are significant in the range of ∼ 2 − 3 points. Image-based RAIV has Lower performance than NLVR2. In Table 6.2 we note that models perform worse on Im-Im (T), Im-Im (G) as well as IP2P compared to NLVR2. For the first two, we attribute this discrepancy to the fact that Im-Im datasets explicitly consider actions that lead to the queries having richer semantics. For IP2P, the visual similarities between the two images are very high since they have Task Dataset Rand Cond RAIV Im-Im (T) 72.67 74.12 Im-Vid (T) 67.1 68.3 Vid-Vid (T) 60.64 61.82 Reasoning Im-Im (T) 59.55 64.11 Im-Vid (T) 52.15 56.32 Vid-Vid (T) 48.87 51.85 Table 6.6: Accuracy@1 for Random vs Conditional Sampling for RAIV and Reasoning tasks. By default, conditional sampling is used for task-specific pre-training. 162 very minor edits. Another possible reason is that IP2P is very diverse in terms of objects which may not be sufficiently covered in the pre-training datasets. Template Queries vs Generative Queries In Table 6.2 we find that compared to generative queries, the template queries are easier for Im-Im but harder for Im-Vid and Vid-Vid. This is likely because, for Im-Im (T) case, the templates used in task-specific pre-training resemble those in the downstream dataset. Conversely, for Im-Im (G) dataset, the model needs to learn from natural language generation and not templates. On the other hand, in the Im-Vid and Vid-Vid datasets, the generated queries often include additional information such as objects present in the video which could be controlled in the templatebased queries. 6.4.6 Ablative Study We perform multiple ablative studies such as the effectiveness of the model architecture, framework and task-specific pre-training against downstream data, using random vs conditional sampling. Task-specific Pre-training with other image pre-trained baselines In Table 6.4 we introduce additional image-pre-trained vision-language models to verify the effectiveness of task-specific pre-training. We compare to METER, ALBEF [130] and VinVL [296]. For ALBEF, we use ALBEF-4M and for VinVL we use Oscar-B w/VinVL. ALBEF is similar to METER in that it uses a patch-based vision transformer [47] but additionally includes knowledge distillation during pre-training but notably the vision module is initialized from ViT-B/16 compared to CLIP-ViT-B/16 for METER. VinVL on the other hand uses an object detector (Faster-RCNN[199]) to extract relevant object features. In all cases, we find task-specific pre-training is helpful and provides a consistent improvement in performance (∼ 2 − 3%). Across architectures, METER outperforms ALBEF as its vision transformer is initialized from CLIP. METER and VinVL have similar performance as object features from the strong object detector plays an important role for the latter. 163 Effect of Fusion Strategies In Table 6.5 we compare different strategies for fusing information for IP2P dataset. As noted earlier, by default we use Late-Fusion where information from both visual inputs and text is processed by the model and then the [CLS] feature from both inputs is used for classification. In addition, we also compare Mid-Fusion where instead of using [CLS] feature directly, we add two transformer encoder layers to the output before classification. For Early-Fusion, we directly input the two images and the text. We find that when using 100% of the data, early fusion performs slightly worse than both mid and late fusion but performs slightly better when using task-specific pre-training. We attribute this to early fusion being more data-hungry. We also find Mid-Fusion slightly outperforms Late-Fusion (71.21 compared to 70.15) likely due to additional transformer layers. Using Limited Fine-Tuning Data In Table 6.5, we also compare effect of using limited data for fine-tuning. We note that obtaining high-quality data tailored for downstream tasks is often expensive. Thus, task-specific pre-training which leverages existing pre-training data with different objectives is an attractive alternative. We find this to be the case, especially for Mid-Fusion where using task-specific pretraining and fine-tuning on just 10% of IP2P data leads to similar performance as directly fine-tuning on the entire downstream dataset. Sampling strategy in Task-specific Pre-training During the task-specific pre-training stage, since creating the visual pairs is performed on the go and different sampling strategies can be utilized. For a given image, we could either sample a random image (Rand) or we could condition it on some objective such as having at least one common object (Cond). Comparing the two for both RAIV and the Reasoning task, we find the conditional sampling to be useful likely due to training on harder examples. 6.4.7 Visualization We provide qualitative examples from our dataset and outputs of our model as follows: 1. On Template-based queries and Reasoning, namely, Im-Im (T), Im-Vid(T), Vid-Vid (T) in Figure 6.6 164 Q: A man kneels in exactly one of I1 and I2. PrA: False GtA: False PrR: A man kneels in both I1 and I2 GtR: A man kneels in both I1 and I2. (b) Q: In both I1 and I2, a person is swimming in a pool. PrA: True GtA: False PrR: A person is swimming in a pool in both I1 and I2. GtR: A person is swimming in a pool in I2 but not in I1 (a) Q: In neither I1 nor I2, a person is reading a book (c) PrR: A person is reading a book in both I1 and I2. GtR: A person is reading a book in I2 but not in I1. PrA: False GtA: False Figure 6.6: Model Predictions vs Ground-Truth for template-based (“T”) validation datasets. (a) Im-Im (T), (b) Im-Vid (T), (c) Vid-Vid (T). PrA and GtA refer to Predicted and Ground-truth Answers respectively. PrR and GtR refer to predicted and ground-truth reasoning respectively. 165 Q: The original image features a beautiful young woman with curly blond hair on a black leather sofa, while the modified image features the same woman with a cat added to the scene. PrA: True GtA: True. Q: The original image is a photograph of a beautiful young woman with curly blond hair sitting on a black leather sofa, while the modified image is a sculpture of the same woman in the same pose and setting. PrA: False GtA: False. Q: The original image shows a wooden house next to the Iceland sea, while the new image depicts the same wooden house but with a haunted theme. PrA: True GtA: True Q: The modified image has snow, which is not present in the original image. PrA: True GtA: False (a) (b) Figure 6.7: Model Predictions vs Ground-truth for IP2P dataset. For a given pair of images, both the chosen True and False sentences are shown. 2. IP2P Generated queries in Figure 6.7 3. On Generated queries, Im-Im (G), Im-Vid(G), Vid-Vid(G) in Figure 6.8 166 (a) (b) Q: In at least one of the images, a girl with brown hair is depicted as grabbing a CD. PrA: False GtA: True (c) (a) I1: skipping_201.jpg, I2: climbing_132.jpg (b) I1: falling_187.jpg, I2: v_Ihd-NwI030c_seg_85_95_trimmed (c) I1: v_5lqvuMwYODI_seg_35_45, Ev3, I2: v_gNT4N5W81Hc_seg_30_40, Ev5 Q: In both images, a woman is performing an action with a rope in a gymnasium. The action being performed is skipping in the first image and climbing in the second image. PrA: True GtA: True Q: In exactly one of the images, a man in white pants is depicted as falling. PrA: False GtA: True Figure 6.8: Model Predictions vs Ground-Truth for generated queries (“G”) validation datasets. (a) Im-Im (G), (b) Im-Vid (G), (c) Vid-Vid (G). PrA and GtA refer to Predicted and Ground-truth Answers respectively. 167 6.5 Conclusion In this work, we explore Reasoning Across Images and Video (RAIV) task which involves classifying a statement about a pair of visual inputs (images, videos or a mixed combination) as true or false. We introduce multiple datasets to study RAIV with semantically rich queries, and visually similar inputs as well as allow reasoning for the provided answers. We investigate the potential for task-specific pre-training which involves additional pre-training on objectives similar to the downstream task but confined to the original pre-training dataset. Our experiments validate the effectiveness of including task-specific pretraining for improved downstream performance. 168 Chapter 7 Detect Anything: An Open-Vocabulary Object Detection Model with Weak Supervision 7.1 Introduction Object detection, the process of localizing objects and classifying them into distinct categories to differentiate and identify different parts of an image, is a fundamental computer vision task. It is a building block for understanding images at a deeper, more granular level and has seen several applications in tasks such as phrase grounding [211], reasoning [234], question answering [3] and other vision-language tasks. However, the manual annotation effort to localize bounding boxes and classifying them into pre-defined categories presents a significant challenge in both time and money. As a result, existing object-detection datasets often have limited number of classes typically in the range of 100s and occasionally in range of 1000 but a far cry from the size of human vocabulary. To address this issue, research has moved towards exploring open-vocabulary object detection [287] where the set of classes is not pre-defined. Instead, the set of object categories is provided directly at the test-time and model should generalize to even novel classes not seen during training. Existing approaches in open-vocabulary object detection (OVD) such as ViLD [67] and RegionCLIP [304] have largely relied on using extensively annotated datasets such as COCO [145] or LVIS [69]. These 169 Dog Person Train Test Dog, Person, Suitcase, Conveyor (a) Base Categories: {Cat, Lamp} Training Set: Bounding Box Annotations for all Base Categories Test Image Novel: {Carrot, Remote} Expected Output Objects: Cat, Remote (b) Training Set: Raw Images with tags or text Very Large Vocabulary: {Bed, Lamp, TV, Player, Chairs, Remote…} (c) Test Image + Optional Class Set Expected Output Objects: Cat, Bench, Plant, Glass Figure 7.1: (a) The standard OVD task. Given an annotated training set with base classes of “Dog” and “Person”, the goal is detect novel objects not seen during training such as “Suitcase” and “Conveyor” in addition to the seen classes. (b) The standard OVD task, dubbed OVD-BT. A model is trained on an annotated detection dataset with “base” classes and at test time we are provided the set of “novel” classes. (c) Our proposed Weakly-Supervised OVD (WS-OVD) task. The training set contains only raw images but without any notion of base and/or novel classes instead having a large vocabulary. At test time, the goal is to find all objects given a class set by the user. methods are trained on a set of designated “base classes” within these datasets. The object features of these “base” classes are then projected to a shared image-text multi-modal embedding space such as CLIP [191]. Once trained, the model is evaluated over held-out classes (novel classes), with the expectation that the object features for novel classes will also be aligned to corresponding text in the shared feature space. For brevity, we call this OVD-BT where “BT” stands for Base Trained. This transfer from base to novel classes has the implicit assumption that a detection head (such as a Region Proposal Network (RPN) [199]) trained on “base classes” would also generalize to “novel classes”. In practice, the resulting performance of a model is dependent on the considered base classes and there exists a large gap between detection results in the base and novel classes. In this work, we opt to remove the dependence on a detection head trained on “base classes” and instead utilize foundation models such as “Segment-Anything” model (SAM) [111] which provides classagnostic segmentation results. In this new paradigm, instead of restricting to a set of “base classes” for 170 which detection annotation are provided, we directly train on a large set of classes (an open-vocabulary) without associated detection annotation data. By mapping the segmented outputs obtained from SAM to textual class descriptions and aligning these with the multi-modal embeddings provided by CLIP, the model is capable of classifying image segments across a wide spectrum of categories. Such a strategy allows model to be trained directly on raw image data without the need for annotated large-scale detection datasets. Once trained, this model could be used for various downstream tasks such as open-vocabulary detection and referring expression [185, 278]. We dub this task Weakly Supervised Open-Vocabulary Object Detection (WS-OVD). We illustrate this key idea in Figure 7.1 (details in Section 7.3.1). The key challenge in training directly on image-data then is obtaining high-quality mapping for the segments obtained via SAM to a class description. A straightforward, albeit naive, strategy is to crop out the image segment or the enclosing bounding box and then use image-text models such as CLIP to classify these segments. However, such an approach leads to sub-optimal matches. There are multiple reasons ranging from mismatched aspect ratios, viewing angles, class information without descriptions, and generally the cropped out image being out of domain with respect to the CLIP training data. Given an a large vocabulary set, such a naive matching method leads to numerous false-positives. This observation is also consistent with observation in previous works [137]. To address this challenge, we propose Detect-Anything-Model (DAM). To mitigate the issue of finding high-quality segment-text matching, we contribute three innovative variants: (1) DAM-IT: We employ image-tagging models such as Recognize-Anything Model (RAM) [298] that generates an extensive list of image tags (IT) for a given image. These tags are usually objects or scenes visible in the image. Crucially it trims the possible set of object category to match using CLIP, which significantly improves the precision of the matching process. (2) DAM-IC: Instead of relying on image-tags we utilize datasets with image-caption (IC) pairs and incorporate the noun-phrases that describe the objects in the image. This enhances the correlation between image segments and textual descriptions. (3) DAM-G: We further adopt 171 a grounded image generation (G) model such as GLIGEN [136] which generates images with objects at specified locations. Since, we specify the particular object categories at different positions, we have a pre-existing map between the bounding boxes and text modalities. This approach holds the potential for more precise and contextually relevant mapping between image segments and descriptive text (framework details in Section 7.3.2 and segment-text matching in Section 7.3.3). Our primary experiments are conducted on open-vocabulary object detection task with weak supervision. We find that on multiple OVD benchmarks and splits such as COCO [145] and LVIS [69], our DAM variants perform competitively or outperforms existing BT baselines on the novel classes. We also show our trained model can be readily used for phrase grounding on RefCOCO+ [280]. We further perform ablative studies on the effect of combining the three variants. To summarize, our contributions are: (i) we explore the task of open-vocabulary object detection by leveraging existing large-scale foundational models without training on additional detection data (ii) remove the dependence on a detection heads which are biased towards particular datasets and show that models such as SAM can be utilized for open-vocabulary detection tasks (iii) show detailed ablative studies highlighting the contribution of each mapping technique. Our code will be released upon publication. 7.2 Related Works Foundational Models i.e., models pre-trained on large quantity of web-sourced data have set a new benchmark for versatility and performance across diverse tasks. Notable models include CLIP [191], which establishes a bridge between vision and language; SAM [111], which generates a segmentation map for any given image; BLIP [129], which generates captions for images, GLIP [132], which links image regions to text, Stable-Diffusion [202], which generates images from given text; and GLIGEN [136], which generates grounded images. Here, our aim is tackle the task of weakly-supervised open-vocabulary object detection 172 and we utilize existing foundational models to forgo training on an explicit “base class” set and instead align the object features to a shared multi-modal image-text embedding space. Zero-Shot/Open-vocabulary object detection. aims to classify, detect or segment objects in images without being restricted to a predefined set of classes. Early work in Zero-shot Detection [14, 193] and Zero-Shot Grounding [211] aimed at aligning image and region features to word-embeddings. OVR-CNN [287] leveraged BERT [45] to pre-train a Faster-RCNN [199] on image-caption pairs before fine-tuning to downstream tasks. Following the success of CLIP, the field has diverged along multiple fronts. ViLD [67] and RegionCLIP [304] proposed distilling the rich features from CLIP into a detector. Another direction is to utilize additional data in the form of pseudo-boxes (PBBL [59]), imagenet-21k [42] and conceptual captions [220] (Detic [308]) or exploiting frozen VLM models (F-VLM [120]). More recent works such as OV-DETR [286], CORA [261] have focused on non-RPN architectures such as DETR [24]. However, in all of the above approaches, a detection head is needed to be trained on a set of base classes, and this constraint cannot be avoided. In contrast, our work is focused on learning without any annotation i.e., training with weak-supervision. Weakly-Supervised Object Detection (WSOD) is not new and has been previously explored in the context of training object-detectors with only image-level labels [18, 134, 222, 223, 251, 271]. However, these methods were limited by absence of proposals and had to rely on hand-crafted proposals [186, 243]. In contrast, we leverage the recent success of class-agnostic segmentation models, namely SAM [111], which has shown ability to segment even novel objects. A closely related field Semi-Supervised WSOD involves additionally using bounding box with image-labels. Detic [308] is a seminal work in this regard. However, different from Detic, our model doesn’t use any annotated detection data. 173 MaskViT Mask Head Step 1: SAM applied to the Image Regions Text A photo of a dolphin A photo of a trees A photo of a person Step 2: Segment-Text Matching MaskViT Features CLIP Text Features Step 3: Feature Alignment dolphin person water Step 4: Inference: Segment To Text A photo of a sky Figure 7.2: A schematic framework our Detect-Anything-Model (DAM). First SAM [111] is applied to obtain all the image segments. Then, a segment-text matching step is performed via CLIP [191] (details in Section 7.3.3). Then the features of the SAM model are aligned with those of CLIP text features. Finally, at inference time, one can use the best matching text features to obtain the class labels for all image segments. 7.3 Method We first describe our task WS-OVD i.e., open-vocabulary detection without annotated detection data (Section 7.3.1). Then we describe our proposed framework DAM (Section 7.3.2) and its variants for segmenttext matching (Section 7.3.3). 7.3.1 Task: Weakly Supervised Open-Vocabulary Detection (WS-OVD) Open-Vocabulary Object detection (OVD) task involves localizing objects not necessarily seen during training phase. Typically, a dataset such as COCO [145] or LVIS [69] is partitioned into “base” and “novel” categories. The model is trained on the “base” classes, with the features space aligned to text-embedding of 174 these classes such as embeddings obtained from CLIP [191]. At inference time, the model uses its detectionhead to obtain a set of candidate proposals and extract the corresponding visual features. Then, given the set of both “base” and “novel” category list, the nearest neighbor of these visual features in the text space is used as the object category. We term this formulation of OVD as OVD-BT. A key issue in this OVD-BT is the dependence on explicit detection-head which is trained on a set of base-classes and fails to generalize to other novel classes. For instance, a Region-Proposal Network (RPN)[199] trained on objectness on a particular dataset would incorrectly classify “novel” object proposals as background classes. To this end, we leverage Segment-Anything model (SAM) [111] which has shown the ability to deliver class-agnostic segmentation effortlessly extending to even novel objects without explicit prior class-specific training. Given an image as an input, SAM conditions the image with a stride of points and outputs three segmentation map for each point. Finally, a non-maximum suppression is applied to obtain the final segmentation map. Our task is to integrate the semantic information to this segmentation output. One naive option is to use the same detection data as previous work, and simply replace the RPN with SAM. In this scenario, we still depend on the annotations from a detection dataset for the “base” classes which require meticulous labeling and extensive resources both in time and money. As such, the ability to train models on just raw images or image-text pairs becomes paramount. Such a capability makes it more adaptable to real-world scenarios where annotations might be scarce or unavailable. We denote this task as WS-OVD. More precisely, in the usual open-vocabulary detection task we are given a detection dataset D = {I, BB, C} where I denotes images, BB denotes boxes, and C denotes their corresponding categories. At train time we only have bounding box annotations for the base classes, but are provided the additional novel classes at test time. A model M is trained only over these base classes and at inference time, the model should output objects in both base classes and novel classes. In our proposed task WS-OVD, we do 175 away with the notion of base and novel classes; instead, we are only provided with a dataset of images D = {I} or paired image-text pairs D = {I, T} and the task is to train a model M over these images along with existing foundational models such as SAM and CLIP to detect every object in the image. 7.3.2 Method Framework A crucial question for WS-OVD is is how to obtain training data in the absence of expensive annotated datasets. We look to other foundational models such CLIP [191] to enable SAM to not just segment but also understand and describe its segments. We are not concerned with the data SAM has been pre-trained on, instead, only limit to what additional training data we have access to. Suppose we have a diverse and large dataset with segment-text alignment, we can train our SAM model for open-vocabulary tasks. This is similar to previous works, but with a few key differences. First, we don’t train an RPN. Instead, we obtain the best ground-truth matches with that of SAM outputs before the non-maxima suppression stage. Note that some ground-truths may have a region which is simply not annotated. Thus, we specifically train on the annotated patches to be foreground with a description phrase (referring phrase) which is background for every other region. Since our aim is to enable it for open-vocabulary tasks, we align the SAM features with that of CLIP text features. At inference time, we can directly obtain the best match from a set of description phrases. We denote this model as DAM and illustrate this in Figure 7.2. More specifically, let I denote a given image, with the set of N segments generated by SAM as S = {Si} N1 i=1. We assume we have a set of text descriptions T = {Ti} N2 i=1 with N2 ≤ N1. These descriptions can be in the form of noun-phrases, image-tags, object names, or object descriptions. In particular, we are guaranteed to find at least one segment which corresponds to this text description. Let H : (I, T, S) → P(S) denote the mapping of a particular text Ti and a given image I to set of segments it corresponds to 176 which we denote with H(I, Ti , S) = Sˆ i . For each non-empty Sˆ i i.e. |Sˆ i | ≥ 1, we chose the segment with highest matching score Sˆ i0 . Given this information, we can train our model M to align each pair (Ti , Sˆ i0) ) and treat them as positives and each non-aligned pair (Ti , Sˆ j0) ) where i ̸= j as negatives which can be trained similar to contrastive learning in CLIP. Training. We note that segmentation is by itself a difficult task and as such we don’t wish to “forget” such knowledge. Thus, we freeze the SAM architecture and only train an additional projection network on top of segment-features obtained from the MaskViT backbone. Since SAM uses a grid of points as a prompt to output a particular segmentation map, we equate the feature of the cell the point corresponds to as the feature for segment. We wish to align these features to the corresponding text features. Since our aim is to perform open-vocabulary detection, we align these segment features to the text features from CLIP during training time. At inference time, we the model can utilize the user-provided classes and use the segment-text alignment to extract the corresponding class for a provided segment. More precisely, let SFi be the feature corresponding to the segment Si and let the CLIP text feature be T Fi for the text description Ti . Then, our task is to align (T Fi , SFi0 ), and at the same time we have that T Fi doesn’t align to other features. Thus, we utilize contrastive learning function similar to that in CLIP [304]. If Sc denotes the cosine similarity then Lc = − 1 N X N i=1 " log exp(Sc(Vi , Ti)/τ ) PN j=1 exp(Sc(Vi , Tj )/τ ) + log exp(Sc(Vi , Ti)/τ ) PN k=1 exp(Sc(Vk, Ti)/τ ) # (7.1) General Inference. The general inference pipeline is relatively straightforward with a few key differences from the training pipeline . A given is passed through a SAM model to obtain features for each queried point resulting in V = {Vj} P j=1 where Vj ∈ Rd is the corresponding feature. Note that this step occurs before any filtering based on prediction iou or non-maxima suppression. Thus, all strided points are considered. During training, our loss function minimized the distance of these point feature 177 to the corresponding object. Thus, at inference, we can directly perform a nearest neighbor search via cosine similarity to all possible textual features of the possible object list. In the general case, we can use a user-provided list of classes and select if the segment feature should be classified as one of those classes based on some threshold τ (we choose τ = 0.2), else it is classified as a background. Mathematically, Cj = argmax j (Sc(Vi , Tj )) s.t. Sc(Vi , Tj ) > τ where Sc denotes the cosine similarity of two vectors. Recall that each query point could’ve more than one corresponding segment with each segment having its own prediction IoU score. We multiply this prediction IoU score with the cosine similarity, and use this as the new class probability score. This new class probability score is further filtered via non-maximum suppression (NMS). All operations are performed directly at the pixel level. Finally, each segment is converted to a bounding box for evaluation. Inference for a particular dataset. We use the term “dataset” to denote the set of classes that are provided by the user at test time. Since each dataset has its own set of classes to be evaluated on, this information is needed before the NMS filtering is performed. Thus, if a given image is already inferred for one particular set of classes, we can re-use the segment features V and save computation resources. However, the rest of the process including computing the nearest neighbor must be repeated. 7.3.3 Segment-Text Matching As noted in Figure 7.2, the second stage of Segment-Text matching plays an important role in the pipeline. Crucially, the better Segment-Text matching we can perform, the better results we obtain. Naive Matching via CLIP. One obvious method would be to first curate a large list of object categories, and get corresponding CLIP text features. Then, apply SAM on the original image to obtain all the segments of the image S = {S} N i=1. Crop each segment from the image and obtain a CLIP image features and use nearest neighbor search to find best matching text. In practice, we find this method to have very 178 Table 7.1: Comparison of different Segment-Text Matching approaches Img Dataset Img Quality Phrases Precision Recall Image-Tags Existing Natural ✗ ✓ ✓✓ Captions Existing Natural ✓ ✓ ✓ GLIGEN Generated Synthetic ✓ ✓✓ ✓✓ poor matching with the best match to texts with no relation. We posit this occurs due to variety of reasons from change in aspect ratio for stuff classes, a distribution shift of input image for CLIP, a heavy-tail distribution of the classes. This observation is also consistent with previous work [137]. To address this limitation, we opt for the following three methods: (i) trimming the possible classes to a smaller set obtained from an Image tagger (ii) or restricting to noun-phrases from a paired caption (iii) using a grounded image-generator such as GLIGEN [136] to create a new image. We describe these methods in detail below and summarize it in Table 7.1. DAM-IT: Using Image Tags. Two key issues in previous CLIP matching is that the number of possible classes are large, and a large number of segments had no good matches. To address this issue we utilize image-taggers such as Recognize-Anything Model [298] which given an image provides a list of object categories present in the image. Thus our possible options for categories reduces from the order of 5k to order of 10 − 30. Now, instead of assigning each segment a class, we assign each image-tag a particular segment. Even though, each image-tag could refer to multiple segments, in practice, we find having hard threshold to test for multiple segments leads to poor precision. Thus, we only choose the best match and prefer precision over recall. DAM-IC: Using Image Captions. Instead of using an image-tagger, we can also utilize the largescale paired image-text datasets which have been in use for vision-language models. Crucially, we extract the noun-phrases from the captions and use them instead of image-tags. Even though our recall for the objects in the image is compromised, the captions provide additional information such as attribute and 179 other contextual information missing from image-taggers. Everything else in the model remains the same as in the original captions. DAM-G: Generating Image using GLIGEN. A key issue in both of the above approaches is that they require an existing image dataset to operate and further the matching to be precise. Here, we propose another method which circumvents this issue entirely. Instead of using existing image datasets, we rely on grounded diffusion models to generate images. In particular, we utilize GLIGEN [136] which takes input a caption along with bounding boxes and corresponding category names. The output image contains the object in the provided bounding box. Thus, we have a direct ground-truth matching between the bounding-box segment and the text. A key issue arises when the generated image is not natural looking. To circumvent this, we use CLIP-Score [80] to filter out such images. 7.4 Experiments We first describe the datasets used and the required data curation to train our models in Section 7.4.1 followed by baselines and implementation details 7.4.3. Finally, we describe our main results and discuss further ablation studies 7.4.5. 7.4.1 Datasets Training datasets. As noted in Section 7.3.3, we don’t require datasets with bounding box annotations. For training, we primarily use COCO train 2017 dataset but discard the bounding box and class annotations. To train with image-tags, we utilize Recognize-Anything [298] model on the entirety of COCO train 2017 dataset which provides image-tags across 4585 classes. To train with image-captions, we extract the nounphrases from the 5 associated paired captions in COCO-Captions [145] dataset, collate them and remove duplicates. 180 Finally, to train with images generated via grounded-image generator such as GLIGEN [136], we consider two methods. Recall that GLIGEN takes input an input caption, object names with bounding boxes and produces the corresponding image. To generate such images from scratch, we sample two objects from a distribution of existing object detection datasets and create a caption using pre-defined templates. While this approach limits the variety of the image, we can obtain unlimited amounts of images with different object pairs often difficult to find in natural images. In the second case, we use the image-generation method in tandem with image-tags and image-captions. In this case, we use the objects from the image-tags and the captions from the image-captions instead. This process can be considered a form of image-augmentation. Since image-generation often results in poor quality generation, we apply CLIP-filter on the resulting generated image similar to previous work [21]. We create a dataset with fixed size approximately equal to 2× COCO and name this GLG1 and GLG2 when using the first and second method respectively. The 2× criterion is needed because unlike COCO images, the image-generation method leads to significantly smaller number of objects per image. We provide more details on the dataset creation process as well as statistics in the supplementary. Downstream Tasks and Datasets. For the downstream task of open-vocabulary object detection we consider the following datasets: (i) COCO Generalized Detection dataset [145] (OVD-COCO) has 80 categories. Following previous work, we consider 48 base classes and 17 novel classes. We emphasize that our method makes no distinction between novel and base classes. We use the same dataset as previous methods using 107k images for training. (ii) LVIS [69] v1.0 which has 1203 categories. We follow previous work and consider 866 frequent and common categories as base classes and 337 rare categories as novel. We denote this split as OVD-LVIS. (iii) LVIS-FBase: uses the same dataset as LVIS v1.0 but a different split as introduced in EZAD [151]. Here, the 405 frequent categories are considered base, and 461 common and 337 rare categories as novel (total of 798 novel categories). 181 For the downstream task of referring expression we consider RefCoCo+ [280] which focuses on attribute object referring expression and omits spatial relations. Metrics. Following previous work, we report box-AP50 on COCO, and box-AP and mask-AP for LVIS. Since we don’t perform training on base-classes, for fair comparison we focus only novel-class AP50/AP but also report base and All AP for completeness. For RefCoCo+, the metric is simply Accuracy@1. 7.4.2 Additional Dataset Details Table 7.2: Dataset Statistics. COCO and LVIS have the same training image set so their statistics are identical. U-NP refers to Unique Noun-Phrases. # Imgs # Tags/Img # Cap/Img # U-NP/Img COCO/LVIS 113k 14.11 5 12.85 GLG1 250k - - 2 GLG2 250k 3 1 3 As noted in Section 7.3.3, our method relies on segment matches provided by CLIP to the tags from Recognize-Anything Model [298] (for DAM-IT), noun-phrases extracted from COCO-Captions (for DAMIC) and grounded image generation (for DAM-G). We note the dataset statistics in Table 7.2. Obtaining Tags. We use Recognize-Anything Model (RAM) to obtain tags. We use RAM with SwinLarge [150] trained on 14M images namely from COCO [145], VG [118], SBU [172], CC3M [220], CC3Mval, CC12M [26]. It lists 4585 tags. We use the default parameter settings i.e., an image-size of 384 with a threshold of 0.68 for the label output. We find that on average, we get 14.11 tags for each image. We assign the best matched image-segment. Noun-Phrases from Captions. We use spacy [84] to extract Noun-Phrases from the paired images. We use the model ‘en_core_web_lg‘ and use ‘noun-chunks’ to extract out the noun-phrases from each sentence. COCO Captions [145] contains 5 captions for each image. We apply the noun-phrase extractor on each caption separately and then collate them and remove duplicates. 182 (a) Passed (b) Failed Tags dog; apple Caption: A photo of a dog and apple. Boxes: Randomly generated Figure 7.3: An example from GLG1 dataset. We sample the tags “dog” and “apple” which is used to create a caption using a fixed template. The boxes are generated at random and used as input and overlayed on the images for visualization purposes only. The same parameters are used for both (a) and (b), with the only difference being the random seed. Only (a) passes both CLIPScore Filter and the obtained Tag list. Grounded-Image Generation. We also propose using synthetic images to train WS-OVD models. To this end, we consider two settings: (i) where we train the model only on synthetically generated data and no other data. (ii) where we train the model on regular image data and additionally synthetically generated data. These result in two datasets namely GLG1 and GLG2. The images in these dataset are created using GLIGEN [136]. We use “Box+Text” diffusion model. The model takes in “boxes” and its “labels” and a corresponding caption. We use default parameters, with αsample = 0.3 and guidance scale of 7.5, and set image to be of size 512 × 512. We now describe their creation. GLG1. Here, we assume we don’t have access to any image dataset. Instead, we have access to a distribution of the object tags and categories. Thus, we sample two object-tags from a unimodal distribution of objects. We obtain this distribution from the unique noun-phrases in COCO-Captions and Conceptual Captions and restrict our set to the 4585 tags provided in RAM-14M. To create a particular image, we sample two object-tags. We restrict to only two object-tags as we found GLIGEN often failed to generate 183 Caption: A smiling woman in winter outerwear holding a bowl. Tags Used: bowl; woman Boxes: Randomly generated (a) (b) (c) (d) Figure 7.4: Example from GLG2 dataset. Given the original image in (a), we use its caption and sample two tags also found in the caption to generate new images (b), (c), and (d). Bounding boxes are randomly sampled (kept same for (c) and (d)). Even though the resulting images don’t exactly match the captions, these are still valuable for detection. all provided object-tags and further the random selection of more than two tags resulted in very unnatural images. We then create the caption using a template such as “A photo of X and Y ”. If the two sampled tags are “dog” and “apple”, the caption would be “A photo with a dog and apple.” We generate the boxes at random. Once generated, we perform two checks. First, we discard any images with CLIP-Score [80] less than 0.1. Second, we use the RAM-14M model to check if it detected the two objects. If either of the criteria are not met, we discard the output. We continue the process till we collect 250k images. Figure 7.3 illustrates this with an example. GLG2 Here, we assume we have access to an existing image dataset (with paired caption) on which we would train our WS-OVD model. Thus, instead of sampling random object-tags, we can instead sample the same tags and captions used in the original image-dataset. For each image, we have the tag-list as obtained from RAM as well as 5 corresponding coco-caption. We choose the caption with the most tags matched and randomly sample 3 tags common in the tag-list as well in the noun-phrases. We use the caption as is, use the provided tags, but still generate the boxes at random. We use the same filtering process as GLG1 and continue till we obtain 250k images. Figure 7.4 illustrates this with an example 184 7.4.3 Baselines and Implementation Details Baselines. We consider multiple competitive baselines and provide system-level details such as detection backbone and CLIP model in the tables (see Table 7.3) We provide brief summary of the compared models in supplementary. Since, none of the previous baselines use SAM [111] as a detection backbone, we design a new baseline dubbed DAM-ViLD which replicates the ViLD design but replaces the proposals generated by the RPN with SAM model. Implementation Details. For each of the DAM models, we follow the same training schedule and parameters. We use the MaskViT-H [72] model from SAM for the initial segmentation but when training we utilize MaskViT-B model due to constrained training resources. The initial segmentation map is used only for the purpose of Segment-Text matching. We note that the backbone is frozen so that the model is not over-fitted to our training set. We use AdamW [152] with learning rate of 1e −5 with a 10% warmup to 1e −4 followed by a linear decay. We train models for 10 epochs with batch size of 8. The images are scaled to have smaller side 256. For CLIP, we use ViT-B/32 model to be consistent with other methods but also compare with ViT-B/16 and ViT-L/14 models. To generate image-tags, we use RAM-14M model [298] with a tag-list of 4585 tags. For image-captions, to extract noun-phrases we use Spacy with “en _core_web_lg” model and only store the lemmatized words and remove duplicates across the 5 captions. Finally, for image-generation, we use GLIGEN [136] model with “Box+Text” modality built on Stable-Diffusion v1.4 [202]. Each generated image is passed through a filtering process using CLIP ViT-L/14 and an image with a CLIP-Score [80] of less than 0.2 is discarded (threshold decided heuristically). We note that the above are offline process and the same outputs on COCO and can be re-used for LVIS. 185 7.4.4 Additional Baseline Details We briefly describe the baselines we compared in Table 7.3 and Table 7.4: • OVR-CNN [287] popularized the task of OVD. The authors proposed training a Faster-RCNN [199] using image-caption data and then fine-tuning it for box prediction. The initial pre-training is similar to language pre-training losses such as mask-language modeling [45] using region features. • ViLD. [67] trains a Region Proposal Network (RPN) [199] such that the RoI-Aligned features of the “base” categories are aligned with the image features of the same region when it is cropped and passed via CLIP. Since image features from CLIP are already aligned with the CLIP text features, the model which produces CLIP image-like features for the region can use cosine distance to find the best match. Our DAM-ViLD uses the same idea with the RPN replaced with SAM [111]. • OV-DETR [286] follows similar idea as ViLD but proposes to replace the RPN architecture with that of a transformer based architecture such as DETR [24]. The key insight is that the RPN are not conditioned on the classes and therefore would miss the novel classes since it treated these novel classes as negatives during training. In the DETR architecture, the model takes additional “class” names embedding as conditional query inputs. • OADP [254] suggests that a simple center crop of the object to train the alignment of the objectdetection head with CLIP features is noisy and throws away informative parts. Thus, they suggest combining losses from global (whole image), block (patchwise) and object-centric (cropped image). This can be seen as an augmentation strategy. • CORA [261] notes that while DETR architecture as used in OV-DETR is useful, at test-time it requires per-class inference since queries are conditioned on the class-names which makes the inference very slow. To circumvent this issue, the authors suggest an anchor pre-matching step using the Region-features from CLIP. To account for the difference in CLIP image-level encoder features to 186 region-pooled features, the authors propose Region-Prompting and they find these region prompts generalize to novel classes. • Detic [308] suggests training existing object detectors which are typically trained on limited number of classes to a large number of image-classification datasets. The key idea is to simply assign the image-level labels to the largest region-proposal. • RegionCLIP [304] proposes training an RPN similar to ViLD but crucially instead of aligning the region-features to region-cropped CLIP image features, the authors suggest aligning it to the CLIP text embedding. • Cond-Head [255] proposes to additionally condition the object detection and mask segmentation head on the semantic embedding of the detected classes. • F-VLM [120] proposes using frozen pre-trained vision-language models (VLM) primarily for region classification. The authors suggest training just the detection or mask head on VLM on base classes for box regression and using the VLM (frozen) only for region classification at inference. • PBBL [59] proposes generating and then training over pseudo boxes obtained using image-text datasets. 7.4.5 Results on Open-Vocabulary Object Detection Results on OVD-COCO. are reported in Table 7.3. We observe (i) The three models DAM-IT, DAM-IC and DAM-G have very similar performance on each of the novel, base and all classes. This is expected since we make no distinction between the three at train time. (ii) DAM-ViLD has similar performance on base classes and significantly higher on novel class compared to methods such as ViLD and RegionCLIP trained specifically on the base-classes suggesting detection head gets overfitted. Thuse DAM-ViLD has the advantage over methods that employ pseudo-labels 187 Table 7.3: Results on OVD-COCO. Here, M-R50 refers to Faster-RCNN [199] or Mask-RCNN [73] with Res50 [75] backbone. Detection CLIP Annotation AP50 Boxes Extra Novel Base All ViLD [67] M-R50 ViT-B/32 ✓ ✗ 27.6 59.5 51.3 OV-DETR [286] Deform DETR ViT-B/32 ✓ ✗ 29.4 61 52.7 OADP [254] M-R50 ViT-B/32 ✓ ✗ 30 53.3 47.2 CORA [261] DAB-DETR RN50 ✓ ✗ 35.1 35.5 35.4 CORA [261] DAB-DETR RN50x4 ✓ ✗ 41.7 44.5 43.8 Detic [308] M-R50 ViT-B/32 ✓ IN-21K 27.8 47.1 45 OVR-CNN [287] M-R50 BERT ✓ Captions 22.8 46 39.9 F-VLM [120] M-R50 RN50 ✓ Captions 28 - 39.6 PBBL [59] M-R50 ALBEF ✓ PseudoBox 30.8 46.1 42.1 RegionCLIP [304] M-R50 RN50 ✓ CC3M 31.4 57.1 50.4 CondHead [255] M-R50 RN50 ✓ CC3M 33.7 58 51.7 CORA+ [261] DAB-DETR RN50x4 ✓ Captions 43.1 60.9 56.2 DAM-ViLD SAM-B/16 ViT-B/32 ✓ ✗ 39.3 60.4 53.2 DAM-IT SAM-B/16 ViT-B/32 ✗ Img-Tags 40.4 39.1 39.5 DAM-IC SAM-B/16 ViT-B/32 ✗ Captions 38.2 38.8 38.5 DAM-G SAM-B/16 ViT-B/32 ✗ ✗ 35.8 37.7 36.9 which may introduce noise, such as PBBL, and OADP, inevitably hurting their performance on base categories. (iii) DAM-IT variant performs better than DAM-ViLD likely due to RAM model’s precise image-tag outputs facilitating effective alignment of image segments with their textual descriptions. (iv) DAM-IC performs slightly worse compared to DAM-IT likely because DAM-IC is effective in accurately detecting a range of objects provided they are present in the caption which is a narrower set compared to predicted image-tags. (v) While DAM-G underperforms other variants, it should be noted that the training is performed on synthetically generated data without using any particular dataset. Moreover, the only information used is the unimodal distribution of the objects which suggests a naive augmentation of the data is still very valuable for training. However, we further note that the base performance is relatively higher than for 188 Table 7.4: LVIS object detection results. * denotes using CLIP RN50, ** denotes using CLIP RN50x4. r, c, f refer to rare, common, frequent respectively. Detection Annotation Box AP Mask AP Boxes Extra APr APc APf AP APr APc APf AP ViLD M-R50 ✓ ✗ 16.7 26.5 34.2 27.8 16.6 24.6 30.3 25.5 OV-DETR Deform DETR ✓ ✗ - - - - 17.4 25 32.5 26.6 OADP M-R50 ✓ ✗ 21.9 28.4 32 28.7 21.7 26.3 29 26.6 CORA** DAB-DETR ✓ ✗ 22.2 - - - - - - - Detic M-R50 ✓ IN-21K 26.2 31.3 36.6 32.5 19.5 - - 30.9 RegionCLIP* M-R50 ✓ CC3M 17 27.2 34.3 28.2 17.4 26 31.6 26.7 CondHead* M-R50 ✓ CC3M 19.9 28.6 35.2 29.7 20 27.3 32.2 27.9 DAM-ViLD SAM-B/16 ✓ ✗ 23.1 28.6 35.1 29.4 21.6 27.8 34.5 30.4 DAM-IT SAM-B/16 ✗ Img-Tags 23.5 25.2 25.1 24.6 18.9 19.3 20.4 19.7 DAM-IC SAM-B/16 ✗ Captions 22.8 24.9 24.7 24.1 17.5 18.4 18.9 18.7 DAM-G SAM-B/16 ✗ ✗ 19.5 21.2 22.7 21.3 15.1 17.6 18.1 17.9 novel class suggesting possible oversampling of the base classes or a better image generation quality by GLIGEN for those classes. (vi) Our DAM variants negates the need for the re-training required by methods such as ViLD, and OVDETR that could utilize novel class names during training. Instead, the exact same model can be re-used for other datasets such as on LVIS. Results on OVD-LVIS. We provide the box AP and mask AP results in Table 7.4. We first note that except for DAM-ViLD variant, our other DAM are exactly the same as those used in Table 7.3. Since, COCO and LVIS share the same images, our training pipeline is identical with the only difference in the classes provided by the user during evaluation. This is an exciting use case where a user needs to train the model only once on a collection of images. We further find most of the trends observed for COCO hold for LVIS as well for both box and mask AP. However, we do find that the performance on mask AP is somewhat poor relative to box AP. We attribute this to instance segmentation being more difficult task, and thus access to ground-truth for training is very helpful. Results on LVIS-Fbase In Table 7.5, we note box AP results with the same model but with a different split for Base and Novel categories as proposed in EZAD [151] where the rare and common are clubbed 189 Table 7.5: Box AP on LVIS with FBase Split. Numbers for ViLD and EZAD on FBase split are taken from [151]. APr APc APf AP ViLD 11.3 13.2 26.4 18.5 EZAD 13.1 14.3 31.9 21.3 DAM-IT 23.5 25.2 25.1 24.6 DAM-IC 22.8 24.9 24.7 24.1 DAM-G 19.5 21.2 22.7 21.3 Table 7.6: Acc@1 on RefCoCo+. “FT”: Finetuned, “ZS”: Zero-shot. val testA testB MDETR FT 81.13 85.52 72.96 CLIP-ZS ZS 78.12 81.68 71.54 DAM-IC ZS 75.14 77.11 67.15 DAM-IC FT 83.37 84.36 79.23 into the novel split. This ensures the evaluation on the novel classes is more representative. Since DAM isn’t trained with base novel classification, we can directly apply our previously trained model without any modifications. We find that Referring Expressions task involves localizing a given query object and has a similar formulation to the OVD. Instead of providing the possible object classes, the user provides exactly one query. The model then chooses the best possible segment for that query. To test our model’s capability to generalize to such tasks, we use DAM-IC which is trained using noun-phrases obtained from image-captions in Table 7.6. We don’t compare DAM-IT or DAM-G since they are primarily trained on object names without any attribute information whereas DAM-IC is trained with noun-phrases which often contains attribute-noun pairs. We also provide a separate baseline CLIP-ZS where each segment obtained from the SAM model is individually processed and then ranked with respect to the query. This is obviously a very inefficient method but it provides a robust baseline and in some sense an upper bound. Interestingly, DAM-IC performs competitively in the zero-shot setting, i.e., the same model used in open-vocabulary object detection is used. Moreover, when fine-tuned on the target task, we find the model outperforms MDETR [96] which is a strong fine-tuned baseline on val and testB set. On testA set which involves “person” class, the model performs worse. Effect of CLIP Model for Matching Typically, better segment-text matching should result in stronger model. To test this hypothesis, in Table 7.7 we compare use of different backbones for CLIP. We compare 190 Table 7.7: Ablation for different CLIP model Novel Base All ViT-B/32 DAM-IT 40.4 39.1 39.5 DAM-IC 38.2 38.8 38.5 ViT-B/16 DAM-IT 43.8 42.1 42.7 DAM-IC 41.4 40.9 41.1 ViT-L/14 DAM-IT 44.7 42.9 43.4 DAM-IC 43.3 41.6 42.4 Table 7.8: Fusion Ablation. E: Ensemble, LF: Late Fusion, EF: Early Fusion DAM-IT+IC DAM-IT+IC+G Novel Base All Novel Base All ViT-B/32 E 41.1 39.8 40.2 40.8 39.4 39.9 LF 41.5 40.1 40.5 40.9 40.2 40.5 EF 42.3 40.5 41.3 43.5 44.1 44 ViT-B/16 E 43.5 40.3 41.3 43.7 40.6 41.9 LF 44 42.9 43.5 44.4 43.6 44.1 EF 44.3 42.1 42.9 45.7 45.1 45.3 the two models DAM-IT and DAM-IC. We don’t compare DAM-G since GLIGEN provides the segmenttext matching on its own and we don’t have to rely on an external CLIP model. As expected, we find significant improvement on COCO Detection when using ViT-B/16 or ViT-L/14 by 2-3 AP50 points. This is encouraging as the matching process is only performed as a pre-processing step. Therefore, using a large ViT model for CLIP matching is not an issue. Effect of Fusing DAM variants. We note the DAM variants which have different segment-text matching techniques are in fact complementary to each other. In Table 7.8 we report results when fusing these different variants with both ViT-B/32 and ViT-B/16 CLIP models. We consider three ways to fuse them: (i) Ensemble: is a test-time method where we simply average the cosine similarity score from the different trained models requiring no extra training. (ii) Late Fusion: a linear layer takes the concatenation of the different features to produce a new vector. Crucially, the individual models are frozen except for the last linear layer. (iii) Early Fusion: we combine the different matching techniques to obtain better matches. That is, we end up training exactly one model but with better matches. While Ensemble and Late Fusion methods work the same when combining image-tags and captions (DAM-IT+IC) or combining all three (DAM-IT+IC+G), the Early Fusion has some crucial difference. For DAM-IT+IC, we collate the tags as well as the noun-phrases as the first step and treat this new set similar 191 Table 7.9: Results on ODinW. All numbers are AP50. wo/P and w/P refers to with and without prompting. GLIP-T (w/P) DAM-IT+IC+G (wo/P) DAM-IT+IC+G (w/P) Pascal-VOC 56.2 38.7 38.7 AerialDrone 12.5 4.9 9 Aquarium 18.4 5.86 14.8 Rabbits 70.2 63.9 66.7 EgoHands 50 41.9 46.1 Mushrooms 73.8 51.4 65.1 Packages 72.3 59.6 68.1 Raccoon 57.8 40.3 54.9 Shellfish 26.3 8.8 22.2 Vehicles 56 38.5 52.6 Pistols 49.6 32.1 46.6 Pothole 17.7 10.2 14.5 Thermal 44.1 26.6 40.8 Avg 46.53 32.52 41.55 to tags and train similar way to DAM-IT. For DAM-IT+IC+G, we perform same process as above tags and captions but crucially instead of using GLG1 dataset which was obtained by sampling object names, we create a new GLG2 dataset where the tags are the object names and captions from the image are used as input to GLIGEN. We note that GLG2 dataset can be created in a pre-processing step. The resulting images can be considered to be an augmentation on the input image while preserving the semantics. From Table 7.8, while a simple ensemble improves over individual models, adding DAM-G naively leads to worse performance. This holds true even for Late Fusion likely because the distribution of GLG1 is very different from that of DAM-IT and DAM-IC. However, in the early fusion case, this issue is alleviated since we use GLG2 similar to COCO distribution. The early fusion of the three models DAM-IT+IC+G gives us the best performing model on OVD-COCO. 7.5 Results on ODinW We report zero-shot results on ODinW in Table 7.9. We follow the Object-Detection in the Wild (ODinW) on 13 datasets as used in [132] and follow the same settings including the prompts. We only evaluate in 192 the zero-shot setting. We use the DAM-IT+IC+G early fusion model for comparison which is trained with image-tags, image-captions and image-generations. In the main paper it is in Table 7.8. We compare it with GLIP-T, however, we note that GLIP-T has been trained on large dataset with labeled bounding boxes and classes. We find that having additional prompts as introduced in previous work [132] indeed helps. In fact, it increased our AP50 by over 10 points. This in contrast to previous work where the improvement was in the order of 1−3 points. 7.6 Visualizations Figure 7.5: Top row: First is the original image, Second is the COCO-Ground-Truth. Bottom row: First is the raw segmentation output from SAM. Second is the output provided in open-space. Third is the output provided in COCO-vocabulary. We visualize our model outputs in Figure 7.5 and Figure 7.6 for COCO and LVIS respectively. We provide the original and ground-truth images in the top row followed by the SAM output, DAM-IT output when no target dataset is provided, and DAM-IT output when restricted to COCO-categories in Figure 7.5. Similar for LVIS-categories in Figure 7.6. When no target dataset is provided, we use the entire list of 4585 tags as used by RAM. 193 Figure 7.6: Top row: First is the original image, Second is the COCO-Ground-Truth. First is the raw segmentation output from SAM. Second is the output provided in open-space. Third is the output provided in LVIS-vocabulary. We note that when no output space is provided, the model is able to find objects such as radiator which are not present in the 80 object category set. However, it disappears when we restrict to the 80 categories of COCO. We also note that there are multiple unclassified segments obtained from SAM. This is because we have a high threshold for inference at 0.8 to avoid false positives. We also note that some of the class names changed such as “television” to “tv”. For LVIS, interestingly, we find that when we restrict to LVIS classes only, we are additionally able to classify the “mirror” and we ignore “window” which is not in LVIS vocabulary. 194 7.7 Conclusion In this work, we analyze the task of WS-OVD where the aim is to train a open-vocabulary model without annotated detection data. Our proposed DAM framework leverages multi-modal models to match class-agnostic segments obtained from SAM with textual descriptions. We contribute three variants utilizing image-tags, image-captions and image-generators for effective segment matching. Experiments on multiple OVD datasets validate the effectiveness of the three variants and their combination. 195 Chapter 8 Conclusions and Future Direction In this thesis, we have pushed the boundaries of image and video understanding by developing sophisticated models that actively interpret and analyze the complex interactions between visual and linguistic elements. Our research has methodically tackled the pressing need for artificial intelligence systems to go beyond traditional object recognition, venturing into a nuanced understanding of visual content through semantic roles. We first aimed at enhancing object entity localization and object relation understanding. For that we looked at zero-shot grounding in Chapter 2 to allow localizing novel objects from a given natural language query phrase. Next, in Chapter 3 we analyzed the problem of video object grounding to localize referred objects in a video. Then, we focused on the understanding actions and human-object interactions. To that end, we contribute Video Question Answering with Phrases in Chapter 4 where we pose Video-QA as a fill-in-theblank task and treat it as a generative task. Then, we introduce the task of Visual Semantic Role Labeling in Videos in Chapter 5 which require identifying salient actions, the corresponding semantic role coreferenced across videos and describing event relations. Finally, we considered two tasks which require finer understanding of images and videos. We propose the task of reasoning across images and videos in Chapter 6 where the model is provided a statement to 196 be classified as true or false statement and further provide a reasoning. Lastly, we propose the task of open-vocabulary object detection in a weakly supervised manner in Chapter 7 where the model has to detect all objects for a given vocabulary provided at test time with no existing annotations. At this point, it is imperative to get a birds-eye view in how our research can affect the real world. In my humble opinion, the practical implications are vast. Our datasets, models, evaluation techniques can directly enhance fine-grained content retrieval where a user ask for specific content, help in automated structured captions to generate sequence graphs which can be easily queried, and finally producing common-sense graphs to build world models to have future AI models be grounded in the real-world. Future Work. The question to consider for future work is what research gaps exist currently and how they could potentially be mitigated. I believe even with the latest advancements in Vision-Language research such as the prevalence of large-scale image-language models, video understanding remains an ongoing challenge. A key challenge in existing work is handling diverse media content using the same model. Currently, each video domain has its own specific models. For instance, instructional videos, movie-clips, user-generated videos, and sports-videos all have specific needs. The video-audio alignment is stronger in instructional videos but it also has a significant portion of it with a person just talking to the audience. Movie-Clips have shot-changes and the dialogue often has no correspondence to what is actually shown in the video. Similarly, user-generated videos may have exaggerated skits which require common-sense understanding and sports videos requires specific tracking of visually similar looking players which may not be needed in other domains. Future research could expand our proposed frameworks to include more diverse media content so that the models are highly robust and generalizable, and potentially incorporate real-time content analysis. The other key area is exploiting unaligned video text data. Compared to the paired video-image-text data, there exists orders of magnitude larger unaligned video-image-text content on the internet such 197 as videos from YouTube and texts from books and images from galleries. Exploiting such large data to improve existing supervised or weakly supervised models remains an under-explored topic with lot of potential benefits. The contributions in this thesis are complementary to fundamental advancement in large-scale pre-training and thus the potential to use unsupervised learning techniques could enhance the efficiency of training a video-language model. 198 Bibliography [1] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. “Video description: A survey of methods, datasets, and evaluation metrics”. In: ACM Computing Surveys (CSUR) 52.6 (2019), pp. 1–37. [2] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. “YouTube-8M: A Large-Scale Video Classification Benchmark”. In: ArXiv abs/1609.08675 (2016). [3] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. “VQA: Visual Question Answering”. In: International Journal of Computer Vision 123 (2015), pp. 4–31. [4] Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. “Audio visual scene-aware dialog”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 7558–7567. [5] Peter Anderson, X. He, C. Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 6077–6086. [6] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. “Bottom-up and top-down attention for image captioning and visual question answering”. In: CVPR. 2018. [7] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. “Vqa: Visual question answering”. In: ICCV. 2015, pp. 2425–2433. [8] A. Bagga and B. Baldwin. “Entity-Based Cross-Document Coreferencing Using the Vector Space Model”. In: COLING-ACL. 1998. [9] M. Bain, Arsha Nagrani, A. Brown, and Andrew Zisserman. “Condensed Movies: Story Based Retrieval with Contextual Embeddings”. In: ACCV. 2020. 199 [10] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. “Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval”. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 1708–1718. [11] Collin F Baker, Charles J Fillmore, and John B Lowe. “The berkeley framenet project”. In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1. 1998, pp. 86–90. [12] Satanjeev Banerjee and Alon Lavie. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments”. In: ACL. 2005. [13] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. “Zero-Shot Object Detection”. In: ECCV (2018). [14] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. “Zero-Shot Object Detection”. In: European Conference on Computer Vision. 2018. url: https://api.semanticscholar.org/CorpusID:4799963. [15] Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. “Object level visual reasoning in videos”. In: ECCV. 2018. [16] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. “Mutan: Multimodal tucker fusion for visual question answering”. In: ICCV. 2017. [17] Hedi Ben-younes, Rémi Cadene, Nicolas Thome, and Matthieu Cord. “BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection”. In: AAAI (2019). [18] Hakan Bilen and Andrea Vedaldi. “Weakly Supervised Deep Detection Networks”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 2846–2854. url: https://api.semanticscholar.org/CorpusID:9173222. [19] Claire Bonial, Olga Babko-Malaya, Jinho D Choi, Jena Hwang, and Martha Palmer. “Propbank annotation guidelines”. In: Center for Computational Language and Education Research, CU-Boulder (2010). [20] Claire Bonial, Jena Hwang, Julia Bonn, Kathryn Conger, Olga Babko-Malaya, and Martha Palmer. “English propbank annotation guidelines”. In: Center for Computational Language and Education Research Institute of Cognitive Science University of Colorado at Boulder 48 (2012). [21] Tim Brooks, Aleksander Holynski, and Alexei A Efros. “InstructPix2Pix: Learning to Follow Image Editing Instructions”. In: arXiv preprint arXiv:2211.09800 (2022). [22] Susan Windisch Brown, Julia Bonn, James Gung, Annie Zaenen, James Pustejovsky, and Martha Palmer. “VerbNet Representations: Subevent Semantics for Transfer Verbs”. In: Proceedings of the First International Workshop on Designing Meaning Representations. Florence, Italy: Association for Computational Linguistics, Aug. 2019, pp. 154–163. doi: 10.18653/v1/W19-3318. 200 [23] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. “Activitynet: A large-scale video benchmark for human activity understanding”. In: CVPR. 2015. [24] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. “End-to-end object detection with transformers”. In: European conference on computer vision. Springer. 2020, pp. 213–229. [25] J. Carreira and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 4724–4733. [26] Soravit Changpinyo, Piyush Kumar Sharma, Nan Ding, and Radu Soricut. “Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 3557–3567. url: https://api.semanticscholar.org/CorpusID:231951742. [27] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. “Learning to Detect Human-Object Interactions”. In: WACV (2018). [28] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. “Learning to Detect Human-Object Interactions”. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (2018), pp. 381–389. [29] Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. “HICO: A Benchmark for Recognizing Human-Object Interactions in Images”. In: ICCV (2015), pp. 1017–1025. [30] Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, and Ram Nevatia. “AMC: Attention guided multi-modal correlation learning for image search”. In: CVPR (2017). [31] Kan Chen, Jiyang Gao, and Ram Nevatia. “Knowledge aided consistency for weakly supervised phrase grounding”. In: CVPR. 2018. [32] Kan Chen, Rama Kovvuri, and Ram Nevatia. “Query-guided regression network with context policy for phrase grounding”. In: ICCV. 2017. [33] Xinlei Chen, H. Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, P. Dollár, and C. L. Zitnick. “Microsoft COCO Captions: Data Collection and Evaluation Server”. In: ArXiv abs/1504.00325 (2015). [34] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. “Uniter: Universal image-text representation learning”. In: ECCV. 2020. [35] Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. “Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video”. In: ACL. 2019. [36] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. “Unifying Vision-and-Language Tasks via Text Generation”. In: ICML. 2021. 201 [37] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. “Scaling Egocentric Vision: The EPIC-KITCHENS Dataset”. In: ECCV. 2018. [38] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. “Scaling Egocentric Vision: The EPIC-KITCHENS Dataset”. In: European Conference on Computer Vision (ECCV). 2018. [39] P. Das, C. Xu, R. F. Doell, and Corso J. J. “A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching”. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013. [40] Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. “Guesswhat?! visual object discovery through multi-modal dialogue”. In: CVPR. 2017. [41] Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. “Visual Grounding via Accumulated Attention”. In: CVPR. 2018. [42] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: CVPR. Ieee. 2009, pp. 248–255. [43] Michael Denkowski and Alon Lavie. “Meteor Universal: Language Specific Translation Evaluation for Any Target Language”. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation. 2014. [44] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL-HLT. 2019. [45] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423. [46] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jürgen Gall, Rainer Stiefelhagen, and Luc Van Gool. “Large Scale Holistic Video Understanding”. In: European Conference on Computer Vision. Springer. 2020, pp. 593–610. [47] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: ICLR (2021). [48] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Nanyun Peng, Zicheng Liu, and Michael Zeng. “An Empirical Study of Training End-to-End Vision-and-Language Transformers”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 18145–18155. 202 [49] Hugo Jair Escalante, Carlos A Hernández, Jesus A Gonzalez, Aurelio López-López, Manuel Montes, Eduardo F Morales, L Enrique Sucar, Luis Villaseñor, and Michael Grubinger. “The segmented and annotated IAPR TC-12 benchmark”. In: CVIU (2010). [50] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. “The pascal visual object classes (voc) challenge”. In: IJCV (2010). [51] Haoqi Fan, Yanghao Li, Bo Xiong, Wan-Yen Lo, and Christoph Feichtenhofer. PySlowFast. https://github.com/facebookresearch/slowfast. 2020. [52] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. “SlowFast Networks for Video Recognition”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 6201–6210. [53] Charles J Fillmore, Christopher R Johnson, and Miriam RL Petruck. “Background to framenet”. In: International journal of lexicography (2003). [54] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. “Multimodal compact bilinear pooling for visual question answering and visual grounding”. In: EMNLP (2016). [55] Chen Gao, Yuliang Zou, and Jia-Bin Huang. “iCAN: Instance-centric attention network for human-object interaction detection”. In: BMVC (2018). [56] J. Gao, C. Sun, Zhenheng Yang, and R. Nevatia. “TALL: Temporal Activity Localization via Language Query”. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 5277–5285. [57] Jiyang Gao, Kan Chen, and Ramakant Nevatia. “CTAP: Complementary Temporal Action Proposal Generation”. In: ECCV. 2018. [58] Jiyang Gao, Runzhou Ge, Kan Chen, and Ramakant Nevatia. “Motion-Appearance Co-memory Networks for Video Question Answering”. In: CVPR (2018). [59] Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong. “Open Vocabulary Object Detection with Pseudo Bounding-Box Labels”. In: European Conference on Computer Vision. 2021. url: https://api.semanticscholar.org/CorpusID:247315676. [60] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. “AllenNLP: A Deep Semantic Natural Language Processing Platform”. In: ACL Workshop. 2018. eprint: arXiv:1803.07640. [61] Rohit Girdhar, J. Carreira, C. Doersch, and Andrew Zisserman. “Video Action Transformer Network”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 244–253. [62] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. “Video action transformer network”. In: CVPR. 2019. 203 [63] Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. “Detecting and recognizing human-object interactions”. In: CVPR. 2018. [64] Raghav Goyal, S. Kahou, Vincent Michalski, Joanna Materzynska, S. Westphal, Heuna Kim, V. Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, F. Hoppe, Christian Thurau, I. Bax, and R. Memisevic. “The “Something Something” Video Database for Learning and Evaluating Visual Common Sense”. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 5843–5851. [65] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and D. Parikh. “Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 127 (2017), pp. 6325–6334. [66] C. Gu, C. Sun, Sudheendra Vijayanarasimhan, C. Pantofaru, D. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik. “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 6047–6056. [67] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. “Open-vocabulary object detection via vision and language knowledge distillation”. In: arXiv preprint arXiv:2104.13921 (2021). [68] Sergio Guadarrama, Erik Rodner, Kate Saenko, Ning Zhang, Ryan Farrell, Jeff Donahue, and Trevor Darrell. “Open-vocabulary Object Retrieval.” In: Robotics: science and systems. Citeseer. 2014. [69] Agrim Gupta, Piotr Dollar, and Ross Girshick. “Lvis: A dataset for large vocabulary instance segmentation”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 5356–5364. [70] Saurabh Gupta and Jitendra Malik. “Visual Semantic Role Labeling”. In: ArXiv abs/1505.04474 (2015). [71] Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. “Imagine This! Scripts to Compositions to Videos”. In: ECCV. 2018. [72] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B. Girshick. “Masked Autoencoders Are Scalable Vision Learners”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 15979–15988. url: https://api.semanticscholar.org/CorpusID:243985980. [73] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. “Mask R-CNN”. In: ICCV. 2017, pp. 2961–2969. [74] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. “Mask R-CNN”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2020), pp. 386–397. [75] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 770–778. url: https://api.semanticscholar.org/CorpusID:206594692. 204 [76] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: CVPR. 2016. [77] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. “ActivityNet: A large-scale video benchmark for human activity understanding”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 961–970. [78] Michael Heilman and Noah A Smith. Question generation via overgenerating transformations and ranking. Tech. rep. Carnegie-Mellon Univ Pittsburgh pa language technologies insT, 2009. [79] Lisa Anne Hendricks, O. Wang, E. Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. “Localizing Moments in Video with Natural Language”. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 5804–5813. [80] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. “CLIPScore: A Reference-free Evaluation Metric for Image Captioning”. In: ArXiv abs/2104.08718 (2021). url: https://api.semanticscholar.org/CorpusID:233296711. [81] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation (1997). [82] Y. Hong, Tongtao Zhang, Timothy J. O’Gorman, Sharone Horowit-Hendler, Huai-zhong Ji, and Martha Palmer. “Building a Cross-document Event-Event Relation Corpus”. In: LAW@ACL. 2016. [83] Matthew Honnibal and Ines Montani. “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing”. In: CNNIP. To appear. 2017. [84] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. “spaCy: Industrial-strength Natural Language Processing in Python”. In: (2020). doi: 10.5281/zenodo.1212303. [85] Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. “Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation”. In: ACL. 2019. [86] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. “Natural language object retrieval”. In: CVPR. 2016. [87] De-An Huang, Shyamal Buch, Lucio M. Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. “Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos”. In: CVPR (2018). [88] Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. “MovieNet: A Holistic Dataset for Movie Understanding”. In: The European Conference on Computer Vision (ECCV). 2020. [89] Drew A Hudson and Christopher D Manning. “GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering”. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2019). 205 [90] Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. “The THUMOS challenge on action recognition for videos “in the wild””. In: Computer Vision and Image Understanding 155 (2017), pp. 1–23. issn: 1077-3142. doi: https://doi.org/10.1016/j.cviu.2016.10.018. [91] Y. Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. “TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 1359–1367. [92] Bhavan Jasani, Rohit Girdhar, and D. Ramanan. “Are we Asking the Right Questions in MovieQA?” In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (2019), pp. 1879–1882. [93] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. “Towards understanding action recognition”. In: ICCV. Dec. 2013, pp. 3192–3199. [94] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”. In: ICML. 2021. [95] Mandar Joshi, Danqi Chen, Y. Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. “SpanBERT: Improving Pre-training by Representing and Predicting Spans”. In: Transactions of the Association for Computational Linguistics 8 (2019), pp. 64–77. [96] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. “Mdetr-modulated detection for end-to-end multi-modal understanding”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 1780–1790. [97] Andrej Karpathy and Li Fei-Fei. “Deep visual-semantic alignments for generating image descriptions”. In: CVPR. 2015. [98] W. Kay, J. Carreira, K. Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, F. Viola, T. Green, T. Back, A. Natsev, Mustafa Suleyman, and Andrew Zisserman. “The Kinetics Human Action Video Dataset”. In: ArXiv abs/1705.06950 (2017). [99] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. “The kinetics human action video dataset”. In: arXiv preprint arXiv:1705.06950 (2017). [100] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. “Referitgame: Referring to objects in photographs of natural scenes”. In: EMNLP. 2014. [101] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. “ReferIt Game: Referring to Objects in Photographs of Natural Scenes”. In: EMNLP. 2014. [102] Zeeshan Khan, C.V. Jawahar, and Makarand Tapaswi. “Grounded Video Situation Recognition”. In: ArXiv abs/2210.10828 (2022). 206 [103] Anna Khoreva, Anna Rohrbach, and Bernt Schiele. “Video Object Segmentation with Language Referring Expressions”. In: ACCV. 2018. [104] Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. “Re-evaluating Automatic Metrics for Image Captioning”. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017, pp. 199–209. url: https://www.aclweb.org/anthology/E17-1019. [105] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. “Hadamard product for low-rank bilinear pooling”. In: ICLR (2017). [106] Wonjae Kim, Bokyung Son, and Ildoo Kim. “ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision”. In: ICML. 2021. [107] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: ICLR (2015). [108] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980 (2015). [109] Paul R Kingsbury and Martha Palmer. “From TreeBank to PropBank.” In: LREC. Citeseer. 2002, pp. 1989–1993. [110] Paul R Kingsbury and Martha Palmer. “From treebank to propbank.” In: LREC. 2002, pp. 1989–1993. [111] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. “Segment Anything”. In: arXiv:2304.02643 (2023). [112] Y. Kong and Yun Fu. “Human Action Recognition and Prediction: A Survey”. In: ArXiv abs/1806.11230 (2018). [113] R. Krishna, Kenji Hata, F. Ren, Li Fei-Fei, and Juan Carlos Niebles. “Dense-Captioning Events in Videos”. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 706–715. [114] R. Krishna, Yuke Zhu, O. Groth, J. Johnson, Kenji Hata, J. Kravitz, Stephanie Chen, Yannis Kalantidis, L. Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: International Journal of Computer Vision 123 (2016), pp. 32–73. [115] Ranjay Krishna, Ines Chami, Michael Bernstein, and Li Fei-Fei. “Referring relationships”. In: CVPR. 2018. [116] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. “Dense-captioning events in videos”. In: ICCV. 2017. 207 [117] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. “Visual genome: Connecting language and vision using crowdsourced dense image annotations”. In: IJCV (2017). [118] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations”. In: 2016. url: https://arxiv.org/abs/1602.07332. [119] Hilde Kuehne, Hueihan Jhuang, E. Garrote, T. Poggio, and Thomas Serre. “HMDB: A large video database for human motion recognition”. In: 2011 International Conference on Computer Vision (2011), pp. 2556–2563. [120] Weicheng Kuo, Yin Cui, Xiuye Gu, A. J. Piergiovanni, and Anelia Angelova. “F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models”. In: ArXiv abs/2209.15639 (2022). url: https://api.semanticscholar.org/CorpusID:252668578. [121] Alina Kuznetsova, Hassan Rom, Neil Gordon Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. “The Open Images Dataset V4”. In: International Journal of Computer Vision 128 (2020), pp. 1956–1981. [122] Kenton Lee, Luheng He, M. Lewis, and Luke Zettlemoyer. “End-to-end Neural Coreference Resolution”. In: EMNLP. 2017. [123] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. “TVQA: Localized, Compositional Video Question Answering”. In: EMNLP. 2018. [124] Jie Lei, Licheng Yu, T. Berg, and Mohit Bansal. “TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval”. In: ECCV. 2020. [125] Jie Lei, Licheng Yu, Tamara L. Berg, and Mohit Bansal. “TVQA+: Spatio-Temporal Grounding for Video Question Answering”. In: ArXiv abs/1904.11574 (2019). [126] Ang Li, Meghana Thotakuri, D. Ross, J. Carreira, Alexander Vostrikov, and Andrew Zisserman. “The AVA-Kinetics Localized Human Actions Video Dataset”. In: ArXiv abs/2005.00214 (2020). [127] Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, and Andrew Zisserman. “The AVA-Kinetics Localized Human Actions Video Dataset”. In: ArXiv abs/2005.00214 (2020). [128] Jianan Li, Yunchao Wei, Xiaodan Liang, Fang Zhao, Jianshu Li, Tingfa Xu, and Jiashi Feng. “Deep attribute-preserving metric learning for natural language object retrieval”. In: ACM Multimedia. 2017. [129] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models”. In: arXiv preprint arXiv:2301.12597 (2023). 208 [130] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven C. H. Hoi. “Align before Fuse: Vision and Language Representation Learning with Momentum Distillation”. In: NeurIPS. 2021. [131] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. “HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training”. In: EMNLP. 2020. [132] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. “Grounded Language-Image Pre-training”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 10955–10965. url: https://api.semanticscholar.org/CorpusID:244920947. [133] Manling Li, Alireza Zareian, Q. Zeng, Spencer Whitehead, Di Lu, Huai-zhong Ji, and Shih-Fu Chang. “Cross-media Structured Common Space for Multimedia Event Extraction”. In: ACL. 2020. [134] Xiaoyan Li, Meina Kan, Shiguang Shan, and Xilin Chen. “Weakly supervised object detection with segmentation collaboration”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 9735–9744. [135] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. “Scene graph generation from objects, phrases and region captions”. In: ICCV. 2017. [136] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. “GLIGEN: Open-Set Grounded Text-to-Image Generation”. In: CVPR (2023). [137] Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, and Diana Marculescu. “Open-vocabulary semantic segmentation with mask-adapted clip”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023, pp. 7061–7070. [138] Chin-Yew Lin. “ROUGE: A Package For Automatic Evaluation Of Summaries”. In: ACL. 2004. [139] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries”. In: Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, July 2004, pp. 74–81. url: https://www.aclweb.org/anthology/W04-1013. [140] T. Lin, X. Liu, Xin Li, E. Ding, and Shilei Wen. “BMN: Boundary-Matching Network for Temporal Action Proposal Generation”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 3888–3897. [141] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. “BSN: Boundary Sensitive Network for Temporal Action Proposal Generation”. In: ECCV. 2018. [142] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He, Bharath Hariharan, and Serge J Belongie. “Feature Pyramid Networks for Object Detection.” In: CVPR. 2017. 209 [143] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. “Focal loss for dense object detection”. In: TPAMI (2018). [144] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755. [145] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. “Microsoft COCO: Common Objects in Context”. In: ECCV. 2014. [146] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. “Bilinear cnn models for fine-grained visual recognition”. In: ICCV. 2015. [147] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. “SSD: Single shot multibox detector”. In: ECCV. 2016. [148] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: arXiv preprint arXiv:1907.11692 (2019). [149] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: arXiv preprint arXiv:1907.11692 abs/1907.11692 (2019). [150] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin transformer: Hierarchical vision transformer using shifted windows”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2021, pp. 10012–10022. [151] Zhuoming Liu, Xuefeng Hu, and Ramkant Nevatia. “Efficient Feature Distillation for Zero-shot Annotation Object Detection”. In: 2023. url: https://api.semanticscholar.org/CorpusID:264935739. [152] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: International Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=Bkg6RiCqY7. [153] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/c74d97b01eae257e44aa9d5bade97baf-Paper.pdf. [154] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. “Neural Baby Talk”. In: CVPR (2018). [155] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. “Neural baby talk”. In: CVPR. 2018. [156] Ruotian Luo and Gregory Shakhnarovich. “Comprehension-Guided Referring Expressions”. In: CVPR (2017). 210 [157] Xiaoqiang Luo. “On Coreference Resolution Performance Metrics”. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing. Vancouver, British Columbia, Canada: Association for Computational Linguistics, Oct. 2005, pp. 25–32. url: https://www.aclweb.org/anthology/H05-1004. [158] Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan Al-Regib, and Hans Peter Graf. “Grounded Objects and Interactions for Video Captioning”. In: NIPS Workshop (2017). [159] Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron C. Courville, and C. Pal. “A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering”. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 7359–7368. [160] Louis Mahon, Eleonora Giunchiglia, B. Li, and Thomas Lukasiewicz. “Knowledge Graph Extraction from Videos”. In: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA) (2020), pp. 25–32. [161] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. “The Stanford CoreNLP Natural Language Processing Toolkit”. In: ACL Workshop. 2014. [162] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. “Generation and comprehension of unambiguous object descriptions”. In: CVPR. 2016. [163] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, I. Laptev, Josef Sivic, and Andrew Zisserman. “End-to-End Learning of Visual Representations From Uncurated Instructional Videos”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 9876–9886. [164] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. “End-to-End Learning of Visual Representations from Uncurated Instructional Videos”. In: CVPR. 2020. [165] Antoine Miech, D. Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, I. Laptev, and Josef Sivic. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 2630–2640. [166] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips”. In: ICCV. 2019. [167] George A Miller. “WordNet: a lexical database for English”. In: Communications of the ACM (1995). [168] Mathew Monfort, B. Zhou, Sarah Adel Bargal, Alex Andonian, Tom Yan, K. Ramakrishnan, L. Brown, Quanfu Fan, Dan Gutfreund, Carl Vondrick, and A. Oliva. “Moments in Time Dataset: One Million Videos for Event Understanding”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2020), pp. 502–508. 211 [169] N. Moosavi and Michael Strube. “Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric”. In: ACL. 2016. [170] Alejandro Newell and Jia Deng. “Pixels to Graphs by Associative Embedding”. In: NIPS. 2017. [171] OpenAI. “GPT-4 Technical Report”. In: ArXiv abs/2303.08774 (2023). [172] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. “Im2Text: Describing Images Using 1 Million Captioned Photographs”. In: Advances in Neural Information Processing Systems. Ed. by J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K.Q. Weinberger. Vol. 24. Curran Associates, Inc., 2011. url: https://proceedings.neurips.cc/paper_files/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9- Paper.pdf. [173] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. “Im2text: Describing images using 1 million captioned photographs”. In: Advances in neural information processing systems 24 (2011). [174] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. “fairseq: A Fast, Extensible Toolkit for Sequence Modeling”. In: NAACL-HLT. 2019. [175] Martha Palmer, Daniel Gildea, and Paul Kingsbury. “The proposition bank: An annotated corpus of semantic roles”. In: Computational linguistics (2005). [176] Martha Palmer, Paul Kingsbury, and Daniel Gildea. “The Proposition Bank: An Annotated Corpus of Semantic Roles”. In: Computational Linguistics 31 (2005), pp. 71–106. [177] Kishore Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu. “Bleu: a Method for Automatic Evaluation of Machine Translation”. In: ACL. 2002. [178] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. “Automatic differentiation in PyTorch”. In: NIPS Workshop. 2017. [179] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorchan-imperative-style-high-performance-deep-learning-library.pdf. 212 [180] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorchan-imperative-style-high-performance-deep-learning-library.pdf. [181] Jeffrey Pennington, R. Socher, and Christopher D. Manning. “Glove: Global Vectors for Word Representation”. In: EMNLP. 2014. [182] Jeffrey Pennington, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation”. In: EMNLP. 2014. [183] Bryan A Plummer, Paige Kordas, M Hadi Kiapour, Shuai Zheng, Robinson Piramuthu, and Svetlana Lazebnik. “Conditional image-text embedding networks”. In: ECCV. 2018. [184] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models”. In: ICCV. 2015. [185] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik. “Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models”. In: International Journal of Computer Vision 123 (2015), pp. 74–93. [186] Jordi Pont-Tuset, Pablo Arbeláez, Jonathan T. Barron, Ferran Marqués, and Jitendra Malik. “Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), pp. 128–140. url: https://api.semanticscholar.org/CorpusID:2625735. [187] Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng, and Michael Strube. “Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation”. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Baltimore, Maryland: Association for Computational Linguistics, June 2014, pp. 30–35. doi: 10.3115/v1/P14-2006. [188] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. “Towards Robust Linguistic Analysis using OntoNotes”. In: CoNLL. 2013. [189] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. “Towards robust linguistic analysis using ontonotes”. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning. 2013, pp. 143–152. [190] Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, and Aniruddha Kembhavi. “Grounded Situation Recognition”. In: ECCV. 2020. 213 [191] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. “Learning Transferable Visual Models From Natural Language Supervision”. In: ICML. 2021. [192] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language Models are Unsupervised Multitask Learners”. In: (2019). [193] Shafin Rahman, Salman Hameed Khan, and Nick Barnes. “Improved Visual-Semantic Alignment for Zero-Shot Object Detection”. In: AAAI Conference on Artificial Intelligence. 2020. url: https://api.semanticscholar.org/CorpusID:214171269. [194] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified, real-time object detection”. In: CVPR. 2016. [195] Joseph Redmon and Ali Farhadi. “YOLO9000: better, faster, stronger”. In: CVPR (2017). [196] Joseph Redmon and Ali Farhadi. “Yolov3: An incremental improvement”. In: arXiv preprint arXiv:1804.02767 (2018). [197] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, B. Schiele, and Manfred Pinkal. “Grounding Action Descriptions in Videos”. In: Transactions of the Association for Computational Linguistics 1 (2013), pp. 25–36. [198] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: NIPS. 2015. [199] Shaoqing Ren, Kaiming He, Ross B. Girshick, and J. Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), pp. 1137–1149. [200] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. “Grounding of textual phrases in images by reconstruction”. In: ECCV. 2016. [201] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and B. Schiele. “A dataset for Movie Description”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 3202–3212. [202] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. “High-Resolution Image Synthesis with Latent Diffusion Models”. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 10674–10685. url: https://api.semanticscholar.org/CorpusID:245335280. [203] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. 2021. arXiv: 2112.10752 [cs.CV]. [204] Matteo Ruggero Ronchi and Pietro Perona. “Describing common human visual actions in images”. In: BMVC (2015). 214 [205] Josef Ruppenhofer, Michael Ellsworth, Myriam Schwarzer-Petruck, Christopher R Johnson, and Jan Scheffczyk. FrameNet II: Extended theory and practice. Tech. rep. International Computer Science Institute, 2016. [206] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. “Imagenet large scale visual recognition challenge”. In: IJCV (2015). [207] Arka Sadhu, K. Chen, and R. Nevatia. “Video Object Grounding Using Semantic Roles in Language Description”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 10414–10424. [208] Arka Sadhu, Kan Chen, and R. Nevatia. “Video Question Answering with Phrases via Semantic Roles”. In: NAACL. 2021. [209] Arka Sadhu, Kan Chen, and Ram Nevatia. “Video Object Grounding using Semantic Roles in Language Description”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2020. [210] Arka Sadhu, Kan Chen, and Ram Nevatia. “Zero-Shot Grounding of Objects From Natural Language Queries”. In: ICCV. 2019. [211] Arka Sadhu, Kan Chen, and Ram Nevatia. “Zero-shot grounding of objects from natural language queries”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 4694–4703. [212] Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, and Aniruddha Kembhavi. “Visual Semantic Role Labeling for Video Understanding”. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2021. [213] Arka Sadhu and Ram Nevatia. Detect Anything: An Open-Vocabulary Object Detection Model with Weak Supervision. 2024. [214] Arka Sadhu and Ram Nevatia. “Leveraging Task-Specific Pre-Training To Reason Across Images and Videos”. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Jan. 2024, pp. 5794–5804. [215] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. “A simple neural network module for relational reasoning”. In: NIPS. 2017. [216] Mike Schuster and Kuldip K Paliwal. “Bidirectional recurrent neural networks”. In: IEEE TSP (1997). [217] Mike Schuster and Kuldip K. Paliwal. “Bidirectional recurrent neural networks”. In: IEEE TSP (1997). [218] Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. “Video Visual Relation Detection”. In: ACM Multimedia. 2017. 215 [219] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. “Objects365: A large-scale, high-quality dataset for object detection”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 8430–8439. [220] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, July 2018, pp. 2556–2565. doi: 10.18653/v1/P18-1238. [221] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. “Self-Attention with Relative Position Representations”. In: NAACL-HLT. 2018. [222] Yunhang Shen, Rongrong Ji, Yan Wang, Zhiwei Chen, Feng Zheng, Feiyue Huang, and Yunsheng Wu. “Enabling Deep Residual Networks for Weakly Supervised Object Detection”. In: European Conference on Computer Vision. 2020. url: https://api.semanticscholar.org/CorpusID:226291972. [223] Yunhang Shen, Rongrong Ji, Yan Wang, Yongjian Wu, and Liujuan Cao. “Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 697–707. url: https://api.semanticscholar.org/CorpusID:198311816. [224] Peng Shi and Jimmy Lin. “Simple BERT Models for Relation Extraction and Semantic Role Labeling”. In: arXiv preprint arXiv:1904.05255 (2019). [225] Peng Shi and Jimmy Lin. “Simple BERT Models for Relation Extraction and Semantic Role Labeling”. In: ArXiv abs/1904.05255 (2019). [226] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. “Hollywood in homes: Crowdsourcing data collection for activity understanding”. In: ECCV. 2016. [227] Gunnar A. Sigurdsson, Gül Varol, X. Wang, Ali Farhadi, I. Laptev, and A. Gupta. “Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding”. In: ECCV. 2016. [228] Carina Silberer and Manfred Pinkal. “Grounding Semantic Roles in Images”. In: EMNLP. 2018. [229] Carina Silberer and Manfred Pinkal. “Grounding semantic roles in images”. In: EMNLP. 2018. [230] K. Soomro, A. Zamir, and M. Shah. “UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild”. In: ArXiv abs/1212.0402 (2012). [231] Emma Strubell, Pat Verga, Daniel Andor, David Weiss, and Andrew McCallum. “Linguistically-Informed Self-Attention for Semantic Role Labeling”. In: EMNLP. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 5027–5038. doi: 10.18653/v1/D18-1548. 216 [232] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. “Linguistically-Informed Self-Attention for Semantic Role Labeling”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 5027–5038. doi: 10.18653/v1/D18-1548. [233] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. “VL-BERT: Pre-training of Generic Visual-Linguistic Representations”. In: International Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=SygXPaEYvH. [234] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. “A Corpus for Reasoning about Natural Language Grounded in Photographs”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Ed. by Anna Korhonen, David Traum, and Lluís Màrquez. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 6418–6428. doi: 10.18653/v1/P19-1644. [235] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. “Actor-Centric Relation Network”. In: Computer Vision – ECCV 2018. Ed. by Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss. Cham: Springer International Publishing, 2018, pp. 335–351. isbn: 978-3-030-01252-6. [236] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. “Actor-centric relation network”. In: ECCV. 2018. [237] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to Sequence Learning with Neural Networks”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger. Vol. 27. Curran Associates, Inc., 2014. url: https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf. [238] Hao Tan and Mohit Bansal. “LXMERT: Learning Cross-Modality Encoder Representations from Transformers”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019. [239] Yansong Tang, Dajun Ding, Yongming Rao, Y. Zheng, Danyang Zhang, L. Zhao, Jiwen Lu, and J. Zhou. “COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 1207–1216. [240] Makarand Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. “MovieQA: Understanding Stories in Movies through Question-Answering”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 4631–4640. [241] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. “LLaMA: Open and Efficient Foundation Language Models”. In: arXiv preprint arXiv:2302.13971 (2023). [242] Yao-Hung Tsai, Santosh Kumar Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov, and Ali Farhadi. “Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph”. In: CVPR. 2019. 217 [243] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. “Selective search for object recognition”. In: IJCV 104 (2013), pp. 154–171. [244] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [245] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: Advances in Neural Information Processing Systems. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [246] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: NIPS. Vol. 30. 2017. [247] Ramakrishna Vedantam, C. L. Zitnick, and D. Parikh. “CIDEr: Consensus-based image description evaluation”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 4566–4575. [248] Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and Sanja Fidler. “MovieGraphs: Towards Understanding Human-Centric Situations from Videos”. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018. [249] Marc B. Vilain, J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman. “A model-theoretic coreference scoring scheme”. In: MUC. 1995. [250] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. “Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge”. In: IEEE TPAMI 39 (2017), pp. 652–663. [251] Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao, and Qixiang Ye. “C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 2194–2203. url: https://api.semanticscholar.org/CorpusID:118686276. [252] L. Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, D. Lin, X. Tang, and L. Gool. “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition”. In: ECCV. 2016. [253] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. “Temporal segment networks: Towards good practices for deep action recognition”. In: ECCV. 2016. [254] Luting Wang, Yi Liu, Penghui Du, Zihan Ding, Yue Liao, Qiaosong Qi, Biaolong Chen, and Si Liu. “Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection”. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 11186–11196. url: https://api.semanticscholar.org/CorpusID:257482791. 218 [255] Tao Wang and Nan Li. “Learning to Detect and Segment for Open Vocabulary Object Detection”. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 7051–7060. url: https://api.semanticscholar.org/CorpusID:255096379. [256] X. Wang, Ross B. Girshick, A. Gupta, and Kaiming He. “Non-local Neural Networks”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 7794–7803. [257] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. “VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research”. In: The IEEE International Conference on Computer Vision (ICCV). Oct. 2019. [258] Xin Eric Wang, Jiawei Wu, Junkun Chen, Lei Li, Y. Wang, and William Yang Wang. “VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 4580–4590. [259] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. “Transformers: State-of-the-Art Natural Language Processing”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, Oct. 2020, pp. 38–45. url: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [260] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, and Ross B. Girshick. “Long-Term Feature Banks for Detailed Video Understanding”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 284–293. [261] Xiaoshi Wu, Feng Zhu, Rui Zhao, and Hongsheng Li. “CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching”. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 7031–7040. url: https://api.semanticscholar.org/CorpusID:257687253. [262] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. “Aggregated Residual Transformations for Deep Neural Networks”. In: CVPR (2017). [263] Saining Xie, C. Sun, J. Huang, Zhuowen Tu, and Kevin Murphy. “Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification”. In: ECCV. 2018. [264] Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. “A Graph-Based Framework to Bridge Movies and Synopses”. In: The IEEE International Conference on Computer Vision (ICCV). Oct. 2019. [265] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. “Video Question Answering via Gradually Refined Attention over Appearance and Motion”. In: ACM Multimedia. 2017. 219 [266] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. “Video Question Answering via Gradually Refined Attention over Appearance and Motion”. In: ACM Multimedia. 2017. [267] J. Xu, T. Mei, Ting Yao, and Y. Rui. “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 5288–5296. [268] Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh, and Louis-Philippe Morency. “What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets”. In: arXiv:2007.03626 [cs, stat] (July 2020). url: http://arxiv.org/abs/2007.03626. [269] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. “Graph R-CNN for Scene Graph Generation”. In: ECCV. 2018. [270] Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. “Graph r-cnn for scene graph generation”. In: ECCV. 2018. [271] Ke Yang, Dongsheng Li, and Yong Dou. “Towards precise end-to-end weakly supervised object detection network”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 8372–8381. [272] Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. “Commonly uncommon: Semantic sparsity in situation recognition”. In: CVPR. 2017. [273] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation Recognition: Visual Semantic Role Labeling for Image Understanding”. In: Conference on Computer Vision and Pattern Recognition. 2016. [274] Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. “Situation recognition: Visual semantic role labeling for image understanding”. In: CVPR. 2016. [275] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions”. In: TACL (2014). [276] Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, P. Abbeel, and Lerrel Pinto. “Visual Imitation Made Easy”. In: ArXiv abs/2008.04899 (2020). [277] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. “Mattnet: Modular attention network for referring expression comprehension”. In: CVPR. 2018. [278] Licheng Yu, Zhe L. Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. “MAttNet: Modular Attention Network for Referring Expression Comprehension”. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 1307–1315. [279] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. “Modeling context in referring expressions”. In: ECCV. 2016. 220 [280] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. “Modeling context in referring expressions”. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14. Springer. 2016, pp. 69–85. [281] Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. “ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering”. In: AAAI. 2019, pp. 9127–9134. [282] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering”. In: ICCV. 2017. [283] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. “Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering”. In: IEEE TNNLS (2018). [284] Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. “Rethinking Diversified and Discriminative Proposal Generation for Visual Grounding”. In: IJCAI (2018). [285] Christopher Zach, Thomas Pock, and Horst Bischof. “A duality based approach for realtime TV-L 1 optical flow”. In: Joint pattern recognition symposium. Springer. 2007. [286] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. “Open-Vocabulary DETR with Conditional Matching”. In: European Conference on Computer Vision. 2022. url: https://api.semanticscholar.org/CorpusID:247597014. [287] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. “Open-Vocabulary Object Detection Using Captions”. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 14388–14397. url: https://api.semanticscholar.org/CorpusID:227126502. [288] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “From Recognition to Cognition: Visual Commonsense Reasoning”. In: CVPR. 2018. [289] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. “From Recognition to Cognition: Visual Commonsense Reasoning”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 6713–6724. [290] Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. “SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference”. In: EMNLP. 2018. [291] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. “Neural Motifs: Scene Graph Parsing with Global Context”. In: CVPR (2017). [292] Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. “Leveraging Video Descriptions to Learn Video Question Answering”. In: AAAI. 2017. [293] H. Zhang, Yi-Xiang Zhang, B. Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. “A Comprehensive Survey of Vision-Based Human Action Recognition Methods”. In: Sensors (Basel, Switzerland) 19 (2019). 221 [294] Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. “Grounding referring expressions in images by variational context”. In: CVPR. 2018. [295] Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. “Graphical Contrastive Losses for Scene Graph Parsing”. In: CVPR. 2019. [296] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. “VinVL: Making Visual Representations Matter in Vision-Language Models”. In: CVPR 2021 (2021). [297] Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang, and Jiawan Zhang. “Automatic Generation of Grounded Visual Questions”. In: IJCAI. 2016. [298] Youcai Zhang, Xinyu Huang, Jinyu Ma, Zhaoyang Li, Zhaochuan Luo, Yanchun Xie, Yuzhuo Qin, Tong Luo, Yaqian Li, Shilong Liu, et al. “Recognize Anything: A Strong Image Tagging Model”. In: arXiv preprint arXiv:2306.03514 (2023). [299] Yundong Zhang, Juan Carlos Niebles, and Alvaro Soto. “Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining”. In: WACV (2019). [300] Zixing Zhang, Zhou Zhao, Yang Zhao, Q. Wang, H. Liu, and L. Gao. “Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 10665–10674. [301] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. “BERTScore: Evaluating Text Generation with BERT”. In: International Conference on Learning Representations. 2020. url: https://openreview.net/forum?id=SkeHuCVFDr. [302] Hang Zhao, Zhicheng Yan, Lorenzo Torresani, and Antonio Torralba. “HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization”. In: ICCV (2019), pp. 8668–8678. [303] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. 2023. arXiv: 2306.05685 [cs.CL]. [304] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. “Regionclip: Region-based language-image pretraining”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 16793–16803. [305] Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. “Grounded video description”. In: CVPR. 2019. [306] Luowei Zhou, Nathan Louis, and Jason J. Corso. “Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction”. In: BMVC. 2018. url: http://bmvc2018.org/contents/papers/0070.pdf. 222 [307] Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. “End-to-End Dense Video Captioning with Masked Transformer”. In: CVPR (2018), pp. 8739–8748. [308] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. “Detecting Twenty-thousand Classes using Image-level Supervision”. In: ECCV. 2022. [309] Linchao Zhu and Yi Yang. “ActBERT: Learning Global-Local Video-Text Representations”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2020, pp. 8743–8752. [310] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. “Visual7w: Grounded question answering in images”. In: CVPR. 2016. [311] Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. “Hcvrd: a benchmark for large-scale human-centered visual relationship detection”. In: AAAI. 2018. [312] Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton van den Hengel. “Parallel attention: A unified framework for visual object discovery through dialogs and queries”. In: CVPR. 2018. 223
Abstract (if available)
Abstract
While machine learning research has traditionally explored image, video and text understanding as separate fields, the surge in multi-modal content in today's digital landscape underscores the importance of computation models that adeptly navigate complex interactions between text, images and videos. This dissertation addresses this challenge of grounding language in visual media - the task of associating linguistic symbols with perceptual experiences and actions. The overarching goal of this dissertation is to bridge the gap between language and vision as a means to a ``deeper understanding'' of images and videos to allow developing models capable of reasoning over longer-time horizons such as hour-long movies, or a collection of images, or even multiple videos.
A pivotal contribution of my work is the use of Semantic Roles for images, videos and text. Unlike previous works that primarily focused on recognizing single entities or generating holistic captions, the use of Semantic Roles facilitates a fine-grained understanding of ``who did what to whom'' in a structured format. It maintains the advantages of having free-form language phrases and at the same time also being comprehensive and complete like entity recognition, thus enriching the model's interpretive capabilities.
In this thesis, we will introduce the various vision-language tasks developed during my Ph.D. This includes grounding unseen words, spatio-temporal localization of entities in a video, video question answering, visual semantic role labeling in videos, reasoning across more than one image or a video, and finally, weakly-supervised open-vocabulary object detection. Each task is accompanied by the creation and development of dedicated datasets, evaluation protocols, and model frameworks. These tasks aim to investigate a particular phenomenon inherent in image or video understanding in isolation, develop corresponding datasets and model frameworks, and outline evaluation protocols robust to data priors.
The resulting models can be used for other downstream tasks like obtaining common-sense knowledge graphs from instructional videos or drive end-user applications like Retrieval, Question Answering, and Captioning. By facilitating the deeper integration of language and vision, this dissertation represents a step-forward in machine learning models capable of finer-understanding of the world around us.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Bridging the visual reasoning gaps in multi-modal models
PDF
Common ground reasoning for communicative agents
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Building generalizable language models for code processing
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Identifying and mitigating safety risks in language models
PDF
Multimodal reasoning of visual information and natural language
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Countering problematic content in digital space: bias reduction and dynamic content adaptation
PDF
Learning shared subspaces across multiple views and modalities
PDF
Modeling, learning, and leveraging similarity
PDF
Externalized reasoning in language models for scalable and trustworthy AI
PDF
Speech and language understanding in the Sigma cognitive architecture
PDF
Semantic-based visual information retrieval using natural language queries
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Towards generalized event understanding in text via generative models
PDF
Event detection and recounting from large-scale consumer videos
PDF
Annotating FrameNet via structure-conditioned language generation
Asset Metadata
Creator
Sadhu, Arka
(author)
Core Title
Grounding language in images and videos
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/17/2024
Defense Date
04/23/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,image understanding,machine learning,natural language processing,OAI-PMH Harvest,video understanding
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ramakant (
committee chair
), Mintz, Toby (
committee member
), Ren, Xiang (
committee member
)
Creator Email
ark.sadhu2904@gmail.com,asadhu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113940023
Unique identifier
UC113940023
Identifier
etd-SadhuArka-12939.pdf (filename)
Legacy Identifier
etd-SadhuArka-12939
Document Type
Dissertation
Format
theses (aat)
Rights
Sadhu, Arka
Internet Media Type
application/pdf
Type
texts
Source
20240517-usctheses-batch-1153
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
image understanding
machine learning
natural language processing
video understanding