Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Semantic-based visual information retrieval using natural language queries
(USC Thesis Other)
Semantic-based visual information retrieval using natural language queries
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Semantic-based Visual Information Retrieval using Natural Language Queries by Rama Kovvuri A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2019 Copyright 2019 Rama Kovvuri Dedication To Sushma My wife, companion and best friend. She always understood. ii Acknowledgments I would like to acknowledge all the people whose continuous support helped me traverse the tough waters of graduate studies. First and foremost, I would like to thank my advisor Prof. Ramakant Nevatia. His guidance, broad knowledge of elds of Computer Vision and Articial Intelligence and insights were valuable for my learning experience. I have learnt a lot from our discussions about research, life among other things and consider myself lucky to be part of the IRIS lab. I would also like to thank Prof. Jyotirmoy Deshmukh, Prof. Panayiotis Georgiou, Prof. Antonio Ortega and Prof. Ulrich Neumann for dedicating their invaluable time to serve on my thesis defense and qualication committees; Prof. Anoop Namboodiri for introducing me to the wonderful eld of computer vision; Kan Chen and Arnav Agharwal for the collaboration during my graduate studies. I am also thankful to all the friends that I have made along this path. Thanks to Kiran Matam, Matthias Hernandez, Arnav Agharwal, Kan Chen, Pramod Sharma, Zhenheng Yang, Chen Sun, Krishna Narra, Chuang Gan and so many others for being there; be it for a research advice or for a stress relief getaway. Lastly but more importantly, I would like to thank my mom who gave me all the sup- port and understanding to choose my career path; my grandfather for always encouraging my academic pursuits; my in-laws for being patient during the long years and my family for all the encouragement. A very special thanks to my dear wife for being my listening ear when frustrated, for sharing my joy during success and most importantly for always being there. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures ix Abstract xi 1 Introduction 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Existing Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 Segment-based models for multimedia event retrieval . . . . . . . . 5 1.4.2 Video Embeddings for semantic-based video retrieval . . . . . . . . 6 1.4.3 Better proposals and Query-based regression for phrase grounding 6 1.4.4 Phrase interrelationships and context for phrase grounding . . . . 6 1.4.5 Modular phrase attention Network for phrase grounding . . . . . . 7 1.4.6 Knowledge transfer for weakly-supervised phrase grounding . . . . 7 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Related Work 8 2.1 Event Retrieval in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Phrase Grounding in Images . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Semantic models for Event Detection and Recounting in videos 11 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Segment-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Seed Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.2 Iterative positive segment mining . . . . . . . . . . . . . . . . . . . 15 3.2.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 Model Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.5 Model Recounting . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Zero-shot Event Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 iv 3.3.1 Continuous Word Space (CWS) Embedding . . . . . . . . . . . . . 18 3.3.1.1 Mid-level semantic video representation . . . . . . . . . . 19 3.3.1.2 Concepts and Queries to Continuous Word Space . . . . 19 3.3.1.3 Embedding Function . . . . . . . . . . . . . . . . . . . . 19 3.3.1.4 Scoring videos . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Concept Space (CoS) Embedding . . . . . . . . . . . . . . . . . . . 20 3.3.3 Dictionary Space (DiS) Embedding . . . . . . . . . . . . . . . . . . 20 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 Object Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.3.1 Training parameters . . . . . . . . . . . . . . . . . . . . . 21 3.4.3.2 Multimedia Event Detection . . . . . . . . . . . . . . . . 22 3.4.3.3 Multimedia Event Recounting . . . . . . . . . . . . . . . 22 3.4.4 Zero-shot Event Retrieval . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4 Query-guided Regression with Context Policy for Phrase Grounding 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 QRC Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.2 Proposal Generation Network (PGN) . . . . . . . . . . . . . . . . 31 4.2.3 Query guided Regression Network (QRN) . . . . . . . . . . . . . . 32 4.2.4 Context Policy Network (CPN) . . . . . . . . . . . . . . . . . . . . 33 4.2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3.3 Performance on Flickr30K Entities . . . . . . . . . . . . . . . . . . 36 4.3.4 Performance on Referit Game . . . . . . . . . . . . . . . . . . . . . 38 4.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 Exploring Phrase Interrelationships and Context for Phrase Grounding 41 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Our network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2.2 Proposal Indexing Network(PIN) . . . . . . . . . . . . . . . . . . . 44 5.2.3 Inter-phrase Regeression Network(IRN) . . . . . . . . . . . . . . . 45 5.2.4 Proposal Ranking Network(PRN) . . . . . . . . . . . . . . . . . . . 45 5.2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.3 Results on Flickr30k Entities . . . . . . . . . . . . . . . . . . . . . 49 v 5.3.4 Results on ReferIt Game . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Grounding visualization on ReferIt dataset . . . . . . . . . . . . . . . . . 53 5.4.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6 Modular Phrase Attention Network for bottom-up Phrase Grounding 56 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Our network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.2 Visual model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.4 Multimodal Subspace . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.3 Results on Flickr30k Entities . . . . . . . . . . . . . . . . . . . . . 62 6.3.4 Results on ReferIt Game . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7 Knowledge Transfer for Weakly-supervised Phrase Grounding 66 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 7.2 Weakly Supervised Grounding . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.2.2 Data-driven Knowledge Transfer . . . . . . . . . . . . . . . . . . . 69 7.2.3 Appearance-based Knowledge Transfer . . . . . . . . . . . . . . . . 69 7.2.4 Language Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.3.3 Results on Flickr30k Entities . . . . . . . . . . . . . . . . . . . . . 72 7.3.4 Results on ReferIt Game . . . . . . . . . . . . . . . . . . . . . . . . 72 7.3.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 8 Conclusions and Future Work 74 8.0.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 8.0.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Reference List 76 vi List of Tables 3.1 Comparison of MED performance (AP metric) on the NIST MEDTEST 2014 dataset using Video-level Models(VM), Segment-based Models(SM) and Late Fusion(VM+SM). . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Comparison of Average Precision(AP) of the ranked segments in test videos for Video-level Models(VM) and Segment-based Models(SM) for various thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Retrieval performance (AP metric) using single query tag on the NIST MEDTEST 2014 dataset using our \continuous word space" (CWS), our \concept space" (CoS) extension of [15], and \dictionary space" (DiS) [87] embedding approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Retrieval performance for multiple query tags on MEDTEST 2014 dataset using our CWS, our CoS [15], and DiS [87] embedding approaches. . . . . 27 4.1 Dierent models' performance on Flickr30K Entities. Our framework is evaluated by combining with various proposal generation systems. . . . . 34 4.2 Comparison of dierent proposal generation systems on Flickr30k Entities 35 4.3 QRC Net's performances on Flickr30K Entities for dierent weights of L reg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 QRC Net's performances on Flickr30K Entities for dierent dimensions m of v q i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 QRC Net's performances on Flickr30K Entities for dierent reward values of CPN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.6 Phrase grounding performances for phrase types dened in Flickr30K En- tities. Accuracy is in percentage. . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Dierent models' performance on Referit Game dataset. . . . . . . . . . . 38 4.8 QRC Net's performances on Referit Game for dierent weights of L reg . 38 4.9 QRC Net's performances on Referit Game for dierent dimensions of v q i . . 39 5.1 Relative performance of our apprach on Flickr30k dataset. Dierent com- binations of our modules are evaluated. . . . . . . . . . . . . . . . . . . . 48 5.2 Performance evaluation of ecieny of PIN on Flickr30k Entities . . . . . . 48 5.3 Relative performance of our approach on Referit Game dataset. . . . . . . 51 5.4 PIRC Net's performance on Fickr30K Entities for dierent values K for indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.5 IRN's performance for various values of N k . . . . . . . . . . . . . . . . . 51 5.6 PIRC Net's performance in proposal indexing with and without various components of IRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.7 PIRC Net's performance on ReferIT Game for dierent values of K . . . 53 vii 5.8 Performance evaluation of eciency of Phrase Indexing Network(PIN) on Referit Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.1 Relative performance of our apprach on Flickr30k dataset. Dierent com- binations of our modules are evaluated. . . . . . . . . . . . . . . . . . . . 62 6.2 Relative performance of our approach on Referit Game dataset. . . . . . . 63 7.1 Relative performance of our approach for Weakly supervised Grounding on Flickr30k dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Relative performance of our approach for Weakly supervised Grounding on Referit Game dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 viii List of Figures 1.1 Some applications of Semantic-based Visual Information Retrieval(SBVIR) 1 1.2 An overview of the general approach . . . . . . . . . . . . . . . . . . . . . 4 1.3 Illustration of intuition behind our models. Top-left: Labels of latent mod- els for retrieval of event 'Beekeeping'. Top-right: Continuous word space representation of videos for reducing semantic gap between queries and concepts. Bottom-left: Advantages of region proposal generation and re- gression over object proposals. Bottom-right: Benets of employing visual context for relative queries. . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Exemplars from the event "Marriage Proposal" in the TRECVID MED Dataset [56] showcasing the variance in backgrounds; actions and their order of occurrence in unconstrained videos. . . . . . . . . . . . . . . . . . 12 3.2 Outline of the training, iterative positive mining and testing approaches. 14 3.3 Video and Query Embedding framework . . . . . . . . . . . . . . . . . . . 18 3.4 Captions/labels generated by segment-based models for events Bike Trick, Dog Show, Marriage Proposal, Rock Climbing, Winning a race without a vehicle and Beekeeping (from top to bottom). The rst ten out of the twenty positive test videos of the event are chosen and the middle frame of the segment is chosen for illustration. It can be seen that the captions are relevant to the segments. . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Visualization of frequency of tags generated for events. (Tags are generated using labels of Segment-based Models.) . . . . . . . . . . . . . . . . . . . 24 3.6 Top 5 video screenshots for single query tags in row 1) \forest", 2) \bicy- cle", 3) \climber", 4) \vehicle". Ranks decrease from left to right. . . . . 26 4.1 QRC Net rst regresses each proposal based on query's semantics and visual features, and then utilizes context information as rewards to rene grounding results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Query-guided Regression network with Context policy (QRC Net) consists of a Proposal Generation Network (PGN), a Query-guided Regression Net- work (QRN) and a Context Policy Network (CPN). PGN generates pro- posals and extracts their CNN features via a RoI pooling layer [64]. QRN encodes input query's semantics by an LSTM [25] model and regresses pro- posals conditioned on the query. CPN samples the top ranked proposals, and assigns rewards considering whether they are foreground (FG), back- ground (BG) or context. These rewards are back propagated as policy gradients to guide QRC Net to select more discriminative proposals. . . . 30 ix 4.3 Some phrase grounding results in Flickr30K Entities [60] (rst two rows) and Referit Game [30] (third row). We visualize ground truth bounding box, selected proposal box and regressed bounding box in blue, green and red resepctively. When query is not clear without further context infor- mation, QRC Net may ground wrong objects (e.g., image in row three, column four). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Overview of the PIRC Net framework . . . . . . . . . . . . . . . . . . . . 41 5.2 Architecture of PIRC Net network showing its three constituent modules: Proposal Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN). PIN generates and classies pro- posals into coarse phrase categories and indexes them into candidate pro- posals per query phrase. IRN augments the proposals by estimating the locations of neighboring phrases through relation-guided regression. PRN ranks the augmented proposals and assigns scores considering their loca- tion with respect to neighbors, contrast with respect to other candidates and semantic context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Grounding performance per phrase category for dierent methods on Flickr 30K Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4 Qualitative results on test sets of Flickr30K Entities(top row) and ReferIT Game(middle row). Last row shows failure cases from both datasets. Green: Correct Prediction, Red: Wrong Prediction, Blue: Ground truth . 50 5.5 Visualization of some of the queries and corresponding bounding boxes chosen by PIRC Net for Flickr30K dataset. \Blue" is the groundtruth bounding box, \Green" is positive result and \Red" is negative result . . 52 5.6 Visualization of some of the queries and corresponding bounding boxes cho- sen by PIRC Net for ReferIt dataset. \Blue" is the groundtruth bounding box, \Green" is a positive result and \Red" is an incorrect result . . . . . 55 6.1 Overview of the MPA Net framework . . . . . . . . . . . . . . . . . . . . . 57 6.2 Architecture of MPA network showing its constituent modules. First, the region proposals generated by RPN are classied as relevant and non- relavant to the head noun of the query phrase. Next, a discriminative embedding is generated for the query phrase using weighted average of word attention. Finally, correlation to each of the relevant proposal and the embedding is computed to rank the proposals. . . . . . . . . . . . . . 58 6.3 Qualitative results on the test sets for MPA Net architecture. The intensity of color indicates the attention weight for the weight. Green: Correct Prediction, Red: Wrong Prediction, Blue: Groundtruth . . . . . . . . . . 64 7.1 Comparison of supervised vs weakly-supervised grounding training ow . 66 7.2 An overview of the key ideas of WPIN framework . . . . . . . . . . . . . . 67 7.3 An overview of the WPIN Net architecture . . . . . . . . . . . . . . . . . 68 7.4 Qualitative results on the test sets of Flickr30K Entities and ReferIT Game. Green: Correct Prediction, Red: Wrong Prediction, Blue: Groundtruth . 72 x Abstract Semantic-based Visual Information Retrieval(SBVIR) is an important problem that in- volves research in areas such as Computer Vision, Machine learning, Natural Language Processing and other related areas in computer science. Semantic-based visual informa- tion retrieval(SBVIR) involves retrieval of desired visual entities (images/videos) from a large database based on an input query by analyzing their semantics. An input query could be a xed or an open-ended noun phrase and database consists of visual entities that are either videos or images. SBVIR is a challenging problem because of the inter- class and intra-class variance in appearance of objects; diculties in recognizing large variety of objects, capturing the complex relations among the objects and wide variance among open-ended queries. This thesis concentrates on the solutions oered for two sub problems in SBVIR : Multimedia Event Retrieval and Recounting for complex videos and Phrase Grounding in images. The rst part of the thesis focuses on Event Retrieval and Recounting problem in complex videos. Given a xed input query, Multimedia Event Retrieval involves ranking the videos in a database based on the correlation with the query. Multimedia Event Re- counting further provides positive evidence for the highly ranked videos i.e, the segments that contain the event described in the query. A framework is presented in the rst part of thesis for joint Multimedia Event Retrieval and Recounting. The framework describes a novel method for retrieval and recounting using multiple latent models for each event trained using positive segments. Next, zero-shot training methods are proposed to rank the videos while having no exemplar videos during training. For this setting, skip-gram vectors are used to project the videos into query space using intermediate concepts. The second part of thesis concentrates on Phrase Grounding in images. Given an open-ended input query phrase and an input image, Phrase grounding methods ground ( localize ) the query phrase in the input image using a bounding box. Three methods are described for Phrase Grounding, each trying to address the short comings faced by the earlier methods. Earlier methods for Phrase Grounding formulate it as a ranking problem, ranking the candidate bounding boxes generated for an input image based on correlation to query phrase in a multimodal subspace. The rst method, QRC Net describes a framework to improve the accuracy of generated candidate bounding boxes and then improve their localization through regression based on query phrase. The second method, PIRC Net describes a framework to coarsely categorize the bounding boxes into phrase categories to improve generalization. Further, the method describes framework to inculcate inter-phrase relationships and visual context by accounting for multiple query phrases for an input image and multiple objects in an image respectively. The nal method, MPA Net describes a framework for bottom-up grounding by improving the xi semantic interpretation of the language model. To achieve this, the framework used per- word attention to concentrate on discriminative words of query phrase during proposal generation and ranking. Finally, a weakly-supervised training framework is described that uses knowledge-transfer from object detection systems to tackle the lack of groundtruth annotations for query phrases during training. This thesis presents the results for above mentioned methods on publicly-available image or video datasets to demonstrate their eectiveness. xii Chapter 1 Introduction With the advent of of multimedia applications that simplify the capture and sharing of visual information, the volume of multimedia data increases manifold. There is higher need to organize the data for ecient retrieval. While captions and tags are employed to annotate the content in the online data, a large portion of it contains noisy captions or incomplete tags. Hence, a more reliable method to encode and retrieve the content in the multimedia data is required. Semantic-based visual information retrieval(SBVIR) involves retrieval of desired visual entities (images/videos) from a large database by analyzing their semantics. SBVIR is an alternative to concept-based visual information indexing where metadata such as keywords, tags, captions etc. are employed to generate concepts for retrieval. SBVIR is advantageous because it relies on intrinsic semantics rather than the noisy and incomplete metadata for retrieval. However, it is also more challenging due to the diculties in recognition of large variety of objects, capturing the complex relationships among the objects and open-ended vocabulary of the queries. SBVIR nds applications in many diverse elds such as visual search, visual dialog and scene interaction in elds such as robotics and autonomous driving ( Fig 1.1 ). Figure 1.1: Some applications of Semantic-based Visual Information Retrieval(SBVIR) Depending on the application, queries for SBVIR can be from a xed-set or open- ended. For xed-set queries, training is typically done in a fully-supervised setting using 1 exemplars for each query. For open-ended queries, training is done using zero-shot setting or generalization from limited training data. For zero-shot training, generally knowledge is provided in the form of text description or concepts that are related to the query to compensate for the lack of training data. Alternatively, one can leverage linguistic knowledge to generalize from limited training data to accommodate open-ended queries. 1.1 Problem Statement SBVIR databases could be divided into two depending upon the nature of the visual entities. Spatial Retrieval is limited to instances obtained by query localization of rigid entities. This is typically seen in static images. Temporal Retrieval spans across multiple frames accommodating the temporal interactions of the semi-rigid entities. This is typically seen in videos. In this thesis, we focus on some sub-topics in SBVIR : Exemplar-based Multimedia Event retrieval and recounting, Zero-shot Multimedia Event retrieval in videos; Phrase Grounding and Weakly-supervised Phrase Grounding in images. A brief overview of each of the topics is provided below: Multimedia Event Retrieval and Recounting (temporal): Retrieving prede- ned event categories and recounting their informative segments from videos in the wild. Zero-shot Event Retrieval (temporal): Retrieving adhoc event categories given detailed description of the event from videos in the wild. Phrase Grounding (spatial): Localization and retrieval of image regions de- scribed by an adhoc natural language query from an input image. Weakly-supervised Phrase Grounding (spatial): Localization and retrieval of image regions described by an adhoc natural language query by training in weakly- supervised setting having no groundtruth annotations. Event retrieval in unconstrained videos involves understanding the complex interac- tions of the humans/agent with \in the wild" environment. An \Event" itself constitutes of multiple actions that occur in non-deterministic order in various environments. For ex- ample, an event like \Marriage Proposal" can occur either indoor or outdoor and may or may not contain dierent actions like \Getting on a knee", \Hugging", \Wearing a ring" etc. To encapsulate the wide variety of possible events, a natural language phrase is used to describe the query event. In supervised scenario, few video exemplars are provided for the query event to model the structure for retrieval. Recounting involves providing intervals and possible names of the actions in the video that has lead to the nal decision for the event. For zero-shot scenario, the video exemplars are replaced by a small text description of the event. This problem is closer to the real world applications where it is dicult for the user to provide exemplars for each query. Zero-shot requires learning a semantic space to understand the related action/object categories for each event and build a representation for that event by either mining visual exemplars from web or from an annotated dataset. Hence, it is important to generalize the visual and semantic models to tackle the wide variety of possible queries. 2 Phrase localization in images involves identifying the visual region in the image re- ferred to by the natural language phrase. Unlike localization and detection from pre- dened set of nouns/verbs; phrase localization is more amenable to real world scenario where the query vocabulary is open-ended. Such a system requires generalization from the few training examples to learn the visual representation and image-text subspace for a wide variety of input queries. For example, an input query phrase like \a man in red sweater next to pillar"; the localization system must not only understand the visual rep- resentation of nouns like \man", \sweater", \pillar" but also the understand the nature of relations \in", \next to" and their corresponding visual denition. A robust phrase grounding framework must work for multiple possible forms of a given query phrase, work for complex query phrases and generalize from limited visual categories provided during training. 1.2 Challenges Identifying the salient regions in a large subspace is critical and challenging. Identifying the positive regions from background regions for a given query is challenging. Lastly, gen- eralizing the appearance models i.e., learning the visual appearance from similar entities in training is challenging and requires linguistic knowledge. A detailed description of the challenges is provided below: Large search space: The referred entity can appear at multiple scales and varied locations in the image/video. Thus the search space is too large for brute force, e.g. sliding window search. Ecient indexing strategy such a positive proposal generation is needed. Proposal ranking: For a given query, there are a high number of unrelated back- ground proposals for one positive proposal. Ecient modeling of multimodal sub- space is needed to identify the relevant proposals. Generalized appearance models: While only a limited number of object cat- egories appear during training, many novel object categories can appear during test-time. Thus appearance models that use semantic relevance from the training categories to generalize are needed. 1.3 Existing Frameworks Traditional vision models have used template matching, dictionary-based models etc., for vision-language tasks. Deep learning models have proven successful in identifying labels of visual data in tasks such as classication, detection etc. Various architectures such as AlexNet, VGG and ResNet have been proposed for robust visual representation improv- ing the accuracy of these core tasks. However, these models cannot be directly applied towards understanding the higher-level semantics of the visual data. Various approaches have been proposed to align the visual and language modalities; such as common at- tribute/concept feature subspace, latent alignment using RNNs, two-stream networks, Canonical Correlation Analysis(CCA) etc. A general overview of these approaches is pre- sented in 1.2. To generate the candidate instances, these approaches reduce the search 3 Natural Language Query Query representation Knowledge Representation/ Exemplars Visual Information database Visual representation Candidate instances Similarity Matching Retrieved Visual Instances Figure 1.2: An overview of the general approach space by generating object proposals for images and action proposals/action concept scores for videos. For textual representation, various methodologies such as lstm feature encoding, word embeddings, intermediate concept features etc. have been employed. Fol- lowing are their drawbacks for images and videos. For images, object proposals are not generic to represent complex scenes in the images. Also, not a lot of work has been done to leverage the context provided by other visual entities in image while retrieving the query. For videos, predened concept space creates a semantic gap for the queries that do not relate to the concepts. Also, it is dicult to generalize the actions in videos unlike images due to high variances. This thesis describes methods that attempt to address the above problems. For videos, latent action models are described that learn concepts from the exemplars in a data-driven fashion. Further, methods are described to reduce the semantic gap in the concept- space by projecting both the videos and queries in a continuous subspace for alignment. For images, methods are described to learn the region proposals by a combination of classication and regression to eciently predict the region proposals by learning a higher- level category label for each of the query phrase and then producing the oset parameters for proposal based on the query. Further, methods that employ the multiple cues provided by the surrounding visual entities and multiple queries for the same image to use the context as additional information for localization are described. Lastly, a method that learns the discriminative information in query phrase and then uses it for bottom-up phrase grounding is described. 4 The theme of this thesis is to leverage the robust visual representation provided by deep networks to retrieve visual entities using natural language phrases as queries. A brief overview of an intuition behind some of the contributions is shown in Figure 1.3. Dierent architectures are desctibed to model the semantics such as latent segment models and video vector embeddings for video retrieval; latent attention models and two-stream embedding models for phrase grounding and retrieval in images. 1.4 Contributions overview QUERY: People in yellow jackets Regressed Proposal Object Proposal Region Proposal Object proposal vs Region+Regressed proposal Visual Context QUERY: Tallest Giraffe ✓ C 1 C 2 C N Concept Bank QUERY PHRASE No related concepts Concept Embedding Query Embedding No semantic gap ✓ Semantic gap Video vector embeddings Event name: Latent concept discovery and modeling Figure 1.3: Illustration of intuition behind our models. Top-left: Labels of latent models for retrieval of event 'Beekeeping'. Top-right: Continuous word space representation of videos for reducing semantic gap between queries and concepts. Bottom-left: Advantages of region proposal generation and regression over object proposals. Bottom-right: Benets of employing visual context for relative queries. A brief overview of the thesis contributions is presented in the following paragraphs. 1.4.1 Segment-based models for multimedia event retrieval Event retrieval in untrimmed videos is challenging due to large intra-class variances in structure, low-quality videos and possible presence of long segments not directly related to the event. To overcome these challenges, we formulate Segment-based models that 5 latently identify the informative segments of the untrimmed videos. Segment-based mod- els have the following salient properties: (1) Ability to predict the semantics of video segments using knowledge transfer from external corpus. (2) Ecient suppression of background segments by training and testing using ensemble models from positive fore- ground segments. On the challenging TRECVID 2014 dataset, this approach shows improvements in both retrieval and recounting tasks. 1.4.2 Video Embeddings for semantic-based video retrieval Zero-shot retrieval extends event retrieval to open-ended queries. In this setting, the exemplars provided for each query are replaced by event description text and is a more practical formulation. In this work, we leverage the concept-detectors and reduce the semantic gap between the query and the concepts using continuous word space embed- dings. This not only allows for fast query-video similarity computation with implicit query expansion, but leads to a compact video representation, which allows implementa- tion of a real-time retrieval system that can t several thousand videos in a few hundred megabytes of memory. 1.4.3 BetterproposalsandQuery-basedregressionforphrasegrounding Phrase Grounding aims at localizing the the visual entities described by a natural lan- guage phrase in an image. State-of-the-art approaches use independent object proposals to reduce the search space for possible localizations. Although the object proposals per- form reasonably well, their performance is limited and the grounding accuracy is bound by the performance. We propose a novel Query-guided Regression network with Context policy (QRC Net) which jointly learns a Proposal Generation Network (PGN), a Query- guided Regression Network(QRN) and a Context Policy Network (CPN). PGN aims at increasing the performance of proposals by adapting to the grounding task; QRN elimi- nates the upperbound of proposal generators by regressing the proposal based on input query phrase and CPN localizes multiple frames together to resolve con icting predic- tions. We evaluate this approach on two popular grounding datasets : Flickr30K Entities and Referit Game. Experiments show QRC Net provides a signicant improvement in accuracy with 14.25% and 17.14% increase over the state-of-the-art respectively. 1.4.4 Phrase interrelationships and context for phrase grounding Existing methods for phrase grounding do not fully leverage the visual and textual context provided by the multiple phrases present in an image or the relationships among those phrases. We propose a network that explores phrase interrelationships and context for Grounding(PIRC) that contains three modules: Proposal Indexing Network(PIN) an extension of PGN that indexes region proposals relevant to query phrase category; Inter- phrase Regression Network(IRN) to augment the region proposals by regressing from location of neighboring phrases in the sentence and Proposal Ranking Network(PRN) to learn an objective for ranking the region proposals while incorporating the context from neighboring phrases. This network addresses some of the limitations of QRC Net 6 and improves the state-of-the-art further by 8% and 15% over the Flickr30K and ReferIt datasets respectively. 1.4.5 Modular phrase attention Network for phrase grounding Unlike the above two approaches, we describe a framework for bottom-up grounding by improving the semantic interpretation of the language model. To achieve this, the frame- work used per-word attention to concentrate on discriminative words of query phrase during proposal generation and ranking the proposals. For generating candidate propos- als, the head noun of the query phrase is identied and related proposals are chosen. In the next stage, the most likely proposal for query phrase is chosen by attending on the most discriminative words in the query phrase. This network improves the state-of-the-art on both Flickr30K and ReferIt datasets using a simpler architecture. 1.4.6 Knowledge transfer for weakly-supervised phrase grounding While earlier works address supervised phrase grounding, it is not scalable to huge vol- umes of data because obtaining the ground-truth locations for image-phrase query pairs is expensive and non-trivial. In the absence of ground-truth spatial locations of the phrases during training(weakly-supervised training), WPIN Network employs knowledge transfer mechanisms from object detection systems to train the network. We demonstrate the eectiveness of WPIN Network on the Flickr 30k Entities and ReferItGame datasets, for which we achieve improvements over state-of-the-art approaches. 1.5 Thesis Outline Rest of the thesis is organized as follows: Chapter 2 contains a brief overview of the litera- ture. Chapter 3 presents our framework for event retrieval and recounting in videos using segment-based models. Further, we present its extension for zero-shot setting. Chapter 4 describes the network architecture of our phrase grounding system with natural lan- guage queries. In chapter 5, we describe the formulation that incorporates relationships and context to improve the phrase grounding system. In chapter 6, we describe the framework that uses attention on query phrases to improve the interpretability and per- formance of phrase grounding system. In chapter 7, we describe a network architecture that uses knowledge transfer from object detection systems to train weakly-supervised phrase grounding system. Finally in Chapter 8, we conclude and discuss the possible future directions of the research. 7 Chapter 2 Related Work 2.1 Event Retrieval in Videos Multimedia Event Retrieval State-of-the-art approaches for multimedia event re- trieval encode low-level temporal and spatial features using feature aggregation and train classication models per each event. For example, [74], [55] achieved good performance by training discriminative classication models over low-level features. Dierent modal- ities of features like SIFT [48] and Dense Trajectories [83] were employed. Feature encodings such as Fisher Vector [57] and VLAD [3] were used for aggregation. Dis- criminative classiers such as linear SVM are trained on top these features to model an event. Dierent types of pooling [41] and post-processing [40] approaches have been sug- gested for further improvements in performance and identication of positive segments. While they achieve reasonable performance, these models do not have any semantic in- terpretability and have poor generalization for higher-level semantic tasks. Mid-level to high-level representations have been employed for semantic interpretation of the classication results. A classier is trained per each semantic concept on low-level features and the condence scores are joined in a vector representation. The concepts used involved low-level actions for building action bank [76] and broad object categories for object bank [28].Early eorts in this direction [45] [73] employed these features for event retrieval. Post-processing was done to identify positive segments for retrieved videos. The scores of relevant concepts for each event were used to identify positive evidence in multimedia event recounting task [45]. Along with the individual concept scores, concept co-occurrence [81] has also been exploited for more robust identication of positive evidence. These approaches suer from at least one of the following problems: (1)Acquiring appropriate concepts to model all the events beforehand from large concept space is highly challenging (2) Individual concept responses themselves are highly noisy and need not perform eciently across datasets. (3) Recounting and classication are treated as separate problems while joint formulation is a better option given the inter- dependency of the tasks. Recently [75], came up with joint framework for detection and recounting. In this work, the segments locations are treated as latent variables. On of the disadvantages is that this formulation is dicult to model with few number of video exemplars, which is typical for the case of event retrieval. Since each event can have large variance in its appearance, it is common to represent an event using an ensemble of models. In object and video domains, [54] [80] [94] attempt 8 to represent each class using multiple discriminative models. [94] models the intra-class variations in an object class through a set of representative models. [54] represents videos using augmented temporal composition of motion features and learn models for dierent temporal scales. Temporal structure of the human activities is learned using motion- based discriminative classiers and matching is done using the correlation between the motion templates. [80] utilizes latent models to infer the temporal composition of a video. To capture semantics, temporal segments associated to the same latent state are assigned to the same concepts and the concepts themselves are unlabeled. Zero-shot Event Retrieval For supervised event retrieval, each event is provided with exemplars to train a representative model for it. However, this is not a practical scenario for a real-world system given the wide varierty of possible events. An alternative scenario is to use external data such as publicly-available datasets or weakly supervised web corpus such as captioned images and Youtube videos to either mine relevant concepts or exemplars for event retrieval. However, such data is expected to be noisy and likely of low quality raising additional challenges. Apart from using fully annotated concepts to develop a concept bank, [20] uses lters based on visual features to eliminate videos with noisy descriptions and build a robust predictor from the ltered videos. [88] uses various captioned image and video datasets to map data into textual space and use it eciently to build concepts that could be used for semantic understanding of videos. [77] uses common representation space for images and videos to obtain web images similar to video images and train detectors using both the web and video images. 2.2 Phrase Grounding in Images Phrase grounding Methodologies Phrase grounding requires learning correlation be- tween visual and language modalities. Early phrase grounding [36] [31] parsed knowledge from sentence using dependecy trees with limited vocabulary. Karpathy et al [1] replace the dependecy tree with bidirectional RNN and align the sentence fragments to image regions in common embedding space. Hu et al [27] propose SCRC that ranks proposals using knowledge transfer from image captioning. Rohrbach et al [66] employ attention to rank proposals in a latent subspace. Chen et al [4] extend this approach to account for regression based on query semantics. Alternatively, Plummer et al [60] suggest using Canonical Correlation Analysis(CCA) [21] and Wang et al [39] suggest Deep CCA to learn similarity among visual and language modalities. Wang et al. [84] employ struc- tured matching and boost performance using partial matching of phrase pairs. Plummer et. al [59] further augment CCA model to take advantage of extensive linguistic cues in the phrases. Proposal generation and spatial regression in Images Proposal generation systems are widely used in object detection and phrase grounding tasks. Two popular methods: Selective Search [82] and EdgeBoxes [95] employ ecient low-level features to produce proposals on possible object locations. Based on proposals, spatial regres- sion method is successfully applied in object detection. Fast R-CNN [16] rst employs a regression network to regress proposals generated by Selective Search [82]. Based on this, Ren et al. [64] incorporate the proposal generation system by introducing a Region Proposal Network (RPN) which improves both accuracy and speed in object detection. 9 Redmon et al. [63] employ regression method in grid level and use non-maximal suppres- sion to improve the detection speed. Liu et al. [46] integrate proposal generation into a single network and use outputs discretized over dierent ratios and scales of feature maps to further increase the performance. Reinforcement learning for Image-language systems Reinforcement learning is rst introduced to deep neural network in Deep Q-learning (DQN) [52], which teaches an agent to play ATARI games. Lillicrap et al. [43] modify DQN by introducing deep deterministic policy gradients, which enables reinforcement learning framework to be optimized in continuous space. Recently, Yuet al. [93] adopt a reinforcer to guide speaker- listener network to sample more discriminative expressions in referring tasks. Liang et al. [42] introduce reinforcement learning to traverse a directed semantic action graph to learn visual relationship and attributes of objects in images. Inspired by the successful applications of reinforcement learning, this thesis proposes a CPN network to assign rewards as policy gradients to leverage context information in training stage for phrase grounding. Visual and Semantic context for Phrase Grounding Context provides broader information that can be leveraged to resolve semantic ambiguities and rank proposals. Hu et al [27] used global image context and bounding box encoding as context to augment visual features. Yu et al. [91] further encode the size information and jointly predict all query regions to boost performance in referring task. Plummer et al. [59] jointly optimize neigboring phrases by encoding their relations for better grounding performance. Chen et al [5] [4] employ semantic context to jointly ground multiple phrases to lter con icting predictions among neighboring phrases. [5] [4] [59] use semantic integrity among neighboring phrases to re-rank the proposals. Weakly-supervised Phrase Grounding While supervised phrase grounding achieves signicantly higher performance and is easy to train, is is expensive to obtain large amount of annotated data for ground truth location of a query phrase. As an alternative setting, Weakly-supervised Phrase Grounding systems are trained using only input image and query phrase pairs during training without the explicit information about the ground truth bounding box for query phrase. Rohrbach et al [66] employ phrase reconstruction loss to train these systems. Using the predicted score for each bounding box, this loss tries to compute the similarity between the phrase reconstructed from the captioning of the bounding boxes and the query phrase. Recently, Chen et al [7] incorporate an addi- tional visual reconstruction loss along with the phrase reconstruction loss to attest the uniqueness of the bounding boxes. Knowledge Transfer for Weakly-supervised Learning Knowledge transfer in- volves solving a target task by learning from a dierent but related source task. Li [13] employ knowledge transfer alogrithms to predict the novel object classes. Rohrbach et al [68] use linguistic knowledge bases to automatically transfer information from source to target classes. Deselaers [9] et al and Rochan [65] employ knowledge transfer in weakly- supervised object detection and localization respectively. [65] learn representation by transferring appearances from familiar classes to unknown objects using skip vectors [51] as a knowledge base. In this thesis, we propose to use knowledge transfer from object detection task to index region proposals for weakly supervised phrase grounding. 10 Chapter 3 Semantic models for Event Detection and Recounting in videos 3.1 Introduction User generated videos have been growing at a rapid rate. These videos typically do not come with extensive annotations and metadata; even category level labels may be missing or noisy. For ecient retrieval and indexing of such videos, it would be useful to have automated methods that classify a video into one of the known categories but also to identify key segments and provide semantic labels for them to enable rapid perusal and other analyses. Given an input video, our framework provides a user-dened event label (detection) and positive evidence for the same with their locations and labels (recounting). The tasks of detection and recounting are challenging due to large intra-class variances in structure, imaging conditions and possible presence of long segments not directly re- lated to the event. As shown in Figure 3.1, a video with caption "Marriage Proposal" can contain various backgrounds such as a "restaurant", "basketball court" and "outdoors". However, eectively identifying instances such as "Getting down on one knee", "Proposal speech" and "Wearing a ring" can help in identifying the event despite the variations. Popular approaches to model events can be divided into holistic or part-based. Holistic approaches (e.g., [74] [55]), model an event using distributions of low-level features from various modalities such as appearance, scene, text, motion of its constituent videos. It is common to encode features using Fisher Vectors [57] which are aggregated using dierent type of pooling [41] such as max and average pooling. While these approaches achieve reasonable performance for detection, they do not identify positive segments or provide semantic interpretation of the results needed for tasks like recounting. Also, they work well for videos that are trimmed, i.e., where almost the entire video corresponds to a single event category. Other methods [28] [45] have tried to use semantic features by computing concept scores using a dictionary of object [28] and/or action detectors [76] applied to each frame and aggregating the scores. While these methods can provide some semantic interpretation of the video, by emphasizing the high-scoring concepts [45] [73], their utility for localization and recounting is still limited. There are also diculties like the concept dictionaries may not be well matched to the concepts in the video and concept detectors may not perform uniformly across datasets [32], which may be considered as problems of "transfer learning". 11 Figure 3.1: Exemplars from the event "Marriage Proposal" in the TRECVID MED Dataset [56] showcasing the variance in backgrounds; actions and their order of occurrence in unconstrained videos. Part-based approaches use video-segments instead of entire videos for event models. For example, [54] represents an event using a set of models from various temporal scales for human activity classication. Temporal structure of classiers is embedded as a template and the models are learnt by mining iteratively for positive segments of motion features. While this approach captures the intra-class variance, it cannot be applied to web videos due to lack of temporal structure unlike human activity. [80] splits a video into xed-length temporal segments and employs a variable-duration HMM to model the state-variations in the segments. Latent models are used to infer the temporal composition of a video. This method performs well on action datasets but unlikely to handle the variations in web videos due to rigid temporal constraints. [75] proposed a joint framework for detection and recounting where the positive segment locations are treated as latent variables. Their method uses global video model and part-segment models based on concept dictionary in conjunction to optimize for event classication and recounting. While this approach gives signicant improvement over methods using only semantic features, it is still limited by the concept dictionary. In our approach for MED, we employ video-segments instead of entire videos for training event models. While it is impractical to provide large numbers of positive video exemplars to model each event, each video exemplar provides tens to thousands of video segments whose positive instances can be utilized to model an event. If the positive seg- ments are identied and clustered, they can be used to discard the signicant amount of "outliers" or "non-informative" segments found in unconstrained videos and structurally 12 highlight the semantically meaningful parts of video. If the positive segments are labeled, they can also be used for tasks such as recounting of the videos. We train ensembles of models to depict the sub-categories of an event using the positive segments from video exemplars. We allow for data-sharing while training our models, enabling them to use segments not just from training data but from background segments which helps to over- come limited data which is common in long-tail distributions. We use knowledge transfer from detectors of external concept dictionary only for initialization and concepts (subcategories) of an event are trained by mining groups of positive segments from exemplar videos themselves in a weakly supervised fashion. Unlike [54], which uses augmented initial models from various scales, this form of initialization has more semantic interpretrability of the models and higher incidence on positive seg- ments. We also do not attempt to assign label to each segment of the video or model temporal composition of the constituent events, unlike [54] [80], and it is more amenable to unstructured videos. For the Zero-shot event retrieval setting, where no visual exemplars are provided for modeling the event, we use embed the knowledge transfer from detectors of external concept dictionary into a continuous word space. The appearance of the concepts is generalized based on the semantic similarity among the associated concepts and the video with highest anity to the concept tags of an event is ranked higher. Our overall framework for MED is represented in Figure 3.2. Given a set of exem- plar videos for each event, we rst divide each video into xed-length, non-overlapping segments and use responses of concept detectors to sample possible positive segments. We use the sampled segments as initial seeds and use iterative positive segment mining to group similar segments. From the resulting groups of segments, we train an SVM ("candidate" segment-based models) for each group. The contributions of each of the "candidate" segment-based models towards event are evaluated in the next step using a greedy strategy. The top contributing "candidate" segment-based models ("representa- tive segment-based models") are chosen to represent the event. For testing, we score all the segments of test video using the "representative segment-based models" of an event and aggregate the scores. Final video-level score is obtained by averaging the scores from segments with top responses. The pipeline of our Zero-shot retrieval approach is presented in 3.3. In our approach, since query tags map to the video embedding space, query-video similarity can be com- puted as a dot product in a low-dimensional space. The mapping oers multiple advan- tages. Embedding videos in a continuous word space: 1. results in implicit query expansion (mapping query tags to multiple, semantically related tags) 2. leads to compact video representation due to low dimensionality of the space, which is a signicant gain, as the detector bank may contain a large number of concepts 3. allows fast scoring for each video Following sections contain a detailed description of our methods. We show both qualitative and quantitative results on the challenging MEDTest 2014 [56] dataset provided by NIST for classication ,recounting and zero-shot tasks respectively. 13 (a) Training approach selects relevant concepts for an input query and uses iterative positive mining to select segments and greedy selection to prune them. (b) Iterative positive mining approach (up) trains an exemplar SVM [50] using the initial seed segment and iteratively mines for similar positive segments to generalize. Testing approach (down) generates scores for videos using max responses from the selected models and nal score is generated by averaging the highest local responses. Figure 3.2: Outline of the training, iterative positive mining and testing approaches. 14 3.2 Segment-based Models 3.2.1 Seed Initialization For training ecient segment-based models, we need an initialization scheme that can identify a subset of representative positive segments as seed segments. For this, we take advantage of the observation that highest responses of the top contributing concepts of an event are highly relevant to the event [28]. We use a mid-level feature representation to select concepts that are relevant to the event and then choose the segments that have high scores for these concepts. Note that some of the initial seed segments can be noisy and are pruned in the latter stages of training. Given a set of videos V =fV 1 :::V N g belonging to an event C , let each video V i contain F i frames. Given K concept detectors fD 1 :::D K g, each concept detector is applied to V. For each video segment V i (f i ), a K dimensional response vector V s i (f i ) is obtained. We select top L relevant concepts per event C by computing the sum of l 1 normalized feature response vectors as follows : = P if i 2V() V s i (f i ). The topL concepts are then selected as max L 1<L<K (k);k2f1; 2;:::Kg. For each conceptc k , we choose positive segmentsfV t 1 (f t 1 );:::V t P (f t P )g2V() that have the P highest responses for c k in V(). To obtain genuine maximal responses without redundancy, we apply non-maximal suppression. We choose L and P values to be relatively high, to ensure that the initial seeds are oversampled. This way, the representation of an event can be exhaustive when the models are pruned in the latter stages. From the LP initial segment seeds chosen, we build candidate segment models M i ;i2 C . Each seed is used as a positive example and hard mined negative segments from background set are used as negative examples, to train exemplar SVMs [50]. 3.2.2 Iterative positive segment mining The candidate models trained in the previous step tend to over-t to the single exemplar they are trained on. To generalize the models further, we need to retrain the models with positive segments similar to the exemplar. To avoid the problems faced by classical clustering models, we choose to mine the positive segments iteratively in a discriminative space. In each iteration, we group N p positive examples that show high response to the current model. We then retrain the current model by including the mined positive exam- ples in the exemplar set. This form of mining helps in learning more reliable templates by using the mined samples as a form of "regularization" to prevent overtting and models long-tail distribution naturally [94]. It is also advantageous in discarding outliers, since there is no requirement for a sample to be bound to a cluster. Choice of each group is independent of the other groups' choices and can be trained in parallel for eciency. The algorithm is iterative and alternates over the following two steps until it reaches conver- gence. (i) Each candidate model M i scores all the positive segments for an event and mines the top scoring N p segments (ii) Each candidate model re-trains to include the N p mined positives with the existing positive set to improve the generalization of the candidate model. 15 ID Event Name VM [28] ELM [75] SM VM+SM 21 Bike trick 0.0653 0.0912 0.0778 0.0696 22 Cleaning an appliance 0.0856 0.0910 0.1028 0.1272 23 Dog show 0.7729 0.6853 0.7840 0.8194 24 Giving direction 0.1093 0.1296 0.1313 0.1244 25 Marriage proposal 0.0208 0.0459 0.0266 0.0371 26 Renovating a home 0.0690 0.0673 0.0593 0.0802 27 Rock climbing 0.0812 0.0889 0.0850 0.0850 28 Town Hall meeting 0.3855 0.3674 0.4447 0.4840 29 Winning a race without a vehicle 0.2543 0.2978 0.2989 0.3041 30 Working on a metal crafts project 0.1032 0.2186 0.1238 0.1237 31 Beekeeping 0.7367 0.7532 0.73855 0.7565 32 Wedding shower 0.2545 0.2790 0.2793 0.2894 33 Non-motorized vehicle repair 0.2712 0.3070 0.2774 0.2841 34 Fixing musical instrument 0.4575 0.4067 0.4124 0.4686 35 Horse-riding competition 0.3534 0.3323 0.2782 0.3842 36 Felling a tree 0.1774 0.1952 0.2141 0.2238 37 Parking a vehicle 0.1719 0.1802 0.2791 0.2678 38 Playing fetch 0.0906 0.0984 0.0749 0.0842 39 Tailgating 0.2066 0.2132 0.1889 0.2001 40 Tuning a musical instrument 0.0781 0.1484 0.2026 0.1938 mAP 0.2373 0.2498 0.2540 0.2704 Table 3.1: Comparison of MED performance (AP metric) on the NIST MEDTEST 2014 dataset using Video-level Models(VM), Segment-based Models(SM) and Late Fu- sion(VM+SM). Convergence of the algorithm is judged based on the Average Precision (AP) value of the candidate model, on a held-out validation set. The iteration is terminated when there is a marginal improvement in AP or when enough positive examples are mined, whichever happens earlier. Many of the candidate segment models, M i , trained in this step are either noisy or redundant and need to be further pruned to build a representative set for each event. 3.2.3 Model Selection From the pool ofjLjjPj candidate models for each event, we need to select a subset SfjLjjPjg, that is representative of the event. Many of the candidate models are redundant due to over-sampling. So, the subset, S is chosen to maximize the mean Aver- age Precision (mAP) on the training set excluding the positive segments used for training and their neighbors. Since the search over the entire subset space has high computational complexity, we opt for a greedy algorithm to choose the nal representative models, f M i , which works quite well in our experiments. We tried both the greedy model selection and greedy model elimination strategies to select the subset. We observe that, greedy selection gives similar performance as greedy elimination while being computationally 16 faster. At each step, we add a model, f m i that maximizes the AP of the existing subset S. We use early stopping to prevent over-tting. 3.2.4 Model Testing Testing the segment-based models is dierent from video-level models primarily in two aspects. Firstly, since segment-based models are trained on the discriminative segments they are expected to have low responses for non-discriminative and outlier segments. This results in sparse high detection scores across the video segments. Averaging across the segments would result in a very low and noisy nal score. Secondly, since each event C could be represented by more than one segment-based model, f M i from representative set, detection scores of various models for a segment have to be aggregated to obtain detection score for that segment. However, the detection scores of the models are not comparable and need to be calibrated across each other in probabilistic space. To calibrate the detection scores of representative segment-based models, f M i of an event C , we use a held-out validation set (VS) to mine non-redundant top P s scores from positive segments V s j ;V j 2 C and a background set to mine N s hard-negative scores. A learned sigmoid ( f M i ; f M i ) is then t to each model, f M i and the detection scores x j =Sc(V s j ; f M i ) are rescaled to be comparable to each other as follows: f(x j jw( f M i ); f M i ; f M i ) = 1 1 +e f M i (w( f M i ) T x j + f M i ) This calibration step also suppresses the responses of models that do not have high distinction in positive and negative scores by shifting the decision boundary towards the exemplars [50]. The nal detection score, X j , for each segment V s j is then obtained by max-pooling the calibrated scores,f(x j ), of all the representative segment-based models, f M i of the event C . X j =max(f(x j j f M i ));x j =Sc(V s j ; f M i ) Once the detection score for each segment is calculated, a video-level score is obtained by averaging the scores of local maxima of the video. Sc(V j ) =avg(max k (X j ));X j =Sc(V s j ) Non-redundancy of the scores is achieved through non-maximal suppression while aver- aging suppresses noisy responses. 3.2.5 Model Recounting For identifying the positive evidence, we take the segments with local maxima scores. Ev(V j ) =fmax k (X j )g;X j =Sc(V s j ) The corresponding labels of the positive evidence are identied by choosing the labels of the representative models that have the maximum scores for the segments with local maxima. 17 3.3 Zero-shot Event Retrieval In this section, we discuss zero-shot event retrieval where only the name of the event and associated tags are presented without any visual exemplars for the event. We detail below our \continuous word space" embedding method, and an extension of the \concept space" embedding approach of [15] for zero-shor retrieval. We also include an overview of the \dictionary space" embedding method of [87]. 3.3.1 Continuous Word Space (CWS) Embedding We begin by mapping feature bank concepts to their respective continuous word space vectors. Each video is then mapped to the continuous word space by computing a sum of these concept bank vectors, weighted by the corresponding detector responses. User query tags are mapped to the continuous word space, and aggregated to form a single query vector. Finally, videos are scored using a similarity measure in the common space. The framework is illustrated in Figure 3.3. Figure 3.3: Video and Query Embedding framework 18 3.3.1.1 Mid-level semantic video representation We represent a videoV i as a vector of concept bank condence scores, f i = (w i1 ;w i2 ;:::; w iN ) T , corresponding to semantic conceptsfc 1 ;c 2 ;:::;c N g. Owing to semantic inter- pretation, such a representation is more useful for event-level inference than standard bag-of-words. The video feature is assumed to be an L 1 -normalized histogram f i of these concept scores, which are scaled condence outputs of detectors trained on the corresponding concepts. 3.3.1.2 Concepts and Queries to Continuous Word Space We map each semantic feature concept c i to its corresponding continuous word repre- sentation vectorv(c i ). Since it is possible that no (case-insensitive) exact match exists in the vocabulary, matches may be computed by heuristic matching techniques. A user query can contain one, or more tags. Each tag is mapped to its corresponding continuous word space representation, if it exists. The query vectorq is anL 2 -normalized sum of these vectors, which is equivalent to the Boolean AND operator over input tags. More sophisticated schema may be designed for query representation and combination, but this is not explored in this paper. 3.3.1.3 Embedding Function We interpret a video as a text document, with the feature response vector being a his- togram of word occurrence frequency. The video feature vector is mapped to continuous word space by computing sum of concept vectors (mapped in Section 3.3.1.2), weighted by their corresponding feature responses. To avoid including noisy responses in the rep- resentation, only a thresholded set of top responses is used. If we denote f c i to be the embedding for video v i , then f c i = X c k 2C i w 0 ik v(c k ) whereC i is the set of top responses for that video. The weights w 0 ij are equivalent tow ij up to scale, to ensurekf c i k 2 = 1. Since the number of concepts in a concept bank can be large, and if detectors are trained with reasonable accuracy, it is expected that the bulk of the feature histogram will be distributed among a few concepts. Aggregating high-condence detector responses helps in suppressing the remaining noisy detector responses. 3.3.1.4 Scoring videos Videos are scored using the dot product similarity measure. This measure has the advan- tage of implicitly performing query expansion, i.e., responses of feature concepts seman- tically related to the query tags will be automatically aggregated, e.g., if a user queries for \vehicle", then responses in video V i , for related tags (assuming they are among the thresholded concepts), such as \car", or \bus", will be aggregated as follows: 19 q T f c i =v(vehicle) T X c k 2C i w 0 ik v(c k ) =v(vehicle) T [w 0 icar v(car) +w 0 ibus v(bus)] +v(vehicle) T X c i 2 C i n fcar;busg w 0 i v(q) T v(c i ) v(vehicle) T [w 0 icar v(car) +w 0 ibus v(bus)] + 0 since unrelated concepts are expected to be nearly orthogonal. 3.3.2 Concept Space (CoS) Embedding The authors in [15] applied their retrieval method to the \UCF-101" [72] dataset. The goal of their method was to transfer knowledge from detectors trained on a known set of classes, to create a detector for an unseen class. This was achieved by aggregating detector scores of the known classes, weighted by their semantic similarity to the unseen class, using a continuous word space. We brie y describe the key components of their approach, and our modications that extend its application to the domain of query tag- based retrieval for unconstrained web videos: Mid-level representation In [15] classiers were trained for a held out set of classes. These classiers were then applied to the target class, and their condence scores were concatenated to obtain a video representation. We replace these classiers by our detector bank, as described in Section 3.3.1.1. As in their work, we use this as the nal video representation, yielding a concept space video embedding. Unseen class representation In [15], a class name is represented by its continuous word representation, trained on Wikipedia articles. Each detector concept is sim- ilarly mapped to this space. The representations are used to rank concepts with respect to the query by computing the dot product similarity score. We extend this idea, and replace the unseen class with the set of query tags. Scoring videos The query-video similarity score is computed as a sum of the top K detector responses, weighted by the class-concept similarity score obtained in the previous step. We leave this unmodied in our extension. 3.3.3 Dictionary Space (DiS) Embedding For completeness, we provide a high-level overview of the method presented in [87], which is evaluated in this work. The authors rst trained multiple concept detector banks and applied them to each video, to obtain a set of mid-level concept space representations. In a given detector bank, for each concept, they obtain the top K similar words in a corpus, using a similarity score derived from the corpus. The condence score of a concept detector for each video is distributed to the top K similar words, weighted by the corresponding similarity score. Thus, each detector bank response is mapped to a sparse vector in the dictionary space. 20 The tags in a query (which are a subset of the word corpus) are used to create a sparse vector in the dictionary space. The dimensions corresponding to each query tag are set to 1.0, while the rest are set to 0.0. For each detector bank, a similarity score with the query vector is computed as a dot product in this space. These similarity scores are fused to obtain the nal query-video similarity score. 3.4 Experiments In this section, we provide details about the dataset we used, various choices of parameters and evaluate the performance of our segment-based models. 3.4.1 Dataset In our experiments, we use TRECVID MED14 [56] test video corpus and MED 14 event kit data for evaluation. The dataset contains unconstrained, Youtube-like web videos from the Internet consisting of high-level events. The MEDTest 14 has around 27,000 videos and the event kit consists of a 100Ex setting, providing approximately 100 exemplars per event. The "event kit" consists of 20 complex high-level events diering in various aspects such as background : outdoor ( bike trick ) vs indoor ( town hall meeting ); frequency : daily ( parking a vehicle ) vs uncommon ( beekeeping ); sedentary ( tuning a musical instrument ) vs mobile ( horse-riding competition ). A complete list of events is provided in table 3.1. 3.4.2 Object Bank For mid-level features, we choose an Object Bank [28] containing 15k categories of ImageNet. Each category is trained using a convolution network with eight layers and error back propagation. The responses for each category are obtained for each frame and the 15k dimensional vector is simply averaged across frames to obtain segment level and video level representations. The 15k objects are noun phrases that encapsulate a high diversity of concepts such as scenes, objects, people and activities. 3.4.3 Evaluation 3.4.3.1 Training parameters For training the segment-based models, the rst parameter choice is the number of initial seed models(KM ). For the value of (K ), a performance plateau was reached forK = 50. For M , lower values led to poor performance due to noisy estimates of the object bank, while higher values led to high redundancy in the initial seeds. M = 5 was chosen for our experiments. For discriminative clustering, N p = 10 was used for collecting positives in each iteration and at this rate most of the models stabilize in 3-4 iterations (30-40 exemplars). A maximum iteration limit of 20 (200 exemplars) is set for the clustering, with most of the models reaching convergence far before except the highly noisy ones. For training and validation, we use a split of 67%-33% on the training videos. 21 ID Event Name Thresh 1 = 0.5 Thresh 2 = 0.9 Avg VM SM VM SM VM SM 23 Dog show 0.9612 0.9668 0.7619 0.7778 0.8771 0.8806 25 Marriage proposal 0.2801 0.3118 0.2686 0.2913 0.2737 0.2991 27 Rock climbing 0.7322 0.7506 0.7153 0.7351 0.7258 0.7381 29 Winning a race 0.6684 0.6841 0.6004 0.6580 0.6449 0.6746 mAP 0.6604 0.6783 0.5865 0.6155 0.6304 0.6462 Table 3.2: Comparison of Average Precision(AP) of the ranked segments in test videos for Video-level Models(VM) and Segment-based Models(SM) for various thresholds. We use less than 1% of the available training segments to train all the events, showing the eciency of our training procedure. Some events such as "dog show" were eciently represented with a single model. This indicates that if the events have low intra-class variance, representation is possible with very few models. 3.4.3.2 Multimedia Event Detection We compare the performance of our segment-based models with a standard video-level model using the object bank features [28] and evidence localization model (ELM) [75]. For [28], we use a histogram intersection kernel SVM [49] to model the event and logis- tic regression based fusion when combining the two modalities. For [75], latent svm is used on the object bank that models both global and part-based models. A summary of the results per event is provided in Table 3.1. For majority of events, AP of the segment-based models is better than the AP of the other methods, while late fusion with video-level models improves the performance signicantly indicating some complemen- tarity of "modeling-segments" to "modeling-context". Note that AP of segment-based models is similar to that of ELM which uses both global and part-based models. Hence, a better comparison is with fusion results which are better than that of ELM. Also ELM is relatively slow as it uses a latent SVM for inference. Events such as "Winning a race without vehicle"(running, swimming, potato race ) and "Tuning a musical instru- ment"(guitar, key board, snare drum) improve considerably, indicating that events that contain natural subcategories are modeled more accurately using segment-based models. Sometimes, lack of sucient data to model events leads to drop in performance as in the case of event "horse-riding competition", where the segment-based models produce high scores in the test videos that have strong incidence of horse, race track or jockey but they perform poorly when the race occurs in a grassy surface and horses appear in a very low resolution where incidence is on poorly trained "paddock" model. 3.4.3.3 Multimedia Event Recounting Multimedia Event Recounting (MER) generates summary of key evidence for event of a video, by providing when(event interval) and what(evidence label) and condence of the evidence segments. To evaluate the performance of segment-based models for MER, we use the annotations provided by NIST for positive MEDTest videos of 4 events. The 22 Figure 3.4: Captions/labels generated by segment-based models for events Bike Trick, Dog Show, Marriage Proposal, Rock Climbing, Winning a race without a vehicle and Beekeeping (from top to bottom). The rst ten out of the twenty positive test videos of the event are chosen and the middle frame of the segment is chosen for illustration. It can be seen that the captions are relevant to the segments. annotations provide the probability that a video segment belongs to the event. We use various thresholds to categorize the test segments into positive/negative for the event and report the Average Precision of the retrieved scores for the segments. We consider any overlap > 50% as positive. The average precision (AP) for each event at various thresh- olds, based on the rank of each segment are shown in Table 3.2. The AP is consistently better for segment-based models, indicating that they are able to better discriminate the positive segments from the outliers. Segment-based models can also be used to provide labels to the informative segments without any post-processing due to the label assigned to each model. Figure 3.4 contains examples of labels produced by segment-based models for sample videos of some events. For events like "marriage proposal" and "rock climbing", single models like "sweetheart" and "rockclimbing" are able to encapsulate majority of videos with precision. In the absence of specic labels from object bank, as in the case of "swimming" and "potato race" from event "Winning a race without a vehicle", it can be seen that semantically closer labels like "sport" and "broad jumping" have been assigned. This can be attributed 23 to the inter-model dependencies in the object bank which are eciently utilized by the discriminative clustering algorithm. Figure 3.5 shows the frequency distribution of tags that were generated using the labels for positive MEDTest videos of each category. It can be seen that the tags are highly relevant to the event categories. Figure 3.5: Visualization of frequency of tags generated for events. (Tags are generated using labels of Segment-based Models.) 3.4.4 Zero-shot Event Retrieval Zero-shot Event Retrieval uses description of each event provided by EventKit for mod- eling the respective events. For each event, we pick nouns from the supplied description in the EventKit, and use them as query tags to evaluate retrieval performance. Since re- trieval performance depends on availability and quality of relevant detectors, rankings will vary with dierent query tags for the same event. Thus, we report the best performing tag (highest mean average precision or mAP) using our pipeline. Using the tags selected above, for each event, we compare our method's performance with our extension of the CoS technique presented in [15]. We use heuristics for mapping detector concepts to continuous word space vectors. As per the authors' recommendation, we test performance usingK = 3; 5, and present results for the best performing parameter value, K = 3. Here, K indicates the number of top related concept detectors used for scoring. For the DiS baseline of [87], since the value ofK (denotes the number of nearest words in the dictionary for each detector concept) used in their experiments is not mentioned in the paper, we repeat the experiment with values of K = 3; 5; 7. We present results 24 ame Single Tag SIR-3 Embed Multi-tags ID Event Name Tags DiS CoS CWS 21 Bike trick bicycle 0.0417 0.0253 0.0475 22 Cleaning an appliance cooler 0.0479 0.0492 0.0521 23 Dog show rink 0.0082 0.0066 0.0103 24 Giving direction cop 0.0500 0.0551 0.0477 25 Marriage proposal marriage 0.0015 0.0029 0.0039 26 Renovating a home furniture 0.0086 0.0110 0.0236 27 Rock climbing climber 0.1038 0.0649 0.0823 28 Town Hall meeting speaker 0.0842 0.1025 0.1145 29 Winning a race without a vehicle track 0.1217 0.1374 0.1233 30 Working on a metal crafts project repair 0.0008 0.0079 0.0564 31 Beekeeping apiary 0.5525 0.5697 0.5801 32 Wedding shower wedding 0.0120 0.0120 0.0248 33 Non-motorized vehicle repair bicycle 0.0247 0.1559 0.0191 34 Fixing musical instrument instrument 0.0131 0.0179 0.1393 35 Horse-riding competition showjumping 0.2711 0.2832 0.2940 36 Felling a tree forest 0.1593 0.1468 0.1303 37 Parking a vehicle vehicle 0.0813 0.0882 0.0768 38 Playing fetch dog 0.0073 0.0079 0.0054 39 Tailgating jersey 0.0022 0.0028 0.0031 40 Tuning a musical instrument piano 0.0687 0.0363 0.0795 mAP 0.0830 0.0892 0.0957 Table 3.3: Retrieval performance (AP metric) using single query tag on the NIST MEDTEST 2014 dataset using our \continuous word space" (CWS), our \concept space" (CoS) extension of [15], and \dictionary space" (DiS) [87] embedding approaches. for K = 5, which results in the highest AP. In our case, there is only one detector bank (which corresponds to the image modality), so the multi-modal fusion techniques presented in [87] are not applicable. We also test several tag combinations for a query, and present the best performing combinations for the CWS embedding, per event. Tags are combined using AND boolean semantics: query vectors for individual tags are averaged, and then L 2 -normalized, to obtain the nal query vector. This is equivalent to late fusion by averaging video scores using individual tags. For comparison, the same tags are used as query inputs to the CoS and DiS methods. For each event in the MED2014 dataset, the Average Precision (AP) score using selected single tag queries are presented in Table 3.3, and using multiple tag queries in Table 3.4. While the mean Average Precision (mAP) for a single tag query using our CWS method is comparable with CoS and DiS methods, our mult For the case of multiple tag queries (Table 3.4), most events show improved performance over the corresponding single tag result, using the CWS embedding. In particular, large bumps in performance are observed for the events 26, 29, 34, 39, and 40. This can be explained by implicit query expansion aggregating related high quality concept detectors, whose responses combine favorably during late fusion. 25 In contrast, both CoS and DiS methods exhibit a performance drop for several events, using multiple query tags. This can be explained by sensitivity to the xed parameter K. In the case of CoS, the value of K that was aggregating sucient detector responses for a single tag, may not be enough for multiple tags. The DiS approach will select K nearest dictionary words for every query tag, and may end up selecting concepts that have aggregated scores of noisy detectors. While it is not evident how to select K in case of multiple tags, our CWS embedding avoids this issue by aggregating all relevant detector responses (equivalent to adaptively setting K equal to the number of concepts), as shown in Section 3.3.1.4. For visualization, we selected a few tags from the Single tag column in Table 3.3 and show screenshots from the top 5 videos retrieved by our system in Figure 3.6. It may be noted that the tags \forest" and \vehicle" do not have matching detectors in the concept bank. Hence, the rankings are based on implicit query expansion resulting from the continuous word space video representation. Figure 3.6: Top 5 video screenshots for single query tags in row 1) \forest", 2) \bicycle", 3) \climber", 4) \vehicle". Ranks decrease from left to right. Using our CWS representation, assuming 8-byte double precision oating point num- bers, we can t 200,000 videos (the entire TRECVid [71] database) in 460 MB of memory, versus concept bank representation, which requires over 22 GB. While the CWS embedding is inherently slightly more computationally intensive, relative to CoS and DiS schemes, it tends to capture diverse detection responses. This is critical in case one, or more, query tags do not have any corresponding concepts in the detector bank. In such a case, strong responses belonging to semantically related concept detectors may not be captured among the top K responses. This becomes apparent in AP scores for events 30 and 34. The query tags, \repair" and \instrument" do not have corresponding concepts in the detector bank, but our CWS embedding captures relevant responses, vastly improving results. 26 ID Event Name Tags DiS CoS CWS 21 Bike trick bicycle chain 0.0307 0.0208 0.0747 22 Cleaning an appliance cooler refrigerator 0.0512 0.0493 0.0551 23 Dog show hall dog competition 0.0060 0.0142 0.0489 24 Giving direction cop map 0.0068 0.0615 0.0351 25 Marriage proposal woman ring 0.0031 0.0031 0.0029 26 Renovating a home furniture ladder 0.0092 0.0110 0.0884 27 Rock climbing mountaineer climber 0.1038 0.0674 0.1192 28 Town Hall meeting speaker town hall 0.0842 0.0991 0.1381 29 Winning a race without vehicle swimming track 0.1241 0.1887 0.2185 30 Working on metal crafts project metal solder 0.0024 0.0028 0.0264 31 Beekeeping honeybee apiary 0.0929 0.5530 0.5635 32 Wedding shower gifts wedding 0.0120 0.0110 0.0191 33 Non-motorized vehicle repair bicycle tools 0.0090 0.1538 0.0414 34 Fixing musical instrument instrument tuning 0.0249 0.0179 0.2692 35 Horse-riding competition showjumping horse 0.2765 0.2863 0.2985 36 Felling a tree tree chainsaw 0.0409 0.0440 0.0521 37 Parking a vehicle parking vehicle 0.0621 0.0882 0.0504 38 Playing fetch dog beach 0.0077 0.0079 0.1006 39 Tailgating jersey car 0.0034 0.0028 0.0036 40 Tuning a musical instrument piano repair 0.1004 0.0320 0.1458 mAP 0.0526 0.0857 0.1176 Table 3.4: Retrieval performance for multiple query tags on MEDTEST 2014 dataset using our CWS, our CoS [15], and DiS [87] embedding approaches. 3.5 Conclusion In this chapter, we formulated a novel approach using segment-based models that can be used to tackle event classication and recounting tasks simultaneously. Using the noisy pre-trained concepts, we trained discriminative models that can diversely repre- sent an event with semantic interpretation. Further, we use a semantic embedding space to transfer knowledge from these external concepts for zero-shot event retrieval. Pro- posed methods have achieved promising results in classication, recounting and zero-shot retrieval tasks on the challenging TRECVID dataset. 27 Chapter 4 Query-guided Regression with Context Policy for Phrase Grounding 4.1 Introduction Given an image and a related textual description, phrase grounding attempts to localize objects which are referred by corresponding phrases in the description. It is an important building block in computer vision with natural language interaction, which can be utilized in high-level tasks, such as image retrieval [18, 62], image captioning [1, 12, 29] and visual question answering [2, 6, 14]. Phrase Grounding is a challenging problem that involves parsing language queries and relating the knowledge to localize objects in visual domain. To address this prob- lem, typically a proposal generation system is rst applied to produce a set of proposals as grounding candidates. The main diculties lie in how to learn the correlation be- tween language (query) and visual (proposals) modalities, and how to localize objects based on multimodal correlation. State-of-the-art methods address the rst diculty by learning a subspace to measure the similarities between proposals and queries. With the learned subspace, they treat the second diculty as a retrieval problem, where proposals are ranked based on their relevance to the input query. Among these, Phrase-Region CCA [60] and SCRC [27] models learn a multimodal subspace via Canonical Correlation Analysis (CCA) and a Recurrent Neural Network (RNN) respectively. Varun et al. [53] learn multimodal correlation aided by context objects in visual content. GroundeR [66] introduces an attention mechanism that learns to attend on related proposals given dif- ferent queries through phrase reconstruction. These approaches have two important limitations. First, proposals generated by in- dependent systems may not always cover all referred objects given various queries; since retrieval based methods localize objects by choosing one of these proposals, they are bounded by the performance limits from proposal generation systems. Second, even though query phrases are often selected from image descriptions, context from these de- scriptions is not utilized to reduce semantic ambiguity for grounding systems. Consider example in Fig 4.1. Given the query \a man", phrases \a guitar" and \a little girl" can be considered to provide context that proposals overlapping with \a guitar" or \a little girl" are less likely to be the ones containing \a man". To address the aforementioned issues, we propose to predict the referred object's lo- cation rather than selecting candidates from limited proposals. For this, we adopt a 28 A man is playing a guitar for a little girl. Query: A man Context: a guitar, a little girl Step 1: PGN generates a set of proposals Box 1 Box 2 Box 3 Box 1 0.25 Box 2 0.52 Box 3 0.20 Step 2: QRN regresses each proposal and predicts its relevance to the query Language input: A man Visual input: proposals from PGN Box 4 Box 4 0.03 Foreground Reward: 1.0 Context Reward: 0.2 Background Reward: 0.0 Step 3: CRN rewards top ranked proposals based on their labels, and back propagates as policy gradients using reinforcement learning Context Reward: 0.2 Step 1, Step 2: Forward Step 3: Back Propagation Figure 4.1: QRC Net rst regresses each proposal based on query's semantics and visual features, and then utilizes context information as rewards to rene grounding results. regression based method guided by input query's semantics. To reduce semantic ambi- guity, we assume that dierent phrases in one sentence refer to dierent visual objects. Given one query phrase, we evaluate predicted proposals and down-weight those which cover objects mentioned by other phrases (i.e., context). For example, we assign lower rewards for proposals containing \guitar" and \a little girl" in Fig 4.1 to guide system to select more discriminative proposal containing \a man". Since this procedure depends on prediction results and is non-dierentiable, we utilize reinforcement learning [78] to adap- tively estimate these rewards conditioned on context information and jointly optimize the framework. For implementation, we propose a novel Query-guided Regression network with Con- text policy (QRC Net) which consists of a Proposal Generation Network (PGN), a Query- guided Regression Network (QRN) and a Context Policy Network (CPN). PGN is a pro- posal generator which provides candidate proposals given an input image (red boxes in Fig. 4.1). To overcome performance limit from PGN, QRN not only estimates each pro- posal's relevance to the input query, but also predicts its regression parameters to the referred object conditioned on the query's intent (yellow boxes in Fig. 4.1). CPN sam- ples QRN's prediction results and evaluates the prediction results by leveraging context information as a reward function. The estimated reward is then back propagated as pol- icy gradients (Step 3 in Fig. 4.1) to assist QRC Net's optimization. In training stage, 29 we jointly optimize PGN, QRN and CPN using an alternating method as in [64]. In test stage, we x CPN and apply trained PGN and QRN to ground objects for dierent queries. We evaluate QRC Net on two popular grounding datasets: Flickr30K Entities [60] and Referit Game [30]. Flcirk30K Entities contains more than 30K images and 170K query phrases, while Referit Game has 19K images referred by 130K query phrases. Experiments show QRC Net outperforms state-of-the-art methods by a large margin on both two datasets, with more than 14% increase on Flickr30K Entities and 17% increase on Referit Game in accuracy. Our contributions are twofold: First, we propose a query-guided regression network to overcome performance limits of independent proposal generation systems. Second, we introduce reinforcement learning to leverage context information to reduce semantic ambiguity. More details of QRC Net are provided in Sec. 4.2. Next we analyze and compare QRC Net with other approaches in Sec. 4.3. CNN proposal bounding boxes … Proposal generation loss PGN Query phrase: “A woman” LSTM ROI pooling Concat Description: A woman is cutting a tomato with a man in the kitchen. Query: A woman Context: a tomato, a man Visual input MLP Query input Regression parameters Probability distribution Classificati- on loss Regression loss QRN Reward Assign Rank Reward loss CRN Context labels Reward Function FG Reward: 1.0 BG Reward: 0.0 context Reward: 0.2 … Figure 4.2: Query-guided Regression network with Context policy (QRC Net) consists of a Proposal Generation Network (PGN), a Query-guided Regression Network (QRN) and a Context Policy Network (CPN). PGN generates proposals and extracts their CNN fea- tures via a RoI pooling layer [64]. QRN encodes input query's semantics by an LSTM [25] model and regresses proposals conditioned on the query. CPN samples the top ranked proposals, and assigns rewards considering whether they are foreground (FG), background (BG) or context. These rewards are back propagated as policy gradients to guide QRC Net to select more discriminative proposals. 4.2 QRC Network QRC Net is composed of three parts: a Proposal Generation Network (PGN) to generate candidate proposals, a Query-guided Regression Network (QRN) to regress and rank these 30 candidates and a Context Policy Network (CPN) to further leverage context information to rene ranking results. In many instances, an image is described by a sentence which contains multiple noun phrases that are used as grounding queries, one at a time. We consider the phrases that are not in the query to provide context; specically to infer that they refer to objects not referred to by the query. This helps rank proposals; we use CPN to optimize using a reinforcement learning policy gradient algorithm. We rst present the framework of QRC Net, followed by the details of PGN, QRN and CPN respectively. Finally, we illustrate how to jointly optimize QRC Net and employ QRC Net in phrase grounding task. 4.2.1 Framework The goal of QRC Net is to localize the mentioned object's location y given an image x and a query phraseq. To achieve this, PGN generates a set ofN proposalsfr i g as candi- dates. Given the queryq, QRN predicts their regression parametersft i g and probability fp i g of being relevant to the input query. To reduce semantic ambiguity, CPN evaluates prediction results of QRN based on the locations of objects mentioned by context phrases, and adopts a reward function F to adaptively penalize high ranked proposals containing context referred objects. Reward calculation depends on predicted proposals, and this procedure is non-dierentiable. To overcome this, we deploy a reinforcement learning procedure in CPN where this reward is back propagated as policy gradients [79] to opti- mize QRN's parameters, which guides QRN to generate more discriminative proposals. The objective for QRC Net is: arg min X q [L gen (fr i g) +L cls (fr i g;fp i g;y) +L reg (fr i g;ft i g;y) +J()] (4.1) where denotes the QRC Net's parameters to be optimized and is a hyperparameter. L gen is the loss for generation proposals produced by PGN.L cls is a multi-class classi- cation loss generated by QRN in predicting the probabilityp i of each proposalr i .L reg is a regression loss generated by QRN to regress each proposal r i to the mentioned object's location y. J() is the reward calculated by CPN. 4.2.2 Proposal Generation Network (PGN) We build PGN with a similar structure as that of RPN in [64]. PGN adopts a fully con- volutional neural network (FCN) to encode the input image x as an image feature map x. For each location (i.e., anchor) in image feature map, PGN uses dierent scales and aspect ratios to generate proposalsfr i g. Each anchor is fed into a multiple-layer percep- tron (MLP) which predicts a probabilityp o i estimating the objectness of the anchor, and 4D regression parameters t i = [(xx a )=w a ; (yy a )=h a ; log(w=w a ); log(h=h a )] as dened in [64]. The regression parameters t i estimate the oset from anchor to referred objects' bounding boxes. Given all the referred objects' locationsfy l g, we consider a proposal to 31 be positive when it covers some object y l with Intersection over Union (IoU) > 0:7, and negative when IoU < 0:3. The generation loss is: L gen = 1 N cls N cls X i=1 (i2S y [S y ) log(p o i ) + g N reg Nreg X i=1 (i2S y ) 3 X j=0 f (jt i [j] t i [j]j) (4.2) where (:) is an indicator function, S y is the set of positive proposals' indexes and S y is the set of negative proposals' indexes. N reg is thetotal number of all anchors and N cls is the batchsize as dened in [64]. t i represents regression parameters of anchor i to corresponding object's locationy l . f(:) is the smooth L1 loss function: f(x) = 0:5x 2 (x< 1), and f(x) =jxj 0:5(x 1). We sample the topN anchors based onfp o i g and regress them as proposalsfr i g with predicted regression parameters t i . Through a RoI pooling layer [64] and MLP layers, we extract visual feature v i 2R dv for each proposal r i . fr i g andfv i g as fed into QRN as visual inputs. 4.2.3 Query guided Regression Network (QRN) For input query q, QRN encodes its semantics as an embedding vector q2 R dq via a bi-directional LSTM [47]. Given visual inputsfv i g, QRN concatenates the embedding vector q with each of the proposal's visual feature v i . It then applies a fully-connected (fc) layer to generate multimodal featuresfv q i g2 R m for each of thehq;r i i pair in an m-dimensional subspace. The multimodal feature v q i is calculated as: v q i ='(W m (qjjv i ) + b m ) (4.3) where W m 2 R (dq +dv )m ; b m 2 R m are projection parameters. '(:) is a non-linear activation function. \jj" denotes a concatenation operator. Based on the multimodal feature v q i , QRN predicts a 5D vector s p i 2R 5 via a fc layer for each proposal r i (superscript \p" denotes prediction). s p i = W s v q i + b s (4.4) where W s 2R m5 and b s 2R 5 are projection weight and bias to be optimized. The rst element in s p i estimates the condence ofr i being related to input queryq's semantics. The next four elements are regression parameters which are in the same form as t i dened in Sec. 4.2.2, wherex;y;w;h are replaced by regressed values andx a ;y a ;w a ;h a are proposal's parameters. We denotefp i g as the probability distribution offr i g after we feedfs p i [0]g to a softmax function. During training, we consider one proposal as positive which overlaps most with ground truth and with IoU> 0:5. Thus, the classication loss is calculated as: L cls (fr i g;fp i g;y) = log(p i ) (4.5) where i is positive proposal's index in the proposal set. Given the object's location y l mentioned by query q, each proposal's ground truth regression data s q i 2 R 4 is calculated in the same way as the last four elements of s p i , 32 by replacing [x;y;w;h] with the ground truth bounding box's location information. The regression loss for SRN is: L reg (ft i g;fr i g;y) = 1 4N N X i=1 3 X j=0 f (js p i [j + 1] s q i [j]j) (4.6) where f(:) is the smooth L1 loss function dened in Sec. 4.2.2. 4.2.4 Context Policy Network (CPN) Besides using QRN to predict and regress proposals, we also apply reinforcement learning to guide QRN to avoid selecting proposals which cover the objects referred by query q's context in the same description. CPN evaluates and assigns rewards for top ranked proposals produced from QRN, and performs a non-dierentiable policy gradient [79] to update QRN's parameters. Specically, proposalsfr i g from QRN are rst ranked based on their probability dis- tributionfp i g. Given the ranked proposals, CPN samples the top K proposalsfr 0 i g and evaluates them by assigning rewards. This procedure is non-dierentiable, since we don't know the proposals' qualities until they are ranked based on QRN's probabilities. There- fore, we use policy gradients reinforcement learning to update the QRN's parameters. The goal is to maximize the reward expectation F (fr 0 i g) under the distribution offr 0 i g parameterized by the QRN, i.e., J = E fp i g [F ]. According to the algorithm in [85], the policy gradient is r r J =E fp i g [F (fr 0 i g)r r logp 0 i ( r )] (4.7) where r are QRN's parameters andr r logp 0 i ( r ) is the gradient produced by QRN for top ranked proposal r i . To predict reward value F (fr 0 i g), CPN averages top ranked proposals' visual features fv 0 i g as v c . The reward is computed as: F (fr 0 i g) =(W c (v c jjq) + b c ) (4.8) where \jj" denotes concatenation operation and (.) is a sigmoid function. W c and b c are projection parameters which produce a scalar value as reward. To train CPN, we design a reward function to guide CPN's prediction. The reward function performs as feedback from environment and guide CPN to produce meaningful policy gradients. Intuitively, to help QRN select more discriminative proposals related to queryq rather than context, we assign lower reward for some top ranked proposal overlaps the object mentioned by context and higher reward if it overlaps with the mentioned object by query. Therefore, we design the reward function as R(fr 0 i g) = 1 K K X i=1 [(r 0 i 2S q ) +(r 0 i = 2 (S q [S bg ))] (4.9) where S q is the set of proposals with IoU > 0:5 with mentioned objects by query q, and S bg is the set of background proposals with IoU < 0:5 with objects mentioned by all queries in the description. (.) is an indicator function and 2 (0; 1) is the reward for proposals overlapping with objects mentioned by context. The reward prediction loss is: L rwd (fr 0 i g) =jjF (fr 0 i g)R(fr 0 i g)jj 2 (4.10) During training,L rwd is backpropagated only to CPN for optimization, while CPN back- propagates policy gradients (Eq. 4.7) to optimize QRN. 33 Approach Accuracy (%) Compared approaches SCRC [27] 27.80 Structured Matching [84] 42.08 SS+GroundeR (VGG cls ) [66] 41.56 RPN+GroundeR (VGG det ) [66] 39.13 SS+GroundeR (VGG det ) [66] 47.81 MCB [14] 48.69 CCA embedding [60] 50.89 Our approaches RPN+QRN (VGG det ) 53.48 SS+QRN (VGG det ) 55.99 PGN+QRN (VGG pgn ) 60.21 QRC Net (VGG pgn ) 65.14 Table 4.1: Dierent models' performance on Flickr30K Entities. Our framework is eval- uated by combining with various proposal generation systems. 4.2.5 Training and Inference We train PGN based on a RPN pre-trained on PASCAL VOC 2007 [11] dataset, and adopt the alternating training method in [64] to optimize PGN. We rst train PGN and use proposals to train QRN and CPN, then initialize PGN tuned by QRN and CPN's training. Same as [66], we select 100 proposals produced by PGN (N = 100) and select top 10 proposals (K = 10) predicted by QRN to assign reward in Eq. 4.9. After calculating policy gradient in Eq. 4.7, we jointly optimize QRC Net's objective (Eq. 4.1) using Adam algorithm [33]. We choose the rectied linear unit (ReLU) as the non-linear activation function. During testing stage, CPN is xed and we stop its reward calculation. Given an image, PGN is rst applied to generate proposals and their visual features. QRN regresses these proposals and predicts the relevance of each proposal to the query. The regressed proposal with highest relevance is selected as the prediction result. 4.3 Experiment We evaluate QRC Net on Flickr30K Entities [60] and Referit Game datasets [30] for phrase grounding task. 4.3.1 Datasets Flickr30K Entities [60]: The numbers of training, validation and testing images are 29783, 1000, 1000 respectively. Each image is associated with 5 captions, with 3.52 query phrases in each caption on average. There are 276K manually annotated bounding boxes 34 Proposal generation RPN [64] SS [82] PGN UBP (%) 71.25 77.90 89.61 BPG 7.29 3.62 7.53 Table 4.2: Comparison of dierent proposal generation systems on Flickr30k Entities referred by 360K query phrases in images. The vocabulary size for all these queries is 17150. Referit Game [30] consists of 19,894 images of natural scenes. There are 96,654 distinct objects in these images. Each object is referred to by 1-3 query phrases (130,525 in total). There are 8800 unique words among all the phrases, with a maximum length of 19 words. 4.3.2 Experiment Setup Proposal generation. We adopt a PGN (Sec. 4.2.2) to generate proposals. Dur- ing training, we optimize PGN based on an RPN pre-trained on PASCAL VOC 2007 dataset [11], which does not overlap with Flickr30K Entities [60] or Referit Game [30]. We also evaluate QRC Net based on Selective Search [82] (denoted as \SS") and Edge- Boxes [95] (denoted as \EB"), and an RPN [64] pre-trained on PASCAL VOC 2007 [58], which are all independent of QRN and CPN. Visual feature representation. For QRN, the visual features are directly generated from PGN through a RoI pooling layer. Since PGN contains a VGG Network [70] to pro- cess images, we denote these features as \VGG pgn ". To predict regression parameters, we need to include spatial information for each proposal. For Flickr30K Entities, we augment each proposal's visual feature with its spatial information [x tl =W;y tl =H;x br =H;y br =W; wh=WH] as dened in [91]. These augmented features are 4101D vectors (d v = 4101). For Referit Game, we augment VGG pgn with each proposal's spatial information [x min ;y min ; x max ;y max ;x center ;y center ;w box ;h box ] which is same as [66] for fair comparison. We denote these features as \VGG pgn -SPAT", which are 4104D vectors (d v = 4104). To compare with other approaches, we replace PGN with a Selective Search and an EdgeBoxes proposal generator. Same as [66], we choose a VGG network netuned using Fast-RCNN [16] on PASCAL VOC 2007 [11] to extract visual features for Flickr30K Entities. We denote these features as \VGG det ". Besides, we follow [66] and apply a VGG network pre-trained on ImageNet [8] to extract proposals' features for Flickr30K Entities and Referit Game, which are denoted as \VGG cls ". We augment VGG det and VGG cls with spatial information for Flickr30K Entities and Referit Game datasets following the method mentioned above. Model initialization. Following same settings as in [66], we encode queries via a bi-directional LSTM [47], and choose the last hidden state from LSTM as q (dimension d q = 1000). All convolutional layers are initialized by MSRA method [22] and all fc layers are initialized by Xavier method [17]. We introduce batch normalization layers after projecting visual and language features (Eq. 4.3). During training, the batch size is 40. We set weight for regression loss L reg as 1.0 (Eq. 4.1), and reward value = 0:2 (Eq. 4.9). The dimension of multimodal feature vector 35 Weight 0.5 1.0 2.0 4.0 10.0 Accuracy (%) 64.15 65.14 64.40 64.29 63.27 Table 4.3: QRC Net's performances on Flickr30K Entities for dierent weights ofL reg . Dimension m 128 256 512 1024 Accuracy (%) 64.08 64.59 65.14 61.52 Table 4.4: QRC Net's performances on Flickr30K Entities for dierent dimensions m of v q i . v q i is set to m = 512 (Eq. 4.3). Analysis of hyperparameters is provided in Sec. 4.3.3 and 4.3.4. Metric. Same as [66], we adopt accuracy as the evaluation metric, dened to be the ratio of phrases for which the regressed box overlaps with the mentioned object by more than 50% IoU. Compared approaches. We choose GroundeR [66], CCA embedding [60], MCB [14], Structured Matching [84] and SCRC [27] for comparison, which all achieve leading perfor- mances in image grounding. For GroundeR [66], we compare with its supervised learning scenario, which achieves the best performance among dierent scenarios. 4.3.3 Performance on Flickr30K Entities Comparison in accuracy. We rst evaluate QRN performance based on dierent independent proposal generation systems. As shown in Table 4.1, by adopting QRN, RPN+QRN achieves 14.35% increase compared to RPN+GorundeR. We further improve QRN's performance by adopting Selective Search (SS) proposal generator. Compared to SS+GroundeR, we achieve 8.18% increase in accuracy. We then incorporate our own PGN into the framework, which is jointly optimized to generate proposals as well as features (VGG pgn ). By adopting PGN, PGN+QRN achieves 4.22% increase compared to independent proposal generation system (SS+QRN) in accuracy. Finally, we include CPN to guide QRN in selecting more discriminative proposals during training. The full model (QRC Net) achieves 4.93% increase compared to PGN+QRN, and 14.25% increase over the state-of-the-art CCA embedding [60]. Detailed comparison. Table 4.6 provides the detailed phrase localization results based on the phrase type information for each query in Flickr30K Entities. We can ob- serve that QRC Net provides consistently superior results. CCA embedding [60] model is good at localizing \instruments" while GroundeR [66] is strong in localizing \scenes". By using QRN, we observe that the regression network achieves consistent increase in accuracy compared to GroundeR model (VGG det ) in all phrase types except for class \instruments". Typically, there is a large increase in performance of localizing \animals" (with increase of 11.39%). By using PGN, we observe that PGN+QRN has surpassed state-of-the-art method in all classes, with largest increase in class \instruments". Finally, by applying CPN, QRC Net achieves more than 8.03%, 9.37%, 8.94% increase in accu- racy in all categories compared to CCA embedding [60], Structured Matching [84] and GroundeR [66] respectively. QRC Net achieves the maximum increase in performance of 36 Reward 0.1 0.2 0.4 0.8 Accuracy (%) 64.10 65.14 63.88 62.77 Table 4.5: QRC Net's performances on Flickr30K Entities for dierent reward values of CPN. Phrase Type people clothing body parts animals vehicles instruments scene other GrdeR (V cls ) [66] 53.80 34.04 7.27 49.23 58.75 22.84 52.07 24.13 GrdeR (V det ) [66] 61.00 38.12 10.33 62.55 68.75 36.42 58.18 29.08 Struct Match [84] 57.89 34.61 15.87 55.98 52.25 23.46 34.22 26.23 CCA [60] 64.73 46.88 17.21 65.83 68.75 37.65 51.39 31.77 SS+QRN 68.24 47.98 20.11 73.94 73.66 29.34 66.00 38.32 PGN+QRN 75.08 55.90 20.27 73.36 68.95 45.68 65.27 38.80 QRC Net 76.32 59.58 25.24 80.50 78.25 50.62 67.12 43.60 Table 4.6: Phrase grounding performances for phrase types dened in Flickr30K Entities. Accuracy is in percentage. 15.73% for CCA embedding [60] (\scene"), 32.90% for Structured Matching [84] (\scene") and 21.46% for GroundeR [66] (\clothing"). Proposal generation comparison. We observe proposals' quality plays an impor- tant role in nal grounding performance. The in uence has two aspects. First is the Upper Bound Performance (UBP) which is dened as the ratio of covered objects by generated proposals in all ground truth objects. Without regression mechanism, UBP directly determines the performance limit of grounding systems. Another aspect is the average number of surrounding Bounding boxes Per Ground truth object (BPG). Gener- ally, when BPG increases, more candidates are considered as positive, which reduces the diculty for following grounding system. To evaluate UBP and BPG, we consider that a proposal covers the ground truth object when its IoU > 0:5. The statistics for RPN, SS and PGN in these two aspects are provided in Table 4.2. We observe that PGN achieves increase in both UBP and PBG, which indicates PGN provides high quality proposals for QRN and CPN. Moreover, since QRN adopts a regression-based method, it can sur- pass UBP of PGN, which further relieves the in uence from UBP of proposal generation systems. Hyperparameters. We evaluate QRC Net for dierent sets of hyperparameters. To evaluate one hyperparameter, we x other hyperparameters to default values in Sec. 4.3.2. We rst evaluate QRC Net's performance for dierent regression loss weights . The results are shown in Table 4.3. We observe the performance of QRC Net uctuates when is small and decreases when becomes large. We then evaluate QRC Net's performance for dierent dimensions m for multimodal features in Eq. 4.3. The performances are presented in Table 4.4. We observe QRC Net's performance uctuates whenm< 1000. Whenm becomes large, the performance of QRC Net decreases. Basically, these changes are in a small scale, which shows the insensitivity of QRC Net to these hyperparameters. 37 Approach Accuracy (%) Compared approaches SCRC [27] 17.93 EB+GroundeR (VGG cls -SPAT) [66] 26.93 Our approaches EB+QRN (VGG cls -SPAT) 32.21 PGN+QRN (VGG pgn -SPAT) 43.57 QRC Net (VGG pgn -SPAT) 44.07 Table 4.7: Dierent models' performance on Referit Game dataset. Weight 0.5 1.0 2.0 4.0 10.0 Accuracy (%) 43.71 44.07 43.61 43.60 42.75 Table 4.8: QRC Net's performances on Referit Game for dierent weights of L reg . Finally, we evaluate dierent reward values for proposals covering objects mentioned by context. We observe QRC Net's performance uctuates when < 0:5. When is close to 1.0, the CPN assigns almost same rewards for proposals covering ground truth objects or context mentioned objects, which confuses the QRN. As a result, the performance of QRC Net decreases. 4.3.4 Performance on Referit Game Comparison in accuracy. To evaluate QRN's eectiveness, we rst adopt an inde- pendent EdgeBoxes [95] (EB) as proposal generator, which is same as [66]. As shown in Table 4.7, by applying QRN, we achieve 5.28% improvement compared to EB+GroundeR model. We further incorporate PGN into the framework. PGN+QRN model brings 11.36% increase in accuracy, which shows the high quality of proposals produced by PGN. Finally, we evaluate the full QRC Net model. Since Referit Game dataset only contains independent query phrases, there is no context information available. In this case, only the rst term in Eq. 4.9 guides the learning. Thus, CPN does not contribute much to performance (0.50% increase in accuracy). Hyperparameters. We evaluate QRC Net's performances for dierent hyperparam- eters on Referit Game dataset. First, we evaluate QRC Net's performance for dierent weights of regression lossL reg . As shown in Table 4.8, performance of QRC Net uctu- ates when is small. When becomes large, regression loss overweights classication loss, where a wrong seed proposal may be selected which produces wrong grounding results. Thus, the performance decreases. We then evaluate QRC Net's performance for dierent multimodal dimensionsm of v q i in Eq. 4.3. In Table 4.9, we observe performance changes in a small scale whenm< 1000, and decreases when m> 1000. 38 Dimension m 128 256 512 1024 Accuracy (%) 42.95 43.80 44.07 43.51 Table 4.9: QRC Net's performances on Referit Game for dierent dimensions of v q i . A snowboarder clothed in red is in the middle of a jump from a snowy hill. Query 1: A snowboarder Query 3: a snowy hill Query 2: red Two people walk down a city street that has writing on it. Query 1: Two people Query 2: a city street Query 3: writing An african american woman dressed in orange is hitting a tennis ball with a racquet. Query 1: An african american woman Query 2: orange Query 3: a racquet Query 1: tree far left Query 2: tree on the right side Query 1: people Query 2: front window on the left Figure 4.3: Some phrase grounding results in Flickr30K Entities [60] (rst two rows) and Referit Game [30] (third row). We visualize ground truth bounding box, selected proposal box and regressed bounding box in blue, green and red resepctively. When query is not clear without further context information, QRC Net may ground wrong objects (e.g., image in row three, column four). 39 4.3.5 Qualitative Results We visualize some phrase grounding results of Flickr30K Entities and Referit Game for qualitative evaluation (Fig. 4.3). For Flickr30K Entities, we show an image with its associated caption, and highlight the query phrases in it. For each query, we visualize the ground truth box, the selected proposal box by QRC Net and the regressed bounding box based on the regression parameters predicted by QRN. Since there is no context information in Referit Game, we visualize query and ground truth box, with selected proposal and regressed box predicted by QRN. As shown in Fig 4.3, QRC Net is strong in recognizing dierent people (\A young tennis player" in the rst row) and clothes (\purple beanie" in the second row), which is also validated in Table 4.6. However, when the query is ambiguous without further context description, QRC Net may be confused and produce reasonably incorrect grounding result (e.g., \hat on the right" in the third row of Fig. 4.3). 4.4 Conclusion We proposed a novel Query-guided Regression network with Context policy (QRC Net) to address the phrase grounding task. QRC Net not only relieves the performance limita- tion brought from proposal generation system, but also leverages context information to further boost performance. Experiments show QRC Net provides a signicant improve- ment in performance compared to state-of-the-arts, with 14.25% and 17.14% increase in accuracy on Flickr30K Entities [60] and Referit Game [30] datasets respectively. 40 Chapter 5 Exploring Phrase Interrelationships and Context for Phrase Grounding 5.1 Introduction Traditional image classication and detection tasks involve assigning labels to images / bounding boxes from a predened set of nouns. Phrase Grounding( or Phrase Localiza- tion) localizes visual entities in an image given a natural language phrase as a query and is the logical next step. With the advent of deep learning architectures that reliably encode visual [23] [70] and natural language [25] modlaities, vision-language systems can handle complex natural language phrases for tasks such as Image Captioning [90], Image-text co-reference resolution [67] and Human-computer interaction [91]. Phrase Grounding can be an important intermediate step for these tasks. However, Grounding faces several challenges such as generalizing the visual models from limited data, resolving semantic ambiguities from an open-ended vocabulary and localizing small, hard-to-detect visual entities. A baby cries while another baby holds a pacifier Categories: {PEOPLE, FURNITURE} Phrase Types: {A baby, another baby}: PEOPLE {Pacifier}: OBJECT PROPOSAL INDEXING NETWORK ( PIN) HOLDS PACIFIER? Interphrase Regression Source Phrase : PEOPLE Target Phrase : OBJECT Relation: Holds INTERPHRASE REGRESSION NETWORK (IRN) CONTEXT HOLDS PACIFIER? CRYING? PROPOSALS Context 1: HOLDS PACIFIER Context 2 : CRYING Rank visually similar proposals incorporating context PROPOSAL RANKING NETWORK (PRN) Figure 5.1: Overview of the PIRC Net framework 41 Current approaches( [66] [27] [60] [5]) frame grounding as a retrieval problem where candidate region proposals are ranked based on their similarity to the query phrase. These approaches learn a multimodal subspace between visual and language modalities using various approaches such as knowledge transfer from image captioning [66]; Canon- ical Correlation Analysis [60] and latent attention [66] [5]. Further, visual [53] [91] and semantic [59] [4] context has been employed to resolve ambiguity and con icting predic- tions. Recently, impressive results have been obtained by [5] modeling con ict resolution using reinforcement learning techniques. The performance of existing approaches is typically bound by the success of inde- pendent proposal generator [82] [95]. To address the issue, [5] suggest a query-guided regression network to regress the proposal based on the query phrase. However, regression is not eective when there is no proposal in the local neighborhood of the visual entity. The approaches are also agnostic to the underlying semantics of the region proposal. This results in higher number of proposals needed per query to account for variance in the query phrase and hence, inter-class errors among visually similar entities. Further, the rich information provided by the interrelations of query phrases for same image and relative locations of dierent visual entities in the image has not been fully explored. The visual and semantic context provided by the interrelationships of the neighboring phrases can be utilized to reduce ambiguity, predict visual entity locations and rank visual entities during prediction. Our objective is to incorporate various semantic and contextual cues for localizing the visual entity described by the query phrase. To this end, we propose a framework that both prunes, augments and nally selects the most correlated region proposal for the query phrase. An overview of our framework is presented in Figure 5.1. Proposal Indexing Network(PIN) aims to reduce the inter-class errors using a coarse to ne indexing of region proposals. Based on the query phrase category, the proposals are pruned and shortlisted for more detailed analysis. Inter-phrase Regression Network(IRN) aims to account for cases where there is no corresponding positive region proposal and predicts the location of a query phrase based on the inter-relationships with its neighboring phrases. These sets of proposals are then ranked by a Proposal Ranking Network (PRN) that uses a discriminative correlation objective to score these proposals by incorporating both visual and semantic context. We present techniques to train our framework where the location of ground truth bounding box of a query phrase is given during training. PIN is trained to predict region proposals closer to ground truth. We evaluate our framework on two popular phrase grounding datasets: Flickr30K entities [60] and Refer-it Game [30]. Experiments show our framework outperforms the existing state-of-the-art on both datasets achieving 6%/8% and 10%/15% improvements using VGGNet/ResNet architectures on Flickr30K and Referit datasets respectively, in supervised setting. Our contributions are: (a) We propose query-guided Proposal Indexing Network that reduces inter-class errors for grounding. (b) We introduce Inter-phrase Regression and Proposal Ranking Networks that leverage the context provided by neighboring phrases in a caption. 42 A standing man is playing guitar and woman is playing drums with a child next to keyboard Query Phrases: A standing man, guitar, woman drums , child, keyboard Relations: is playing, is playing, next RPN ROI pooling Phrase Clsfn {PEOPLE, INSTRUMENTS} Is playing Is playing next to Multimodal Features Regression Estimation FC network Visual Context Encoding Textual Context Encoding Bimodal Correlation A standing man guitar woma n chi ld keyboar d drums LSTM Query Phrases Indexed Proposals FC network Candidate Proposals Augmented Proposals Groundings Query Phrases + Context Candidate +Augmented Proposals PIN IRN PRN Figure 5.2: Architecture of PIRC Net network showing its three constituent modules: Proposal Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN). PIN generates and classies proposals into coarse phrase cat- egories and indexes them into candidate proposals per query phrase. IRN augments the proposals by estimating the locations of neighboring phrases through relation-guided regression. PRN ranks the augmented proposals and assigns scores considering their lo- cation with respect to neighbors, contrast with respect to other candidates and semantic context. 5.2 Our network In this section, we present the architecture of our PIRC Net network (Figure 5.2). First, we provide the overview of the entire framework followed by detailed descriptions of each of the three subnetworks : Phrase Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN) respectively. 5.2.1 Framework Overview Given an image I and query phrases p i 2 fCg, the goal of our system is to predict the location of visual entities specied by the queries. To this end, PIN is trained to generate a set of region proposals v k 2fV T g and select a subset of these region proposals fV i S gfV T g using coarse to ne indexing based on query phrase p i 's semantics. To account for missing positive proposals, IRN augments the region proposalsfV i Sa gfV i S g for each query phrase by predicting its location using relationship r i with its neighboring phrases Np i . Finally, PRN uses context-incorporated contrastive features to rank(s i ) the 43 augmented proposalsfV i Sa g and choose the proposal v max that is most relevant to the query phrase. 5.2.2 Proposal Indexing Network(PIN) The goal of PIN is to generate region proposals and then retrieve a subset of those region proposalsfV S g using coarse to ne indexing based on query phrase semantics. The retrieved set should contain region proposals that belong to the query phrase category and have high correlation with the query phrase. We netune RPN [64] pretrained on object detection task to generate region proposals fV T g instead of object proposals similar to [5]. Further, we coarsely classify the region proposals into xed number of phrase categories C j ;j 2 [1;N]. For this, we encode query phrases from training images using skip-thought vectors [34] and cluster them j 2 [1;N] into N xed categories. Skip-thought vectors [34] employs a data-driven encoder-decoder framework to embed semantically similar sentences into similar vector representations. Given a query phrase p i , it is embedded into a skip-thought vector ep i and then categorized as follows: p i 2 C j : ep i : j > ep i : J ;8 J6=j J2 [1;N] The region proposals' visual featuresfFV T g are generated through a RoI pooling operata- tion [16] followed by a Fully Connected network. A region classication layer is followed by Softmax operation to generate probability of a region proposal belonging to a phrase category. For each query phrase, we retrieve only the top K region proposals with high correlation to its phrase category C j for further processing. This indexing reduces the number of region proposalsfV i I ;i2 [1;K ]gfV T g needed per query phrase by retaining only relevant proposals. To reduce the search space for IRN and PRN, these proposals are further processed for ner indexing. We use a ner indexing strategy similar to [5] to obtain an estimate of the candidate proposalsfV i S ;i2 [1;L]g from the K region proposals for query phrase. For a query phrase p i ; we employ an LSTM to encode it as an embedding vector lp i . The visual featuresfFV i I g of each of indexed region proposalsfV I g are concatenated to the query phrase embedding lp i and these multimodal features are projected through FC layer network to get a 5 dimensional prediction vectorfs i g. The rst element of the vector indicates the similarity scorefs i [0]g between proposal and query embedding and the next four elements indicate the regression parametersfs i [1 : 4]g. During training given the ground truth bounding box g p i , indexed region proposalsfV i I g are considered positive v i p if they have IOU> 0.5 with ground truth. The loss function for this indexing network is calculated as a combination of classication and regression loss mentioned below: L cls (fV I g;g p i ) =log(s i p [0]) L reg (fV I g;g p i ) = 1 4I I X j=1 4 X k=1 f (ks j i [k]kks j g i [k]k) wheres g i are regression parameters for indexed proposals relative to ground-truth and f is a smooth-L1 loss function. The top L region proposals with highest similarity scores fs i [0]g to query phrase are chosen as candidate region proposalsfV i S g for further inspec- tion. 44 While the attention-based classication objective described above achieves reasonable performance for identifying the related proposals, it does not incorporate any visual or semantic context from the neighboring phrases. IRN and PRN further analyze these relationships to improve ranking among candidate region proposals. 5.2.3 Inter-phrase Regeression Network(IRN) Inter-phrase Regression Network takes advantage of the relationship among neighboring phrases to estimate the relative location of a target phrase from a source phrase. Given a phrase tuple offp s ;r so ;p o g, IRN estimates the regression parameters to predict the location of object phrase, p o given the location of subject phrase, p s and vice-versa. To model such relation, the feature representation of source phrase Np i must not only encode its corresponding visual appearancefFV Np i S g but also its spatial conguration and its relationship dynamics with the target phrase p i . For example, the relationship dynamics of 'person-in-clothes' is dierent to those of 'person-in-vehicle' and is dependent on where the 'person' is. To encode the spatial conguration, we employ a 5D vector that is encoded asflV Np i S g x min W ; y min H ; xmax W ; ymax H ; x:y W:H where (W;H ) are width and height of the image respectively. To encode the relationship dynamics among the phrases, we suggest to use the phrase categories of source and target phrases embedded as a one-hot-vectorfrV Np i S g2R NxN . The relation r i between the two phrases is encoded using a lstmer i and the concatenated feature is projected using a Fully Connected Layer to obtain regression parameters rp i 2R 4 for target phrase location. rp i =(W rl (fv k jjlv k jjrv k jjer i ) + b rl );v k 2fV Np i S g where 'jj' denotes the concatenation operator, p i is the query phrase whose regression parameters are predicted andfV Np i S g is the set of indexed visual proposals of neighboring phrases Np i of query phrase p i . W rl and b rl are projection parameters and is the non- linear activation function. During training, the regression lossL rlReg for predicted regression parameters rp i is calculated as follows: L rlReg (rp i ;g p i ) = 4 X k=1 f (krp i [k]kkg p i [k]k) given the ground truth regression parameters g p i calculated from ground truth location g of target phrase p i . The estimated proposalsfrV i S g from both the neighbor proposal candidates(subject and object) are added to the indexed proposal setfV i Sa gfV i S g S frV i S g. IRN is useful if the query phrase doesn't have a positive region proposal in the indexed proposal set. Further, it is also helpful in the case of generating proposals for smaller objects which are often missed by proposal generators. The augmented region proposal setfV i Sa g is next passed to Proposal Ranking Network. 5.2.4 Proposal Ranking Network(PRN) Proposal Ranking Network(PRN) is designed to incorporate visual and semantic context while ranking the region proposals. PRN employs a bimodal network to generate a condence score for each region proposal using a discriminative correlation metric. For the 45 visual modality, to dierentiate among the dierent region proposals from the same phrase category we propose to employ contrastive visual featuresfcfV i Sa g. Following [69], we encode the visual disparities among candidate region proposalsfV i Sa g by average pooling the dierence in visual features of target proposal from rest of the proposals in the augmented proposal set as follows. cfv k = X l6=k fv k fv l kfv k fv l k ;v k 2fV i Sa g For location and size feature representationflV i Sa g, we encode relative position and size of target proposals with respect to closest proposal from the candidate proposals of neighbor- ing phrasesfV N i Sa g. This representation helps in capturing the dynamics of the proposals with their neighboring phrases and is especially helpful in relative references. The fea- ture representation is as encoded as a 5 D vector [xc k xc Nk ] w k ; [yc k yc Nk ] h k ; [w k w Nk ] w k ; [h k h Nk ] h k ; [w Nk h Nk ] [w k h k ] ; using the relative distance between centers, relative width, relative height and relative size of target proposal v k 2fV i Sa g and its closest candidate proposal in neighbor- ing phrasesv N k 2fV N i Sa g respectively. The nal visual representation is the concatenation of all the above representations. To encode the text modality, we use a HGLMM sher vector encoding [35] fp i of the embeddings for all the words in query phrase p i . For context, we concatenate this feature representation with the sher vector encoding fC of the entire captionC. Rv k =(W V (fv k jjcfv k jjlv k ) + b V ) Rp i =(W T (fp i jjfC ) + b T ) To compute the cross modal similarity, rst the textual representation is projected into the similar dimensionality as the visual representation. Then, the discriminative condence score (v k ;p i ) is computed by accounting for the bias between the two modalities as follows: Vp i =(W P (Rp i ));bp i =(b P (Rp i )) (v k ;p i ) = Vp i :Rv k + b p i To learn the projection weights and bias for both modalities during training, we employ the max-margin ranking lossL rnk that assigns higher scores to positive proposals. To account for multiple positive proposals in the indexed set, we experiment with both maximum and average pooling to get the representative positive score from the candidate proposals. In our experiments, maximum pooling operator performed better. The ranking loss is formulated below: L rnk = X N k max[0; +(v N k ;p i )max((v P k ;p i ))] The loss implies that the score of the highest scoring positive proposal, v P k 2fV i Sa g should be greater than each of the negative proposal v N k 2fV i Sa g by a margin . 5.2.5 Training and Inference The proposal generator for PIN is pre-trained using RPN [64] architecture on PASCAL VOC [11] dataset. The fully-connected network of PIN is alternatively optimized with RPN to index the proposals. The ner-indexing network of PIN is trained independently for 30 epochs with a learning rate of 1e-3 and then trained jointly with the IRN and PRN 46 for further 30 epochs with learning rate reduced by a factor of 10 every 10 epochs. During testing, the region proposal with the highest score from PRN is chosen as the prediction for a query phrase. 5.3 Experiments and Results We evaluate our framework on Flickr30K Entities [61] and Referit Game [30] datasets for phrase grounding task. 5.3.1 Datasets Flickr30k Entities We use a standard split of 30,783(training+validation) images for training and 1000 images for testing. Each image is annotated with 5 captions. A total of 360K query phrases are extracted from these captions and each phrase is assigned one of eight pre-dened phrase categories.referring to 276K manually annotated bounding boxes in images. We treat the connecting words between two phrases in the caption as a 'relation' and only use relations occurring more than 5 times for training IRN. ReferIt Game We use a standard split of 9,977(training+validation) images for training and 9974 images for testing. A total of 130K query phrases are annotated to refer to 96K distinct objects. Unlike Flick30K, the query phrases are not extracted from a caption and do not come with an associated phrase category. Hence, we skip training IRN for ReferIT while relative location features of PRN are extracted from rest of the candidate proposals of the query phrase instead of neighbor phrases. 5.3.2 Experimental Setup Phrase Indexing Network (PIN) An RPN pre-trained on PASCAL VOC 2007 [11] is netuned on the respective datasets for proposal generation. Fully connected network for classication consists of 3 successive fully connected layers followed by ReLU non-linearity and Dropout layers with droput probability 0.5. Predened phrase categories are used as target classes for Flickr30k [61]. 20 cluster centers obtained from clustering training query phrases are used as target classes for ReferIt Game [30]. Vectors from the last fc layers are used as visual representation for each proposal. For ner indexing, hidden size and dimension of bi-lstm are set to 1024. InterPhrase Regression Network (IRN) Since query phrases of ReferIt game are annotated individually, IRN is only applicable to Flickr30k dataset. The visual features from PIN are concatenated with 5D spatial representation (for regression) and 8*8 one-hot embedded vector for source and target phrase categories; for generating representation vector. Both left and right neighboring phrases, if available, are used for regression prediction. Proposal Ranking Network(PRN) For visual stream, the visual features from PIN are augmented with contrastive visual features and 5D relative location features; generating an augmented feature vector. For text stream, 6000D features are generated for both query phrase and corresponding caption by HGLMM sher vector encoding [35] of word2vec embeddings; generating a 12000D feature vector. Each stream has 2 fully 47 Approach Accuracy (%) Compared approaches SCRC [27] 27.80 Structured Matching [84] 42.08 GroundeR [66] 47.81 MCB [14] 48.69 CCA embedding [60] 50.89 SPPC [59] 55.85 MSRC [4] 57.53 QRC [5] 65.14 Our approaches PIN (VGG Net) 66.27 PIN + IRC (VGG Net) 70.17 PIN + PRN (VGG Net) 70.97 PIRC Net (VGG Net) 71.16 PIN (Res Net) 69.37 PIN + IRC (Res Net) 71.42 PIN + PRN (Res Net) 72.27 PIRC Net (Res Net) 72.83 Table 5.1: Relative performance of our apprach on Flickr30k dataset. Dierent combina- tions of our modules are evaluated. Retrieval Rate Top 1 Top 3 Top 5 QRC* [5] 65.14 73.70 78.47 PIRC Net 66.27 79.80 83.46 Proposal Limit 83.12 Table 5.2: Performance evaluation of ecieny of PIN on Flickr30k Entities connected layers followed by a ReLU non-linearity and Dropout layers with probability 0.5. The intermediate and output dimensions of visual and text streams are [8192,4096]. Network Parameters All convolutional and fully connected layers are initialized using MSRA and Xavier respectively. All the features are l2 normalized and batch nor- malization is employed before similarity computation. Training batch size is set to 40 and learning rate is 1e-3 for both ickr30k and referit. VGG architecture [70] is used for PIN to be comparable to existing approaches. Further, results are provided on ResNet [23] architecture to establish the new state-of-the-art with improved visual representation. In the experiments, we use VGG net for comparison with other methods and ResNet for ablation studies. We set L as 10 and 20 for VGG net and ResNet respectively. Evaluation Metrics Accuracy is adopted as the evaluation metric and a predicted proposal is considered positive, if it has an overlap of>0.5 with the ground-truth location. For evaluating the eciency of indexing network, Top 3 and Top 5 accuracy are also presented. 48 0 10 20 30 40 50 60 70 80 90 100 Person Clothing Bodyparts Animals Vehicles Instruments Scene Other Accuracy % Performance per phrase category GroundeR SPPC QRC Net EPIC Net (VGG) EPIC Net (ResNet) Figure 5.3: Grounding performance per phrase category for dierent methods on Flickr 30K Entities 5.3.3 Results on Flickr30k Entities Performance We compare performance of PIRC Net to other existing approaches for Flickr30k dataset. As shown in table 5.1, PIN alone achieves 1.13% improvement while using a fraction of proposals compared to existing approaches. Adding IRN with ranking objective, we achieve 5.05% increase in accuracy. Adding PRN which incorporates context to rank proposals along with PIN, we achieve 5.83% improvement. Our overall framework achieves 6.02% improvement over QRC which is the existing state-of-the-art approach. Further, employing Resnet architecture for PIN gives an additional 1.67% improvement. Ablation Studies: PIN and PRN Proposal Indexing Network(PIN): The eectiveness of PIN is measured by its ability to shortlist the candidate proposals for query phrase. For this, we measure the acurracy of indexed proposals at Top 1, Top 3 and Top 5 ranks. We compare the results to QRC, which is the current state of the art in table 5.2. The upperbound i.e., the maximum accuracy possible with RPN is mentioned in the last row. PIN consistently outperforms QRC [5] in Top 3 and Top 5 retrieval showcasing its eectiveness in indexing the proposals. Our indexed proposals also perform slightly better than upper-bound at top 5 by regressing proposals to positive locations. Proposal Ranking Network(PRN): Per category performance for eight dier- ent phrase types of Flickr30K Entities is presented in gure 5.3. Hyperparameters We evaluate the performance of PIRC Net for dierent values of K i.e, the cardinality of the coarsely indexed region proposal set. The results are shown in 5.4. We observe that the performance of PIRC Net peaks at around K = 50 and then decreases as K becomes large. Apart from the standard parameters for deep networks such as learning rate, initialization etc., PIRC Net has 3 important hyperparameters. Each of them are enlisted below: (a) PIN (stage 1): Number of proposals, K retained for a phrase category after PIN, stage 1. (b) PIN (stage 2): Number of proposals, L retained per query after PIN, stage 2. 49 Query 1: Two soccer players Query 2: One girl Query 3: green shorts Query 4: the other girl Query 5: red and white shirt Kid closest standing to us First girl sitting left white shirt Bushes on lower right Piece with no shadows Ground Gasoline A restaurant table Scissors Woman on left in tan White path on left Caption: Two soccer players with one girl wearing green shorts and the other girl in a red and white shirt Figure 5.4: Qualitative results on test sets of Flickr30K Entities(top row) and ReferIT Game(middle row). Last row shows failure cases from both datasets. Green: Correct Prediction, Red: Wrong Prediction, Blue: Ground truth (c) IRN : Number of neighboring proposals Nk used to generate proposals for IRN. We discuss the eect of each of these hyperparameters in this section. Proposals per phrase category(K): We evaluate the performance of PIRC Net for dierent values of K i.e, the cardinality of the coarsely indexed region proposal set. The results are shown in Table 5.4. We observe that the performance of PIRC Net peaks at around K = 50 and then decreases as K becomes large. This is inline with our observation that the higher number of proposals could increase the interclass errors among classes. Proposals per query(L): As seen in Table 3 of the paper, PIN indexing retrieves most of the positive proposals in the top 5 indexing results. We observe that during testing, adding proposals after L = 5 has no signicant dierence in performance. During training, we choose thrice the number of negative proposals for each positive proposal for training PRN. For example if we retain L=20 during training of which 5 proposals are positive, we use the remaining proposals as negative samples. We also observed that ResNet retrieves higher number of positive proposals per query on an average than VGG Net and hence, requires higher L during training. Neighbor proposals per query(Nk): IRN module employs the neighboring phrases and their relation with the query phrase to generate proposals. We evaluate the eect of the number of neighbor proposals used to generate proposals. Each query phrase can have two neighbor phrases; the phrase that precedes it(subject phrase) and the phrase that succeeds it(object phrase). We evaluate the performance of either ( if 50 Approach Accuracy (%) Compared approaches SCRC [27] 17.93 GroundeR [66] 26.93 MSRC [4] 32.21 Context-aware RL [86] 36.18 QRC [5] 44.07 Our approaches PIN (VGG Net) 51.67 PIRC Net (PIN+PRN) (VGG Net) 54.32 PIN (Res Net) 56.67 PIRC Net (PIN+PRN) (Res Net) 59.13 Table 5.3: Relative performance of our approach on Referit Game dataset. Proposals 10 20 30 50 100 No Indexing PIN (resNet) 67.55 68.47 68.15 69.37 68.09 65.10 Table 5.4: PIRC Net's performance on Fickr30K Entities for dierent values K for in- dexing applicable ); i.e., estimating the location from subject phrase(IRN-subject) and from ob- ject phrase(IRN-object) in Table 5.5. IRN's decision is considered accurate if one of the Nk proposals contains a positive proposals for the query phrase. In our experiments, we found that Nk = 3 gives reasonable performance while not increasing the complexity of PRN module. Further, we evaluate the eect of adding the proposals from IRN to PIN proposals. The results are shown in Table 5.6. We observe that adding proposals generated from both subject and object phrases not only increases the upperbound accuracy for PRN but also signicantly increases the number of positive proposals per query (PPQ). Grounding visualization on Flickr30K dataset We present visualization of some of the results of PIRC Net for Flickr30K dataset in Figure 5.5. We can see that PIRC Net is able to successfully detect objects from diverse categories ( People: 5.5(a) , Animal : 5.5(g)(l), Vehicle: 5.5(d), Clothes : 5.5(e) etc ) and varied types of queries ( 5.5(c) ) . Also, PIRC Net has improved performance in detecting smaller objects owing the the eciency of IRN module in inferring their location ( Queries: 5.5(b) , 5.5(i) ). Traditional proposals have lower accuracy in detection of smaller objects and IRN is able to estimate their location using the location of neighbors and their relation. No. of Neighbor Proposals(Nk) 1 5 10 20 30 Accuracy (IRN-subject) 23.55 36.63 42.52 47.75 51.05 Accuracy (IRN-object) 37.59 54.60 60.85 65.88 69.20 Table 5.5: IRN's performance for various values of N k 51 (a) An american footballer (b) the ball (c) the ref (d) a bike (e) annel shirt (f) a blue backpack (g) black and white dog (h) a stick (i) its mouth (j) a man (k) cowboy hat (l) brown horse (m) wicker chair (n) classes (o) one arm (p) A man (q) drink (r) red sweater Figure 5.5: Visualization of some of the queries and corresponding bounding boxes chosen by PIRC Net for Flickr30K dataset. \Blue" is the groundtruth bounding box, \Green" is positive result and \Red" is negative result 52 Method PIN PIN + IRN(subj) PIN+IRN(obj) PIN + IRN Upperbound 83.63 84.61 84.74 85.02 PPQ 2.35 2.92 3.17 3.4 Table 5.6: PIRC Net's performance in proposal indexing with and without various com- ponents of IRN Proposals 10 20 30 50 100 No Indexing PIN (resNet) 53.22 54.04 54.96 56.67 56.36 48.29 Table 5.7: PIRC Net's performance on ReferIT Game for dierent values of K 5.3.4 Results on ReferIt Game Performance Table 5.3 shows the performance of PIRC Net compared to existing approaches. PIN network gives a 7.6% over state-of-the-art approaches. The diversity of objects in ReferIt Game is higher than diversity of objects in Flickr30k Entities dataset. Hence, PIN's Indexing operations reduce the inter-class errors among visually similar classes. Employing PRN in addition to PIN, gives a 10.25% improvement over QRC. Using ResNet architecture gives an additional 4.81% improvement; leading to 15.06% improvement over the state-of-the-art. Ablation Studies: PIN Similar to Flickr30k, we perform ablation studies for ef- fectiveness of PIN . The results are presented in Table 5.8. PIN consistently performs better for Top 3 and Top 5 retrieval. Hyper-parameters Similar to Fickr30K, we experiment with dierent values of K . The results are shown in 5.4. We observe that the performance of PIRC Net peaks at around K = 50 and then decreases as K becomes large. Proposals per phrase category(K): Similar to Fickr30K, we experiment with dif- ferent values of K . The results are shown in Table 5.7. We observe that the performance of PIRC Net peaks at around K = 50 and then decreases as K becomes large. However, the plateau is not as high when we increase the number of proposals. This is because ReferIt dataset generally has higher density of objects in general (Ex: Crowded streets, dense scenery) as opposed to Flickr 30K. Proposals per query(L): As seen in Table 6 of the paper, PIN indexing retrieves most of the positive proposals in the top 5 indexing results. Similar to Flickr 30K, we observe that during testing, adding proposals after L = 5 has no signicant dierence in performance. During training, we choose thrice the number of negative proposals for each positive proposal for training PRN. 5.4 Grounding visualization on ReferIt dataset We present visualization of some of the results of PIRC Net for ReferIt dataset in Figure 5.6. Similar to Flickr 30K, we can see that PIRC Net is able to successfully detect objects from diverse categories ( People: 5.6(g)(h) , Animal : 5.6(a), Scene: 5.5(d), Clothes : 5.5(e) etc ) and varied types of queries ( 5.6(c)(d)(e) ). Typically, the queries for ReferIt dataset are more lengthy than the queries associated with Flickr30K dataset 53 Retrieval Rate Top 1 Top 3 Top 5 QRC* [5] 44.07 54.96 59.45 PIRC Net 51.67 68.49 73.69 Proposal Limit 77.79 Table 5.8: Performance evaluation of eciency of Phrase Indexing Network(PIN) on Referit Game and often provide spatial references. While the problems such as ambiguity and user- specic annotations still persist in this dataset ( 5.6(m)(n)(p)), some of the results can be improved by eectively paying higher attention to the important nouns/attributes in a phrase. 5.4.1 Qualitative Results We present qualitative results for few samples on Flickr30K Entities and ReferItGame datasets(Fig 5.4). For Flickr30K(top row), we show an image with its caption and ve associated query phrases. We can see the importance of context in localizing the queries. For ReferIt(middle row), we show the query phrase and the associated results. The bottom row shows the failure cases which can be attributed to dierent reasons such as ambiguous query, multiple possible answers, little margin of error for smaller objects and diculties in query interpretation. We notice that the negative results mostly owe to the following factors. (a) Localization of small objects is highly sensitive to the bounding box location. For example, in queries 5.5(i), 5.5(n) and 5.5(q), we can notice that while visually the results are not very inconsistent with the groundtruth location, the small bounding box gives a low IoU score for the result. (b) Bounding box annotations are user-specic and sometimes incorrect. A sample from the incorrect annotations is shown in 5.5[p]. For 5.5[m], with the query 'wicker chair' we can see that while the groundtruth annotation points to the arm of the chair, the algorithm outputs the original location of the chair which is occluded. (c) Ambiguity: Visual and Semantic. Sometimes, the queries themselves are not self- contained and have a degree of ambiguity associated with them. While PRN module tries to address this by encoding multiple queries together, this problem still persists in some cases as shown in 5.5[o]. 5.5 Conclusion In this chapter, we addressed the problem of phrase grounding using PIRC Net, a frame- work that incorporates semantic and contextual cues to rank visual proposals. We show that incorporating these cues at dierent stages improves the performance of phrase grounding. Our framework outperforms other baselines for phrase grounding. Possible future work includes integration of grounding modules into higher level vision-language tasks. 54 (a) llama (b) cloud (c) patch of blue on sky in left (d) building (e) mountains on the horizon (f) brown grass bottom left (g) group of people right side (h) woman in stripes (i) sleeping bag front left on ground (j) building on the right (k) road (l) couple on left (m) the plant on the left (n) door to the right (o) middle window (p) bottom chair (q) chair on left (r) white hat on the kid turned backwards Figure 5.6: Visualization of some of the queries and corresponding bounding boxes chosen by PIRC Net for ReferIt dataset. \Blue" is the groundtruth bounding box, \Green" is a positive result and \Red" is an incorrect result 55 Chapter 6 Modular Phrase Attention Network for bottom-up Phrase Grounding 6.1 Introduction Previous two chapters present state-of-the art techniques for Phrase Grounding i.e., lo- calization of regions in an image with bounding boxes given a natural language phrase as a query. To achieve this, a grounding framework usually consists of two models : visual and language; to encode the input image regions and the query phrase respectively. For visual model, object proposals are employed to encode the salient regions in the input image that could possibly be referred by a query phrase. For language model, an en- coder such as RNN or sher vector is employed to generate representation for the query phrase. Finally, the correlation between object proposals and query phrase is computed in a trained multimodal subspace. The proposal that has the highest correlation with query phrase is chosen as the grounding for the query phrase. For visual model, some existing approaches employ object detector pre-trained on ob- ject detection datasets such as PASCAL VOC[11] or MSCOCO[44]. Recent approaches improved upon the performance of visual model by ne-tuning the detector for ground- ing datasets[37][38]. Among these approaches, PIRC Net[38] proposes to improve the generalization of visual appearance of objects from limited samples using coarse phrase categories. In all these cases, the discriminative power of the visual features is limited by the categories the object detection system is trained on. For language model, the query phrases are composite noun phrases with subject or referred object as the head noun of the phrase. The rest of the phrase is used to describe one of the following: attributes, spatial reference or relative relationships. Existing phrase grounding approaches treat the query phrase as singular expression with out any emphasis on distinctive information provided by the query. For visual model, PIRC Net employs phrase categories associated with a phrase to train the categorization of proposals. This helps with the generalization, since phrases like \a football player" and \an athlete" both fall into the same phrase category, \Person". However, this generalization is limited to only few coarse categories and also causes intra-class errors. To avoid such errors, proposals need to be imbibed with higher level of semantic information. This work proposes to associate each proposal to semantic label of its referred object, i.e., the head noun. For example in gure 6.1[a], the query phrase \football player in white jersey" refers to the head noun \player". Associating 56 QUERY IMAGE QUERY PHRASE PROPOSALS Multimodal Feature Layer RELEVANT NON-RELEVANT PROPOSAL CLASSIFICATION FOOTBALL PLAYER IN WHITE JERSEY PHRASE GROUNDING HEAD NOUN : PLAYER RELEVANT PROPOSALS (a) Framework to retrieve relevant proposals for a query phrase (b) Framework for discriminative ranking of proposals for a query phrase Figure 6.1: Overview of the MPA Net framework the proposals to continuous semantic representation of the head nouns increases the generalization of proposals. For a given head-noun of the query phrase, the proposal generator categorizes proposals as either relevant or non-relevant to the head noun. For language model, the lack of emphasis on distinctive words while encoding the query phrase could likely confuse the grounding system during inference. For example in gure 6.1[b], the query phrase \football player in white jersey" uses \white jersey" to disambiguate the referred entity from the other football player in the input image. If the grounding system is able to identify such distinctive information such as attributes, spatial location or relationships to other objects from the query phrase, it could be leveraged to rank the proposals with this information higher than the other ones. To this end, we propose to learn attention per each word during query encoding to generate a distinctive query embedding that could aid in ranking the proposals. Our overall objective for MPA Net objective is two-fold: (a) Improve the semantic interpretability for proposals for improved generalization of grounding. (b) Identify the distinctive information per each query phrase to rank the proposals more eectively. To this end, we propose a framework that generates proposals, retrieves relevant proposals for the query phrase and ranks them based on their correlation to the distinctive information 57 of the query. An overview of our framework is presented in Figure 6.1. For a given query phrase and an input image, an object detection system is trained to generate proposals for the input image. Further, these proposals are classied as relevant and non-relevant to the head noun of the query phrase. Attention weights are learnt for each word in the query phrase and a discriminative embedding is generated by computing weighted mean of words over these attention weights. Finally, the correlation between relevant proposals and discriminative embedding is computed to learn the proposal closest to the query phrase. We evaluate our framework on two popular phrase grounding datasets: Flickr30K entities [60] and Refer-it Game [30]. Experiments show our framework outperforms the existing state-of-the-art on both datasets achieving 1% and 1% improvements us- ing ResNet[24] architectures on Flickr30K and Referit datasets respectively, with much simpler and interpretable architecture. Our contributions are: (a) We propose an ecient method to improve the semantic interpretation of proposals for grounding. (b) We propose an ecient method to improve the distinctive ranking of the generated proposals. Figure 6.2: Architecture of MPA network showing its constituent modules. First, the region proposals generated by RPN are classied as relevant and non-relavant to the head noun of the query phrase. Next, a discriminative embedding is generated for the query phrase using weighted average of word attention. Finally, correlation to each of the relevant proposal and the embedding is computed to rank the proposals. 58 6.2 Our network In this section, we present the architecture of our MPA network (Figure 6.2). First, we provide the overview of the entire framework followed by detailed descriptions of dierent parts of the network respectively. 6.2.1 Framework Overview Given an input image I and a query phrase q, the goal of our system is to predict the location of visual entities specied by the query. Our system has 3 components : vision model, language model and multimodal subspace. Vision model takes the image I and head noun h q as input and generates relevant proposals pR i and their visual features fR i for the query phrase. The language model encodes the query phrase q as input to generate per word discriminative attention weights aw k and phrase encodingfq as output. A weighted mean is computed using the embedding of each word and their attention weights to generate a discriminative embedding dq. Finally, the score for each proposal is computed as a combination of discriminative score and global score computed using discriminative embedding and phrase encoding respectively in multimedia subspace. 6.2.2 Visual model Visual model of MPA Net is similar to the Faster RCNN architecture [64]. The nal layer for classication is replaced with a multimodal binary classication layer to predict the relevance score r i of the proposal for a head noun. Given an input image I and a query phrase q, visual model rst identies the head noun for the query phrase. We propose to use a rule-based approach to achieve this. First, we compute a dependency tree using the spaCy tool [26]. Then, the head noun is computed using a rule-based approach as follows: i A word is identied as head noun if it is the subject noun of the phrase(nsubj) ii If the phrase has no subject, then the root of the tree considered as head noun if it is a noun. iii If either of these is not satised, all the nouns in the phrase are considered as possible root nouns. In practice, we have noticed that only 10% of query phrases encounter the third case. Once the head noun h q is identied, it is represented as using its word embedding eh q . For case iii, the average embedding of all possible head nouns is used for representation. Proposals p i for input image are generated using RPN and their corresponding visual features f i are generated through ROI pooling using RCNN. The visual feature and word embedding are concatenated and this multimodal feature is projected using a fully con- nected layer to predict a binary relevance score. A proposal pR i that contains the visual entity described by the head noun is expected to generated high relevance score rs i . rs i = softmax((W r (f i jjeh q ) + b r )); (6.1) p i 2frelevantg8rs i >T 59 where 'jj' denotes the concatenation operator. During training, the visual model is netuned using the query phrases and ground truth information of the grounding dataset. For negative mining, groundtruth bounding boxes for other query phrases of the image are used. All the relevant proposals pR i , with a relevance score greater than thresholdT are used for ranking in the multimodal space. 6.2.3 Language Model The relevant proposals for a query phrase are not unique since an input image could contain more than one instance of the head noun. To identify the distinctive information in the query phrase, the language model is trained to identify the discriminative attention weights aw k per each word of the query phrase. Given a query phrase, we propose to learn to generate the attention weights aw k automatically for each word similar to [92]. For a given query phrase q =fw k g K k=1 , a bi-LSTM is used to encode the query phrase. Vector representation for each word e k is rst computed using one-hot word embedding matrix. Then the bidirectional LSTM is employed to encode the feature representation for each word. A bidirectional LSTM simply appends the hidden state h k of each word in forward and backward direction to compute the nal hidden state. Using the hidden state for each word of the query phrase h k , the discriminative attention weights for each word aw k are computed as follows: aw k = softmax((W h (h k ) + b h ))8k2 [1;K] (6.2) The attention weights are used to nally compute the discriminative embedding ed for the query phrase using weighted mean over word embeddings. Complementary to the discriminative embedding ed, global representation of the query phrase, the hidden state of the nal word h K , is used to compute the global correlation score. ed = K X k=1 aw k e k (6.3) 6.2.4 Multimodal Subspace The nal score for each proposal pR i is computed as a combination of both the discrim- inative score ds i and a global score gs i . Both the scores are complementary, since one measures how correlated the proposal is to the discriminative information of the query phrase while the other measures its global correlation. Two disparate multimodal sub- spaces are computed using dierent training metric for each score. The proposal with the highest combination score is chosen as the grounding for the query phrase. The training process for the multimodal subspaces is described in the next paragraphs. For discriminative multimodal subspace, we propose to employ a ranking objective for training to latently identify the discriminative words in a query phrase. The language model is trained to produce a discriminative embedding such that it ranks the proposals 60 with high overlap to groundtruth higher than rest of the proposals. The ranking objective for training is dened as follows: L rnk = X j6=i max[0; +(fp i ;ed)max((fp j ;ed))] (6.4) The training objective tries to maximize the score of the highest scoring positive proposal from average score of the rest of the negative proposals. All the proposals that either do not have high overlap with the positive proposal or correspond to positives for other query phrases are treated as negative proposals during training. For global multimodal subspace, we propose to employ a multioutput sigmoid classi- cation objective for training. Since multiple proposals could be positive and vice-versa, this method of classication where each prediction is independent from the rest is ideal. The nal loss is computed using cross entropy loss with prediction for each proposal treated as a binary classication result. The classication objective for training in de- ned as follows: L cls = X i y i log( (fp i ;h K )) + (1y i )log(1 (fp i ;h K )) (6.5) This classication objective tries to maximize the probabilities of all the positive proposals to have correlation with the query phrase. The nal score is computed as the average of global score and discriminative score. 6.2.5 Training and Inference The visual model is pre-trained using Faster RCNN [64] architecture on PASCAL VOC [11] dataset. The RPN module is then netuned using grounding dataset and RCNN nal layer is replaced for binary prediction. The network is trained end-to-end using the com- bined loss of max-margin and classication. Training time is set to 30 epochs with a learning rate of 1e-3. During testing, the region proposal with the highest score is chosen as the prediction for a query phrase. 6.3 Experiments and Results We evaluate our framework on Flickr30K Entities [61] and Referit Game [30] datasets for phrase grounding task. 6.3.1 Datasets Flickr30k Entities We use a standard split of 30,783(training+validation) images for training and 1000 images for testing. Each image is annotated with 5 captions. A total of 360K query phrases are extracted from these captions and each phrase is assigned one of eight pre-dened phrase categories.referring to 276K manually annotated bounding boxes in images. We treat the connecting words between two phrases in the caption as a 'relation' and only use relations occurring more than 5 times for training IRN. ReferIt Game We use a standard split of 9,977(training+validation) images for training and 9974 images for testing. A total of 130K query phrases are annotated to refer to 96K distinct objects. Unlike Flick30K, the query phrases are not extracted from a caption and do not come with an associated phrase category. Hence, we skip training 61 Approach Accuracy (%) Compared approaches QRC [5] 65.14 Our approaches PIRC Net (VGG Net) 71.16 PIN (Res Net) 69.37 PIN + IRC (Res Net) 71.42 PIN + PRN (Res Net) 72.27 PIRC Net (Res Net) 72.83 MPA Net (Res Net) 73.74 Table 6.1: Relative performance of our apprach on Flickr30k dataset. Dierent combina- tions of our modules are evaluated. IRN for ReferIT while relative location features of PRN are extracted from rest of the candidate proposals of the query phrase instead of neighbor phrases. 6.3.2 Experimental Setup Visual model An RPN pre-trained on PASCAL VOC 2007 [11] is netuned on the respective datasets for proposal generation. Fully connected network for classication consists of 3 successive fully connected layers followed by ReLU non-linearity and Dropout layers with droput probability 0.5. Vectors from the last fc layers are used as visual representation for each proposal. Language model One hot embedding matrix is initialized with pre-trained embed- dings using glove. The embedding size is set to 512 while the hidden size and dimension of bi-lstm are set to 1024 (512+512). Multimodal Subspace Each stream has 2 fully connected layers followed by a ReLU non-linearity and Dropout layers with probability 0.5. The intermediate and output dimensions of visual and text streams are [2048,1024]. Network Parameters All convolutional and fully connected layers are initialized using MSRA and Xavier respectively. All the features are l2 normalized and batch nor- malization is employed before similarity computation. Training methodology is similar to Faster RCNN and learning rate is 1e-3 for both ickr30k and referit. ResNet architecture is employed to compare the results to PIRC Net. Evaluation Metrics Accuracy is adopted as the evaluation metric and a predicted proposal is considered positive, if it has an overlap of>0.5 with the ground-truth location. 6.3.3 Results on Flickr30k Entities Performance We compare performance of MPA Net to other existing approaches for Flickr30k dataset. As shown in table 6.1, MPA Net achieves > 1% improvement over PIRC Net with much simpler architecture. While comparing with PIN module of PIRC 62 Approach Accuracy (%) Compared approaches QRC [5] 44.07 Our approaches PIRC Net (PIN) (VGG Net) 51.67 PIRC Net (PIN+PRN) (VGG Net) 54.32 PIRC Net (PIN) (Res Net) 56.67 PIRC Net (PIN+PRN) (Res Net) 59.13 MPA Net ( ResNet) 60.32 Table 6.2: Relative performance of our approach on Referit Game dataset. Net, which is a fairer comparison due to no use of context, MPA Net achieves > 4% improvement over state-of-the-art. 6.3.4 Results on ReferIt Game Performance Table 6.2 shows the performance of EPIC Ground compared to existing approaches. As shown in table 6.2, MPA Net achieves> 1% improvement over PIRC Net with much simpler architecture. While comparing with PIN module of PIRC Net, which is a fairer comparison due to no use of context, MPA Net achievez > 4% improvement over state-of-the-art. 6.3.5 Qualitative Results We present qualitative results for few samples on Flickr30K Entities and ReferItGame datasets(Fig 6.3). The intensity of the color indicates the attention given to each word while computing the embedding. The failure cases can be attributed to dierent reasons such as ambiguous query, multiple possible answers, little margin of error for smaller objects and diculties in query interpretation similar to PIRC Net. We notice that the negative results mostly owe to the following factors. (a) Localization of small objects is highly sensitive to the bounding box location. For example, in queries 6.3(b), we can notice that while visually the results are not very inconsistent with the groundtruth location, the small bounding box gives a low IoU score for the result. (b) Bounding box annotations are user-specic and sometimes incorrect. A sample from the incorrect annotations is shown in 6.3[b], with the query 'not the rock' we can see that while the groundtruth annotation points to the whole of picture, the algorithm outputs the original location of without rock which is deemed incorrect. (c) Ambiguity: Visual and Semantic. Sometimes, the queries themselves are not self-contained and have a degree of ambiguity associated with them. 63 (a) Example 1 (b) Example 2 (c) Example 3 (d) Example 4 Figure 6.3: Qualitative results on the test sets for MPA Net architecture. The intensity of color indicates the attention weight for the weight. Green: Correct Prediction, Red: Wrong Prediction, Blue: Groundtruth 64 6.4 Conclusion In this chapter, we addressed the problem of phrase grounding using MPA Net, a frame- work that incorporates modular phrase attention to rank visual proposals. We show that incorporating phrase attention at dierent stages improves the performance of phrase grounding. Our framework outperforms other baselines for phrase grounding. Possible future work includes integration of grounding modules into higher level vision-language tasks. 65 Chapter 7 Knowledge Transfer for Weakly-supervised Phrase Grounding 7.1 Introduction For QRC Net and PIRC Net, we assume that ground truth annotations are available for each query phrase during training. However in reality, ground truth annotations are expensive to obtain, not easily scalable to high volume of data and susceptible to human errors. On the contrary, image-phrase pairs data with no annotations is easier to obtain. Web has a lot of image-caption data and this could be readily leveraged to obtain image- phrase pairs. To this end, we address the problem of weakly-supervised grounding, i.e., training a grounding system where image-phrase pairs are not delineated with object location. An comparison of both the methods is shown in gure 7.1. Figure 7.1: Comparison of supervised vs weakly-supervised grounding training ow Open-ended queries and visual diversity make phrase grounding a challenging prob- lem. To reduce the visual search space in an image for a query phrase, typically, a proposal generator is employed to obtain a set of candidate regions. The task of grounding system 66 is then reduced to ranking these proposals in order of their correlation to the query phrase. To rank the proposals, a multi-modal subspace [89, 37, 66, 39] is learned to compute the correlation between visual(proposals) and language(query phrase) modalities. Training this subspace for weakly-supervised is more challenging as there is no ground truth annotation available to compare the prediction output. To overcome this, [66] pro- poses to use an encoder-decoder style reconstruction module. In the rst phase, encoder lstm is used to obtain attention scores for all the proposals. In the second stage, the decoder lstm tries to reconstruct the original query phrase using weighted averaging of proposals over attention weights. [89] along with reconstructing the query phrase also checks the neighborhood context for weakly-supervised grounding. In this chapter, we propose two methods to improve the performance of existing weakly-supervised phrase grounding systems. Firstly, we posit that knowledge transfer from pre-trained object detection classes could be used to train phrase categories of query phrase in a data-driven fashion. Secondly, we posit that pre-trained language model could be used to identify semantically similar classes and transfer appearance knowledge from seen pre-trained classes to unseen training classes. For example, if we have a pre-trained object detection system; it could be used to obtain proposals and corresponding visual features. It also produces a probability distribution of the source object categories for each proposal. We employ the former to ne-tune a object detection network that produces phrase category distribution for each proposal. We use probability distribution to transfer appearance knowledge from source object classes to target phrase categories. An overview of our method is presented in Figure 7.2. Figure 7.2: An overview of the key ideas of WPIN framework In spirit, the approach is a weakly-supervised variant of PIN module of [38] where proposals are categorized to reduce the inter-class errors among proposals of dierent categories. WPIN (Weakly-supervised PIN Network) uses both data-driven and appear- ance driven knowledge transfer to categorize the proposals. The WPIN module then uses reconstruction objective to predict the query phrase from proposals similar to [66]. 67 WPIN Net is evaluated on the two standard grounding datasets: Flickr30K Enti- ties [60] and Referit Game [30]. Flickr30K Entities contains more than 30K images and 170K query phrases, while Referit Game has 19K images referred by 130K query phrases. To train in weakly-supervised setting, we do not consider the bounding box annotations that are available for these datasets. Experiments show that in weakly supervised setting, our framework achieves 5% and 4% improvement on state-of-the-art for both datasets. Our contributions: (a) We propose data-driven knowledge transfer to learn the phrase categories from pre-trained object detection network. (b) We propose appearance-driven knowledge transfer to learn about the proposals from source object categories. Further details of the framework are provided in the next section. Experimental results of our approach are provided in Sec. 3. 7.2 Weakly Supervised Grounding Weakly-supervised Proposal Indexing(WPIN) Net consists of two branches for proposal generation: data-driven knowledge transfer and appearance based knowledge transfer. A combination of scores from both the branches is used to make the nal prediction. The proposals are then passed to a language consistency module that attempts to reconstruct the query phrase from each proposal. In the following subsections, we will rst introduce the framework of WPIN Net and then describe the knowledge transfer mechanisms. Lastly, we provide more details about the training and inference methods. Figure 7.3: An overview of the WPIN Net architecture 68 7.2.1 Framework Given a query phrase q p and query image q i , WPIN Net chooses the object proposal v max that is likely to contain the object referred to by the query phrase. Similar to supervised phrase grounding, WPIN Net generates the proposals using an independent proposal generator. These proposals are then categorized into phrase categories C j ;j2 [1;N] using an object detection framework. The object detection framework is trained using a combination of Data-driven knowledge transfer and Appearance-based knowledge transfer. Data-driven knowledge transfer uses multi-label sigmoid classication loss,L dkt to identify the phrase categories of proposals in the image. Only the proposalsv i belonging to the query phrase category C j are chosen for further analysis. A language consistency module is then employed on these proposals to reconstruct the query phrase given the weighted average using attention weight of each proposal. Reconstruction lossL r is then generated by comparing the similarity between reconstructed and query phrases. The objective of WPIN Net is to minimize the training loss: arg min(L dkt +L r ) (7.1) whereL dkt is the knowledge transfer loss,L r is the reconstruction loss and is the hyper parameter. Detailed overview of WPIN Net is shown in Figure 7.3. 7.2.2 Data-driven Knowledge Transfer Data-driven Knowledge Transfer netunes the representations learned on source object classesfsc o g to predict the probabilityfpc j g of a target phrase category C j ;j 2 [1;N] for a region proposal v k 2fV T g. The nal scoring layer of object detection network is replaced to predict the probabilities of N target phrase categories. Each region proposal v k is represented as the distribution of the target category probabilities [pc k 1 ::pc k J ]. For training the weakly supervised network, the representations of K region proposals given by the object detection system for an imageI are added and loss functionL dkt is calculated as a multi-label sigmoid classication loss as follows: L dkt = 1 N N X j=1 y j log(PC j ) + (1y j )log(1PC j ) wherePC j =( 1 K P K k=1 pc k j ). denotes the sigmoid function. y i = 1 if the image contains the target phrase category and 0 otherwise. During test time, the region proposals with highest scoresfpc max j g for a query phrase p 2 C j , are chosen as the indexed query proposalsfV I g. 7.2.3 Appearance-based Knowledge Transfer Appearance-based knowledge Transfer is based on the principle that semantically related object classes have visually similar appearances. While this does not hold true univer- sally, it provides a strong generalization among classes that follow the principle. Given probability scoresfpc o g of a set of source classesfsc o g for a region proposal v k 2fV T g, the goal of the knowledge transfer is to learn the correlation score S j pk for a query phrase p for that region proposal v k . To measure the correlation among dierent classes, we 69 employ skip vectors [51] that embed semantically related words in similar vector repre- sentations. For a query phrase p, we employ its constituent nouns NP extracted using Stanford POS tagger along with phrase categoryfC j g;j2 [1;N] for its semantic repre- sentation . For a set of K proposalsfv k g given by an object detection system with source class probability scoresfpc k o g; we measure their correlation S j pk to target phrase class C j and phrase's constituent nouns NP as follows: S j pk = P O o=1 pc o Vsc o jj P O o=1 pc o Vsc o jj : V (C j ) +V (NP) jjV (C j ) +V (NP)jj A fusion of appearance-based correlation S j pk and data-driven probabilitypc k j is employed as the nal score for correlation of proposal v k with query phrase p. 7.2.4 Language Consistency Language consistency module optimizes the attention weights for each proposal so that the combined representation could learn to reconstruct the input query phraseq p . This network with attention is is similar to [66]. However unlike [66], we introduce an additional weight of S j that is a combination of knowledge transfer weights S j pc and S j pk along with attention weights a j for each proposal V j . Thus, the weighted visual feature is reconstructed as : Vf r = W a 0 @ N X j=1 S j pk +S j pc a j Vf j 1 A + b a (7.2) The reconstruction visual feature Vf r is then given as input for decoder LSTM, which predicts a xed length reconstruction query q r using probabilities for each word p t r tin[1;T ]. During training, the reconstruction loss between input query q p and recon- structed query q r is computed as follows: L r = 1 T T X t=1 log p t r [w gt ] (7.3) where cross entropy is computed for each word in the sequence and w gt is the index of the word in the query phrase. 7.2.5 Training and Inference Faster RCNN system [64] pretrained on MSCOCO [44] dataset using VGG [70] archi- tecture is employed for knowledge transfer. We use 100 proposals produce by RPN for each query. The network is trained using Adam optimization. The learning rate of the network is set to 1e-4. 7.3 Experiments and Results We evaluate our framework on Flickr30K Entities [61] and Referit Game [30] datasets for weakly-supervised phrase grounding task. 70 Approach Accuracy (%) Compared approaches Deep Fragments [31] 21.78 GroundeR [66] 28.94 Our approach WPIN Net 34.28 Table 7.1: Relative performance of our approach for Weakly supervised Grounding on Flickr30k dataset. 7.3.1 Datasets Flickr30k Entities The standard split of 30,783(training+validation) images for training and 1000 images for testing similar to the supervised setting. Each image is annotated with 5 captions. Flickr30K also assigns each phrase with phrase category from one of eight pre-dened phrase categories. ReferIt Game We use a standard split of 9,977(training+validation) images for training and 9974 images for testing similar to the supervised setting. Unlike Flick30K, the query phrases are not extracted from a caption and do not come with an associated phrase category. Hence, we use the skip-gram representation of each phrase to assign it to a phrase category. 7.3.2 Experimental Setup An RPN pre-trained on MS COCO [44] is netuned on the respective datasets for pro- posal generation. Fully connected network for classication consists of 3 successive fully connected layers followed by ReLU non-linearity and Dropout layers with droput prob- ability 0.5. Predened phrase categories are used as target classes for Flickr30k [61]. 20 cluster centers obtained from clustering training query phrases are used as target classes for ReferIt Game [30]. Vectors from the last fc layers are used as visual representation for each proposal. The hidden size and dimension of bi-lstm are set to 1024. Network Parameters All convolutional and fully connected layers are initialized using MSRA and Xavier respectively. All the features are l2 normalized and batch nor- malization is employed before similarity computation. Training batch size is set to 40 and learning rate is 1e-3 for both ickr30k and referit. VGG architecture [70] is used to be comparable to existing approaches. Evaluation Metrics Accuracy is adopted as the evaluation metric and a predicted proposal is considered positive, if it has an overlap of>0.5 with the ground-truth location. For evaluating the eciency of indexing network, Top 3 and Top 5 accuracy are also presented. Reference Approach We choose GroundeR [66] as the reference approach. GroundeR achieves state-of-the-art performance on both Flickr30K Entities and Referit Game datasets. 71 Approach Accuracy (%) Compared approaches LRCN [10](reported in [27]) 8.59 CAFFE-7K [19](reported in [27]) 10.38 GroundeR [66] 10.69 Our approach WPIN Net 14.39 Table 7.2: Relative performance of our approach for Weakly supervised Grounding on Referit Game dataset. 7.3.3 Results on Flickr30k Entities Performance We compare performance of WPIN Network to two other existing ap- proaches for Weakly-supervised Grounding. As shown in table 7.1, we achieve 5.34% improvement over existing state-of-the art. Tree in top right corner Sky Tree Sky Bottom Hill on left Figure 7.4: Qualitative results on the test sets of Flickr30K Entities and ReferIT Game. Green: Correct Prediction, Red: Wrong Prediction, Blue: Groundtruth 7.3.4 Results on ReferIt Game Performance We compare performance of WPIN Net to two other existing approaches for Weakly-supervised Grounding on Referit. As shown in table 7.2, we achieve 3.70% improvement over existing state-of-the art. 72 7.3.5 Qualitative Results We present qualitative results for few samples on Flickr30K Entities and ReferItGame datasets(Fig 7.4). The correct cases are highlighted using green bounding boxes, wrong ones using red and ground truth is indicated using blue. 7.4 Conclusion In this chapter, we addressed the problem of weakly-supervised phrase grounding using knowledge transfer. The architecture WPIN Net, is a framework that incorporates both data-driven and appearance-driven knowledge transfer to rank visual proposals. We show that incorporating this knowledge improves the performance of weakly-supervised phrase grounding. Our framework outperforms other baselines for weakly-supervised phrase grounding. Possible future work includes exploration of grounding modules that uses a combination of supervised and weakly-supervised training data. 73 Chapter 8 Conclusions and Future Work In the previous chapters, we have described the contributions towards semantic-based retrieval problems in both images and videos. In this chapter, we conclude the thesis and discuss possible future directions that can be explored towards to extend these works. 8.0.1 Conclusion Chapter 3 addresses the problem of event detection and zero-shot event recounting. For event detection, Chapter 2 presents a technique that models the event classes using posi- tive segment models learned latently. It gives 2% improvement over ELM which was the state-of-the-art method. The improvement is consistent across the event classes indicating robustness of the method. Further, we address the problem of Zero-shot event recounting using mid-level seman- tic video representation. To learn the representation of a class without any prior examples, we generate the concepts for each class. Skip-gram vectors were employed to project the event class and query video into a common subspace. This gives 3% improvement over existing state-of-the-art methods. Chapter 4 introduces a simple but eective QRC Network for addressing the problem of phrase grounding. It aims at improving the performance of phrase grounding by improving the diversity of visual proposals. We presented the reinforce training algorithm to utilize the context while grounding multiple phrase for an image. Experimental results show that QRC Network signicantly outperforms existing state-of-the-art approaches. In Chapter 5, a new architecture PIRC Net further extends the QRC Net to address some limitations. PIRC Net uses 3 novel modules to improve the discriminative power of visual features, to employ relationships to generate proposals and nally to encode visual context for each proposal. This algorithm outperforms QRC Net by a signicant margin. In Chapter 6, another novel architecture MPA Net proposes a bottom up inference framework for phrase grounding. MPA Net divides each phrase into head noun and the discriminative information to generate and rank the visual proposals for the query. This simple algorithm outperforms both QRC and PIRC Nets. Finally, Chapter 7 studies the problem of weakly-supervised Phrase Grounding. We propose a knowledge transfer algorithm that employs both data-driven and appearance- based modalities. The whole network is trained with a query reconstruction loss to facil- itate weakly-supervised learning. Evaluation comparing with state-of-the-art approaches shows improvement on both the Flickr30K Entities and ReferIt Game datasets. 74 8.0.2 Future Work Some of the remaining related questions that would be interesting to pursue in the future are : Visual Relationship Detection While we were able to achieve impressive results in the supervised phrase grounding task in images, the performance doesnot scale as im- pressively to complex queries that are not encountered during training. One of the main reasons for this is the lack of understanding of the nature of relationships in the query phrase; thus being unable to generalize it to the unseen scenarios. To address this prob- lem, there needs to be further understanding on the visual relationship detection(VRD) problem. Visual Relationship Detection(VRD) involves two sub-tasks:(a)Identication the concept(noun) pairs that might share a meaningful relation. (b) Identication of labels of concepts, relationship and their locations. Zero-shot Video retrieval A related open problem is to explore the problem of zero-shot retrieval in videos based on complex queries. One of the main drawbacks of the earlier approaches for the zero-shot retrieval problem was the lack of generalization from the limited concepts. One could use domain adaptation techniques to transfer the knowledge learnt from image-language systems to videos to understand the underlying semantics in the videos. This knowledge can be employed to identify meaningful action concepts in the unlabeled visual database. How to model the knowledge of complex events into simple, ordered action concepts is still an open problem. Knowledge transfer among dierent domains Besides the above two problems, there are some interesting avenues in language and vision such as omni-learning that uses web knowledge to transcend the performance obtained by limited annotated data. Learning from multiple sources of web data which have dif- ferent modalities and biases would be an interesting challenge and could yield signicant improvements. 75 Reference List [1] K. Andrej and F.-F. Li. Deep visual-semantic alignments for generating image de- scriptions. In CVPR, 2015. [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C Lawrence Z., and D. Parikh. Vqa: Visual question answering. In ICCV, 2015. [3] R. Arandjelovic and A. Zisserman. All about vlad. 2013. In CVPR. [4] K. Chen, R. Kovvuri, Gao J., and R. Nevatia. Msrc: Multimodal spatial regression with semantic context for phrase grounding. In ICMR, 2017. [5] K. Chen, R. Kovvuri, and R. Nevatia. Query-guided regression network with context policy for phrase grounding. In ICCV, 2017. [6] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015. [7] Kan Chen, Jiyang Gao, and Ram Nevatia. Knowledge aided consistency for weakly supervised phrase grounding. 2018. In CVPR. [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. [9] T. Deselaers, B. Alexe, and Ferrari V. Weakly supervised localization and learning with generic knowledge. In IJCV, 2012. [10] J. Donahue, L.A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015. [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge. In IJCV, 2010. [12] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll ar, J. Gao, X. He, M. Mitchell, J. C Platt, et al. From captions to visual concepts and back. In CVPR, 2015. [13] Fei fei Li. Knowledge transfer in learning to recognize visual objects classes. In ICDL, 2006. 76 [14] A. Fukui, D. H Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multi- modal compact bilinear pooling for visual question answering and visual grounding. EMNLP, 2016. [15] Chuang Gan, Ming Lin, Yi Yang, Yueting Zhuang, and Alexander G Hauptmann. Exploring semantic inter-class relationships (sir) for zero-shot action recognition. In ACAI, 2015. [16] R. Girshick. Fast R-CNN. In ICCV, 2015. [17] X. Glorot and Y. Bengio. Understanding the diculty of training deep feedforward neural networks. In Aistats, 2010. [18] A. Gordo, J. Almaz an, J. Revaud, and D. Larlus. Deep image retrieval: Learning global representations for image search. In ECCV, 2016. [19] S. Guadarrama, E. Rodner, K. Saenko, and T. Darrell. Understanding object de- scriptions in robotics by open-vocabulary object retrieval and detection. In IJRR, 2016. [20] Amirhossein Habibian, Thomas Mensink, and Cees GM Snoek. Videostory: A new multimedia embedding for few-example recognition and translation of events. pages 17{26. ACM, 2014. In ACMMM. [21] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 2004. [22] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiers: Surpassing human-level performance on imagenet classication. In CVPR, 2015. [23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. 2016. [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [25] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. [26] Matthew Honnibal and Ines Montani. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. [27] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell. Natural language object retrieval. In CVPR, 2016. [28] M. Jain, J.C. van Gemert, and C.G.M. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? 2015. In CVPR. 77 [29] J. Justin, K. Andrej, and F.-F. Li. Densecap: Fully convolutional localization net- works for dense captioning. In CVPR, 2016. [30] Sahar K., Vicente O., Mark M., and Tamara L. B. Referit game: Referring to objects in photographs of natural scenes. In EMNLP, 2014. [31] A. Karpathy, A. Joulin, and F.-F. Li. Deep fragment embeddings for bidirectional image sentence mapping. In NIPS, 2014. [32] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A. Efros, and Antonio Torralba. Undoing the damage of dataset bias. 2012. In ECCV. [33] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [34] R. Kiros, Y. Zhu, R. Salakhutdinov, R.S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought vectors. In NIPS, 2015. [35] B. Klein, G. Lev, G. Sadeh, and L. Wolf. Associating neural word embeddings with deep image representations using sher vector. In CVPR, 2015. [36] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What are you talking about? text-to-image coreference. In CVPR, 2014. [37] R. Kovvuri*, K. Chen*, and R. Nevatia. Query-guided regression network with context policy for phrase grounding. 2017. International Conference on Computer Vision(ICCV). [38] R. Kovvuri and R. Nevatia. Pirc net: Exploring phrase interrelationships and context for grounding. 2018. Asian Conference on Computer Vision. [39] Wang L., Li Y., and Lazebnik S. Learning deep structure-preserving image-text embeddings. In CVPR, 2016. [40] Kuan-Ting Lai, Dong Liu, Ming-Syan Chen, and Shih-Fu Chang. Recognizing com- plex events in videos by learning key static-dynamic evidences., 2014. In ECCV. [41] Weixin Li, Qian Yu, Ajay Divakaran, and Nuno Vasconcelos. Dynamic pooling for complex event recognition. 2013. In ICCV. [42] Xiaodan Liang, Lisa Lee, and Eric P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. 2017. In CVPR. [43] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep rein- forcement learning. ICLR, 2016. [44] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and L.C. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 78 [45] Jingen Liu, Qian Yu, Omar Javed, Saad Ali, Amir Tamrakar, Ajay Divakaran, Hui Cheng, and Harpreet Sawhney. Video event recognition using concept attributes, 2013. In WaCV. [46] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016. [47] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. [48] David G. Lowe. Distinctive image features from scale-invariant keypoints. 2004. In IJCV. [49] S. Maji, A. C. Berg, and J. Malik. Classication using intersection kernel support vector machines is ecient. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008. In CVPR. [50] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A. Efros. Ensemble of exemplar- svms for object detection and beyond. 2011. In ICCV. [51] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed represen- tations of words and phrases and their compositionality. In NIPS, 2013. [52] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015. [53] V. K Nagaraja, V. I Morariu, and L. S Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016. [54] Juan Carlos Niebles, Chih-Wei Chen, and Fei-Fei Li. Modeling temporal structure of decomposable motion segments for activity classication. 2010. In ECCV. [55] D. Oneata, J. Verbeek, and C. Schmid. Action and event recognition with sher vectors on a compact feature set. 2013. In ICCV. [56] Paul Over, Jonathan Fiscus, Greg Sanders, David Joy, Martial Michel, George Awad, Alan Smeaton, Wessel Kraaij, and Georges Qu enot. Trecvid 2014 { an overview of the goals, tasks, data, evaluation mechanisms and metrics. 2014. In TRECVID. [57] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image catego- rization. 2007. In CVPR. [58] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In ACM SIGKDD, 2013. [59] B. A. Plummer, A. Mallya, M. C. Christopher, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image- language cues. In ICCV, 2017. 79 [60] B. A Plummer, L. Wang, C. M Cervantes, J. C Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In IJCV, 2016. [61] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hocken- maier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. In ICCV, 2015. [62] F. Radenovi c, G. Tolias, and O. Chum. CNN image retrieval learns from bow: Unsupervised ne-tuning with hard examples. In ECCV, 2016. [63] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unied, real-time object detection. In CVPR, 2016. [64] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. [65] M. Rochan and Y. Wang. Weakly supervised localization of novel objects using appearance transfer. In CVPR, 2015. [66] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In ECCV, 2016. [67] A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele. Generating descrip- tions with grounded and co-referenced people. In CVPR, 2017. [68] M. Rohrbach, M. Stark, and Schiele B. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, 2011. [69] F. Sadeghi, C.L. Zitnick, and A. Farhadi. Visalogy: Answering visual analogy ques- tions. In NIPS, 2015. [70] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014. [71] Alan F. Smeaton, Paul Over, and Wessel Kraaij. Evaluation campaigns and trecvid. In MIR. ACM, 2006. [72] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01. [73] Chen Sun, Brian Burns, Ram Nevatia, Cees Snoek, Bob Bolles, Greg Myers, Wen Wang, and Eric Yeh. Isomer: Informative segment observations for multimedia event recounting. 2014. In ICMR. [74] Chen Sun and R. Nevatia. Large-scale web video event classication by use of sher vectors. 2013. In WACV. [75] Chen Sun and R. Nevatia. Discover: Discovering important segments for classica- tion of video events and recounting. 2014. In CVPR. 80 [76] Chen Sun and Ram Nevatia. Active: Activity concept transitions in video event classication. 2013. In ICCV. [77] Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. Temporal local- ization of ne-grained actions in videos by domain transfer from web images. 2015. In ACMMM. [78] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998. [79] Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 1999. [80] Kevin Tang, Li Fei-Fei, and Daphne Koller. Learning latent temporal structure for complex event detection. 2012. In CVPR. [81] Chun-Yu Tsai, Michelle L. Alexander, Nnenna Okwara, and John R. Kender. Highly ecient multimedia event recounting from user semantic preferences. 2014. In ICMR. [82] J. R. Uijlings, Koen E. Van D. S., T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013. [83] Heng Wang and C. Schmid. Action recognition with improved trajectories. 2013. In ICCV. [84] M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. Structured matching for phrase localization. In ECCV, 2016. [85] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992. [86] F. Wu, Z. Xu, and Y. Yang. An end-to-end approach to natural language object retrieval via context-aware deep reinforcement learning. In arxiv, 2017. [87] Shuang Wu, S. Bondugula, F. Luisier, Xiaodan Zhuang, and P. Natarajan. Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In CVPR, 2014. [88] Shuang Wu, S. Bondugula, F. Luisier, Xiaodan Zhuang, and P. Natarajan. Zero- shot event detection using multi-modal fusion of weakly supervised concepts. pages 2665{2672, 2014. In CVPR. [89] F. Xiao, L. Sigal, and Y. J. Lee. Weakly-supervised visual grounding of phrases with linguistic structures. In CVPR, 2017. [90] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015. 81 [91] L. Yu, P. Poirson, S. Yang, A. C Berg, and T. L Berg. Modeling context in referring expressions. In ECCV, 2016. [92] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara Berg. Mattnet: Modular attention network for referring expression comprehension. 2018. In CVPR. [93] Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. A joint speaker-listener- reinforcer model for referring expressions. CVPR, 2017. [94] Xiangxin Zhu, Dragomir Anguelov, and Deva Ramanan. Capturing long-tail distri- butions of object subcategories. 2014. In CVPR. [95] C. L. Zitnick and P. Doll ar. Edge boxes: Locating object proposals from edges. In ECCV, 2014. 82
Abstract (if available)
Abstract
Semantic-based Visual Information Retrieval (SBVIR) is an important problem that involves research in areas such as Computer Vision, Machine learning, Natural Language Processing and other related areas in computer science. Semantic-based visual information retrieval (SBVIR) involves retrieval of desired visual entities (images/videos) from a large database based on an input query by analyzing their semantics. An input query could be a fixed or an open-ended noun phrase and database consists of visual entities that are either videos or images. SBVIR is a challenging problem because of the inter-class and intra-class variance in appearance of objects
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodal reasoning of visual information and natural language
PDF
From matching to querying: A unified framework for ontology integration
PDF
Grounding language in images and videos
PDF
Discovering and querying implicit relationships in semantic data
PDF
Visual representation learning with structural prior
PDF
Expanding the performance-compute frontier for retrieval-augmented language models
PDF
Invariant representation learning for robust and fair predictions
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Transfer learning for intelligent systems in the wild
PDF
Pretraining transferable encoders for visual navigation using unlabeled datasets
PDF
Event detection and recounting from large-scale consumer videos
PDF
Investigations in music similarity: analysis, organization, and visualization using tonal features
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
Computer vision aided object localization for the visually impaired
PDF
Analyzing human activities in videos using component based models
PDF
Weighted factor automata: A finite-state framework for spoken content retrieval
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Modeling, learning, and leveraging similarity
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Exploitation of wide area motion imagery
Asset Metadata
Creator
Kovvuri, Naga Vijaya Rama Reddy
(author)
Core Title
Semantic-based visual information retrieval using natural language queries
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/15/2019
Defense Date
11/30/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,grounding,information retrieval,machine learning,OAI-PMH Harvest,phrase grounding,visual information retrieval
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Nevatia, Ram (
committee chair
)
Creator Email
nkovvuri@usc.edu,vijayram9143@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-122079
Unique identifier
UC11675531
Identifier
etd-KovvuriNag-7075.pdf (filename),usctheses-c89-122079 (legacy record id)
Legacy Identifier
etd-KovvuriNag-7075.pdf
Dmrecord
122079
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Kovvuri, Naga Vijaya Rama Reddy
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer vision
deep learning
grounding
information retrieval
machine learning
phrase grounding
visual information retrieval