Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodal perception guided computational media understanding
(USC Thesis Other)
Multimodal perception guided computational media understanding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Multimodal perception guided computational media understanding by Digbalay Bose A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL AND COMPUTER ENGINEERING) December 2024 Copyright 2025 Digbalay Bose I dedicate this thesis to my family. ii Acknowledgements First and foremost, I would like to express my deepest gratitude to my advisor, Prof. Shrikanth Narayanan for his guidance, support and encouragement throughout my Ph.D. Your mentorship has shaped the direction of my research and an interest for critical thinking that I will carry forward in my career. I am immensely grateful to the members of my dissertation committee, Prof. Keith Jenkins and Prof. Jesse Thomason for their feedback and valuable suggestions that greatly improved the quality of my work. To my colleagues at SAIL, thank you for creating such a collaborative environment that made the research process really enjoyable. The countless hours of brainstorming, discussions and shared laughs have been a significant part of my journey. I would like to thank Rajat, Naveen and Krishna especially for their constant support throughout the Ph.D. journey. The numerous discussions, both technical and non-technical, have been an invaluable source of encouragement and insight. I am really grateful to my friends for their encouragement, understanding and being a constant source of positivity and joy throughout this journey. Special mention goes to Agni, Rajrup, Akash, Anik and Souvik for being a family away from home. Finally, I would like to thank my family, especially my parents, for their unconditional love, patience, and constant support throughout this journey. iii Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Multi-modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Modalities - Definition and Introduction . . . . . . . . . . . . . . . . . . . 1 1.1.2 Rise of multi-modal content . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Multimodality and media content: A salient example . . . . . . . . . . . . . . . . 4 1.3 Multimodal perception: Computational perspective . . . . . . . . . . . . . . . . . 5 1.4 Challenges in media understanding . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Organization and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Visual scene tagging through vision-language semantic connection . . . . . 13 2.1 Broad questions and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Role of scene as contextual signal . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Visual scenes in movies: Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 Language driven taxonomy curation . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6.1 Sources of visual scene information . . . . . . . . . . . . . . . . . . . . . 20 2.6.2 Visual scene taxonomy curation . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Role of vision-language semantic connection . . . . . . . . . . . . . . . . . . . . 24 2.8 MovieCLIP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.8.1 Shot detection from movie clips . . . . . . . . . . . . . . . . . . . . . . . 26 2.8.2 CLIP-based visual scene labeling of movie shots . . . . . . . . . . . . . . 27 2.8.3 Prompt design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 2.8.4 Analysis of CLIP tagging . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.8.4.1 Qualitative examples . . . . . . . . . . . . . . . . . . . . . . . 29 2.8.4.2 Shot type distribution: . . . . . . . . . . . . . . . . . . . . . . . 30 2.8.5 Quality estimation through human verification . . . . . . . . . . . . . . . 31 2.9 Experiments and Results: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9.2 Visual scene recognition - Movies . . . . . . . . . . . . . . . . . . . . . . 34 2.9.3 Downstream tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.9.3.1 Visual scene recognition - web videos . . . . . . . . . . . . . . 35 2.9.3.2 Macro-level content understanding - genre classification . . . . . 36 2.9.3.3 Impact of MovieCLIP pretraining . . . . . . . . . . . . . . . . . 37 2.10 Ethical implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.11 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 3: Narrative understanding and message decoding in advertisements: Multimodal perception guided approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.1 Broader questions and contributions . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Advertisements as medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Structure of ads: Task taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Advertisements - Narrative driven tasks and challenges . . . . . . . . . . . . . . . 45 3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 MM-AU benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.2 Annotations through crowd-sourcing . . . . . . . . . . . . . . . . . . . . 52 3.6.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7 Task definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 Advertisements: Modality information . . . . . . . . . . . . . . . . . . . . . . . . 60 3.9 Two-stage multimodal fusion and reasoning . . . . . . . . . . . . . . . . . . . . . 60 3.9.1 Stage 1: Foreground and temporal context modality fusion . . . . . . . . . 61 3.9.2 Stage 2: Reasoning mechanism . . . . . . . . . . . . . . . . . . . . . . . 62 3.10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.10.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.10.2 Language-based reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.10.3 Unimodal and Multimodal context fusion . . . . . . . . . . . . . . . . . . 65 3.11 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.11.1 Language-only reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.11.2 Unimodal vs Multimodal fusion . . . . . . . . . . . . . . . . . . . . . . . 66 3.12 A closer look at unified understanding of advertisements . . . . . . . . . . . . . . 69 3.12.1 Ads reasoning benchmark tasks . . . . . . . . . . . . . . . . . . . . . . . 71 3.12.1.1 Topics: Broad theme understanding . . . . . . . . . . . . . . . . 72 3.12.1.2 Atypicality detection . . . . . . . . . . . . . . . . . . . . . . . 72 3.12.1.3 Persuasion strategies . . . . . . . . . . . . . . . . . . . . . . . . 73 3.12.2 LMMs: Reasoning outline . . . . . . . . . . . . . . . . . . . . . . . . . . 73 v 3.12.2.1 Zero-shot reasoning . . . . . . . . . . . . . . . . . . . . . . . . 74 3.12.2.2 Supervised finetuning . . . . . . . . . . . . . . . . . . . . . . . 74 3.12.3 LMM Models outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.12.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.12.4.1 Topics: Broad theme understanding . . . . . . . . . . . . . . . . 76 3.12.4.2 Atypicality detection . . . . . . . . . . . . . . . . . . . . . . . 77 3.12.4.3 Persuasion modeling . . . . . . . . . . . . . . . . . . . . . . . . 78 3.13 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 4: Characterizing the effect of data quality variations in multimedia understanding: A parameter efficient approach . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1 Broader questions and contributions . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Multimodal perception and quality variations in data . . . . . . . . . . . . . . . . 82 4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.4 Multimodal fusion architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.2 MVSA-Single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Quality variations: Modality corruptions . . . . . . . . . . . . . . . . . . . . . . . 89 4.7 Behavior of multimodal fusion models under corruptions . . . . . . . . . . . . . . 90 4.7.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.2 MVSA-Single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.8 Low-resource robust multimodal fusion . . . . . . . . . . . . . . . . . . . . . . . 95 4.8.1 Parameter-efficient finetuning . . . . . . . . . . . . . . . . . . . . . . . . 95 4.8.1.1 LoRA-Multimodal setup . . . . . . . . . . . . . . . . . . . . . 96 4.8.2 Information Bottleneck (IB) . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.8.2.1 Fusion-IB: Multimodal setup . . . . . . . . . . . . . . . . . . . 99 4.8.2.2 Input-IB: Multimodal setup . . . . . . . . . . . . . . . . . . . . 100 4.9 Low resource multimodal fusion and corruptions . . . . . . . . . . . . . . . . . . 102 4.9.1 LoRA: Multimodal setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9.1.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.9.1.2 MVSA-Single . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.9.2 Fusion-IB and Input-IB: Multimodal setup . . . . . . . . . . . . . . . . . 105 4.9.2.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.9.2.2 MVSA-Single . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.9.3 Optimal LoRA placement . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.9.3.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.9.3.2 MVSA Single . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.9.4 Corrupted vs Clean training . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.9.4.1 MM-IMDb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.9.4.2 MVSA-Single . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.10 Major takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.11 Future directions and ongoing scope of work . . . . . . . . . . . . . . . . . . . . . 120 vi Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.1 Fine-grained narrative understanding . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2 Characterizing modality heterogeneity in multimodal learning . . . . . . . . . . . 124 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 C MovieCLIP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 C.1 Scene classes distribution wrt sources . . . . . . . . . . . . . . . . . . . . 151 D MM-AU Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 D.1 Topic categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 E MM-AU Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 E.1 Language based reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 154 E.1.1 Flan-T5: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 E.1.2 OPT: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 E.1.3 alpaca: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 F Atypicality labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 G Persuasion stratgies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 vii List of Figures 1.1 (a) Sensory modalities associated with real-world human perception (b) Specialized regions in the brain where sensory information comes together and processed by different cortices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Media value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Outline diagram showing the presence of various contexts and associated modalities in a sample movie clip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Role of multimodal perception in media understanding: Connection between different components of multimodal perception and how it can help media understanding at scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Role of multimodal perception in enabling visual scene tagging with key elements as modality connection and modality representation and learning . . . . . . . . . . 14 2.2 Difference between visual scenes - train station and kitchen in terms of object placements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Overview diagram highlighting the challenges associated with visual scene recognition in movies (a) Domain mismatch between Natural scene images,(Source: http://places2.csail.mit.edu/explore.html) vs frames from Movies for living room (b) Movie centric visual scene classes like prison, control room etc that are absent from existing taxonomies (c) Change in visual scene between shots in the same movie clip. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Sources of visual scene information: (a) Sample slugline from a movie script that contains visual scene, time of day, and location (b) Word cloud showing the distribution of visual scene labels in HVU [57] dataset. . . . . . . . . . . . . . . . . . . 20 2.5 The share of different sources (HVU, Movie Sluglines, Common labels and Human expert) in curating the label taxonomy. Example labels from different sources are shown in boxes with the pie chart. . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 TSNE plot of the 179 scene classes in the taxonomy. Certain representative groups of visual scene labels that are semantically close to each other are enclosed by circular/oval shapes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7 1. Contrastive pretraining framework associated with CLIP. 2. Usage of CLIP’s pretrained vision and text encoders for zero-shot classification. Image source: [58] 26 2.8 Overview schematic of the prompt-based visual scene labeling of movie shot using CLIP’s visual and text encoders. S is the similarity matrix where entry Slt refers to similarity values between textual embedding el and visual embedding vt . . . . . . . 27 2.9 Examples of various prompt templates and associated CLIPSceneScores for top-k visual scene labels. Here k=5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 viii 2.10 (a) Sample frames from the movie shots labeled by CLIP with high confidence (CLIPSceneScore ≥ 0.6 and Labels shown in yellow) (b) Sample frames from the movie shots tagged by CLIP with low confidence (Labels shown in yellow). . . . . 29 2.11 Genre-wise distribution of different scene labels. For each genre, the top-5 scene labels are shown in terms of the number of occurrences in the top-1 label provided by CLIP [58] for shots in the MovieCLIP dataset. Threshold for confidence score of top-1 label = 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.12 (a) Distribution of predicted scale labels for shots (N=745059) in MovieCLIP dataset having top-1 CLIPSceneScore <= 0.2(b) Distribution of predicted scale labels for shots (N=107420) in MovieCLIP dataset having top-1 CLIPSceneScore >= 0.4 and top-k CLIPSceneScore >= 0.1 (k=2,3,4,5). Scale labels include : ECS: Extreme Close-up shot, CS: Close-up shot, MS: Medium shot, LS: Long shot, FS: Full shot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.13 Schematic design of the mturk experiment used for human verification of visual scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.14 Distribution of shot scale predictions among the shots having no agreements between human annotators and CLIP’s labeling scheme. ECS: Extreme Close-up shot, CS: Close-up shot, MS: Medium shot, LS: Long shot, FS: Full shot . . . . . 33 2.15 Sample scene classes in HVU where Mscene performs better in comparison to Mscene(Kin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Role of multimodal perception in the narrative understanding of advertisement videos with key elements as modality connections, representations, and reasoning. . 41 3.2 Increase in demand for advertisements over last 4 years . . . . . . . . . . . . . . . 42 3.3 Structure of advertisement video with multiple short-term context changes and overall long-term linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 Examples of diverse forms of messaging in advertisement print image/videos including direct/indirect complex messages and non-photorealistic atypical objects. Image sources: [16], [107], [109] . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5 Task taxonomy for advertisement understanding . . . . . . . . . . . . . . . . . . . 45 3.6 Schematic diagram showing illustrative examples of various tasks in the MM-AU (Multi-modal ads understanding) dataset. Multimodal understanding of ads along the lines of (a) Topic categorization (18 classes), (b) Tone transition, (c) Social message detection i.e. Absence/Presence of social message . . . . . . . . . . . . . 46 3.7 Share of different ad sources in MM-AU dataset. Ads of the world (6304 videos), Cannes Lion (1135), Video-Ads dataset (960) . . . . . . . . . . . . . . . . . . . . 51 3.8 Outline of the annotation framework for tone transition and social message detection problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.9 Example provided to the annotators showing the tone transition and associated explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.10 Example videos associated with absence/presence of social message provided to the annotators. (a) Frame transition associated with an example video urging people to vote (b) Frame transition associated with an example video emphasizing equal opportunities for everyone in sports. . . . . . . . . . . . . . . . . . . . . . . 55 ix 3.11 Distribution of the social message absence (No) and presence (Yes) labels in MM-AU 57 3.12 Distribution of topics in MM-AU dataset . . . . . . . . . . . . . . . . . . . . . . . 57 3.13 Distribution of majority perceived tone labels (among 3 annotators) across (a) start, (b) middle, and (c) ending segments in MM-AU dataset (8399 videos) . . . . . . . 58 3.14 Distribution of topics and perceived tone transition across videos having the presence of social message (739 videos) in MM-AU dataset . . . . . . . . . . . . . . . 58 3.15 Distribution of perceived tone labels across the (a) start, (b) middle, and (c) ending segments in MM-AU dataset for videos having social message (739 videos) . . . . 59 3.16 Proposed architecture for the fusion of different modalities associated with temporal and foreground contexts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.17 PerceiverIO architecture [156] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.18 Comparisons between Flan-T5-XXL and GPT-4 in terms of class-wise F1-score for the social message (Yes: Presence of social message, No: Absence of social message), Tone transition detection tasks. . . . . . . . . . . . . . . . . . . . . . . 67 3.19 Comparisons between Flan-T5-XXL and GPT-4 in terms of class-wise F1-score for the topic categorization task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.20 Comparisons between TxAV(AV-Tx), TxTV(TV-Tx), A-Max, D-Max in terms of class-wise F1-score for the social message (Yes: Presence of social message, No: Absence of social message), Tone transition detection tasks. An average of 5 runs is considered for the F1 score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.21 Comparisons between TxAV (AV-Tx), TxTV(TV-Tx), A-Max, D-Max in terms of class-wise F1-score for the topic categorization task. An average of 5 runs is considered for the F1 score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.22 Multimodal foundation models landscape: From large scale VL pretraining of unified or modality-specific models to pretraining and alignment of vision modalities with the large language model encoder. . . . . . . . . . . . . . . . . . . . . . . . 71 3.23 (a) Image examples associated with the broad theme (topics) along with labels (b) Image examples associated with the fine-grained atypicality labels (c) Image examples showcasing the persuasion strategies . . . . . . . . . . . . . . . . . . . . 72 3.24 ZS (zero-shot) and SFT results for topics prediction (38 class category) . . . . . . 76 3.25 ZS (zero-shot) and SFT results for binary atypicality prediction (2 class category) . 77 3.26 ZS (zero-shot) and SFT results for fine-grained atypicality prediction (10 class category) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.27 Multi-label persuasion prediction (10 class category) . . . . . . . . . . . . . . . . 79 4.1 Variations in data quality require handling of modality heterogeneity, semantic modality connections, and proposal of novel training strategies to learn robust representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 Characteristics of modality-specific factors associated with low-quality multimedia data.Image source: [196] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Dual stream architectures (DualEncoder) denoted by Concat-MLP fusion operations. 87 4.4 Sample multimodal examples (image-text) pairs from MM-IMDb and MVSASingle datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Multimodal dual encoder architecture . . . . . . . . . . . . . . . . . . . . . . . . 91 x 4.6 MM-IMDB training and testing setting under text modality corruption (testing) . . 92 4.7 (a) MM-IMDb: Performance of the dual encoder multimodal model when tested with [MASK] corruptions. k% per sample corruption: k% of words in text replaced by “[MASK]” (b) MM-IMDb: Performance of the dual encoder multimodal model when tested with word removal corruptions. k% per sample corruption: k% of words in text removed. An average of 5 runs is shown for every test data corruption fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.8 MVSA-Single training and testing setting under image modality corruption (testing) 94 4.9 (a) MVSA-Single: Performance of the dual encoder multimodal model when tested with random patch (16×16) corruptions. k% per sample corruption: k% of 16x16 patches in image masked by black patches (b) MVSA-Single: Performance of the dual encoder multimodal model when tested with per-pixel Gaussian noise addition with varying amplitude. Average of 5 runs are shown for every test data corruption fraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.10 Outline of the low resource multimodal fusion approach that focuses on re-parameterization method, i.e., LoRA and information flow control through bottleneck guided objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.11 Outline of LoRA-based weight update using learnable matrices A and B . . . . . . 97 4.12 Outline of LoRA-multimodal setup where individual modality branches are updated by LoRA-based weight adaptation schemes. . . . . . . . . . . . . . . . . . . 97 4.13 Outline of Fusion-level IB multimodal setup: (a) Mutual information (MI) based IB formulation (b) Variational approximation of the MI-based formulation . . . . . 100 4.14 Outline of Input-IB multimodal setup: (a) Mutual information (MI) based IB formulation (b) Variational approximation of the MI-based formulation . . . . . . . . 101 4.15 LoRA-Multimodal setting for MMIMD dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. . . . . . 103 4.16 MM-IMDb (LoRA: Multimodal setup): Performance of dual-encoder multimodal model under LoRA and no-LoRA model configurations for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.17 LoRA-Multimodal setting for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP Visual encoder (visual branch) in the training phase with clean data. noLoRA: Image and text encoders are kept frozen. Testing: The text input(tweets) is kept clean, and the image input is corrupted by random patch-based masking (16 × 16 patches). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.18 MVSA-Single (LoRA: Multimodal setup): Performance of dual-encoder multimodal model under LoRA and no-LoRA model configurations for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 . . . . . . . . . . . . . . . . . . . . . . . . 106 4.19 Fusion-IB: Multimodal setup for MMIMDb dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. . . . . . 107 xi 4.20 Input-IB: Multimodal setup for MMIMDb dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. . . . . . . . . . . . . . 108 4.21 MMIMDb: Performance of Input-IB (no-LoRA) and Fusion-IB (no-LoRA) in comparison to normal cross-entropy fine-tuning setting (no-LoRA) for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 . . . . . . . . . . . . . . . . . . . . 109 4.22 MM-IMDb: (a) Performance of input-IB (no-LoRA), Fusion-IB (no-LoRA) and fine-tuning setting (no-LoRA) for 70% per sample corruption setting (b) Performance of input-IB (LoRA), Fusion-IB (LoRA) and fine-tuning setting (LoRA) for 70% per sample corruption setting in comparison to the no-LoRA setting . . . . . 110 4.23 Fusion-IB: Multimodal setup for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP’s vision encoder (image branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: Random patch (16 × 16) corruptions are applied to the image branch, and text input is kept clean. 111 4.24 Input-IB: Multimodal setup for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP’s vision encoder (image branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: Random patch (16 × 16) corruptions are applied to the image branch, and text input is kept clean. . . 112 4.25 MVSA-Single: Performance of Input-IB (no-LoRA) and Fusion-IB (no-LoRA) in comparison to normal cross-entropy fine-tuning setting (no-LoRA) for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 . . . . . . . . . . . . . . . . . . . . 113 4.26 MVSA-Single: (a) Performance of Input-IB (no-LoRA), Fusion-IB (no-LoRA) and fine-tuning setting (no-LoRA) for 70% per sample corruption setting (b) Performance of input-IB (LoRA), Fusion-IB (LoRA) and fine-tuning setting (LoRA) for 70% per sample corruption setting in comparison to the no-LoRA setting . . . 114 4.27 MM-IMDb: LoRA placement in the image branch for both input and fusion-level IB in the case of MM-IMDb dataset . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.28 (a) MM-IMDb: Text corruption test results (70% per sample corruption by ”[MASK]” tokens) for LoRA placement in the text branch and no-LoRA settings. (b) MMIMDb: Text corruption test results (70% per sample corruption by ”[MASK]” tokens) for LoRA placement in the image branch and no-LoRA settings. Green band indicates the results obtained by adding LoRA weight updates to the image branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.29 MVSA-Single: LoRA placement in the text branch for both input and fusion-level IB in the case of MVSA-Single dataset . . . . . . . . . . . . . . . . . . . . . . . . 116 4.30 (a) MVSA-Single: Image corruption test results (70% per sample corruption by random patch masking) for LoRA placement in the image branch and no-LoRA settings. (b) MVSA-Single: Image corruption test results (70% per sample corruption by random patch masking) for LoRA placement in the text branch and no-LoRA settings. Green band indicates the results obtained by adding LoRA weight updates to the text branch. . . . . . . . . . . . . . . . . . . . . . . . . . . 116 xii 4.31 (a) MM-IMDb: Corrupted training results forInput-IB, Fusion-IB and CE (LoRAMultimodal setting) under no LoRA for text corruptions (“[MASK]”). (b) MMIMDb: Corrupted training results forInput-IB, Fusion-IB and CE (LoRA-Multimodal setting) under LoRA setting for text corruptions (“[MASK]”).LoRA applied to the text branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.32 (a) MVSA-Single: Corrupted training results for Input-IB, Fusion-IB and CE (LoRA-Multimodal setting) under no LoRA for image corruptions (random patch masking). (b) MVSA-Single: Corrupted training results for Input-IB, Fusion-IB and CE (LoRA-Multimodal setting) under LoRA setting for image corruptions (random patch masking).LoRA applied to the image branch. . . . . . . . . . . . . 118 5.1 Connection between multimodal perception at scale and computational media understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xiii List of Tables 2.1 Comparison of MovieCLIP with other available image and video datasets with visual scene classes. Natural: Images of natural scenes.Web-Videos: videos obtained from internet sources like YouTube. AV: whether publicly available or not. Avg or duration span of video data sources are provided with the respective units. NA: Number of scene classes explicitly not mentioned with the dataset. . . . . . . 19 2.2 Statistics of movie shots in MovieCLIP dataset. . . . . . . . . . . . . . . . . . . . 26 2.3 Mean average precision (mAP) and average Spearman correlation of different models on human-verified evaluation set (N=1883 shots). NA: End-to-end models used instead of features. For 3D conv models, the backbone network is mentioned inside brackets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Mean average precision of different models on HVU dataset for multi-label scene classification (248 classes). Backbone for end-to-end models: 3D Resnet18. . . . . 36 2.5 Mean average precision of different models for multi-label genre classification (13 class) on Moviescope dataset. Except Mtrailer comparison results are reported from [59]. Abbreviations: Ac: Action, Ani: Animation, Bio: Biography, Com: Comedy, Cri: Crime, Drm: Drama, Fmy: Family, Fntsy: Fantasy, Hrrr: Horror, Myst: Mystery, Rom: Romantic, ScF: SciFi, Thrl: Thriller, fstVid: fastVideo, fstTConv: fastVideo + Temporal Conv. . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Impact of MovieCLIP pretrained features vs Kinetics pretrained features for Mscene (HVU) and Mtrailer (Moviescope). Results reported are mean average precision (mAP) values. Model(Kin): Model with Kinetics400 pretrained features, where Model ∈ {Mscene,Mtrailer} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Comparison of MM-AU with other available advertisement benchmarks across different modalities. Annotation type: H: Human annotation, AL: Active Learning, SA: Semi-automatic, SG: Store generated. Duration: NA: Not applicable for images; Mentioned in hours(h). #Samples: Number of video clips or images. #Shot: NA: Not applicable for images; Number of shots detected from all the video samples. #Class: T: Topic, S: Sentiment, OE: Open-Ended, H: Humor, Ex: Exciting, Pr: Presentation, St: Style, Pl: Place, PS: Persuasion strategy, PC: Product categories, SM: Social messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Data statistics of MM-AU dataset. #social msg labels: total number of labels . . . 56 3.3 Zero shot performance comparison between various LLMs on MM-AU dataset. Tasks: SM: Social message detection, TT: Tone transition, Topic: Topic categorization. NA∗ : Information not available. F1: Macro-F1. Best performing results are marked in bold for respective tasks. . . . . . . . . . . . . . . . . . . . . . . . . 66 xiv 3.4 Comparative results between different unimodal and multimodal context fusion models across different tasks: Social message and Tone transition, Topic categorization. CLIP-S: Shot level features extracted using CLIP. Modality: A: Audio, V: Visual, T: Text. Results are reported as an average of 5 runs with randomly selected seeds. Best performing results are marked in bold for respective tasks. . . 68 3.5 Parameter details of instruct-tuned decoder only LMM models . . . . . . . . . . . 75 4.1 Dataset distribution across genres . . . . . . . . . . . . . . . . . . . . . . . . . . 88 xv Abstract In today’s information-rich landscape, there has been a rapid rise in media content available through web based platforms reliant on diverse modalities. Multimodal content can appear in various formats, including online news articles, cinematic productions, advertisements short digital videos and print images. The proliferation of multimodal content across diverse platforms necessitates a large-scale understanding of its impact on individuals and society as a whole Our ability to comprehend the underlying narratives in these sources depends on advancing core multimodal perception capabilities that rely on the integration of information from diverse modalities, including audio, visual, and textual streams. In this work, we explore how the core elements of multimodal perception including modality connectedness, representation and reasoning, and heterogeneity can enable media understanding across the dimensions of content tagging, narrative understanding and handling diverse quality variations. Furthermore, in light of current advancements in large-scale multimodal processing, we identify certain key areas that can drive the next generation of media understanding. xvi Publications 1. MovieCLIP: Visual Scene Recognition in Movies; Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Haoyang Zhang, Yin Cui, Kree Cole-McLaughlin, Huisheng Wang, Shrikanth Narayanan; IEEE/CVF Winter Conference on Applications of Computer Vision 2023 2. Contextually-rich human affect perception using multimodal scene information; Digbalay Bose, Rajat Hebbar, Krishna Somandepalli, Shrikanth Narayanan; IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023 3. MM-AU: Towards Multimodal Understanding of Advertisement Videos; Digbalay Bose, Rajat Hebbar, Tiantian Feng, Krishna Somandepalli, Anfeng Xu, Shrikanth Narayanan; ACM Multimedia 2023 4. Automatic Analysis of Asymmetry in Facial Paralysis Patients Using Landmark-Based Measures; Digbalay Bose, Krishna Somandepalli, Tymon Tai, Courtney Voelker, Shrikanth Narayanan, Amit Kochhar; Facial Plastic Surgery & Aesthetic Medicine, 2022 5. Understanding of Emotion Perception from Art; Digbalay Bose, Krishna Somandepalli, Souvik Kundu, Rimita Lahiri, Jonathan Gratch, Shrikanth Narayanan; 4th ICCV CLVL Workshop 2021 6. FedMultimodal: A Benchmark For Multimodal Federated Learning; Tiantian Feng, Digbalay Bose, Tuo Zhang, Rajat Hebbar, Anil Ramakrishna, Rahul Gupta, Mi Zhang, Salman Avestimehr, Shrikanth Narayanan; KDD 2023 Applied Data Science Track xvii 7. A Dataset for Audio-Visual Sound Event Detection in Movies; Rajat Hebbar, Digbalay Bose, Krishna Somandepalli, Veena Vijai, Shrikanth Narayanan; IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023 8. Characterization the effect of modality corruptions in multimodal encoders: A parameter efficient approach; Under preparation 9. Decoding ads through multimodal reasoning: Challenges and Opportunities; Under preparation xviii Chapter 1 Introduction This chapter provides the necessary background material and details the organization of my proposed work. 1.1 Multi-modality 1.1.1 Modalities - Definition and Introduction Modality refers to the way something happens around us or is perceived by humans [1]. The world around us is composed of multiple sensory modalities, as listed below, with enabled capabilities: • Visual: Watching objects and events in the current visual scene • Language: Verbal interactions with other humans • Audio: Speaking and hearing environmental sounds • Touch: Feeling texture of objects • Smell: Distinguishing between pleasant and unpleasant odors. For humans, the information from multiple sensory sources is processed by distinct regions of the brain for different perceptual tasks [2–4]. While the visual input processing is handled by the occipital lobe in the brain, the auditory and language inputs are processed by specific regions of 1 Figure 1.1: (a) Sensory modalities associated with real-world human perception (b) Specialized regions in the brain where sensory information comes together and processed by different cortices the temporal lobe. The sub-division of the human neural processing system into regions that can process a wide variety of modalities indicates the heterogeneity among different sources along with underlying commonalities, as shown in Fig 1.1. In artificial intelligence, a given task is considered multi-modal if multiple sensory modalities are needed to solve the same. The goal of artificial intelligence is to develop autonomous agents that can learn from diverse input modalities to solve complex reasoning tasks. As mentioned by Liang et al. [5], the key properties associated with learning from multimodal data are listed as follows: • Heterogeneity: Heterogeneous nature of different modalities due to diversity in qualities, structures, and representations • Connectedness: Shared/common information between different modalities due to interrelated nature. • Interaction: Existence of optimal interactions/fusions between different modalities, suitable for particular task. 2 Based on the above properties associated with multiple modalities, the major challenges in multimodal learning [5] can be summarized below: • Representation: Learning representations through fusion mechanisms capture the heterogeneity of the modalities along with the shared information. For example, language is considered symbolic due to its word-based composition, whereas audio and visual modalities are represented through signals. • Alignment: Identifying relationships between (sub) elements of different modalities. This requires similarity computation between different modalities along with the handling of long-term dependencies. • Reasoning: Combining knowledge from multiple sources (including external, i.e., knowledge graphs) in a multi-step inference process by exploiting the task structure. • Generation: Translating between different modalities and summarization of multimodal data by preserving the salient content. • Transference: Transferring cross-modal knowledge from secondary modalities to the primary modality in the presence of noise or limited data. 1.1.2 Rise of multi-modal content With the rapid rise in heterogeneous internet networks across the globe, vast amounts of web-based content have been generated at different scales, variety, and velocity [6]. Web-based sources convey information to the viewers through a combination of multiple modalities. For example, an online news article about a significant sports event contains text descriptions with accompanying images. Further, the movies or TV shows hosted through online streaming platforms portray narratives through the combination of video and audio signals (music, speech, ambient sounds). The demand for media content, especially movies, TV shows, and advertisements, is expected to rise over the next three years, with the projected market value to reach 2.9 trillion US dollars, as shown 3 in Fig 1.2. The growing demand for multi-modal media content has led to various descriptive Figure 1.2: Expected value of entertainment market industry from 2017 to 2026 1 tasks such as captioning [7], video summarization [8], and question-answering [9], aimed at enhancing user experience. In the following sections we will introduce the concept of context and how multiple modalities provide contextual information. 1.2 Multimodality and media content: A salient example Multi-modal tasks require the processing of diverse contextual information followed by fusion for both macro and instance-level content understanding. In the case of multi-modal content, the contextual information is usually obtained through multiple input modalities. In Fig 1.3 2 , we can see that the face crop of the person (visual modality) captures the local/facial context in the frame of interest. The global context is further captured by the visual modality through the surrounding scene, i.e., road, car (exterior). External modalities in the form of natural language descriptions (language modality), i.e., Three women standing outside the car and showing their license provide information about the interactions happening in the foreground. Further, this frame is a part of a longer narrative, as shown in the sequence of frames. The frame sequence captures 1https://www.enterpriseappstoday.com/stats/media-and-entertainment-industry-statistics. html 2Hidden Figures (2016): https://www.youtube.com/watch?v=W1VZ1-ZdQ7k&ab channel=20thCenturyStudios 4 Local/Facial context Global/Visual scene context: Road, Car (Exterior) External modalities: Three women standing outside car and showing their license. Visual modality Visual modality Language modality Temporal context Visual modality Temporal context Audio modality Figure 1.3: Outline diagram showing the presence of various contexts and associated modalities in a sample movie clip. both short-term and long-term temporal context, where short-term context centers around local activities, including interactions between the woman and the policeman, followed by long-term temporal context involving multiple transitions within the narrative. By taking long-term temporal context into account, it can be seen that the clip ends with a spacecraft launch. Further, the sound event (audio modality) associated with spacecraft launch captures additional short-term temporal context. 1.3 Multimodal perception: Computational perspective While the sensory modalities function as separate channels, multimodal perception enables humans to interpret events and objects as if information from different senses is processed simultaneously. Neuroscience research links this ability to multisensory convergence zones, such as the superior temporal sulcus [10], and interactions between unimodal brain areas [11, 12], like the 5 primary visual and auditory cortices. In artificial intelligence, multimodal perception involves integrating information from various modalities across different scales and structures with applications in domains like autonomous navigation, social media, healthcare, and entertainment. For effective multimodal perception, as mentioned by Liang et.al[5], the key elements related to the nature of modalities and fusion operations are outlined as follows: • Modality heterogeneity: Modality heterogeneity can be attributed to the underlying diversity in qualities, structures, and representations of information across individual modalities. Each modality consists of unique basic elements with associated input spaces, e.g., characters in the text and frames in a video snippet. Further, the respective modalities have different compositional structures, e.g., spatial structure in images, hierarchical structure in sentences, and temporal structure in signals. Additionally, the amount of information contained within each modality can vary depending on the nature of the multimodal task. Apart from the signal information, modality-specific noise can occur at different levels due to natural factors (occlusions, imperfect keyboard typing, and unclear speech) or imperfections during data recording/collection (sensor failures). To enhance the multimodal perception capabilities of AI systems, it is essential to address the heterogeneous modality-driven factors as part of the integration phase. • Modality connection: Apart from the heterogeneous nature of individual modalities, there exists shared complementary information [13] resulting in modality-based connections. The modality connections can manifest in diverse forms, including: – Statistical dependence/association: Example includes the co-occurrence of two modalities (presence of image and alt-text snippets from websites) or direct association in terms of causal, spatial, or temporal dependencies. Understanding the relationships between modalities is crucial for improving perception capabilities, as it enables the integration of complementary modality-specific information, enhances robustness through auxiliary modality guidance, and reinforces cross-modal coherence. 6 – Semantic linkage: Semantic linkage between modalities can occur either as (a) semantic correspondence i.e., elements associated with different modalities share the same meanings, (b) semantic relations i.e., generalization of semantic correspondence through associated attributes defining the exact connection (semantic, logical, functional or causal relations). Identifying semantic correspondences can improve perception capabilities for diverse V-L tasks, like vision-language grounding [14] and multimodal entity linkage [15]. Further, decoding semantic relations can enhance perception capabilities for higher-order reasoning tasks like visual rhetoric understanding [16] and image-text taxonomy alignment [17]. • Modality representation and reasoning: Multimodal perception relies on the fusion of modalityspecific information for task-specific inference insights. The fusion operation provides a joint multi-modal representation that effectively decreases the number of operations needed for individual modality processing. The fusion is performed either at (a): abstract space where unimodal encoders are used for extracting modality-specific representations followed by specialized architectures for fusion, (b) raw input space where minimal pre-processing is used for input modalities followed by fusion architectures. In the case of abstract representation-based fusion operation, the common fusion operations include additive and multiplicative operations [1], gated attention units [18]. In the case of raw input space fusion, the modalities are concatenated at the input space followed by processing through a singular network stack composed of transformers [19] or convolutional layers [20]. Existing studies [20, 21] have shown the relative benefits of early fusion as compared to abstract late fusion. The utility of the multimodal representations can be realized through reasoning mechanisms. Reasoning mechanisms can enable multimodal perception capabilities through: – Structure modeling: Structure modeling determines the prior structure that informs the reasoning process, including hierarchical (graphical structures [22] for visual question answering [23], image retrieval [24], commonsense reasoning [25]), temporal (memory 7 mechanisms for capturing temporal dependencies [26]) and interactive (sequential decisionmaking framework [27, 28]) setups. – Intermediate concepts: Intermediate concepts can include the (1) dense-vector representations obtained from composition of modality-specific representations, (2) interpretable elements like cross-modal attention-maps [29], discrete symbols [30] and natural language [31]. – Inference schemes: Inference schemes refer to mechanisms that obtain insights about abstract concepts from the given multimodal inputs. Examples of inference schemes include multi-step logical reasoning for problems like multimodal entailment [32], geometric reasoning [33] and guided step-by-step problem solving [34]. – External knowledge addition: External knowledge bases like language and symbolic graphs (Freebase [35], DBPedia [36], Yago [37], WordNEt [38]) to multimodal knowledge graphs ([39]) provide prior compositional structure to guide the reasoning process. Example areas that benefit from external world knowledge include applications like long-form temporal video understanding (MovieQA [40], TVQA [41] etc), visual commonsense reasoning (VCR [42], VisualCOMET [43]). Apart from prior structures in the form of knowledge bases, external knowledge is also available in the form of large pertaining databases like Conceptual Captions [44], LAION-5B [45]. From a computational standpoint, enhancing multimodal perception requires addressing the elements mentioned above to enable widespread adoption across diverse domains. In my thesis, I have concentrated on addressing open challenges in media understanding, focusing on how various computational aspects of multimodal perception can support (1) diverse content tagging, (2) narrative comprehension, and (3) managing quality variations. 8 1.4 Challenges in media understanding Media data can occur in diverse formats, including movies, ads, talk shows, newspaper print items, etc. While the media sources have variations in structures ( temporal duration, semantic organization), we identify key challenges associated with media understanding as a whole: • Diverse content tagging: For media consumption by viewers, content tags provide an initial bridge between the raw media data and semantic understanding. Based on the media source, the nature of the tags can be diverse. For example, in the case of feature-length movies, apart from the broad genre tags, a thorough understanding requires fine-grained tags like character IDs, interactions, visual scenes, etc. However, in the case of youtube-shorts [46], or microvideos, broad thematic tags provide enough information for the content organization, including personalized recommendations. Since content tags provide both macro and micro-level insights into the media sources, automatic detection and their downstream usage is crucial for media understanding at scale. • Narrative understanding: Media content relies on narratives to engage viewers. Based on the media structure, narrative sequence understanding requires decoding of temporal context at diverse scales. For movies, the narrative unfolds over an extended duration, necessitating substantial long-term context processing. Advertisements rely on multiple short-term context switches integrated within a tight narrative. Moreover, beyond the objective narrative structure, symbolic elements are often employed to convey indirect messages about social or political issues, aiming to influence viewers. Automatic decoding of narratives not only reveals details about the media content itself but also reflects broader societal dynamics through character portrayals and representations [47]. • Quality variations: In real-world scenarios, media content is often affected by quality-driven issues like transmission errors, sensor malfunctions, and incomplete or unreliable data. These challenges can lead to information loss, reduced quality, and inconsistencies, making accurate analysis and interpretation more difficult. Further, since the media sources consist of multiple 9 modalities, there is a need to understand the impact of modality-driven corruptions on downstream media understanding tasks. 1.5 Organization and contributions In this section, we provide the basic outline of the dissertation. The thesis statement is given as follows: Media understanding at scale along the dimensions of content tagging, narrative understanding, and diverse quality handling is enhanced by considering the interconnectedness among modalities in the representation space and underlying modality-driven heterogeneity. We rely on the core elements of multimodal perception including inter-connectedness among the modalities, representation learning/reasoning and underlying modality-specific heterogeneity to tackle the previously mentioned challenges of media understanding: diverse content tagging, narrative understanding and quality variations. An organization of how different elements of multimodal perception can help in tackling the media understanding challenges is shown in Fig 1.4. In terms of input sources, we consider multi-modal media content, including movies, advertisements, and social media posts. Certain examples of modality-specific information are as follows: • Visual modality: Captures spatial/temporal context through visual semantic units, i.e., objects and persons captured over images and videos. • Language modality: Captures spatial (semantic)/foreground context through foregroundbased natural language descriptions of images, descriptions/narrations in text transcripts. • Audio modality: Captures temporal context through audio events, background music and speech. The thesis organization is mentioned as follows: 10 Media sources Modality connection Modality representation and reasoning Modality heterogeneity Diverse content tagging Narrative understanding Quality variations Challenges in media understanding Multimodal perception Figure 1.4: Role of multimodal perception in media understanding: Connection between different components of multimodal perception and how it can help media understanding at scale. 1. Chapter 2: We explore the problem of visual scene tagging in multi-modal data esp., movies, by leveraging semantic vision-language knowledge from pretrained multimodal model, i.e., CLIP [48] and domain-specific sources like screenplays, visual vocabularies. Further, we show the utility of visual scene-based representations in improving macro-level content understanding of trailers and web videos. 2. Chapter 3: We explore the role of multi-modal representations and reasoning mechanisms for narrative understanding in advertisement videos along the lines of topic, tone transitions, and social relevance. Additionally, we explore potential opportunities and challenges in leveraging large multimodal models for achieving a unified understanding of the advertisement domain. 3. Chapter 4: We consider the impact of quality variations in multimodal media data through unimodal corruptions. Further, we propose broad strategies for tackling training and testing time behavior mismatch in terms of parameter-efficient methods and information bottleneckguided noise minimization. 11 4. Chapter 5: We propose future directions that aim at leveraging current advancements in multimodal data processing and identify core challenges that can enable next generation of media understanding capabilities. 12 Chapter 2 Visual scene tagging through vision-language semantic connection In this chapter, we consider the task of visual scene tagging in media content by leveraging semantic modality connections from vision-language models, visual vocabularies, and screenplays. Further, we also explore the role of pretrained visual scene representations in downstream macrolevel multimodal content understanding tasks, i.e. genre classification. The overview of this work can be seen in Fig 2.1. 2.1 Broad questions and contributions In this chapter, we will try to address the following questions: • What is the role of visual scenes in media, esp. movies? • What are the challenges associated with visual scene tagging in movies? • How can we leverage domain-specific sources and semantic representations to obtain a visual-scene taxonomy? • How can we utilize semantic vision-language connections to tag media content at scale? • How can we utilize the weakly tagged visual scene dataset called MovieCLIP to learn mediafocused representations? 13 Movies Modality connection Modality representation and reasoning Modality heterogeneity Diverse content tagging Narrative understanding Quality variations Multimodal perception Visual scene tagging [Utilize vision-language semantic connection] Figure 2.1: Role of multimodal perception in enabling visual scene tagging with key elements as modality connection and modality representation and learning In this chapter, we will primarily see how vision-language semantic connections present in large-scale pre-trained contrastive multimodal models can enable weak visual scene tagging in movies at scale. Further, we show that the visual representations learned from the weakly tagged dataset called MovieCLIP enable downstream media-understanding tasks, e.g. genre classification of movie trailers and visual scene understanding of web videos. 2.2 Role of scene as contextual signal Visual scene context refers to the global context in the image, including the relationship of the target objects with the environment/location and other co-occurring objects [49], [50]. Visual scene context drives the likelihood of finding particular objects spatially co-located with each other. For example, as shown in Fig 2.2, utensils are more likely to be present in the kitchen as compared to the train station. Apart from the domain of natural scenes, understanding the visual scene context 14 Figure 2.2: Difference between visual scenes - train station and kitchen in terms of object placements. is also important in the case of media content [47] esp. movies and curated short content like advertisements. In cinematic terms, mis-en-scene [51] refers to how the different elements of a film are depicted and arranged in front of camera. Key components of mis-en-scene include the actors with their different styles, visual scenes where the interactions take place, set design including lighting and camera placement, and the accompanying costumes and makeup of the artists. The visual scene is considered a crucial component since it sets the mood and provides a background for the various actions performed by the actors in the scene. Visual scenes in movies are often tied to social settings like weddings, birthday parties, and workplace gatherings that provide information about character interactions. Accurate recognition of visual scenes can help uncover the bias involved in the portrayal of under-represented characters vis-a-vis different scenes, e.g., fewer women shown in the office as compared to the kitchen. For content tagging tasks like genre classification, visual scenes provide context information like battlefield portrayals in action/adventure movies, spaceshuttle in sci-fi movies, or courtrooms in dramas. In the following section, we highlight certain challenges associated with visual scene recognition, especially w.r.t. movies. 2.3 Visual scenes in movies: Challenges Visual scene recognition, in the case of static images, is primarily driven by natural scenes due to large-scale datasets like SUN397 [52] and Places-2[53]. However, there are certain inherent 15 challenges in visual scene recognition for movies that need to be addressed, as shown in Fig 2.3. Domain mismatch - scene images vs. movie frames: Visual scenes depicted in movies are disFigure 2.3: Overview diagram highlighting the challenges associated with visual scene recognition in movies (a) Domain mismatch between Natural scene images,(Source: http://places2. csail.mit.edu/explore.html) vs frames from Movies for living room (b) Movie centric visual scene classes like prison, control room etc that are absent from existing taxonomies (c) Change in visual scene between shots in the same movie clip. tinct compared to natural scenes due to increased focus on actors, multiple activities, and viewpoint variations like extreme closeup, wide-angle shots etc. An example is shown in Fig. 2.3 (a) for images from Places2 dataset [53] and movie frames from Condensed Movies dataset [54]. Lack of completeness in scene taxonomy: Movies depict both real-life and fictional scenarios that span a wide variety of visual scenes. As shown in Fig. 2.3(b), certain movie-centric visual scene classes like battlefield, control room, prison, war room, funeral, casino are absent from existing public scene taxonomies associated with natural scene image and video datasets. Lack of shot-specific visual scene annotations: Existing datasets like Condensed Movies [54] and VidSitu [55] provide a single visual scene label for the entire movie clip (around 2 minutes long), obtained through descriptions provided as part of YouTube channel Fandango Movie clips 1 . In Fig. 2.3 (c), the provided description: Johnny Five (Tim Blaney) searches for his humanity in the streets of New York. mentions only the visual scene street, while the initial set of events 1https://www.youtube.com/channel/UC3gNmTGu-TTbFPpfSs5kNkg 16 takes place inside church. Instead of considering a single scene label for the entire movie clip, shot-level visual scene annotation can help in tracking the scene change from church to street. 2.4 Contributions In our work, we consider shots within a given movie clip as the fundamental units for visual scene analysis since shots consist of consecutive set of frames related to the same content, whose starting and ending points are triggered by recording using a single camera [56]. Our contributions are as follows: Our contributions are as follows: • Language guided Movie-centric scene taxonomy: We develop a movie-centric scene taxonomy by leveraging scene headers (sluglines) from movie scripts (language-based sources) and existing video datasets with scene labels like HVU[57]. • Automatic shot tagging: We utilize our generated scene taxonomy to automatically tag around 1.12M shots from 32K movie clips using a pretrained vision-language model called CLIP [58] based on a frame-wise aggregation scheme. • Multi-label scene classification: We develop multi-label scene classification baselines using the shot-level tagged dataset called MovieCLIP and evaluate them on an independent shot-level dataset curated by human experts. • Macro-level content understanding: We further extract feature representations from the baseline models pretrained on MovieCLIP and explore their applicability in the macro-level content understanding task of movie genre classification from trailers [59]. 2.5 Related work Image datasets for visual scene recognition: Image datasets for scene classification like MIT Indoor67 [60] relied on categorizing a finite set of (67) indoor scene classes. A broad categorization 17 into indoor, outdoor (natural) and outdoor (man-made) groups for 130K images across 397 subcategories was introduced by the SUN dataset [61]. For large-scale scene recognition, the Places dataset [53] was developed with 434 scene labels spanning 10 million images. The scene taxonomy considered in the Places dataset was derived from the SUN dataset, followed by the careful merging of similar pairs. It should be noted that the curation of large-scale visual scene datasets like Places relied on crowd-sourced manual annotations over multiple rounds. Video datasets for visual scene recognition: While there has been considerable progress in terms of action recognition capabilities from videos due to the introduction of large-scale datasets like Kinetics [62], ActivityNet [63], AVA [64], Something-Something [65], only few large scale datasets like HVU [57] and Scenes, Objects and Actions (SOA) [66] have focused on scene categorization with actions and associated objects. SOA was introduced as a multi-task multi-label dataset of social-media videos across 49 scenes with objects and actions but the taxonomy curation involves free-form tagging by human annotators followed by automatic cleanup. HVU [57], a recently released public dataset of web videos with 248 scene labels, relied on initial tag generation based on cloud APIs followed by human verification. Movie-centric visual scene recognition: In the domain of scene recognition from movies, Hollywood scenes [67] was first introduced with 10 scene classes extracted from headers in movie scripts across 3669 movie clips. A socially grounded approach was explored in Moviegraphs [68] with emphasis on the underlying interactions (relationships/situations) along with spatio-temporal localizations and associated visual scenes (59 classes). For holistic movie understanding tasks, the Movienet dataset[40] was introduced with the largest movie-centric scene taxonomy consisting of 90 place (visual scene) tags with segment-wise human annotations of entire movies. Instead of entire movie data, short movie clips sourced from YouTube channel of Fandango Movie clips were used for text-video retrieval in Condensed movies dataset [54], visual semantic role labeling [55] and pretraining object-centric transformers [19] for long-term video understanding in LVU dataset [69]. While there is no explicit visual scene labeling, the raw descriptions available on Youtube with the movie clips have mentions of certain visual scene classes. 18 MovieCLIP, our curated dataset for visual scene context recognition, is built on top of movie clips available as a part of Condensed Movies dataset [54]. A comparative overview of MovieCLIP and other image and video datasets with visual scene labels is shown in Table 2.1. In comparison with previous video-centric works, our taxonomy generation relies on domain-centric data sources like movie scripts and auxiliary world knowledge from web-video-based sources like HVU with minimal human-in-the-loop supervision for taxonomy refinement. Knowledge transfer from pretrained multimodal models: Vision language(V-L) based pretrainDataset Domain #classes #samples Annotation Unit AV Scene 15 [70] Natural 15 ∼6k Manual Image ✓ MITIndoor67 [60] Natural 67 15620 Manual Image ✓ SUN397 [61] Natural 397 130,519 Manual Image ✓ Places [53] Natural 434 10m Manual Image ✓ Hollywood Scenes [67] Movies 10 3669 Automatic Video clip (36.1s) ✓ Moviegraphs [68] Movies 59 7637 Manual Video clip (44.28 s) ✗ SOA [66] Web-Videos 49 562K Semi-automatic Video clip (10 s) ✗ Movienet [40] Movies 90 42K Manual Scene segment (2 min) ✗ HVU [57] Web-Videos 248 251k Semi-automatic Video clip (10 s) ✓ Condensed Movies [54] Movies NA 33k Automatic Video clip ( 2 min) ✓ VidSitu [55] Movies ∼50 14k Manual Video clip (10 s) ✓ LVU [69] Movies 6 723 Automatic Video clip (1 ∼ 3 min) ✓ MovieCLIP Movies 179 1.12m Automatic Shot (3.54s) ✓ Table 2.1: Comparison of MovieCLIP with other available image and video datasets with visual scene classes. Natural: Images of natural scenes.Web-Videos: videos obtained from internet sources like YouTube. AV: whether publicly available or not. Avg or duration span of video data sources are provided with the respective units. NA: Number of scene classes explicitly not mentioned with the dataset. ing methods involve learning transferable visual representations based on various pretext tasks associated with image and text pairs. Examples of pretext tasks in V-L domain include prediction of masked words in captions based on visual cues in ICMLM [71], pretraining image encoders based on bicaptioning objective in VirTex [72] and contrastive alignment of image-caption pairs in CLIP [58]. Leveraging features from CLIP’s visual and text encoders have improved existing visionlanguage tasks [73] and enabled open-vocabulary object detection [74], and language-driven semantic segmentation [75]. In our work, we use the pretrained visual and text encoders of CLIP and utilize it as a noisy annotator by tagging movie shots based on our curated visual scene taxonomy. 19 2.6 Language driven taxonomy curation Based on domain-specific challenges mentioned in Fig 2.3, we can see the need for a taxonomy curation phase for visual scenes in movies. In this section, we outline the process involved in curating a visual scene taxonomy based on the domain information present in movie scripts and the pre-existing scene information present in auxiliary video datasets. 2.6.1 Sources of visual scene information Movie scripts have been used as external sources for describing and annotating videos through script and subtitle alignment methods in [76], [77], [78], [79]. Movie scripts contain sluglines that provide information about visual scenes, time of the day, and whether the action takes place in indoor or outdoor settings. Figure 2.4: Sources of visual scene information: (a) Sample slugline from a movie script that contains visual scene, time of day, and location (b) Word cloud showing the distribution of visual scene labels in HVU [57] dataset. We parse 156k sluglines from an in-house set of 1434 movie scripts. For each slugline, we automatically extract entities after the “EXT.”(exterior) or “INT.”(interior) tags like Hospital room, River, War room etc. Using this procedure, we extract 173 unique visual scene labels. Our taxonomy generation process is motivated by visual scenes in movies with sluglines from scripts as seed sources along with auxiliary sources. Since the set of labels from movie scripts is not exhaustive, we also consider auxiliary sources, especially web video datasets with visual scene labels like HVU [57]. We consider HVU as source of additional labels since the taxonomy (248 visual scene 20 classes) is semi-automatically curated for short-trimmed videos, having similar nature to movie shots. We don’t consider Places2 [53] dataset since the taxonomy is primarily curated for natural scenes, which are distinct from movie-centric visual scenes. Sample slugline from a movie script and a word cloud showing the distribution of visual scene labels is shown in Fig 2.4. 2.6.2 Visual scene taxonomy curation In order to develop a comprehensive taxonomy of visual scenes in movies, we develop an automatic way of merging taxonomies from movie sluglines and the auxiliary dataset i.e., HVU with minimal human-in-the-loop for post-processing. The broad steps involved in the taxonomy generation are listed as below: Label space preprocessing: For simplicity, we consider the set of unique labels from movie sluglines (MS) to be denoted as LMS and its cardinality as NMS. Similarly we denote NHVU to be the cardinality of the set LHVU , i.e., the set of unique labels from the HVU dataset. For our case, we have NMS = 173 and NHVU = 248. We extract the intersecting set of labels between LMS and LHVU , denoted by the set Lcom . The number of common labels between HVU and movie slugline based taxonomy is Ncom = 68. We remove the common set of labels Lcom from both the label spaces of movie sluglines and HVU. This gives us a non-intersecting set of labels in movie sluglines and HVU denoted by LMS \ LCom and LHVU \ LCom respectively. We combine the sets of labels i.e. LMS \LCom and LHVU \LCom to obtain a larger set of labels called LNC, where NC refers to not common. LNC = (LMS \LCom)∪(LHVU \LCom) (2.1) Merging with common label space: In this step, we find labels in LNC that are semantically close to the labels in Lcom. We extract dense 384D label representations using the MiniLM-L6-v2 sentence transformers model [80] for labels in LNC and Lcom. For each label in LNC we compute cosine similarities with set of labels in Lcom based on label representations. We merge those labels from LNC with the similar labels in Lcom, whose top-1 cosine similarity values are greater than 0.6. 21 We update the set of labels LNC to LN by removing the merged labels. Examples of such merging with respective cosine similarities and sources are as follows: • dune{LNC} → desert{Lcom}(0.66) • tennis camp{LNC} → tennis court{Lcom}(0.62) • restroom{LNC} → bathroom{Lcom}(0.80) • rural area{LNC} → village{Lcom}(0.64) • boardwalk{LNC} → walkway{Lcom}(0.67) • television room{LNC} → living room{Lcom}(0.67) • glacial lake{LNC} → lake{Lcom}(0.73) Figure 2.5: The share of different sources (HVU, Movie Sluglines, Common labels and Human expert) in curating the label taxonomy. Example labels from different sources are shown in boxes with the pie chart. Human-in-the-loop taxonomy refinement: A human expert inspects the labels in LN and removes both generic scene labels such as body of water, coastal and oceanic landforms, horizon, landscape, underground as well as highly specific scenes such as coral reef, white house, piste, 22 badlands etc. We use label representations from the previous step to exploit semantic similarity between the labels remaining in LN. For each relevant label in LN, a threshold of 0.7 on top-1 similarity score is used to filter out similar labels. While comparing two similar labels, the human expert relies on wiki definitions to merge the more specific label into the generic one. For example, by definition bazaar is a special form of market selling local items (as per wiki) and therefore merged with the market. Other examples include: • {stream,riverbed, creek,river} → river • {hill,mountain,mountain pass,mountain range} → mountain • {road,road highway,lane} → road • {port,marina dock,harbor} → harbor This results in a set of labels called Lmerge from LN. Further, The human expert is exposed to randomly sampled 1000 shots from movie clips in Condensed Movies [54] and reviews the current set of labels in the set Lmerge ∪Lcom. Based on the video content, a set of scene labels (Lhuman) that are missing from the current set is added by human expert. Thus the final set of 179 visual scene labels is obtained as follows: Lf inal = Lcom ∪Lmerge ∪Lhuman (2.2) Label source distribution: As shown in Fig. 2.5, the largest share is from movie sluglines (39%).Only 9% of the total labels is provided through feedback from human expert. Instead of manually binning classes into broad categories like indoor, outdoor, or man-made, we discover groupings among classes through Affinity propagation clustering [81] (based on label representations from sentence transformers). Certain clusters of visual scene labels are listed below, where all the sports locations, water bodies and performing arts locations are grouped together. • sport locations: Basketball court, Race track, Tennis court, Batting cage, Golf course 23 • water bodies: River, Pool, Waterfall, Hot spring, Pond, Swamp, Lake • performing arts locations: Stage, Conference Room, Theater, Auditorium, Ballroom • Natural landforms: Mountain, Desert, Valley A detailed breakdown of visual scenes with respective sources is shown in Appendix (C) For visualizing certain groups of visual scene classes that are semantically close to each other, we use tSNE [82] to project the 384 D label embeddings obtained through MiniLM-L6-v2 sentence transformers model [80] to 2 dimensions. Examples include the visual scene classes associated with study, water bodies, medical scene classes, natural landforms, automobile-based scene classes, dining-scene classes, as seen in Fig 2.6. 2.7 Role of vision-language semantic connection Since the manual tagging of Condensed Movies [54] is not feasible due to the duration of data involved, i.e., 1106 hrs, we ask the following question: Can we use additional knowledge sources to provide weak guidance for the tagging task ?. In this case, we leverage the pretrained knowledge available as part of CLIP [58]’s vision and language encoders to provide weak guidance for visual scene labeling. We utilize the pretrained knowledge available in CLIP for weakly tagging the movie video clips with scene labels from our curated taxonomy. CLIP utilizes internet-scale data, i.e., imagetext pairs, to pre-train two visual and text transformer [19] encoders for weak-alignment pretext task. Based on the large batch size of 32,768 image-text pairs, the proxy training task is defined as follows: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset ?. Since CLIP has been trained in a contrastive manner for alignment, it can be used to develop zero-shot classifiers for different tasks including scene recognition ([61]), fine-grained classification ([83], [84], [85]), facial emotion recognition ([86]), object ([87],[88]) and action classification ([89],[90]). 24 Figure 2.6: TSNE plot of the 179 scene classes in the taxonomy. Certain representative groups of visual scene labels that are semantically close to each other are enclosed by circular/oval shapes. In the following section, we outline how CLIP’s pretrained encoders are used for tagging shots in the Condensed movies dataset [54] to obtain a curated shot-centric dataset called MovieCLIP with visual scene labels. 25 Figure 2.7: 1. Contrastive pretraining framework associated with CLIP. 2. Usage of CLIP’s pretrained vision and text encoders for zero-shot classification. Image source: [58] 2.8 MovieCLIP dataset 2.8.1 Shot detection from movie clips Since a shot represents a continuous set of frames with minimal change in the visual scene, we consider movie shots for our subsequent analysis. Associating visual scene labels to shots can help in identifying cases where visual scene recognition is difficult like closeup or extreme closeup scenarios, even within the same movie scene. A singular movie scene involves changes in camera viewpoints across consecutive shots, thus making it difficult for CLIP to associate labels with high confidence for the entire movie scene. For shot detection, we use PySceneDetect2 to segment the movie clips in Condensed Movies with default parameters and content-aware detection mode. The overall statistics of the shot extraction process is shown in Table 2.2: # movies Years # clips # shots Avg shots/clip Avg Duration 3574 1930-2019 32484 1124638 34.66 3.54s Table 2.2: Statistics of movie shots in MovieCLIP dataset. 2https://pyscenedetect.readthedocs.io/en/latest/ 26 2.8.2 CLIP-based visual scene labeling of movie shots Figure 2.8: Overview schematic of the prompt-based visual scene labeling of movie shot using CLIP’s visual and text encoders. S is the similarity matrix where entry Slt refers to similarity values between textual embedding el and visual embedding vt . In this section, we describe how CLIP [58] can be used to associate visual scene labels with individual movie shots. Based on prompt engineering designs considered by GPT3 [91], addition of contextual phrases like “a type of pet” or “a type of food” in the prompts provides additional information for CLIP for zero shot classification. In a similar vein, we consider the visual scene specific prompt: “A photo of a {label}, a type of background location” to tag individual frames in video clips with labels from our scene taxonomy. If a shot contains T frames, we utilize CLIP’s visual encoder to extract frame-wise visual embeddings vt(t = 1,...,T). For each of the individual scene labels in our taxonomy, we utilize CLIP’s text encoder to extract embeddings el(l = 1,2,...,L) for the background-specific prompts We use the label-wise (prompt-specific) text and frame-wise visual embeddings to obtain a similarity score matrix S, whose entries Slt are computed as follows: Slt = e T l vt ∥el∥2∥vt∥2 (2.3) We compute an aggregate shot-specific score for individual scene labels by temporal average pooling over the similarity matrix Slt, since the visual content within a shot remains fairly unchanged. 27 The computation of shot-specific score called CLIPSceneScorel for lth visual scene label is shown in Eq. 2.4. Overall workflow of the process is illustrated in Fig 2.8. CLIPSceneScorel = ∑ T t=0 (Slt) T (2.4) 2.8.3 Prompt design choices In terms of prompt designs, we consider the following templates customized for generic background information: • A photo of {}, a type of background location • A photo of {}, a type of location Since multiple shots consider people in focus, we also consider the following people-centric prompt templates: • People at {}, a type of background location • People at {}, a type of location As shown in Fig 2.9, the prompts associated with generic background information tend to perform better in terms of associating visual scene labels to movie shots with higher confidence, when compared with people-centric prompts. Further inclusion of the contextual phrase ”a type of background location” tends to perform better than ”a type of location” in associating top-1 visual scene labels with higher CLIPSceneScore values. When people-centric prompts are used, the CLIP based labeling scheme can result in incorrect associations like interrogation room in Fig 2.9 (a) and cockpit in Fig 2.9 (c). Hence we consider A photo of {}, a type of background location as our final prompt choice. 28 Prompt: People at {}, a type of background location ● Labels: ['interrogation room', 'restaurant', 'dining room', 'bar', 'subway'] ● CLIPSceneScore: [0.191, 0.078, 0.057, 0.038, 0.0370] Prompt: People at {}, a type of location ● Labels: ['interrogation room', 'subway', 'restaurant', 'suburban', 'dining room'] ● CLIPSceneScore: [0.200, 0.056, 0.053, 0.043, 0.042] Prompt: A photo of {}, a type of background location ● Labels: ['dining room', 'restaurant', 'bar', 'cafe', 'war room'] ● CLIPSceneScore: [0.200, 0.127, 0.113, 0.079, 0.050] Prompt: A photo of {}, a type of location ● Labels: ['dining room', 'restaurant', 'bar', 'cafe', 'war room'] ● CLIPSceneScore: [0.159, 0.121, 0.088, 0.072, 0.0547] Prompt: People at {}, a type of background location ● Labels: ['downtown', 'ballroom', 'city hall', 'tower', 'penthouse'] ● CLIPSceneScore: [0.0735, 0.0638, 0.0579, 0.0573, 0.0397] Prompt: People at {}, a type of location ● Labels: ['downtown', 'ballroom', 'city hall', 'penthouse', 'tower'] ● CLIPSceneScore: [0.0792, 0.0548, 0.0514, 0.0459, 0.0423] Prompt: A photo of {}, a type of background location ● Labels: ['downtown', 'city hall', 'skyline', 'plaza', 'courthouse'] ● CIPSceneScore: [0.335, 0.095, 0.091, 0.035, 0.0297] Prompt: A photo of {}, a type of location ● Labels: ['downtown', 'skyline', 'city hall', 'plaza', 'penthouse'] ● CLIPSceneScore: [0.234, 0.189, 0.063, 0.034, 0.022] Prompt: People at {}, a type of background location ● Labels: ['cockpit', 'car', 'prison holding', 'prison', 'fire station'] ● CLIPSceneScore: [0.0451, 0.0443, 0.0290, 0.0288, 0.0277] Prompt: People at {}, a type of location ● Labels: ['car', 'prison holding', 'interrogation room', 'prison', 'cockpit'] ● CLIPSceneScore: [0.045, 0.033, 0.027, 0.026, 0.0256] Prompt: A photo of {}, a type of background location ● Labels: ['cab', 'cockpit', 'car', 'downtown', 'suburban'] ● CLIPSceneScore: [0.256, 0.114, 0.095, 0.042, 0.037] Prompt: A photo of {}, a type of location ● Labels: ['cab', 'car', 'cockpit', 'elevator', 'truck'] ● CLIPSceneScore: [0.140, 0.121, 0.093, 0.047, 0.036] Prompt: People at {}, a type of background location ● Labels: ['dining room', 'kitchen', 'room', 'living room', 'restaurant'] ● CLIPSceneScore: [0.1350, 0.0767, 0.056, 0.0478, 0.0430] Prompt: People at {}, a type of location ● Labels: ['dining room', 'home', 'restaurant', 'kitchen', 'farmhouse'] ● CLIPSceneScore: [0.1240, 0.0487, 0.0448, 0.0404, 0.02890] Prompt: A photo of {}, a type of background location ● Labels: ['dining room', 'kitchen', 'room', 'living room', 'restaurant'] ● CLIPSceneScore: [0.418, 0.146, 0.060, 0.0586, 0.0265] Prompt: A photo of {}, a type of location ● Labels: ['dining room', 'kitchen', 'room', 'living room', 'war room'] ● CLIPSceneScore: [0.354, 0.072, 0.048, 0.041, 0.039] (a) (b) (c) (d) Figure 2.9: Examples of various prompt templates and associated CLIPSceneScores for top-k visual scene labels. Here k=5 (a) Downtown Cockpit Foundry Stadium (b) Figure 2.10: (a) Sample frames from the movie shots labeled by CLIP with high confidence (CLIPSceneScore ≥ 0.6 and Labels shown in yellow) (b) Sample frames from the movie shots tagged by CLIP with low confidence (Labels shown in yellow). 2.8.4 Analysis of CLIP tagging 2.8.4.1 Qualitative examples As shown in Fig 2.10a, CLIP performs well when distinctive elements of visual scene are present like background objects in indoor locations (living room) or appearance-based cues (green color background for forests). For example in 2.10a, the presence of airplane windows indicates that the associated visual scene label with the given movie shot is cockpit. However, for shots involving blurry motion or absence of background information due to the close up of the involved characters, 29 CLIP’s confidence in associating visual scene labels is low (Fig 2.12b). Genre-wise association: We consider those shots whose top-1 CLIPSceneScore values are greater Figure 2.11: Genre-wise distribution of different scene labels. For each genre, the top-5 scene labels are shown in terms of the number of occurrences in the top-1 label provided by CLIP [58] for shots in the MovieCLIP dataset. Threshold for confidence score of top-1 label = 0.4 than or equal to 0.4 and show the top-5 scene labels in terms of occurrences for certain genres like western, sport, war and musical. From Fig 2.11, we can see that relevant scenes are associated with genres through CLIP’s labeling scheme. Some notable examples include {shooting range, desert} for western, {locker room, boxing ring} for sport, {ballroom, stage} for musical and {cockpit, battlefield} for war. 2.8.4.2 Shot type distribution: We analyze the distribution of shot types in terms of scale for samples extracted from Condensed Movies dataset [54]. We consider the shot scale type taxonomy in MovieShots [92] dataset as follows: 30 • Long shot (LS): Shot taken from long distance like quarter of a mile and having least person close up. • Full shot (FS): Shot showing the entire human body. • Medium shot (MS): Shot showing the humans from knees or waist up. • Close-up shot (CS): Shot focussed on a small segment like face of a person. • Extreme close-up shot (ECS): Shot showing smaller regions like image of eye or mouth. We use the samples in the MovieShots dataset to train a 2-layer LSTM [93] network (hidden dimension = 512) to classify between 5 scale classes. For individual shots, we extract frame-wise features from pretrained ViT-B/16 [94] network at 4 fps for inputs to the LSTM network. For training, validation and testing we use a split of 23557, 6764, and 3332 shots. We use the trained model to predict the shot scale labels for the samples in the MovieCLIP dataset. From Figure 2.12 (a), we can see that for low-confidence shots in MovieCLIP dataset having top-1 CLIPSceneScore <= 0.2, significant proportion (86%) have person close-up ranging from moderate (MS) to very high (ECS). When combined, Close-up (CS) and extreme close-up (ECS) shots constitute 50% of total samples having top-1 CLIPSceneScore <= 0.2. However for high confidence shot samples in MovieCLIP, whose top-1 CLIPSceneScore values are greater than 0.4, the combined share of CS and ECS scale labels decreases to 24%. Further, the combined share of FS and LS rises from 14% in Fig 2.12 (a) to 35% in Fig 2.12 (b). This indicates that a major share of the shot samples having low CLIPSceneScore values have high(CS) to very high (ECS) person close-up. Whereas the increasing share of FS and LS scale labels in Fig 2.12 shows that CLIP needs more background information in the shots to tag visual scenes with high confidence. 2.8.5 Quality estimation through human verification In order to estimate the reliability of the top-k labels provided by CLIP for movie shots, we conducted a verification task on Amazon Mechanical Turk. We provided a pool of annotators with a 31 (a) (b) Figure 2.12: (a) Distribution of predicted scale labels for shots (N=745059) in MovieCLIP dataset having top-1 CLIPSceneScore <= 0.2(b) Distribution of predicted scale labels for shots (N=107420) in MovieCLIP dataset having top-1 CLIPSceneScore >= 0.4 and top-k CLIPSceneScore >= 0.1 (k=2,3,4,5). Scale labels include : ECS: Extreme Close-up shot, CS: Close-up shot, MS: Medium shot, LS: Long shot, FS: Full shot. Figure 2.13: Schematic design of the mturk experiment used for human verification of visual scenes. subset of 2393 movie shots from VidSitu [55] dataset, along with top-5 scene labels. As shown in Fig 2.13, if none of the scene labels appear relevant for the given movie shot, the annotators 32 choose the Not relevant option. After the human verification experiment is complete, we discard the shot samples with no agreements and obtain evaluation data consisting of 1883 shot samples. We also use the LSTM model pretrained on MovieShots dataset to predict shot types for samples that have no agreements with human experts. The distribution of shot-wise predictions from the trained LSTM model is shown in Fig 2.14. We can see that 80% of shots having no agreements with human annotations belong to the shot categories having moderate (MS) to very high person closeup (ECS). Figure 2.14: Distribution of shot scale predictions among the shots having no agreements between human annotators and CLIP’s labeling scheme. ECS: Extreme Close-up shot, CS: Close-up shot, MS: Medium shot, LS: Long shot, FS: Full shot 2.9 Experiments and Results: 2.9.1 Experimental Setup For training and validation purposes, we retain those shot samples whose top-1 CLIPSceneScore is greater than or equal to 0.4 (approx 75 percentile), resulting in a clean subset. After top-1 filtering, we also consider labels from top-k (k = 2 to 5) whose CLIPSceneScore is greater than 0.1 to associate multiple labels per sample. This results in a set of 107k samples with train, val 33 and test split of 73.8k, 23.2k and 10.3k having non-intersecting set of ids with the human-verified evaluation set. Approximately 38.4% of the dataset is multi-label, covering 150 scene classes out of 179 in the curated scene taxonomy. All the related experiments were conducted using the Pytorch[95] framework using 4 T4 NVIDIA GPUs. For training the respective models, we use binary cross entropy loss function. For evaluation, we use mean average precision (mAP) and Pearson correlation (averaged across samples) as metrics. 2.9.2 Visual scene recognition - Movies Frame-wise aggregation models: For frame-wise aggregation, we extract dense embeddings from individual shots at 4fps. We use two sets of embeddings: 512 dim embedding from Resnet18 [96] pretrained on Places2 dataset and 768 dim embedding from ViT-B/16 [94] model pretrained on Imagenet [87]. Following feature extraction, we perform temporal aggregation using LSTM [93] with 2 layers and hidden dimension of 512. 3D Convolutional network models: We use I3D[97], R(2+1)D [98], Slowfast [99] as baseline 3D convolutional models in multi-label setup. I3D[97] and Slowfast[99] models have a Resnet50[96] backbone whereas R(2+1)D [98] has a Resnet34[96] backbone. All the models are initialized from Kinetics400 [62] pretrained weights. For finetuning I3D[97] and Slowfast[99], we use SGD with learning rates in {0.1,1e − 3} and weight decay of 1e-4. For R(2+1)D [98] we use Adam [100] with learning rate 1e-4. Batch sizes for the models are varied between 16 and 32. Video Transformer models: For video transformer models, we consider the base TimeSformer model [101] that considers 8 frames (224 x 224) as inputs. For finetuning TimeSformer [101] model, we use SGD with learning rate 5e-3 and weight decay of 1e-4 and batch size of 8. For better speed-accuracy tradeoff we use the Video Swin Transformer model [102], [103] called SwinB with clip size of 32 frames (224 x 224) as inputs. For finetuning we use AdamW [104] optimizer with learning rate 1e-4 and Cosine Annealing with batch size of 32. 34 Based on the results in Table 2.3, we can see that 2-layer LSTM model trained using features from Imagenet-21K pretrained ViT-B/16 performs better compared to features extracted using Resnet-18 model pretrained on Places2 dataset. This shows that features from ViT-B/16 are more suited for scene recognition in the movie domain as compared to Places2. In terms of endto-end models, video transformers, including TimeSformer and Swin-B models outperform 3D convolutional models. Swin-B model performs better than other models by obtaining an average correlation of 0.497 and mean average precision of 44.4. Swin-B’s performance is very close to ViT-B/16 due to the initialization of Swin-B with pretrained vision transformer weights. Frame wise aggregation Model Features mAP Correlation LSTM (512, 2 layers) Places2 (4 fps) 24.15 0.29 LSTM (512, 2 layers) ViT-B/16 (4 fps) 43.10 0.42 3D convolutional networks Model Features mAP Correlation SlowFast (R50) [99] NA 25.80 0.402 R(2+1)D (R34) [98] NA 26.73 0.40 I3D (R50) [97] NA 13.33 0.26 Video Transformers TimeSformer [101] NA 36.87 0.46 Swin-B [102] NA 44.4 0.497 Table 2.3: Mean average precision (mAP) and average Spearman correlation of different models on human-verified evaluation set (N=1883 shots). NA: End-to-end models used instead of features. For 3D conv models, the backbone network is mentioned inside brackets. 2.9.3 Downstream tasks 2.9.3.1 Visual scene recognition - web videos We also explore knowledge transfer from models finetuned on MovieCLIP by evaluating performance on downstream multi-label scene classification with HVU dataset [57]. For training and evaluation, we use 251k and 16k videos with 248 scene labels. We extract 1024 dim features from the best performing Swin-B model in Table 2.3. We train 3 layer fully connected models on the respective features with the following configuration: 35 Mscene: INP[1024] → FC[4096],DO(0.2) → FC[4096] → FC[248] Here FC refers to a fully connected layer, and DO refers to dropout. From Table 2.4, we can see that Mscene exhibits better performance when compared to existing end-to-end models trained on HVU. Model mAP 3D-ResNet [57] 50.6 3D-STCNet [57] 51.9 HATNet [57] 55.8 Mscene 55.92 Table 2.4: Mean average precision of different models on HVU dataset for multi-label scene classification (248 classes). Backbone for end-to-end models: 3D Resnet18. 2.9.3.2 Macro-level content understanding - genre classification Model Overall Ac Ani Bio Com Cri Drm Fmy Fntsy Hrrr Myst Rom ScF Thrl Mtrailer 56.14 62.97 86.51 14.4 80.77 49.58 79.58 74.55 49.59 50.62 26.83 45.05 47.99 61.36 C3D [105] 53.4 63.8 91.3 16.2 82.3 45.1 71.6 65.3 54.8 50.8 28.2 38.3 21.8 64.8 I3D [97] 38.8 37.2 51.8 9.2 72.6 33.9 67.6 43.6 39 22.8 21.3 34.3 22.6 48.3 LSTM [59] 48.4 47.5 86.8 12 79.2 33 72 64.5 54.4 22.7 24.7 40.4 36.5 54.8 Bi-LSTM [59] 47.4 49.9 86.3 8.2 77.6 29.9 70.8 65.4 55.3 22.3 21.7 41.6 35.9 51.2 fstVid [59] 56.5 61.4 94.8 23.9 81.5 41.7 77 67 62.6 36.1 30.4 48.4 48.2 62 fstTConv [59] 58.9 64.7 95.7 21.2 83.5 49.1 78.9 68.6 68.9 42.7 29.2 46.8 51 64.8 Table 2.5: Mean average precision of different models for multi-label genre classification (13 class) on Moviescope dataset. Except Mtrailer comparison results are reported from [59]. Abbreviations: Ac: Action, Ani: Animation, Bio: Biography, Com: Comedy, Cri: Crime, Drm: Drama, Fmy: Family, Fntsy: Fantasy, Hrrr: Horror, Myst: Mystery, Rom: Romantic, ScF: SciFi, Thrl: Thriller, fstVid: fastVideo, fstTConv: fastVideo + Temporal Conv. As an additional downstream task, we consider multi-label genre classification of movie trailers in the Moviescope dataset [59]. Out of the original set of 4927 trailers, we could access 3900 videos from YouTube. Based on the provided splits, we use 2948, 410 and 542 videos for training, validation and testing purposes, respectively. We use the 1024 dim features extracted from the best performing Swin-B model in Table 2.3. We train 3 layer fully connected models on the respective features with the following configuration: 36 Mtrailer: INP[1024] → FC[512],DO(0.2) → FC[512] → FC[13] Even when the number of trailer videos used is a subset of the original split, Mtrailer exhibits similar genre-wise trends as other models. From Table 2.5, we can see that Mtrailer shows better performance in Animation and Comedy as compared to genres like Biography and Mystery. When compared with fstTConv in Table 2.5, our fully connected model Mtrailer performs slightly worse due to non-availability of entire training data. 2.9.3.3 Impact of MovieCLIP pretraining Figure 2.15: Sample scene classes in HVU where Mscene performs better in comparison to Mscene(Kin) . We consider the impact of MovieCLIP-based pretraining by fixing the fully connected architectures Mscene, Mtrailer and varying the input features. In the without MovieCLIP (Kin) pretraining setting, we extract 1024 dim features from Swin-B model pretrained on Kinetics400 for HVU and Moviescope datasets. From Table 2.6, we can see that the performance of Mscene with MovieCLIP 37 HVU Moviescope Model mAP Model mAP Mscene 55.92 Mtrailer 56.14 Mscene(Kin) 56.05 Mtrailer(Kin) 53.29 Late Fusion 57.73 Late Fusion 56.29 Table 2.6: Impact of MovieCLIP pretrained features vs Kinetics pretrained features for Mscene (HVU) and Mtrailer (Moviescope). Results reported are mean average precision (mAP) values. Model(Kin): Model with Kinetics400 pretrained features, where Model ∈ {Mscene,Mtrailer} . pretrained features is comparable to Mscene(Kin), even when the domain of Kinetics400 [62] is matched to HVU. Further, late fusion of prediction logits of Mscene and Mscene(Kin) with equal weights improves the mAP to 57.73 for HVU, thus indicating the capture of complementary information when trained with movie data. We show some class-wise analysis in Fig 2.15 to showcase the classes where MovieCLIP pretrained features improve upon Kinetics400 pretrained features. From Fig 2.15, we can see that for certain classes present in our taxonomy like tunnel, restaurant, apartment, attic and concert hall, Mscene performs better when compared to Mscene(Kin). Similar trends can be seen for HVU scene classes that are part of broader scene classes in our taxonomy like riverbed (part of river), mountain pass (part of mountain) and track (part of race track). In case of Moviescope, Mtrailer results in improved performance (56.14) as compared to Mtrailer(Kin) (53.29), due to domain similarity with MovieCLIP dataset. 2.10 Ethical implications Visual scene recognition capabilities can help in uncovering biases associated with the portrayal of under-represented and marginalized characters in various settings. For example, women are portrayed more in indoor scenes like kitchen, living room, hospital as compared to scenes like factory, laboratory, or battlefield. Further, characters from marginalized demographic groups are often depicted in the background w.r.t common visual scenes, thus having considerably less share of speaking time. Apart from the portrayal of characters, the usage of large-scale pretrained models 38 like CLIP [58] can help diagnose the inherent biases associated with its predictions since it is trained on free-form data curated from the web. The proposed scheme of utilizing CLIP for weakly tagging datasets can reduce the costs associated with large-scale human expert-driven annotation processes. 2.11 Takeaways From the visual scene context recognition in media content guided by multimodal signal, we have the following major takeaways: • Domain-specific sources: Domain-specific linguistic sources like screenplays provide rich sources of information for taxonomy curation. • Vision-language semantic connection The pretrained knowledge in CLIP in the form of vision-language semantic connection can enable scene tagging at scale. • Context-change tracking: Shot-specific labels can enable tracking of visual scene context changes in dynamic content like movie videos. • Macro-level content understanding: Pretrained visual scene representations can enable macro-level multimodal content understanding ( i.e. genre classification) and scene recognition in web videos. 39 Chapter 3 Narrative understanding and message decoding in advertisements: Multimodal perception guided approach In this chapter, we will explore how the core elements of multimodal perception, namely modality connections, and representations, along with diverse reasoning mechanisms, enable the broad narrative understanding of advertisement videos. Specifically, we consider advertisement videos due to their unique narrative structures of multiple short-term temporal context changes that are linked by a longer narrative thread. In order to study the context-driven approaches, we introduce a new ads-specific benchmark called MM-AU with salient narrative-driven tasks for macro-level understanding, i.e., topic categorization, tone transition, and social relevance. An overview of this work can be seen in Fig 3.1. Additionally, we will explore the role of current large multimodal models in understanding advertisements, focusing on their current capabilities and potential limitations. 3.1 Broader questions and contributions In this chapter, I plan to address the following questions related to narrative-driven understanding of media sources, especially advertisement videos. The broad questions considered in this chapter are listed as follows: • How does the unique structure of advertisement videos help in large-scale computational understanding? 40 Media Modality connection Modality representation and reasoning Modality heterogeneity Diverse content tagging Narrative understanding Quality variations Multimodal perception Decoding temporal narratives and underlying message [Utilize vision-language semantic connection, multimodal representations and reasoning mechanisms] Advertisements Figure 3.1: Role of multimodal perception in the narrative understanding of advertisement videos with key elements as modality connections, representations, and reasoning. • How can we leverage connections between modalities, including representations and associated reasoning mechanisms, to enable narrative-task-driven understanding of ads? • Can we approach the advertisement understanding task in a unified manner through large multimodal models? What are the relative advantages and associated shortcomings? To answer the above-mentioned questions, we introduce an advertisement benchmark called MM-AU with novel narrative-driven tasks of topic categorization, transition in perceived tone, and social message detection. Further, we propose a lightweight two-stage multimodal fusion model for task-specific modeling. Further, we look at the possibility of a unified understanding of advertisement images through large-multimodal models and identify potential limitations and scope of future explorations. 3.2 Advertisements as medium As a primary media source for disseminating information, advertisements (ads for short) have been utilized to promote products or convey messages about extant social or political issues. The utility 41 of advertisements as a media source has been amplified by the wide variety of platforms like radio, television, newspaper print, video streaming, and social networking sites, presenting significant influences, whether direct or indirect, to viewers from diverse backgrounds [106]. As shown in Fig 3.2, The rising importance of advertisements in the current socio-economic scenario is evident from the expected increase in media ad spending from 225.79 in 2020 to 322.11 billion dollars in 2024. Figure 3.2: Increase in demand for advertisements over last 4 years 3.3 Structure of ads: Task taxonomy For decoding ads, including images and videos, the specific nature of the advertisement domain should be taken into account. Some salient characteristics of advertisement videos/images can be listed as follows: Advertisements - Context transitions: In terms of context transitions, advertisements lie between action-oriented short videos and feature-length movies. In Fig 3.3, we can see that multiple shortterm context changes are happening due to the following sequence of activities/interactions: • Police activity starting 42 • Police chief gathering • Police vehicles are getting destroyed • Children playing with toys From the sequence of events, it can be inferred that even though the short-term temporal information indicates negative situations, the overall theme revolves around toys for kids. Thus, the temporal context information needs to be processed both locally and globally for a holistic understanding of ads at the macro level. Figure 3.3: Structure of advertisement video with multiple short-term context changes and overall long-term linkage Complex visual reasoning: Advertisement print/images or videos contain direct or indirect messages in order to attract the attention of viewers. For example, in Fig 3.4 (a), the message conveyed to the viewers is direct in the form of joining army or motivational message: we can do it. However, in Fig 3.4 (b), understanding the ads requires decoding symbolic structures such as forest lungs as a symbol of deforestation, trailing blood as a symbol of animal cruelty or purple color of pencil from eggplant. An additional challenge in advertisement decoding involves the handling of 43 non-photorealistic atypical objects, as shown in Fig 3.4 (c). The atypical portrayal of objects based on the combination of various parts and solid and liquid deformities serves as a rhetorical device to capture the viewer’s attention [107–109]. The automatic detection of atypicality poses a significant challenge for AI systems, as it demands common-sense reasoning to distinguish between different forms and spatial comprehension of local regions within the context of the entire image. Message with simple directions Requires complex visual reasoning: Forest lungs as symbol of deforestation, Trailing blood as symbol of animal cruelty, Color from eggplant (an atypical object) [1] Requires handling of non-photorealistic objects: Atypical object categories like solid deformed objects, combination of different parts (a) (b) (c) Figure 3.4: Examples of diverse forms of messaging in advertisement print image/videos including direct/indirect complex messages and non-photorealistic atypical objects. Image sources: [16], [107], [109] Given the aforementioned characteristics of the advertisement medium, the information in ads can be viewed as structured across multiple levels, necessitating both broad and fine-grained visual understanding. As shown in Fig 3.5. For both broad and fine-grained visual narrative understanding, the tasks shown in Fig 3.5 consider a natural division of both related elements as follows: • Thematic elements: Thematic element of ads requires a broad understanding in the form of semantic categories, i.e., topics. • Fine-grained elements: Fine-grained understanding of ads can be enabled through diverse tasks like atypical object classification, social message detection, modeling persuasive strategies, and perceived sentiment/tone among the viewers. 44 Ads Print images/videos Thematic elements Requires broad understanding e.g. topics Fine-grained elements Atypicality Sentiments/Tone Common-sense reasoning about regular or crafted objects Reasoning about symbols and visual elements that makes an ad persuasive and evokes desired sentiments/perceived feeling (tone) in viewers. Inference/Action Underlying message of ad / Desired action from viewer Subjective Subjective Objective Objective Persuasion Social message Accidents Gender equality Refugee crisis Smoking Objective Reasoning about the underlying social message in an ad that makes it more appealing Figure 3.5: Task taxonomy for advertisement understanding In the next section, we elaborate on the narrative structure of ads and the need for novel tasks for web-based videos. 3.4 Advertisements - Narrative driven tasks and challenges Objectively understanding the rich content in ads and their impact on the viewer experience and behavior is hence of great interest. However, in enabling computational media understanding [47], advertisements present unique challenges in the form of condensed narrative structures [110]. Due to their relatively short duration when compared to feature-length movies, an advertisement video showcases a particular narrative structure in a tightly integrated sequence with different formats, including slice-of-life [111], drama [112], and transformational [113]. Further, reasoning about the narrative structure requires a multi-scale understanding of the underlying topic and fine-grained elements, including the sequence of events (of characters and interactions) and related messages. As shown in Fig 3.6, the key elements associated with the narrative structure of ads and related challenges are listed as follows: 45 Car Food and Drink Electronics Travel and Transport Games Health Start, Positive Middle, Negative End, Positive Accidents Gender equality Refugee crisis Gun violence Environment damage Smoking (a) Topic (b) Tone transition (c) Social message Figure 3.6: Schematic diagram showing illustrative examples of various tasks in the MM-AU (Multi-modal ads understanding) dataset. Multimodal understanding of ads along the lines of (a) Topic categorization (18 classes), (b) Tone transition, (c) Social message detection i.e. Absence/Presence of social message Topic: understanding enables personalized categorization and retrieval for customers along with key insights into the representation of genders [114] and different demographic groups with respect to target classes like healthcare, retail, travel, etc. Topic categorization involves the handling of both inter and intra-topic diversity between the videos in terms of human-object interactions and a wide variety of items/products, as shown in Fig 3.6 (a). Tone transition: Affective tone associated with an advertisement video refers to the perceived feeling by a viewer [115]. Associating the appropriate tone with an ad enhances its persuasiveness, thus enabling the associated brand to expand its reach to a wide range of customers. While the positive tone centers around optimistic elements associated with hope and success [116], portrayals of negative tone are tied to sad narratives involving fear and suffering. However, due to the narrative structure, the perceived affective tone exhibits transitions during the duration of an advertisement video, accompanied by changes in visuals and background music. In Fig 3.6 (b), the video starts on a positive note with a happy person taking a picture, followed by a perceived negative tone in the middle due to the suffering of the person. The advertisement ends on a positive note, with the person being saved by an incoming vehicle. Social message: Advertisements act as a major source of information about pressing social issues for consumers. Brands conveying messages about various social issues, including but not limited to gender inequalities, racial discrimination, and environmental conservation, are viewed favorably 46 by consumers across different age groups [116]. In terms of advertisement videos, social message portrayal is characterized by a huge diversity in depiction, as shown in Fig 3.6 (c) due to underlying categories like smoking, accidents, gun violence etc. In this work, we introduce a multilingual multimodal benchmark called MM-AU for the macrolevel understanding of advertisement videos across the tasks of topic categorization, social message, and tone transition detection. Due to the inherent structure of the ads involving transitions in temporal context driven through multiple modalities, we propose context-guided attention mechanisms for the previously mentioned macro-level tasks. Our contributions can be listed as follows: • Topic classification: We merge existing taxonomies for topic annotations from prior ads datasets and publicly available websites like Ads of world1 to obtain a condensed set of topic labels for the advertisement videos. • Tone Transition detection: We introduce a novel benchmark task of tone transition detection in advertisement videos by obtaining crowdsourced feedback from human annotators. • Social message detection: We provide weak human expert-based labels for detecting the presence/absence of social messages in advertisement videos. • Language-based reasoning: We explore zero-shot baselines for the three benchmark tasks through applications of large-language models on ad transcripts. • Context guided attention: We provide multiple context-guided attention baselines to benchmark the performance for the three macro-level tasks (topic classification, transition detection, and social message) and highlight future possibilities of explorations. 1https://www.adsoftheworld.com/ 47 3.5 Related work Narrative understanding: Narratives [117] play an important role in enabling effective human communication and organizing the daily sequence of events. Advertisements centered on narratives [118] influence consumers by providing a concrete story arc centered around specific themes, protagonists, and their actions. Kim et al. [110] introduced an integrated theory of understanding narratives in advertisements based on key variables like emotic response, ad hedonic value, ad credibility, and perceived goal facilitation. Lien et al. [119] explored narrative ads from the lens of persuasion and the relationship with different advertisement mediums - verbal or visual. In the realm of computational narrative understanding, prior works have focused on language-based approaches for marking high-level structures in short stories [120], most reportable events (MRE) in Reddit comment threads [121] and primary processes in movie scripts, newspaper articles, etc [122]. Affect modeling in videos: Advertisement brands tend to invoke emotional reactions [123] in viewers by influencing their actions i.e., purchasing a particular product. In the domain of television commercials, a combination of physiological, symbolic, and self-report measures was explored in [124] to determine the emotional responses of viewers. The role of facial expressions in decoding the preferences of viewers, including purchase intent, and smile responses, has been explored through large-scale studies in [125], [126]. Apart from facial expressions, the role of CNN-based audio-visual and EEG descriptors from the viewers has been explored in a multi-task setup [127–129] for arousal and valence prediction in advertisement videos. Existing video-based affect datasets like DEAP [130], VideoEmotion [131] also focused on single arousal, valence, and dominance ratings as well as discrete emotion labels for music and user-generated videos, respectively. In the domain of continuous affect modeling, datasets with frame-level annotations have been introduced across a wide variety of domains, including naturalistic and induced clips (HUMAINE [132]), movies (COGNIMUSE [133], LIRIS-ACCEDE [134]) and online videos (EEV [135]). Further extensions of continuous affect modeling based on independent and self-reports have been 48 explored for daily emotional narratives in SENDv1 dataset [136]. For advertisements, climax annotations (presence + rough timestamps) were provided by human annotators on a subset of the Video Ads dataset [16] in [137] along with climax-driven modeling strategies to predict sentiment labels at the video level. In our proposed benchmark MM-AU, based on the standard definition in [116], we ask human annotators to denote the perceived tone in the advertisement video across segments approximately marking the start, middle, and end. The perceived tone transition enables tracking of the narrative dynamics in advertisement videos by considering the interactions between various contextual streams through different modalities, i.e. audio, visual, and narrations/spoken interactions (through transcripts). Advertisement benchmarks: Dataset Annotation type Duration #Samples #Shot #Class Modalities Languages Tasks Video Ads Dataset (I) H NA 64832 NA 38 (T), 30 (S), AR (OE), H(2), Ex(2) Images English Image level classification Video Ads Dataset (V) H 144.87 3477 NA 38 (T), 30 (S), AR (OE), H(2), Ex(2) Video English Video level classification Tencent-AVS H 142.1h 12k 121.1k 25 (Pr), 34 (St), 23 (Pl) Video/Audio/ASR/OCR Chinese and English Scene level classification Ads-persuasion dataset H + AL NA 3000 NA 21 (PS) Images English Image level classification E-MMAD SG descriptions 1021.6 h 120984 NA 4863 (PC) Video Chinese and English Video level captioning MM-AU H + SA 147.8h 8399 216.4k 18 (T), 3 (Tone), 2 (SM) Video/Audio/ASR Multilingual (65 languages) Video level classification Table 3.1: Comparison of MM-AU with other available advertisement benchmarks across different modalities. Annotation type: H: Human annotation, AL: Active Learning, SA: Semi-automatic, SG: Store generated. Duration: NA: Not applicable for images; Mentioned in hours(h). #Samples: Number of video clips or images. #Shot: NA: Not applicable for images; Number of shots detected from all the video samples. #Class: T: Topic, S: Sentiment, OE: Open-Ended, H: Humor, Ex: Exciting, Pr: Presentation, St: Style, Pl: Place, PS: Persuasion strategy, PC: Product categories, SM: Social messages While there has been progress in terms of movie understanding due to the introduction of largescale multimodal benchmark datasets like Condensed Movies [54], MovieNet [40], MAD [138], Movie-cuts [139], MovieCLIP [140] and SAM-S [141], only few benchmarks have focused on large-scale understanding of advertisements across broad and fine-grained content. Hussain et al. [16] introduced the benchmark Video-Ads dataset to facilitate understanding of images and videos along the lines of broad topics, induced sentiments, and action/intent reasoning. The images in the Video-Ads dataset were utilized in [142] for computational modeling of persuasion across 21 categories in marketing domain. Regarding large-scale ads understanding, Tencent-AVS dataset [143] was proposed to enable multi-modal scene level categorization into semantic classes like 49 presentation, places and styles. While the previously mentioned datasets focused on classification tasks, E-MMAD [144] introduced the task of informative caption generation from advertisements across 120k e-commerce videos. MM-AU, our curated multilingual dataset utilizes publicly available videos from Ads of the World along with a subset from Video-Ads dataset and an in-house video catalog from Cannes Lion archive [145]. We provide 18 broad topic categories by combining existing taxonomies (Cannes Lion, Ads of World, and Video-Ads dataset). Further, we rely on human expert annotators to label transitions in perceived tone along with the absence/presence of social messages in 8.4K advertisement videos. A comparative overview of MM-AU and other advertisement datasets is shown in Table 3.1. Semantic video understanding: Existing large-scale video datasets, including Kinetics[146], Moments-in-time [147], ActivityNet [148], AVA [149] have focused mainly on classifying entity driven actions from in-the-wild short videos. Higher-level semantic labels beyond actions like topics, concepts, events, and video types were exploited for large-scale video-level categorization in datasets like Youtube-8M [150], Holistic-visual understanding (HVU) [151] and 3MASSIV [152]. Our proposed benchmark MM-AU explores the domain of semantic video understanding in ads by considering broad categories like topic, presence/absence of social message, and fine-grained affective labels of perceived tone transition. Multimodal representation learning: Multimodal representation learning [5] centers around the fusion of information from different modalities at multiple scales, including early, late, and midfusion. Prior works related to ads have utilized multimodal-LSTMs [153], segment-level autoencoders [154] or joint cross-modal embedding [155] approaches for learning multimodal representations for a variety of tasks. With the advent of transformer [19] based multimodal models like PerceiverIO [156], attention bottlenecks[157], and VATT [158], information fusion at the input token space, followed by joint encoders, have become more prevalent. A multi-task attentionbased approach was explored in [159] for jointly predicting topic and sentiment labels associated 50 with advertisement images. A NextVLAD [160] based approach [161] combined with globallocal attention was utilized for predicting scene-specific labels in the Tencent [143] ads benchmark dataset. 3.6 MM-AU benchmark 3.6.1 Data sources We consider multiple ads-specific sources for curating our proposed MM-AU dataset. As a primary source, we consider Ads-of-the-world (AOW)2 video hosting website since it contains a richlycurated catalog of ads in various formats like film, print, digital, and video spanning across multiple countries. As auxiliary sources, we consider additional videos from the Cannes Lion Film Festival archive and Video-Ads dataset [16]. We filter the videos based on unique video ids associated with their public links to ensure no duplicates across three sources. The share of different sources in curating the combined list of 8399 advertisement videos is shown in Fig 3.7 Ads of the world 75.1% Video-Ads dataset 11.4% Cannes-Lion 13.5% Figure 3.7: Share of different ad sources in MM-AU dataset. Ads of the world (6304 videos), Cannes Lion (1135), Video-Ads dataset (960) 2https://www.adsoftheworld.com/ 51 3.6.2 Annotations through crowd-sourcing We employ a semi-automatic process for tagging the advertisement videos with broad topic categories. For the detection tasks of tone transition and social message, we use Amazon Mechanical Turk 3 to obtain responses from a pool of 36 human annotators. For selecting a pool of workers with the requisite expertise, we hosted an initial pilot study where the workers are instructed to mark the tone transition labels and presence/absence of social message in the given set of videos. Further, in the final annotation phase, three annotators independently annotate each sample for the tone-transition and social message detection tasks. The annotation process details associated with the respective tasks are listed below: Tone transition: The annotators are instructed to mark the perceived tone labels associated with the start, middle, and ending segments of the advertisement videos. To reduce the burden associated with the task, no instructions are provided to mark the timestamps associated with the respective segments. Based on the tone definition considered in [116], we provide the following Input video (hosted through web) Human annotators Instructions for social message detection annotation Instructions for tone transition annotation Social message detection annotation Tone transition detection annotation Figure 3.8: Outline of the annotation framework for tone transition and social message detection problem descriptions to aid the annotation process: 3https://www.mturk.com/ 52 • Positive tone: An advertisement video segment has a positive tone if it contains: optimistic elements portraying hope and success or positive imagery based on uplifting music and visuals. Examples include girls overcoming negative stereotypes and succeeding in sports or a blind person being able to navigate easily through city roads by using an app. • Negative tone: An advertisement video segment has a negative tone if it contains: sad narrative showing suffering, fear, destruction or depressing music and distressing visuals. Salient themes associated with negative tone include domestic violence, environmental damage, human trafficking, crisis from wars etc. If a segment does not contain the above-mentioned characteristics, the annotators are instructed to mark the perceived tone as neutral. To determine the reasoning involved in marking the tone labels associated with the segments, the annotators are also asked to provide explanations regarding their choices. In Fig 3.8, we show the outline of the framework provided to the annotators for marking the tone associated with the start, middle, and end segments and the absence/presence of a social message. We provide a sample example to the annotators regarding tone transition and associated explanations, as shown in Fig 3.9. As seen in Fig 3.9, the beginning (start) segment has a perceived negative tone because police activity is being shown. The middle segment also shows a negative tone because vehicles are being destroyed, followed by a positive tone at the ending portion because the kids are playing with toys. Social message detection: For social message detection, the annotators are instructed to check for the absence/presence of social messages in the given video. Based on the social message framing in ads [116], we provide the following definition to guide the annotation process: • An advertisement video has a social message if it provides awareness about any social issue. Examples include gender equality, drug abuse, police brutality, workplace harassment, domestic violence, child labor, homelessness, hate crimes etc. To simplify the annotation process, we ask the annotators to mark Yes/No for indicating the presence/absence of social messages in the videos instead of marking the exact categories in the curated 53 I think the beginning tone is negative because it shows police activity. I think the middle tone is negative because vehicles are being destroyed. I think the ending tone is positive because two kids are playing with toys and the music is happy. Figure 3.9: Example provided to the annotators showing the tone transition and associated explanations list of social issues [162]. The outline of the framework provided to annotators for marking the presence/absence of social messages is shown in Fig 3.8. Further, annotators are also provided with example videos showing different forms of social messages. In Fig 3.10 (a), frame transitions are shown from an example video urging everyone to vote since voters having bias can cast votes in their absence. In Fig 3.10 (b), sample frames in the sequence are shown from another example video highlighting the importance of equal opportunities for everyone in sports. Topic categorization: We annotate topic categories using the existing taxonomies from Ads-ofthe-world (AOW), Cannes Lions Film Festival [145], and Video-Ads [16] datasets. We denote the taxonomies associated with Cannes Lions Film Festival and Video-Ads datasets as Cannescoding [CC] and Video-Ads [VA] coding schemes. We extract the available tags associated with 6304 videos in Ads-of-the-world [AOW] and retain the top 40 tags based on frequency. Then we manually merge the filtered topic tags from AOW with similar labels in Cannes-coding [CC] 54 (a) (b) Figure 3.10: Example videos associated with absence/presence of social message provided to the annotators. (a) Frame transition associated with an example video urging people to vote (b) Frame transition associated with an example video emphasizing equal opportunities for everyone in sports. and Video-Ads [VA] coding schemes. Some examples of merged labels from different sources are listed as follows with the final parent topic category: • Publications media: Media & Publications [CC]; Media and arts [VA]; TV Promos, Music, Media, Movies [AOW] • Games: Games and toys [VA]; Gaming [AOW] • Sports: Sports equipment and activities [VA]; Sports [AOW] • Clothing: Clothing, Footwear & Accessories [CC]; Clothing and accessories [VA]; Personal Accessories [AOW] A detailed list of mapping between the AOW, CC and VA coding schemes is included as part of the Appendix (D). Our final merged topic taxonomy consists of 18 categories as follows: • Games, Household, Services, Misc, Sports, Banking, Clothing, Industrial and agriculture, Leisure, Publications media, Health, Car, Electronics, Cosmetics, Food and Drink, Awareness, Travel and transport, Retail 55 Dataset Filtering: During the annotation process, we employ certain checks to maintain the quality of the annotated data. We reject those tone transition annotations with very short explanations (single words) or long generic descriptions of ads copied from the internet. Further, we also flag tone-transition annotations with the copied content across the start, middle, and end segments. For topic categorization, we merge categories with low frequencies, i.e., Alcohol and Restaurant, into the broad category of Food and Drink. 3.6.3 Statistics MM-AU consists of 8399 annotated videos with a total of 147 hours of curated data. A detailed overview of MM-AU with total duration, number of tags, and year coverage is shown in Table 3.2. The distribution of topics is shown in Fig 3.12, with Food and Drink, Awareness, and Electronics Attribute Value #videos 8399 #explanations 74970 #topics 18 #social msg labels 25197 #tone labels 75,591 #duration 147.79 hrs #avg duration 63.35s year 2006-2019 #annotators 36 #countries 99 Table 3.2: Data statistics of MM-AU dataset. #social msg labels: total number of labels being the top-3 dominant categories. In the case of perceived tone labels, we obtain a high majority agreement among annotators in marking the start (91.2%), middle (91.6%), and the ending (94.5%) segments of the videos with perceived tone labels. Since annotating the presence/absence of social messages is a comparatively less subjective task than perceived tone labeling, we obtain a majority agreement (99%) among the annotators. In terms of tone labels for start, middle, and end segments, we can see from Fig 3.13, that the dominant perceived tone for the advertisements is positive, with its share rising from 60.2% (start) to 81.3% (end). This can be explained due 56 No 91.0% Yes 9.0% Figure 3.11: Distribution of the social message absence (No) and presence (Yes) labels in MM-AU Topic Counts 0 500 1000 1500 2000 Food and drink Awareness Electronics Car Clothing Leisure Travel and transport Services Health Games Banking Household Cosmetics Misc Sports Industrial and Retail Publications media Figure 3.12: Distribution of topics in MM-AU dataset to the fact that advertisements are primarily designed to persuade viewers to buy certain products or act toward certain social issues. However, from Fig 3.13, we can see that the share of negative tone labels increases from 15.5% to 19.6% due to the narrative structure of the ads, where the middle segment portrays negative elements like human suffering, environmental damage, etc to set up the final conclusion. From Fig 3.11, we can see that 9.0% of the videos, i.e. 759 contain social 57 Positive 60.2% Neutral 24.3% Negative 15.5% (a) Start segment Positive 67.8% Neutral 12.6% Negative 19.6% (b) Middle segment Positive 81.3% Neutral 11.1% Negative 7.6% (c) End segment Figure 3.13: Distribution of majority perceived tone labels (among 3 annotators) across (a) start, (b) middle, and (c) ending segments in MM-AU dataset (8399 videos) Travel and transport 1.4% Car 1.6% Services 2.1% Food and drink 2.5% Health 3.2% Electronics 4.1% Awareness 77.6% (a) Distribution of Topics wrt social message No transition 37.5% Transition 62.5% (b) Distribution of Tone transition wrt social message Figure 3.14: Distribution of topics and perceived tone transition across videos having the presence of social message (739 videos) in MM-AU dataset messages, as marked by Yes label. Out of 759 videos, 62.5% exhibit transition in perceived tone with the share of negative tone rising from 32.3% to 43.3% in the middle segments. We further explore the intersection between the presence of social message and associated topics and tone transition (including tone labels for different segments of the video). From Fig 3.14 (a), we can see that out of 739 videos having social messages, 77.9% have Awareness as the associated topic label. The Awareness topic label includes subcategories related to social issues involving environmental damage, animal rights, smoking, alcohol abuse, domestic violence, refugee crisis, cyberbullying, etc. Further, as shown in Fig 3.14 (b), we observe a greater incidence of transitions in perceived tone (62.5%) associated with videos having social messages. A detailed breakdown of segment-wise perceived tone labels for videos having social messages is shown in Fig 3.15. We can see an increase in the perceived negative tone from 32.3% to 43.3%. This can be attributed to the narrative structure of the advertisement videos having the presence of a 58 Positive 46.4% Neutral 21.3% Negative 32.3% (a) Start segment Positive 46.4% Neutral 10.3% Negative 43.3% (b) Middle segment Positive 68.9% Neutral 10.9% Negative 20.2% (c) End segment Figure 3.15: Distribution of perceived tone labels across the (a) start, (b) middle, and (c) ending segments in MM-AU dataset for videos having social message (739 videos) social message where the middle segment primarily portrays negative elements i.e. environmental damage, human suffering etc to set up the conclusion. 3.7 Task definition Social message detection: Based on the social message annotations (majority), the presence/absence of social message (SM) for ith video is defined as: SMi = 0 No (Absence of SM) 1 Yes (Presence of SM) (3.1) Our aim is to learn a network fSM to predict social message presence/absence SMˆ i for the ith video. The above definition results in 759 and 7640 videos marked with the presence (1) and absence (0) of social messages, respectively. Tone transition: Based on the start, middle, and end perceived tone labels (majority), we define the transition for ith video as follows: Tri = 0 Starti = Midi = Endi 1 else (3.2) 59 Our aim is to learn a network fTr to predict binary tone transition Trˆ i for the ith video. MM-AU dataset has 3854 and 4545 videos marked with Transition (1) and No Transition (0), respectively. Topic categorization: For topic categorization, we aim to learn a multimodal network fTopic to predict Topic ˆ i for the ith video out of 18 target categories. 3.8 Advertisements: Modality information In the case of advertisements, we utilize the available modality-driven contextual information for broad narrative-driven understanding based on the previously mentioned tasks of topic categorization, social message detection, and tone transition. The modality-specific information can be obtained as follows: • Visual modality: Shots with different camera viewpoints and semantic divisions help in tracking the short-term visual context switches. • Audio modality: Audio provides information about the background music and salient audio events. • Language modality: The extracted text transcripts from the audio stream provide complementary information through narrations and descriptions. In the following section, we describe how the fusion and reasoning operations are performed over the three salient modalities to obtain task-specific outputs. 3.9 Two-stage multimodal fusion and reasoning For the input modalities a, v, t, we outline the separate context-guided fusion operations as follows: 60 3.9.1 Stage 1: Foreground and temporal context modality fusion We outline the operations for foreground (language) and temporal (visual) context fusion as follows: etext = Textpro j(ftext(t)) evis = Vispro j(fvisual(v)) TVlogits = AvgPool(TxTV([etext; evis])) (3.3) Here, TxTV is the text-visual fusion block used for combining foreground (text) and temporal (visual) contexts. fvisual and ftext refer to the pretrained visual and text encoders. The linear projection layers for visual and text modalities are denoted by Vispro j and Textpro j. For the temporal Figure 3.16: Proposed architecture for the fusion of different modalities associated with temporal and foreground contexts. 61 context fusion block, we have the following sequence of operations: eaud = Audpro j(faud(a)) evis = Vispro j(fvisual(v)) AVlogits = AvgPool(TxAV([eaud; evis])) (3.4) Here, TxAV is the audio-visual fusion block used for combining temporal context information from (audio) and (visual) modalities. fvisual and faud refer to the pretrained visual and audio encoders. The linear projection layers for visual and audio modalities are denoted by Vispro j and Audpro j. For the fusion blocks, we consider the PerceiverIO transformer architecture [156] due to the following reasons: • Reduction in operational complexity: In the case of PerceiverIO, the input embedding matrix having a significantly large sequence length is mapped to a limited group of latent vectors followed by attention operation on the latent vectors. This decreases the operational complexity of the attention operation. In Fig 3.17, N is smaller than input sequence length M. • Generalization to a wide variety of inputs: PerceiverIO architecture has been shown to be highly generalizable to a wide variety of inputs, including video, audio, and point clouds. In our proposed approach, we train the fusion blocks TxAV and TxTV separately due to overfitting issues faced while using a single fusion block for three modalities. 3.9.2 Stage 2: Reasoning mechanism Further, the combinations of different modalities (audio-visual) and (text-visual) contain complementary information that further improves narrative-level understanding across the tasks in the MM-AU benchmark. In order to capture the complementary information, we perform the taskspecific reasoning through logit fusion by freezing the TxAV and TxTV blocks using the following strategies: 62 Figure 3.17: PerceiverIO architecture [156] • A-max: predclass = argmaxi (TVlogits(i) +AVlogits(i))/2 • D-max: predclass = argmaxi max{(TVlogits(i),AVlogits(i))} Here i ∈ {1,...Class}, where Class refers to the number of classes associated with the task. 3.10 Experiments 3.10.1 Experimental Setup For training, validation, and testing purposes, we consider a split of 5877 (70%), 830 (10%), and 1692 (20%) videos. The modality-specific details are as follows: Visual: For the visual modality, we segment the videos into shots using PySceneDetect4 . We use the shots to extract frame-wise features at 4 fps using CLIP’s [48] pretrained visual encoder (fvisual), ViT-B/32. Further, we average pool the frame-wise visual representations to obtain shot representations. For modeling temporal context information through visual modality, we consider the shot as the basic unit due to its ability to capture short-term temporal variations. Text: For the text modality, we use Whisper [163] multilingual model (large) to extract the transcripts. We translate the multilingual transcripts to English using GPT-4 [164] (temp=0.02, max tokens=2048) by providing the following translation-specific prompt: Please provide an English 4https://github.com/Breakthrough/PySceneDetect 63 translation of this transcript. We use the pretrained BERT[165] model as the text encoder (ftext). The translated transcripts contain details about the foreground context in terms of narrations regarding interactions. Audio: For the audio modality, we use the Audio-Spectrogram transformer (AST) [166] model (faudio) pretrained on AudioSet [167] for extracting features at 10-sec intervals with a step size of 512. We conduct our experiments in a distributed manner using the Pytorch [168] framework on 4 2080ti GPUs. We use accuracy and macro-F1 for evaluation metrics. 3.10.2 Language-based reasoning We investigate the zero-shot capabilities of foundational large language models [169] by applying GPT-4 [164], Opt-IML [170], Flan-T5 (XXL,XL,L) [171] and Alpaca [172] on the translated transcripts. We report the results on 1670 non-empty transcripts out of the test split of 1692 samples for zero-shot evaluation. For GPT-4, we use the following task-specific prompts: • SM: An advertisement video has a social message if it provides awareness about any social issue. Example of social issues: gender equality, drug abuse, police brutality, workplace harassment, domestic violence, child labor, environmental damage, homelessness, hate crimes, racial inequality etc. Based on the given text transcript, determine if the advertisement has any social message. Please provide answers in Yes and No. • TT: Based on the given text transcript from the advertisement, determine if the advertisement has any transitions in tones. Possible tone labels are: positive, negative, and neutral. Please respond by saying Transition or No transition • Topic: Associate a single topic label with the transcript from the given set: <Topic list> Here SM, TT, and Topic refer to the benchmark tasks of Social message detection, tone transition, and topic categorization. <Topic list> refers to the condensed list of 18 topic categories curated 64 for the MM-AU benchmark. Further details about the prompting strategies are included as part of the Appendix (E). 3.10.3 Unimodal and Multimodal context fusion For the supervised unimodal baselines, we consider the following model choices: • LSTM [93]: 2 layers and hidden dimension = 256 • MHA [19]: 4 layers, 4 heads and hidden dimension = 256 For the fusion blocks TxAV and TxTV, we adopt a lightweight PerceiverIO structure composed of 4 encoder layers, 16 latent vectors, 8 heads, and a hidden latent dimensionality of 256. We use binary cross-entropy for social message and tone transition detection tasks and multi-class cross-entropy for topic categorization. For training the unimodal and multimodal models, we use a batch size of 16 with Adam [100] or AdamW [104] as optimizers and lr ∈ {1e−4,1e−5}. While training the supervised models, we fix the maximum sequence lengths for visual(shots), audio, and text modalities at 35, 14, and {256,512}, respectively. During multimodal fusion, when text is missing in the transcripts due to the absence of speech, we replace the text with a string of [MASK] tokens. 3.11 Results 3.11.1 Language-only reasoning Based on results in Table 3.3, we can see that GPT-4 exhibits superior zero-shot performance (F1 and Accuracy) across all the tasks when compared to other large language models. Further, there is a trend towards improved model performance with model scaling, except for Alpaca. The poor performance of Alpaca can be attributed to a lack of self-instruct data associated with complex reasoning tasks from transcripts. Instruction finetuning coupled with model scaling improves the zero-shot performance (F1) of T-5 models from 49.31% to 62.7% and 42.39% to 44.15% for social message and tone transition tasks respectively. 65 Configurations SM TT Topic Model Params Acc F1 Acc F1 Acc F1 GPT-4 [164] NA∗ 87.6 65.66 58.56 58.33 33.29 29.21 Flan-T5-XXL [171] 11B 85.69 62.7 54.79 44.15 30.54 24.23 Flan-T5-XL [171] 3B 65.51 49.31 54.67 42.39 27.18 24.1 Alpaca [172] 7B 10.77 10.56 46.88 39.1 11.19 11.68 Opt-IML [170] 1.3B 37.07 32.49 54.37 35.22 22.22 19.08 Flan-T5-L [171] 780M 8.32 7.76 54.43 35.25 26.82 19.42 Random Baseline - 49.57 39.61 49.95 49.88 5.71 4.71 Majority Baseline - 90.96 47.63 54.11 35.11 23.04 2.08 Table 3.3: Zero shot performance comparison between various LLMs on MM-AU dataset. Tasks: SM: Social message detection, TT: Tone transition, Topic: Topic categorization. NA∗ : Information not available. F1: Macro-F1. Best performing results are marked in bold for respective tasks. We provide class-wise comparisons between Flan-T5-XXL and GPT-4 for the three benchmark tasks of social message, tone transition detection, and topic categorization. From Fig 3.18 (a), we can see that GPT-4 obtains a higher F1-score (38.21%) as compared to Flan-T5-XXL (33.52%) for the Yes class signifying the presence of social message in a complete zero-shot setting. Further for the tone-transition task, GPT-4 obtains a significantly higher F1-score (55.18%) for the Transition class than Flan-T5-XXL (19.75%). In terms of topic categorization, we can see from Fig 3.19 that GPT-4 performs better than Flan-T5-XXL in the case of all categories, except for the Awareness class. For the minority topic categories i.e. Retail, Publications media, Industrial and Agriculture, GPT-4 performs slightly better than Flan-T5-XXL. The poor performance of GPT-4 and Flan-T5-XXL in the Misc category can be attributed to the grouping of multiple diverse subcategories like Petfood, Business and equipment, Politics into a single large category. Future work will involve the usage of expanded topic taxonomy to mark the transcripts with respective categories by incorporating reasoning mechanisms like chain of thought prompting [173]. 3.11.2 Unimodal vs Multimodal fusion From Table 3.4, we can see that supervised unimodal models (MHA, LSTM) show improved performance as compared to simple random and majority baselines. In terms of unimodal models, 66 F1-score 0 25 50 75 100 Yes No Flan-T5-XXL GPT-4 (a) F1-score for social message detection task class wise F1-score 0 20 40 60 80 No transition Transition Flan-T5-XXL GPT-4 (b) F1-score for tone transition detection task class wise Figure 3.18: Comparisons between Flan-T5-XXL and GPT-4 in terms of class-wise F1-score for the social message (Yes: Presence of social message, No: Absence of social message), Tone transition detection tasks. Topic F1 score 0 20 40 60 Games Household Services Misc Sports Banking Clothing Industrial and agriculture Leisure Publications media Health Car Electronics Cosmetics Food and drink Awareness Travel and transport Retail Flan-T5-XXL GPT-4 Figure 3.19: Comparisons between Flan-T5-XXL and GPT-4 in terms of class-wise F1-score for the topic categorization task MHA model trained on shot-level visual features (denoted by CLIP-S) perform far better than audio features (AST) in social message detection (F1: 72.28 vs F1: 55.33) and topic categorization 67 Configurations Acc F1 Acc F1 Acc F1 Model Features Modality Social message Tone transition Topic categorization Random NA NA 49.57±0.28 39.61±0.28 49.95±0.30 49.88±0.30 5.71±0.16 4.71±0.24 Majority NA NA 90.96 47.63 54.11 35.11 23.04 2.08 Unimodal LSTM CLIP-S V 90.57±0.47 68.65±1.70 61.93±0.37 61.65±0.37 52.86±0.43 36.48±1.58 MHA AST A 89.07±2.02 55.33±3.96 59.78±1.56 58.60±0.96 25.11±1.39 15.48±1.57 MHA CLIP-S V 90.41±2.24 72.28±1.66 61.74±1.07 61.48±1.11 61.30±0.89 47.74±1.27 Multimodal TxAT AST + BERT A + T 90.24±0.81 64.02±0.81 62.98±0.56 62.29±0.83 42.99±0.65 30.77±0.96 (1) TxAV CLIP-S +AST A + V 91.62±0.58 70.05±0.67 64.01±0.54 63.72±0.66 61.62±0.46 48.67±0.64 (2) TxTV CLIP-S +BERT T + V 92.23±0.57 74.03±1.00 63.96±0.99 63.48±0.84 63.27±0.59 50.58±1.32 A-Max(1,2) CLIP-S +BERT +AST A + V + T 92.51±0.46 73.17±1.00 65.05±0.36 64.67±0.33 65.92±0.54 54.22±1.14 D-Max(1,2) CLIP-S +BERT +AST A + V + T 92.52±0.46 73.21±0.98 65.01±0.39 64.63±0.32 65.51±0.58 53.67±1.24 Table 3.4: Comparative results between different unimodal and multimodal context fusion models across different tasks: Social message and Tone transition, Topic categorization. CLIP-S: Shot level features extracted using CLIP. Modality: A: Audio, V: Visual, T: Text. Results are reported as an average of 5 runs with randomly selected seeds. Best performing results are marked in bold for respective tasks. tasks. (F1: 47.74 vs F1: 15.48). For the tone transition task, MHA model trained on audio features (AST) shows close performance (F1:58.60 vs F1: 61.48) as compared to visual features, due to the dependence of the tone transition task on the ambient music. Multimodal models: For multimodal models, we observe that the fusion of text and visual modalities through a Perceiver-IO-based encoder (TxTV) performs better (F1:74.03) for social message detection compared to audio-visual or audio-text fusion. This can be attributed to the presence of socially relevant descriptors in the transcripts and video shots. However, the fusion of audio with visual signals (TxAV) improves the performance in the tone-transition detection task (F1: 63.72). For topic categorization, we find that the fusion of text and visual modalities (TxTV) performs better than other paired modalities (F1:50.58) due to topic-specific identifiers in transcripts and shots. Logit fusion: Our proposed approach based on logit fusion strategies: Average-Max (A-Max) and Dual-Max (D-Max) exhibits similar performance across all tasks. We obtain gain for tonetransition (F1:64.67) based on the A-Max fusion strategy of TxAV and TxTV models. In terms of class-wise metrics, A-Max fusion improves the transition class average F1-score to 61.28% as compared to TxAV (60.72%) and TxTV (60.18%), as seen in Fig 3.20 (b). However, we observe from Fig 3.20 (a), that both average-Max (A-Max) and dual-Max (D-Max) fusion strategies do not help for social message detection (i.e., Yes class). In the case of topic categorization, logit fusion 68 through A-Max of TxAV and TxTV models results in the best performance (F1: 54.22). Further, A-Max fusion results in improvements over TxAV and TxTV across 15 topic categories (out of 18), with noticeable gains in the minority categories i.e., Retail (∼6.2%), Industrial & Agriculture (∼5%), Household (∼7%). Similar trends can be observed for D-Max fusion, with improvements obtained across 14 topic categories (out of 18). The detailed comparisons between TxAV, TxTV, A-Max and D-Max can be found in Fig 3.21. Models F1-score 0 25 50 75 100 AV-Tx TV-Tx A-Max D-Max No Yes (a) F1-score for social message detection task classwise Models F1-score 0 20 40 60 80 AV-Tx TV-Tx A-Max D-Max No transition Transition (b) F1-score for tone transition detection task classwise Figure 3.20: Comparisons between TxAV(AV-Tx), TxTV(TV-Tx), A-Max, D-Max in terms of classwise F1-score for the social message (Yes: Presence of social message, No: Absence of social message), Tone transition detection tasks. An average of 5 runs is considered for the F1 score. 3.12 A closer look at unified understanding of advertisements Based on the tasks proposed in the MM-AU benchmark, we can see that lightweight fusion mechanisms can enable a broad narrative-level understanding of the advertisement videos. However, we can see that with novel tasks for different facets of narrative understanding, we would require multiple task-specific models. Further, each task has a dependence on diverse modality connections. This raises a question: Can we enable advertisement understanding through unified models ?. In this regard, we consider multimodal foundation models [174] since the models can be adapted to a wide range of downstream tasks. As mentioned in [175], foundation models provide a unified approach to explore novel task-specific capabilities due to pretaining data scale 69 Topic F1-score 0 25 50 75 100 Household Services Misc Banking Clothing Industrial and agriculture Leisure Health Car Electronics Cosmetics Awareness Travel and transport Retail A-Max AV-Tx TV-Tx D-max Figure 3.21: Comparisons between TxAV (AV-Tx), TxTV(TV-Tx), A-Max, D-Max in terms of class-wise F1-score for the topic categorization task. An average of 5 runs is considered for the F1 score. and transfer learning mechanisms. As defined in [174], multimodal foundation models inherit all properties of foundation models and provide emphasis on vision or vision-language modalities. With the advent of large-scale pretraining followed by task-specific adaptation in the NLP community through models like BERT [176], GPT-2 [177], RoBERTa [178], DeBERTa [179], there have been similar efforts in the multimodal domain through vision-language pretraining like CLIP [48], ALIGN [180], BEIT [181]. As shown in Fig 3.22, with the rise in large language model assistants like ChatGPT [182], LLama family of models [183, 184] the trend of multimodal pretraining + adaptation is shifting towards the alignment of visual modality with the input language modality through connector networks like Q-former [185], perceiver-resampler [186] etc. The alignment with large language models has led to the introduction of a broad family of large multimodal models (LMMs), including open-source LLaVa [187], Qwen-VL [188], and Phi-3.5[189], with diverse VL capabilities ranging from descriptive captioning and question answering to visual grounding. 70 In order to examine the general reasoning capabilities of LMMs, specialized multimodal benchmarks like MME, MMBench, and MMVet have been curated to examine the performance across diverse dimensions of spatial reasoning, OCR, social relation detection, recognition etc. However, there is a lack of complex media-centered tasks in these existing multimodal benchmarks. As shown in Fig 3.4 and 3.5, ads require complex reasoning about symbolic elements across diverse tasks that require broad to fine-grained analysis. This motivates the need to benchmark the capabilities of LMMs w.r.t two broad questions: • Do LMMs understand ads as a domain broadly? • Is there any variability across the nature of tasks? CC12M[1]: Image-text pairs LAION-5B [2]:Image-text pairs Foundation model Classification tasks (examples): Visual question answering Video-language inference etc Downstream fine-tuning Generative tasks (examples): Image/Video captioning Visual dialogue Multimodal machine translation Pre-training using image-text matching, caption decoding, word-region alignment [3] Task: Image to text generation [3] Model architecture [3] Training objective and attention mask [3] General pretraining and fine-tuning paradigm Large multimodal models (LMMs): Bridge with language models Figure 3.22: Multimodal foundation models landscape: From large scale VL pretraining of unified or modality-specific models to pretraining and alignment of vision modalities with the large language model encoder. 3.12.1 Ads reasoning benchmark tasks We consider three different representative benchmark tasks based on the Video-Ads (VA) dataset from Hussain et.al [16]. Video-Ads dataset consists of 64,382 advertisement images. The related tasks are listed as follows: 71 Chocolate Software Animal right Cars Object replacement Solid deformed object Regular object Liquid deformed object Eager: hungry, thirsty Creative: inventive, productive Concreteness: Details about the product Active: energetic, adventurous Fashionable: Trendy, elegant (a) (b) (c) Figure 3.23: (a) Image examples associated with the broad theme (topics) along with labels (b) Image examples associated with the fine-grained atypicality labels (c) Image examples showcasing the persuasion strategies 3.12.1.1 Topics: Broad theme understanding For broad theme understanding, the task is to classify the ad images into representative topic categories from a list of 38 categories. Some notable categories include chocolates, sports, education, cars etc. A detailed list of topics associated with the VA dataset is present in Appendix D under the category marked as VA. The train/val and test split consist of 51k and 12805 images, respectively. 3.12.1.2 Atypicality detection For atypicality detection and fine-grained classification, 3928 image ads were annotated for finegrained atypical categories (notable ones) [107] as follows: • Texture Replacement 1: – Definition: Object’s texture borrowed from another object, e.g., kiwi inside apple. • Object inside object: – Definition: One object is inside another object, e.g., jar in an onion. • Combination of parts: 72 – Definition: Object composed of parts from different objects, e.g., deer head with hand horn. • Solid deformed objects: – Definition: Solid objects that have been deformed from the original shape by bending, twisting, e.g., human arm bent. • Liquid deformed objects: – Definition: Bent and deformed objects but for liquids, e.g., beer as player. A detailed list of atypicality can be found in Appendix F. We consider both (a) Binary: Atypical vs regular and (b) Multi-class: 10 class classification problem. The train/val and test split consist of 3137 and 791 images, respectively. 3.12.1.3 Persuasion strategies For computational modeling of persuasion, we use the augmented Video-Ads dataset introduced by Kumar et.al [142]. The common persuasion strategies that are employed to make the ad images appear persuasive are Concreteness, Eager and Fashionable. A detailed list of persuasion strategies and their definitions is available in G. The task is to predict the relevant persuasion strategies in a multi-label fashion. The train/val and test split consist of 2750 and 250 images, respectively. 3.12.2 LMMs: Reasoning outline We design the multimodal reasoning mechanism based on input task specification T. Here, the task can be broad-theme-oriented or fine-grained. 73 3.12.2.1 Zero-shot reasoning In the case of the zero-shot reasoning mechanism, we test the native zero-shot capabilities of the large multimodal models. Instead of handcrafting the prompts for the LMMs, we use an instructtuned language model agent to generate the appropriate prompts for the LMMs based on task information. The usage of a language model agent to structure the input prompts for a large multimodal model is inspired by language-model guided controller approaches for diverse tasks, e.g., HuggingGPT [190], diverse prompt generation for text-to-image models [191]. We provide the task specification T = {L,M} to the language model fLM where L refers to the list of labels and M refers to additional metadata, including label definitions and annotation manual. fLM generates the system and instruction prompts i.e. S and I for the large multimodal model FLMM. FLMM generates the response R based on S and I and the image < img >. S,I = fLM(T,Q) (3.5) R = FLMM(S,I,< img >) (3.6) Since the responses of different LMMs might not align with the actual desired response ( single class label for classification) and can have diverse responses, we use a separate language model gLM to parse the response R to extract the final answer A. A = gLM(R,L) (3.7) gLM only utilizes the label information to match the closest label from response R. 3.12.2.2 Supervised finetuning While the previous approaches are purely-zero shot and don’t involve any training, we consider supervised fine-tuning or instruction fine-tuning to adapt the large multimodal models to provide the desired output response. For supervised finetuning, we consider a data formatting phase where 74 we reuse the system (S) and instruction (I) prompts for different multimodal models (FLMM) from the zero-shot reasoning stage to organize the query, image and answer (Q,I,A) triplets as responses from a chat system. For supervised fine-tuning, we try to maximize the log-likelihood of generating an answer Ai associated with image Ii and question Qi by keeping the base language model parameters θbase fixed and fine-tuning additional parameters θLoRA through low-rank fine-tuning [192]. LLoRA = − N ∑ i=1 logP(Ai | Qi ,Ii ;θbase,θLoRA) (3.8) In the inference phase, the supervised fine-tuned model is used for predicting the answer for given image and task-specific queries. We don’t use a language-model-based parsing scheme since the answers of the supervised tuned model are aligned to the label space of the given task. 3.12.3 LMM Models outline For investigating the reasoning capabilities of LMMs, we consider open-source instruct-tuned decoder-only models listed as follows: Model Parameter Phi-3.5-vision [189] 4.2B Llava-1.5-7b [187] 7B Llava-v1.6-mistral-7b [187] 7B Llava-v1.6-vicuna-7b [187] 7B Qwen2-VL-7B [193] 7B Llama-3.2-11B [183] 11B Llava-1.5-13b [187] 13B Table 3.5: Parameter details of instruct-tuned decoder only LMM models For efficient inference, we use the paged attention-based LLM serving system called vLLM [194]. For supervised-fine tuning, we use the SwiFT [195] library with LoRA. For generating 75 the structures system and instruction prompts, we use GPT4o as the expert language model (fLM). For response parsing, we use Llama-3-8B as gLM (parser language model). Keeping in mind the resource constraints, we only use models with parameter ranges from 4.2B to 7B. 3.12.4 Experiments 3.12.4.1 Topics: Broad theme understanding Model Accuracy 0 10 20 30 40 50 60 70 80 90 llava-1.5-7b-ZS llava-v1.6-vicuna-ZS llava-1.5-13b-ZS llava-v1.6-mistral-7b-ZS Supervised baseline [16] Phi-3.5-vision-ZS Qwen2-VL-7B-ZS Llama-3.2-11B-Vision-ZS Phi-3.5-vision-sft llava-v1.6-mistral-7b-sft Qwen2-VL-7B-sft Figure 3.24: ZS (zero-shot) and SFT results for topics prediction (38 class category) In order to compare with the previous supervised baseline, we use accuracy as the metric. From Fig 3.24, we can see that the zero-shot reasoning through open-source LMMs (Phi-3.5-Vision, Qwen2-VL-7B, Llama-3.2-11B) performs better than prior supervised baseline. This can be attributed to the presence of pre-training + instruction-tuned data for the models with categories aligned to topics. For SFT (supervised fine-tuning) associated with 7B models, we consider the top 2 models in terms of zero-shot performance, i.e. Qwen2-VL-7B and Llava-v1.6-mistral-7b. We find that SFT improves the performance of 4.5 to 7B models with fewer gains for Qwen2-VL7B. 76 3.12.4.2 Atypicality detection Model F1 score 0 20 40 60 80 Majority binary llava-v1.6-vicuna-7b-ZS Qwen2-VL-7B-ZS llava-v1.6-mistral-7b-ZS Phi-3.5-vision-ZS llava-1.5-7b-ZS llava-1.5-13b-ZS Llama-3.2-11B-Vision-ZS Phi-3.5-vision-sft llava-v1.6-mistral-7b-sft Qwen2-VL-7B-sft Figure 3.25: ZS (zero-shot) and SFT results for binary atypicality prediction (2 class category) We use macro F1-score to compare the performance of the LMMs. We first compare the binary atypicality prediction performance where the task is to predict between binary categories: Regular object and Atypical object. From Fig 3.25, we find that lightweight models like Phi-3.5 show competitive performance when compared with 11-B and 7B models. With SFT finetuning, significant gains are observed for the ZS performance of lava-v1.6-mistral (+10.38%) and Qwen2-VL (+22.93%) LMM models. However, when we go from binary atypicality classification to the fine-grained (10 class) setting, the ZS performance of the LMMs drops significantly, as seen in Fig 3.26. SFT improves the performance of the 4.2B to 7B models with noticeable gains for Qwen2-VL (+9%) and lava-v1.6-mistral (+14.3%). Comparatively worse performance of LMMs in terms of fine-grained atypicality classification (labels mentioned in Sec 3.12.1.2) can be attributed to the lack of grounded region understanding and commonsense reasoning abilities. 77 Model F1 -Macro 0 5 10 15 20 25 Majority class Random llava-1.5-13b-ZS llava-1.5-7b-ZS llava-v1.6-vicuna-7b llava-v1.6-mistral-7b-ZS Qwen2-VL-7B-ZS Phi-3.5-vision-ZS Llama-3.2-11B-Vision-ZS Phi-3.5-vision-sft Qwen2-VL-7B-sft llava-v1.6-mistral-7b-sft Atypicality fine-grained categorization Figure 3.26: ZS (zero-shot) and SFT results for fine-grained atypicality prediction (10 class category) 3.12.4.3 Persuasion modeling Since persuasion modeling is a multi-label task, we use an average of sample-wise Jaccard scores to compare the models. From Fig 3.27, we can see that the fine-grained zero-shot (ZS) understanding of persuasion strategies is poor. This can be attributed to the over-prediction of irrelevant persuasion strategies associated with the advertisement images when the models are operated in a purely zero-shot manner. With SFT, we obtain noticeable gains across the three models, i.e., lava-v1.6-mistral (+24.87%), Qwen2-VL (+22.84%), and Phi-3.5 (+19.63%). 3.13 Takeaways We introduce a benchmark called MM-AU for a macro-level understanding of advertisement videos. Further, we consider the narrative-driven tasks of topic categorization, tone transition, and social message detection. The major takeaways from this work are as follows: 78 Model Jaccard score (sample wise) 0 10 20 30 40 llava-1.5-13b-ZS llava-v1.6-vicuna-7b-ZS llava-1.5-7b-ZS llava-v1.6-mistral-7b-ZS Qwen2-VL-7B-ZS Llama-3.2-11B-Vision-ZS Phi-3.5-vision-ZS Phi-3.5-vision-sft llava-v1.6-mistral-7b-sft Qwen2-VL-7B-sft Figure 3.27: Multi-label persuasion prediction (10 class category) • Advertisement structure: The content in advertisement images and videos is structured into broad thematic elements and fine-grained details, addressed through a variety of tasks. • Modality driven context-fusion: Temporal and foreground context fusion through multiple modalities enable macro-level understanding of advertisement videos. • Narrative-driven understanding: The proposed tasks can be further extended to include explicit modeling of context-based narrative transitions. • Importance of modalities: As seen from the individual task-specific results, not all modalities contain useful signals. For a given macro-level narrative task, there is a need to find the optimal fusion of context streams that can depend on a subset of input modalities. • Unified understanding through large-multimodal models: In terms of unified understanding, understanding broad thematic elements through large multimodal models is easier due to the presence of similar data in pre-training and instruction tuning settings. Finegrained tasks (atypicality, persuasion) are harder due to the lack of symbolic understanding and spatial understanding. 79 • Inclusion of complex instruction fine-tuning data: Apart from general visual understanding tasks, complex reasoning data should be part of instruction tuning multimodal datasets. Large multimodal models can gain improved spatial-context understanding and commonsense reasoning capabilities. 80 Chapter 4 Characterizing the effect of data quality variations in multimedia understanding: A parameter efficient approach In this chapter, we will characterize the impact of modality heterogeneity in multimedia understanding tasks where the goal is to study the behavior of the models under non-ideal settings, e.g., variations in the quality of multimedia sources. In terms of quality variations, we will be considering commonly occurring corruptions in the individual modalities. We will study the behavior of widely used multimodal fusion models under diverse unimodal corruptions. Further, we will explore low-resource parameter efficient strategies to handle modality corruptions and examine their relative advantages and limitations. An overview of this work can be seen in Fig 4.1. 4.1 Broader questions and contributions In this chapter, we will try to address the following questions: • How do variations in media data quality affect multimodal models for downstream understanding tasks? • Under low-resource setting, how can we handle quality variations in testing scenarios for multimodal models in a lightweight manner? Specifically, we will focus on parameter-efficient techniques and information-theoretic methods to mitigate the impact of modality corruption during training/testing mismatched scenarios. 81 Media Modality connection Modality representation and reasoning Modality heterogeneity Diverse content tagging Narrative understanding Quality variations Multimodal perception Movies, Social media Characterize the impact of quality variations in media understanding. [Utilize semantic modality connections, multimodal representations and novel training strategies/regimes] Figure 4.1: Variations in data quality require handling of modality heterogeneity, semantic modality connections, and proposal of novel training strategies to learn robust representations Further, we will check the role of corruption inclusion during the training phase and its role in mitigating training/testing mismatch. 4.2 Multimodal perception and quality variations in data Multimodal perception relies on the fusion of information from diverse modalities, thus resulting in the usage of cross-modal correlations in the form of shared and complementary information. In spite of the high-quality multimedia data available through diverse resources, the quality of the underlying modalities can vary due to environmental factors and sensor malfunctions. As mentioned in Zhang et al. [196], the salient properties of low-quality multimodal content in multimedia data can be summarized as follows: • Modality specific or aligned noise: Modality-specific noise can occur due to environmental factors, transmission losses, and sensor malfunctions. In the case of visual modality, electronic noise can result in the loss of details. Distortion in audio modality can occur due to environmental conditions. For textual modality, the noise can arise due to text redaction, 82 removal of certain content, or replacement with irrelevant or semantically misaligned words. Apart from modality-specific factors, cross-modal noise can be present in multimedia data due to weak alignment between images and text due to the presence of irrelevant content in the web sources (social media, video-hosting platforms). • Incomplete modalities: In the incomplete modality scenario, certain modalities may be absent due to equipment damage or data loss during transmission. Further, due to privacy concerns, certain modalities cannot be collected under real-world data, e.g., image-acquisition restrictions in certain areas, personally identifiable information, etc. Further, in the medical domain, incomplete modalities can occur naturally due to the lack of imaging capabilities in certain sites and the prohibitive cost of acquiring specific data like PET scans [197, 198]. • Imbalance between modalities: Different modalities depict the same concept from different views, thus resulting in correlation among the modality-specific content. However, the information content in the respective modalities can be vastly different depending on the nature of the perception task. For example, in the action recognition task, the visual modality tends to dominate over the audio modality. This results in shortcuts during the multimodal learning process, thus resulting in spurious correlations or incomplete exploration of inter-modality correlations. Gradient balancing methods like Gradient-blending [199] and OGM-GE [200] tend to balance the contributions of respective modalities in order to learn calibrated multimodal correlations. • Dynamic variations: Dynamic variations in input modality quality can occur due to poor sensor calibrations and unexpected environmental conditions. For example, in the selfdriving application during nighttime, the thermal sensor has more information than the RGB sensor. The situation changes during the daytime when the RGB images have more information. 83 An overview of the modality-specific properties associated with low-quality data is shown in Fig 4.2. In this chapter, we will be considering the impact of modality-specific noise on multimodal models and associated mitigation strategies. Figure 4.2: Characteristics of modality-specific factors associated with low-quality multimedia data.Image source: [196] . 4.3 Related work Noise and multimodal representations: Prior studies like [197, 201] have shown the adverse impact of noise, including modality corruption, on multimodal fusion operations. Zhang et al. [202] proposed an uncertainty-based fusion strategy with provable guarantees for low-quality multimodal data. Further, Zhang et al. [202] showed that existing fusion mechanisms like MMBT [kiela2019supervised] and TMC [203] fail adversely under high noise regimes across diverse datasets. Prior works in the area of thermal and RGB noisy fusion have considered total-variationbased approaches [204, 205]. Parameter-efficient fine-tuning: Pretrained language models are adapted for diverse tasks through parameter efficient methods [206] that finetunes a small portion of model parameters, keeping the original set of parameters untouched. Adapter tuning [207] is one of the earliest approaches that rely on the insertion of small adapter modules in the pretrained language models during finetuning. Past parameter efficient approaches also include the pretending of layer-wise trainable 84 parameters, i.e., Prefix-tuning [208] and update of the bias terms, i.e. BitFit [209]. In the area of parameterization-based parameter efficient methods, LoRA [192] decomposes task-specific weight updates into the product of low-rank matrices. He et al. [210] combine the parameter-efficient methods into a unified framework based on method-wise differences and possible connections. In the vision-language domain, LoRA is used to update the parameters of CLIP [211] in a fewshot manner for diverse V-L tasks. Gou et al. [212] utilize a mixture of LoRA experts for diverse V-L instruction fine-tuning tasks. Further, for controllable image generation, LoRA is used to learn concept sliders [213] for pretrained text-to-image models. Information-botteleneck based representation learning: Information-bottleneck (IB) [214, 215] has been previously explored in order to obtain the optimal balance between representation compression and predictive power during the learning process. Apart from the standard variational approach to information bottleneck [216], alternative formulations have explored the usage of classbased conditioning [217], maximum entropy [218], Hilbert-Schmidt criterion [219]. In the natural language processing domain, Information Bottleneck (IB) has been utilized for low-resource finetuning of BERT for natural language inference and sentiment analysis tasks in [220]. Additional usage of information bottleneck includes its usage as a task-specific regularizer for learning adversarially robust representations in language models [221, 222]. In the multimodal domain, IB has been used to learn discriminative minimally sufficient representations [223] for sentiment analysis and emotion recognition tasks. Further, IB as a regularizer has been used to learn multimodal representations robust to linguistic variations and image corruptions for visual question-answering tasks [224]. In the domain of handling noisy modalities and associated aleatoric uncertainty, bottleneck-based joint fusion strategies [225] have been used after an expensive modality alignment phase. Additional usage of IB as an efficient multimodal denoisingbased fusion mechanism has been considered for multimodal intent detection [226], multimodal omics data handling [227]. In the case of multimodal fusion, existing studies have shown the usage of IB-based objectives for learning discriminative representations with limited studies on modality corruptions. In our 85 study, we want to check: (A) Can IB can act as a standalone method to handle modality-specific corruptions (B) How parameter efficient fine-tuning of encoders (esp. LoRA) actually affects IBbased objectives? (C) What is the optimal combination in a low resource setting in terms of IB and parameter-efficient methods to handle modality-specific corruptions? Large scale robustness studies: Earliest works in the robustness analysis of deep-neural networks include benchmark datasets like ImageNet-C and Icons-50 [228], ImageNet-A [229] that consider adversarial and out-of-distribution scenarios. Further specialized robustness-oriented benchmarks include out-o-domain distribution scenarios [230, 231] across multiple tasks, including object detection, image classification, and pose estimation. In the NLP domain, prior works have considered the sensitivity of language models to multiple input perturbations [232], out-ofdomain distribution effects [233], and adversarial corruptions [234]. Existing studies in the multimodal domain have focussed on adversarial image hijacks for vision-language models [235], consistency, and robustness evaluation of image-to-text [236], text-to-image, and speech-to-text models under common corruptions [237] and effects of broad modality corruptions on video-language models [238]. Current extensions have also considered the impact of common modality-specific corruptions on pretrained multimodal generative models [239] when modified through parameterefficient update methods, including LoRA, adapters, and compacters. 4.4 Multimodal fusion architectures In terms of multimodal fusion architectures, we consider DualEncoder models composed of two separate streams for processing image and text modalities. The architecture designs of the two DualEncoder models are shown in Fig 4.3. The motivation for choosing DualEncoder models was to make sure that we investigate the role of unimodal corruptions separately. 86 Image Text Image Text Concat-MLP fusion Fused embedding dropout = 0.3 fc 2 x 768 = 1536 Intermediate embed 1536 x 512 Relu fc Fused embed 512 x 768 512 Figure 4.3: Dual stream architectures (DualEncoder) denoted by Concat-MLP fusion operations. 4.5 Datasets For our experiments, we use commonly available media-centered image-text classification benchmarks, i.e., MM-IMDb [240] and MVSA-Single [241]. 4.5.1 MM-IMDb MM-IMDb dataset [240] consists of 25,959 samples composed of movie poster images and paired text plots along with associated metadata, including year, language, writer, director aspect ratio etc. The poster images and plots are sourced from IMDb through movie IDs provided as part of the MovieLens20M [242] dataset. The multimodal task involves the prediction of genre labels in a 87 multi-label fashion. The distribution of genre labels w.r.t. train/val/test splits (15552, 2608, 7799) are listed in Table 4.1. Genre Train Dev Test Genre Train Dev Test Drama 8424 1401 4142 Family 978 172 518 Comedy 5108 873 2611 Biography 788 144 411 Romance 3226 548 1590 War 806 128 401 Thriller 3113 512 1567 History 680 118 345 Crime 2293 382 1163 Music 634 100 311 Action 2155 351 1044 Animation 586 105 306 Adventure 1611 278 821 Musical 503 85 253 Horror 1603 275 825 Western 423 72 216 Documentary 1234 219 629 Sport 379 64 191 Mystery 1231 209 617 Short 281 48 142 Sci-Fi 1212 193 586 Film-Noir 202 34 102 Fantasy 1162 186 585 Table 4.1: Dataset distribution across genres 4.5.2 MVSA-Single MVSA-Single [241] consists of 2592 samples composed of paired images and tweets. The multimodal task involves the prediction of sentiments (positive, negative, neutral) associated with image-text pairs in a single-label multi-class fashion. The image-text pairs were annotated by a group of annotators. The train/val/test splits consist of 1555, 518, and 519 paired image-text samples. We utilize the publicly available splits provided in the QMF [202] repository 1 . Sample examples from the datasets MM-IMDb and MVSA-Single can be seen in Fig 4.4 1https://github.com/QingyangZhang/QMF 88 Plot: Two parallel stories are told. In the first, a group of research scientists from a variety of backgrounds are investigating the strange appearance of items in remote locations …… Genre labels: Drama, Sci-Fi (a) Sample example from MM-IMDb ● Tweet: There may be times when we are powerless to prevent injustice, but there must never be a time when we fail to protest ● Sentiment category: Positive ● Tweet: @nytimesworld: More intense and frequent natural disasters displaced 19 million worldwide in 2014. ● Sentiment category: Negative (b) Sample example from MVSA-Single Figure 4.4: Sample multimodal examples (image-text) pairs from MM-IMDb and MVSA-Single datasets 4.6 Quality variations: Modality corruptions In terms of quality variations, we consider commonly occurring sources of corruption that result in noisy modalities. These corruptions can occur due to multiple factors, including sensor malfunctioning, environmental factors, and transmission errors. Further, these corruptions are considered for testing the robustness of large multimodal models, e.g., MMCBench [237]. Since we consider only image and text modalities for now in this study, the modality-specific corruptions are listed as follows: Image: For the image modality, we mask an image by adding 16 × 16 black patches at random locations. Detailed pseudocode for the random patch masking is available in Algorithm 1. Another instance of image-specific corruption occurs due to the addition of per-pixel Gaussian noise. Let the original image be represented as I, where I(x, y) is the intensity of the pixel at location (x, y). The noise is drawn from a Gaussian distribution N (µ,σ 2 ), where: • µ is the mean of the noise, • σ 2 is the variance (σ is the standard deviation). Let N(x, y) ∼ N (µ,σ 2 ) represent the noise added to each pixel (x, y). 89 The noisy image I ′ is given by: I ′ (x, y) = I(x, y) +A·N(x, y), Where: • A is a scaling factor that controls the amplitude of the noise, • N(x, y) is sampled independently for each pixel. Algorithm 1 Image Corruption with Random Patch Masking Require: Image I of size H ×W ×C, patch size P, corruption fraction f (0 ≤ f ≤ 1) Ensure: Corrupted image I ′ 1: Compute NH ← ⌊H/P⌋, NW ← ⌊W/P⌋ 2: Compute N ← NH ·NW ▷ Calculate total number of patches in the image 3: Compute K ← ⌊ f ·N⌋ ▷ Compute fraction of indices 4: Randomly select K unique indices {i1,i2,...,iK} from {0,1,...,N −1} ▷ Find the indices that will be used for adding black patches 5: for each selected index ik do 6: r ← ⌊ik/NW ⌋, c ← ik mod NW 7: I ′ (r·P : (r +1)·P, c ·P : (c+1)·P, :) ← 0 ▷ Set the randomly located patch to zero 8: end for 9: Return I′ Text: For text modality, we mask the text by adding ”[MASK]” at random locations in the entire text. Detailed pseudocode for the random text word masking is available in Algorithm 2. Further, we also consider the commonly occurring text corruption where instead of substituting with ”[MASK]” tokens, a fraction of words denoted by f is randomly removed. 4.7 Behavior of multimodal fusion models under corruptions In this section, we characterize the behavior of multimodal fusion models under different types of unimodal corruptions. Based on the fusion blocks shown in Fig 4.3, the overall structure of the 90 Algorithm 2 Corrupting Text by Masking a Fraction of Words Require: Text string S, corruption fraction f where 0 ≤ f ≤ 1 Ensure: Corrupted text T ′ 1: T ← ExtractText(S) ▷ Preprocessing to remove unwanted components 2: W ← Split(T,’ ’) ▷ Split text into a list of words 3: L ← |W| ▷ Number of words 4: K ← ⌊ f ·L⌋ ▷ Number of words to mask 5: I ← RandomChoice({1,...,L},K,replace=false) ▷ Random set of indices to mask 6: for i from 1 to L do 7: if i ∈ I then 8: W′ [i] ← ”[MASK]” 9: else 10: W′ [i] ← W[i] 11: end if 12: end for 13: T ′ ← Join(W′ ,’ ’) ▷ Combine the masked words back into a string 14: return T ′ dual encoder model is shown in Fig 4.5. In the multimodal dual encoder model, we use CLIP’s [48] semantically aligned visual encoder and bert-base-uncased [176] textual encoder 2 . Text (plot, caption, tweet) Image Xfusion or or BERT CLIP Vision encoder Fusion MLP Ylogits Frozen completely Fully-fine tuned Parameter efficient fine-tuning Figure 4.5: Multimodal dual encoder architecture We first ask the following question regarding the effect of variations in media quality (i.e. modality corruptions) and study the behavior for the two datasets MMIMDb and MVSA-Single w.r.t. the dual encoder architecture shown in Fig 4.5. 2https://huggingface.co/google-bert/bert-base-uncased 91 How do variations in media data quality in terms of modality corruptions affect multimodal models for downstream understanding tasks? We follow the policy of training the dual encoder model with clean data and testing with corrupted data with unimodal corruptions (text or image). 4.7.1 MM-IMDb The training and testing setting for the MMIMDb dataset is mentioned here: • Training: Training with clean paired poster images and movie plot texts. • Testing: Testing with clean paired poster images and corrupted movie plot texts (either with ”[MASK]” corruptions and word removals) We consider text corruption for MM-IMDb since it is the dominant modality for MM-IMDb since the text plot contains more information as compared to the poster images. An outline of the training and testing setting for MM-IMDb is shown in Fig 4.6. Text (plot) Image Xfusion Y BERT logits CLIP Vision encoder Fusion MLP Fully-fine tuned Training phase: models trained with CE loss Clean: No corruptions Fusion MLP Testing phase: models tested with corrupted text and clean image Concat-MLP fusion Concat-MLP fusion Text (plot) Image BERT CLIP Vision encoder Xfusion Ylogits Frozen Corrupted Figure 4.6: MM-IMDB training and testing setting under text modality corruption (testing) 92 (a) (b) Figure 4.7: (a) MM-IMDb: Performance of the dual encoder multimodal model when tested with [MASK] corruptions. k% per sample corruption: k% of words in text replaced by “[MASK]” (b) MM-IMDb: Performance of the dual encoder multimodal model when tested with word removal corruptions. k% per sample corruption: k% of words in text removed. An average of 5 runs is shown for every test data corruption fraction. From Fig 4.7 (a) and 4.7 (b), we can see that the performance of the dual-encoder multimodal model is adversely affected due to both ”[MASK]” and word removal corruptions. As the fraction of per sample word corruption (”[MASK]”) increases (from 30% to 70%), the drop in average F1 score increases from 11.31% to 26.26%. For word removal, as a fraction of per sample word corruption increases (from 30% to 70%), the drop in average F1 score increases from 3.88% to 15.01%. This suggests that ”[MASK]” corruption adversely affects the performance of the multimodal dual encoder model as compared to word removal. 4.7.2 MVSA-Single The training and testing setting for MVSA-Single dataset is mentioned here: • Training: Training with clean images and tweets. • Testing: Testing with corrupted images (either with random patch masking and gaussian corruptions) and corrupted tweets. 93 We consider image corruption for MVSA-Single since the sentiment-analysis task requires information from both modalities for prediction with a slightly higher preference for the image modality. An outline of the training and testing setting for MVSA-Single is shown in Fig 4.8. Text (tweet) Image Xfusion Y BERT logits CLIP Vision encoder Fusion MLP Training phase: models trained with CE loss Fusion MLP Testing phase: models tested with corrupted image and clean text Concat-MLP fusion Concat-MLP fusion Text (tweet) Image BERT CLIP Vision encoder Xfusion Ylogits Fully-fine tuned Clean: No corruptions Frozen Corrupted Figure 4.8: MVSA-Single training and testing setting under image modality corruption (testing) (a) (b) Figure 4.9: (a) MVSA-Single: Performance of the dual encoder multimodal model when tested with random patch (16 × 16) corruptions. k% per sample corruption: k% of 16x16 patches in image masked by black patches (b) MVSA-Single: Performance of the dual encoder multimodal model when tested with per-pixel Gaussian noise addition with varying amplitude. Average of 5 runs are shown for every test data corruption fraction. 94 From Fig 4.9 (a) and 4.9 (b), we can see that the performance of the dual-encoder multimodal model is adversely affected by both random patch corruptions and per-pixel Gaussian noise addition. As the fraction of per sample patch corruptions increase (from 30% to 70%), the drop in average F1 score increases from 7.53% to 14.61%. For per-pixel Gaussian noise addition, as the amplitude of the noise is increased, the drop in average F1 score increases from 2.27% to 5.52%. This suggests that for MVSA-Single, random patch masking-based corruption adversely affects the performance of the multimodal dual encoder model as compared to gaussian noise addition. 4.8 Low-resource robust multimodal fusion Based on the previous testing scenarios in the case of MM-IMDb and MVSA-Single, we can see that the unimodal corruptions, when introduced in the testing phase, adversely affect the performance of pretrained models. Further, in the previous settings, we can see that the unimodal encoders were kept frozen. This is primarily done to ensure that we can operate in a low-resource setting where we need to update a minimal amount of parameters. Based on this low-resource constraint, we want to address the following question: Under a low-resource parameter update setting, how can we bridge the gap in training and testing scenarios for multimodal models when faced with corrupted data? This motivates us to focus on low-resource multimodal fusion strategies where we want to mitigate this training and testing gap by updating a minimal number of parameters. An outline of the low-resource multimodal fusion scheme that we will be considering is shown in Fig 4.10. 4.8.1 Parameter-efficient finetuning Parameter-efficient fine-tuning [206] for large vision and language models can be classified into broadly (a) addition-based methods like adapters, prefix tuning, prompt tuning, (b) specificationbased methods, (c) parameterization-based methods like LoRA. 95 Low-resource robust multimodal fusion Adaptation of unimodal representations Information flow control Delta tuning Parameter-efficient adaptation methods LoRA: Reparameterization method Information bottleneck for representation Input-level: Modality specific bottleneck Fusion-level: Fusion-specific bottleneck Figure 4.10: Outline of the low resource multimodal fusion approach that focuses on reparameterization method, i.e., LoRA and information flow control through bottleneck guided objectives. 4.8.1.1 LoRA-Multimodal setup LoRA [192] was introduced as a re-parameterization method for language models that relies on the addition of minimal task-specific parameters. If the frozen pretrained weights are denoted by Wo, LoRA adds the weight update ∆W as the product of two low-rank matrices A and B. W ′ = Wo +∆W = Wo +AB (4.1) Here, A ∈ R d×r and B ∈ R r×n where A and B are the rank −r matrices that are learned during the optimization process. A schematic representation of the LoRA-based additional weight update is shown in Fig 4.11 In the case of the dual encoder multimodal architecture, we aim to use LoRA for updating the unimodal encoders one branch at a time. We don’t use LoRA to update both unimodal encoders in order to minimize the number of trainable parameters. Further, we want to investigate how minimal unimodal adaptation can influence downstream testing performance. An outline of the LoRA application for respective modalities is shown in Fig 4.12. 96 Pretrained weights W0 : Rdxn x r A B + h Only A and B are trained Frozen W= W0 + ΔW=W0 + ɑ AB Figure 4.11: Outline of LoRA-based weight update using learnable matrices A and B Text (plot, caption, tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion LoRA Text (plot, caption, tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion LoRA applied to text modality LoRA applied to image modality Frozen Fully-fine tuned Figure 4.12: Outline of LoRA-multimodal setup where individual modality branches are updated by LoRA-based weight adaptation schemes. We outline the exact setting of the LoRA-based weight update for MMIMDb and MVSASingle in the section 4.9. 4.8.2 Information Bottleneck (IB) The information bottleneck principle [214] formulates the goal of deep learning as a trade-off between compression and predictive power. If a given task has associated inputs X and labels Y 97 and the intermediate layer representations of a deep learning model are denoted by T, then the optimal task-specific representation can be obtained through the following lagrangian formulation: LIB = I(Y;T)−βI(X;T) (4.2) Here, the objective aims at • Increasing predicting power: Maximize the mutual information (MI) between labels Y and representations T. • Representation compression: Minimize the mutual information (MI) between inputs X and representations T. The parameter β controls the relative contributions of the two objective terms. In a multimodal setting e.g., the dual encoder model considered in Fig 4.3, the multimodal representation is obtained by combining the unimodal representations from textual and visual streams. In the case of complex fusion mechanisms, the unimodal representations are often ignored, thus resulting in the under-utilization of the predictive power of unimodal streams. Further, the unimodal branches contain noise factors that can be irrelevant to the given task. Since we are interested in mitigating the impact of modality-specific corruptions in the testing scenario, the motivation for applying IB (at fusion and input level) are as follows: • Minimize input noise dependency: Reduce the impact of modality-specific noise so that the dependency in the testing scenario is minimized. • Maximize representation power: Maximize the task-specific discriminative power of multimodal representations Further, as mentioned in the previous section, parameter-efficient updates of unimodal encoders modulate the representation power of the pretrained encoders for the given task. Apart from standard IB-based formulations, we want to investigate the (A) impact of LoRA on IB-based 98 representation learning schemes and (B) the relative advantages of IB-only or LoRA-only parameter efficient methods and any associated optimal placements. We provide details of the two IB formulations that are inspired from [223] work on early and late-fusion IB schemes. 4.8.2.1 Fusion-IB: Multimodal setup In the fusion-IB multimodal setup, we utilize IB to maximize the representation power of the fused embedding in the dual encoder model and minimize the impact of modality-specific spurious correlations and noise that are propagated due to the fusion operation. The mutual-informationbased objective function can be written as follows: Loss = max(I(Y;Zf usion)−β ∗ I(Xf usion;Zf usion)) (4.3) Since estimating mutual information at high dimensions is difficult, we utilize a variational approximation of the above loss function to optimize the dual-encoder multimodal model. Based on the derivation steps mentioned in [216], the variational approximation of the above loss function is mentioned as follows: Fusion−IBLoss = min(CE(Y;Ylogits) +β ∗KL(p(Zf usion|Xf usion)||N(µ,σ)) (4.4) N(µ,σ) is the prior spherical gaussian distribution . Based on the parameterization trick used in [243], Zf usion is sampled from Xf usion as follows (εN˜(0,I)): Zf usion = µf usion +σf usion ×ε (4.5) An outline of the fusion-based IB is shown in Fig 4.13. As shown in Fig 4.13, the additional parameters that need to be updated are the weights associated with two MLPs for estimating µf usion and σf usion. 99 Text (plot, caption, tweet) Image Xfusion Ylogits or BERT CLIP Vision encoder Fusion MLP Concat-MLP or fusion IB layer Zfusion Loss = max ( I (Y ; Zfusion) - β *I(Xfusion ; Zfusion)) Text (plot, caption, tweet) Image Xfusion Ylogits or BERT CLIP Vision encoder Fusion MLP Concat-MLP or fusion sample Variational approximation= min (CE(Y, Ylogits ) + β*KL (p(Zfusion|Xfusion) | N(𝝁,𝝈)) MLP MLP 𝝁fusion 𝝈 fusion Zfusion or (a) (b) Figure 4.13: Outline of Fusion-level IB multimodal setup: (a) Mutual information (MI) based IB formulation (b) Variational approximation of the MI-based formulation For individual datasets, including MVSA-Single and MM-IMDb, we will discuss Fusion-IB in the context of (A) LoRA update of the unimodal encoders and (B) no application of LoRA for the unimodal encoder updates. 4.8.2.2 Input-IB: Multimodal setup In the input-IB multimodal setup, we draw inspiration from the IB objective to minimize the impact of noise in unimodal branches. In the case when the noise impact minimization is considered for the textual branch, the mutual information-based objective can be written as follows: Input −IBLoss = min(CE(Y;Ylogits) +β ∗ I(Xtext;Ztext) (4.6) As mentioned in the previous section related to Fusion-IB, we utilize a variational approximation to the Input −IBLoss since mutual information is difficult in high dimensions. The variational approximation of Input −IBLoss can be written as: 100 Text (plot, caption, tweet) Image Ylogits or BERT CLIP Vision encoder Fusion MLP Concat-MLP or fusion Zfusion IB layer Loss = min ( CE(Y, Ylogits )+β *I(Xfusion ; Zfusion)) Text (plot, caption, tweet) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion or Zfusion or sample MLP MLP 𝝁text 𝝈 text Variational approximation= min (CE(Y, Ylogits ) + β *KL (p(Ztext |Xtext ) | N(𝝁,𝝈)) Xtext Ztext (a) (b) Figure 4.14: Outline of Input-IB multimodal setup: (a) Mutual information (MI) based IB formulation (b) Variational approximation of the MI-based formulation Input −IBLoss = min(CE(Y;Ylogits) +β ∗KL(p(Ztext|Xtext)||N(µ,σ))) (4.7) Based on the parameterization trick used in [243], Ztext is sampled from Xtext as follows: Ztext = µtext +σtext ×ε (4.8) We don’t consider a separate cross-entropy loss component from the input (text) branch like standard IB since the goal is to minimize the noise impact on the unimodal representation that will be used for multimodal fusion operation. As part of future studies, we plan to include a proper unimodal IB component to check if it adds additional robustness during testing with corrupted modalities. An outline of the Input-IB is shown in Fig 4.14. While the example considered here takes into account the text branch to showcase the overall formulation, a similar setting holds for 101 the image branch as well. As shown in Fig 4.14, the additional parameters that need to be updated are the weights associated with two MLPs for estimating µtext and σtext. For individual datasets, including MVSA-Single and MM-IMDb, we will discuss Input-IB in the context of (A) LoRA update of the unimodal encoders and (B) no application of LoRA for the unimodal encoder updates. For both input and fusion-IB formulations, we sweep the β parameter from [1e − 4,1e − 3]. We find the optimal β parameter to be 1e−4 for both settings. With increasing β parameter, the performance drops sharply for both datasets. 4.9 Low resource multimodal fusion and corruptions 4.9.1 LoRA: Multimodal setup 4.9.1.1 MM-IMDb For MM-IMDb, the training and testing setting for the MMIMDb dataset is mentioned here: • Training: Training with clean paired poster images and movie plot texts (Cross-entropy loss). • Testing: Testing with clean paired poster images and corrupted movie plot texts (with ”[MASK]” corruptions) For MM-IMDb, the two model configurations with the previous training and testing settings are listed as follows: • no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) • LoRA: BERT encoder (LoRA), CLIP Vision encoder (frozen) For the BERT encoder, we apply LoRA (rank=16,α = 8) to the query and value matrices of the attention layers. For MM-IMDb, we use ”[MASK]” corruptions because, in comparison to word 102 Text (plot) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion LoRA Text (plot) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Frozen completely Fully-fine tuned Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (LoRA), CLIP Vision encoder (frozen) Testing phase (corrupted data): ● Text corrupted with [MASK], Image clean or Figure 4.15: LoRA-Multimodal setting for MMIMD dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. removal, the testing performance is significantly worse, as shown in section 4.7.1. From Fig 4.16, we can see that as per-sample text corruption increases, the gap between LoRA and no LoRA curves decreases. Due to the text modality dominance of the genre classification task in MMIMDb, LoRA adaptation of the text branch during training provides additional robustness as compared to the no LoRA setting. If we consider 100% corruption in test data (i.e. all samples are corrupted with ”[MASK]”), the impact of LoRA adaptation in the text branch decreases as the per-sample corruption increases from 30% to 70%. 4.9.1.2 MVSA-Single For MVSA-Single, the training and testing settings are mentioned here: • Training: Training with clean images and tweets (Cross-entropy loss). 103 (a) (b) (c) Figure 4.16: MM-IMDb (LoRA: Multimodal setup): Performance of dual-encoder multimodal model under LoRA and no-LoRA model configurations for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 • Testing: Testing with corrupted images (with random black patch-based masking) and clean tweets. For MVSA-Single, the two model configurations with the previous training and testing settings are listed as follows: • no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) • LoRA: BERT encoder (frozen), CLIP Vision encoder (LoRA) For the CLIP Vision encoder, we apply LoRA (rank=8,α = 16) to the query and value projection linear layers. 104 Text ( tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion LoRA Text (tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Frozen completely Fully-fine tuned or Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (frozen), CLIP Vision encoder (LoRA) Testing phase (corrupted data): ● Text clean, Image corrupted (black patches with patch size=16) Figure 4.17: LoRA-Multimodal setting for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP Visual encoder (visual branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: The text input(tweets) is kept clean, and the image input is corrupted by random patch-based masking (16 × 16 patches). From Fig 4.18, we can see that as the per-sample image corruption increases, the gap between LoRA (rank = 16,al pha = 8) and no LoRA curves increases. Thus, the adaptation of the CLIP Vision encoder (image branch) through LoRA adversely affects the test-time performance with image corruption. Unlike the case with MM-IMDb, the LoRA adaptation of the CLIP’s Vision encoder i.e. the corrupted test branch, doesn’t help in test-time adaptation to image corruptions. 4.9.2 Fusion-IB and Input-IB: Multimodal setup 4.9.2.1 MM-IMDb For MM-IMDb, the training and testing setting for the Fusion-IB multimodal setup is mentioned here: 105 (a) (b) (c) Figure 4.18: MVSA-Single (LoRA: Multimodal setup): Performance of dual-encoder multimodal model under LoRA and no-LoRA model configurations for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 • Training: Training with clean paired poster images and movie plot texts by utilizing the (Fusion-IB) loss in Eq 4.4. • Testing: Testing with clean paired poster images and corrupted movie plot texts (with ”[MASK]” corruptions) For Input-IB multimodal setup, the training and testing setting is mentioned here: • Training: Training with clean paired poster images and movie plot texts by utilizing the (Input-IB) loss in Eq 4.7. • Testing: Testing with clean paired poster images and corrupted movie plot texts (with ”[MASK]” corruptions) 106 For both Input-IB and Fusion-IB, the two model configurations with the previous training and testing settings are listed as follows: • no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) • LoRA: BERT encoder (LoRA), CLIP Vision encoder (frozen) Text (plot) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion LoRA Frozen completely Fully-fine tuned Text (plot) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (LoRA), CLIP Vision encoder (frozen) Testing phase (corrupted data): ● Text corrupted with [MASK], Image clean or Figure 4.19: Fusion-IB: Multimodal setup for MMIMDb dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. For the BERT encoder, we apply LoRA (rank=8,α = 16) to the query and value matrices of the attention layers. The outlines of both Fusion-IB and Input-IB training/testing scenarios are shown in Fig 4.19 and 4.20. From Fig 4.21, we can see that as the per-sample corruption increases, input-level IB closes the gap with cross-entropy fine-tuning (no LoRA) at higher rates of test data corruption (60% and onwards). Since the text modality is dominant for the genre classification task in MMIMDb, the 107 Text (plot) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion IB layer Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (LoRA), CLIP Vision encoder (frozen) Text (plot) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion IB layer Testing phase (corrupted data): ● Text corrupted with [MASK], Image clean LoRA Frozen completely Fully-fine tuned or Figure 4.20: Input-IB: Multimodal setup for MMIMDb dataset. LoRA: LoRA was applied to the BERT encoder (text branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. ”[MASK]” corruptions are applied to the BERT encoder (text branch), and image input is kept clean. effect of input-noise minimization through input-IB is more prominent in comparison to fusion level IB, even when the BERT encoder (text encoder) is not updated via LoRA. In order to check if the LoRA-based parameter update of the BERT encoder helps the input and fusion-based IB settings, we consider the case of k = 70% per sample corruption. From Fig 4.22, we can see that the performance of input-level IB while testing improves upon the application of LoRA during training in the text branch (dotted-red) line. This can be attributed to the importance of the text modality for the genre classification task in MMIMDb. Due to the dominance of text, the LoRA adaptation and input noise minimization through input-IB make the text representations robust against test-time modality corruptions. 108 (a) (b) (c) Figure 4.21: MMIMDb: Performance of Input-IB (no-LoRA) and Fusion-IB (no-LoRA) in comparison to normal cross-entropy fine-tuning setting (no-LoRA) for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 4.9.2.2 MVSA-Single For MVSA-Simgle, the training and testing setting for the Fusion-IB multimodal setup is mentioned here: • Training: Training with clean paired images and tweets by utilizing the (Fusion-IB) loss in Eq 4.4. • Testing: Testing with clean tweets and corrupted images (with random patch masking corruptions) For Input-IB multimodal setup, the training and testing setting is mentioned here: 109 (a) (b) Figure 4.22: MM-IMDb: (a) Performance of input-IB (no-LoRA), Fusion-IB (no-LoRA) and fine-tuning setting (no-LoRA) for 70% per sample corruption setting (b) Performance of input-IB (LoRA), Fusion-IB (LoRA) and fine-tuning setting (LoRA) for 70% per sample corruption setting in comparison to the no-LoRA setting • Training: Training with clean paired poster images and tweets by utilizing the (Input-IB) loss in Eq 4.7. • Testing: Testing with clean tweets and corrupted images (with random patch masking corruptions) For both Input-IB and Fusion-IB, the two model configurations with the previous training and testing settings are listed as follows: • no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) • LoRA: BERT encoder (frozen), CLIP Vision encoder (LoRA) From Fig 4.25, we can see that as the per-sample corruption increases, fusion-level IB performs better than cross entropy fine-tuning (no LoRA) at higher rates of test data corruption (60% and onwards). Since both the modalities are of similar importance for the sentiment classification task in MVSA-Single, the effect of input-noise minimization through input-IB is not prominent in comparison to fusion level IB. Further, we can see that since the LoRA update of CLIP’s visual 110 Text (tweet) Image Xfusion Ylogits or BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion LoRA Frozen completely Fully-fine tuned Text (tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (frozen), CLIP Vision encoder (LoRA) Testing phase (corrupted data): ● Text clean, Image corrupted (black patches with patch size=16) Figure 4.23: Fusion-IB: Multimodal setup for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP’s vision encoder (image branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: Random patch (16 × 16) corruptions are applied to the image branch, and text input is kept clean. encoder adversely affects test-time robustness to random patch masking( Fig 4.18), LoRA’s inclusion with IB (input and fusion-IB) brings down the performance when compared to the no-LoRA setting, as seen in Fig 4.26. 4.9.3 Optimal LoRA placement Based on the previous results shown for MM-IMDb and MVSA-Single, we can infer the following: • w/o LoRA: Without any LoRA update of the unimodal encoders, Input and Fusion level IB provide additional robustness with increasing fraction of corrupted test samples. • w LoRA: However, LoRA, when used to update the corrupted unimodal branch, either helps or adversely affects the performance in the presence of corruption. This brings us to the following question: 111 Text (tweet) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion Training phase (clean data): ● no-LoRA: BERT encoder (frozen), CLIP Vision encoder (frozen) ● LoRA: BERT encoder (frozen), CLIP Vision encoder (LoRA) Testing phase (corrupted data): ● Text clean, Image corrupted (black patches with patch size=16) LoRA Frozen completely Fully-fine tuned IB layer or Text (tweet) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion IB layer Figure 4.24: Input-IB: Multimodal setup for MVSA-Single dataset. LoRA: LoRA was applied to the CLIP’s vision encoder (image branch) in the training phase with clean data. no-LoRA: Image and text encoders are kept frozen. Testing: Random patch (16 × 16) corruptions are applied to the image branch, and text input is kept clean. What is the optimal placement of LoRA adaptation for tackling modality corruption? Since we are considering a dual-encoder multimodal model and considering unimodal corruptions in our test setting, we can focus on the unimodal branch that doesn’t experience corruption during testing. Even if the non-corrupted modality is not dominant, we can assume that the information present is more reliable than the corrupted modality. This motivates us to explore LoRA-based weight updates in the non-corrupted branch. Prior works have used modality-specific LoRA-based weight updates for downstream tasks like emotion recognition [244], visual instruction tuning [245] etc. 112 (a) (b) (c) Figure 4.25: MVSA-Single: Performance of Input-IB (no-LoRA) and Fusion-IB (no-LoRA) in comparison to normal cross-entropy fine-tuning setting (no-LoRA) for k% per sample corruption (a) k=30 (b) k=50 (c) k=70 4.9.3.1 MM-IMDb For MM-IMDb, we apply LoRA-based weight update to the image branch and compare the results with the previous LoRA-Multimodal and IB variants (input-IB and Fusion-IB), where the LoRA was used to update the BERT-base-encoder (text branch). An outline of LoRA application to the image branch for Input and Fusion-IB is shown in Fig 4.27. From Fig 4.28 (b), we can see that the application of LoRA in the image branch improves the performance of Fusion-IB and CE setting, where CE denotes the standard fine-tuning by cross-entropy (LoRA: Multimodal setup with LoRA applied to text branch in Fig 4.15). This shows the potential of adapting non-corrupted unimodal representations for recovering performance during test time. However, the Input IB (LoRA in text branch) is still slightly better due to text dominance in Fig 4.28 (b). 113 (a) (b) Figure 4.26: MVSA-Single: (a) Performance of Input-IB (no-LoRA), Fusion-IB (no-LoRA) and fine-tuning setting (no-LoRA) for 70% per sample corruption setting (b) Performance of input-IB (LoRA), Fusion-IB (LoRA) and fine-tuning setting (LoRA) for 70% per sample corruption setting in comparison to the no-LoRA setting Text (plot) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion IB layer Text (plot) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion LoRA Frozen completely Fully-fine tuned Input-IB: ● Training phase: BERT encoder (frozen), CLIP Vision encoder (LoRA) Fusion-IB: ● Training phase: BERT encoder (frozen), CLIP Vision encoder (LoRA) Figure 4.27: MM-IMDb: LoRA placement in the image branch for both input and fusion-level IB in the case of MM-IMDb dataset 4.9.3.2 MVSA Single For MVSA-Single, we apply LoRA-based weight update to the text branch and compare the results with the previous LoRA-Multimodal and IB variants (input-IB and Fusion-IB), where the LoRA 114 Input IB (LoRA in text branch) (a) Input IB (LoRA in text branch) (b) Figure 4.28: (a) MM-IMDb: Text corruption test results (70% per sample corruption by ”[MASK]” tokens) for LoRA placement in the text branch and no-LoRA settings. (b) MM-IMDb: Text corruption test results (70% per sample corruption by ”[MASK]” tokens) for LoRA placement in the image branch and no-LoRA settings. Green band indicates the results obtained by adding LoRA weight updates to the image branch. was used to update the CLIP’s vision encoder. An outline of LoRA application to the text branch for Input and Fusion-IB is shown in Fig 4.29. From Fig 4.30 (b), we can see that the application of LoRA in the text branch improves the performance of Fusion-IB, Input-IB and CE setting, where CE denotes the standard fine-tuning by cross-entropy (LoRA: Multimodal setup with LoRA applied to image branch in Fig 4.17). This means that the adaptation of text representations through LoRA during the training phase provides the required information to counter the effect of image corruption. Thus, adapting a non-corrupted branch using LoRA during clean data training can enhance robustness when testing with corrupted data. 4.9.4 Corrupted vs Clean training Till now, we have considered the clean training and corrupted testing setting, where unimodal corruptions are introduced to test the performance of the models. In an ideal setting, if we have prior knowledge regarding the nature of corruption, a possible alternative involves the introduction of similar modality-based corruption during training. 115 Text (tweet) Image Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion Zfusion IB layer Text (tweet) Image Xfusion Ylogits BERT CLIP Vision encoder Fusion MLP Concat-MLP fusion IB layer Zfusion LoRA Frozen completely Fully-fine tuned Input-IB: ● Training phase: BERT encoder (LoRA), CLIP Vision encoder (frozen) Fusion-IB: ● Training phase: BERT encoder (LoRA), CLIP Vision encoder (frozen) Figure 4.29: MVSA-Single: LoRA placement in the text branch for both input and fusion-level IB in the case of MVSA-Single dataset Fusion IB (no LoRA in image branch) (a) Fusion IB (LoRA in image branch) Fusion IB (LoRA in text branch) (b) Figure 4.30: (a) MVSA-Single: Image corruption test results (70% per sample corruption by random patch masking) for LoRA placement in the image branch and no-LoRA settings. (b) MVSASingle: Image corruption test results (70% per sample corruption by random patch masking) for LoRA placement in the text branch and no-LoRA settings. Green band indicates the results obtained by adding LoRA weight updates to the text branch. How is the test behavior affected when corruptions are introduced in the training phase? 116 4.9.4.1 MM-IMDb For corrupted setting, we add “[MASK]” during training with the following setting: • Fraction of total sample corruption: Randomly select 30% of samples per batch for corruption. • Per sample corruption: Per sample corrupt 50% of the text with “[MASK]”. From Fig 4.31, we can see that under corrupted training, the test-time performance of Input-IB, Fusion-IB, and CE (LoRA-Multimodal setting) improve with and without application of LoRAbased parameter update in the text encoder. In Fig 4.31(b), we can see that the performance of both Fusion-IB and CE improve with the inclusion of corruption during training when compared to clean training. Thus normal cross-entropy (CE) based fine-tuning of the dual encoder model with corrupted data acts as a robust strategy for handling test-time corruptions. Corrupted data (training) testing performance (w/o LoRA in text branch) Input IB (w/o LoRA in text branch) (a) Input IB (LoRA in text branch) Corrupted data (training) testing performance with (LoRA in text branch) (b) Figure 4.31: (a) MM-IMDb: Corrupted training results for Input-IB, Fusion-IB and CE (LoRAMultimodal setting) under no LoRA for text corruptions (“[MASK]”). (b) MM-IMDb: Corrupted training results for Input-IB, Fusion-IB and CE (LoRA-Multimodal setting) under LoRA setting for text corruptions (“[MASK]”).LoRA applied to the text branch. 117 4.9.4.2 MVSA-Single For corrupted setting, we add random patch-based masking (16 × 16) during training with the following setting: • Fraction of total sample corruption: Randomly select 30% of samples per batch for corruption. • Per sample corruption: Per sample corrupt 50% of the image with 16×16 black patches. Corrupted data (training) testing performance (w/o LoRA in image branch) Fusion-IB (w/o LoRA in image branch) (a) Corrupted data (training) testing performance (w LoRA in image branch) (b) Figure 4.32: (a) MVSA-Single: Corrupted training results for Input-IB, Fusion-IB and CE (LoRA-Multimodal setting) under no LoRA for image corruptions (random patch masking). (b) MVSA-Single: Corrupted training results for Input-IB, Fusion-IB and CE (LoRA-Multimodal setting) under LoRA setting for image corruptions (random patch masking).LoRA applied to the image branch. From Fig 4.32, we can see that under corrupted training, the test-time performance of Input-IB, Fusion-IB, and CE (LoRA-Multimodal setting) improve with and without application of LoRAbased parameter update in the image encoder. In the case, where LoRA fine-tuning of the image encoder adversely affects the test time performance 4.32 (b), we can see that the training with corrupted data improves the performance of all IB variants and normal CE based finetuning. In the case of the LoRA application to image encoder, we can see the performance under corrupted 118 training. Similar to the MM-IMDb dataset, normal cross-entropy (CE) based fine-tuning of the dual encoder model with corrupted data can act as a robust strategy for handling test-time corruptions. 4.10 Major takeaways Based on the modality heterogeneity handling approach in terms of unimodal corruptions, the major takeaways from this chapter are as follows: • Mismatch in train/test behavior: We show that the performance of a dual-encoder multimodal fusion model composed of pretrained unimodal encoders is adversely affected when they encounter unknown noise in the testing phase. • Role of information bottleneck: We show that input and fusion level noise factor minimization through information bottleneck can provide some additional robustness against test-time modality-specific corruptions. • Role of Low-rank parameter update (LoRA): We show that when LoRA is applied to the modality branch that gets corrupted during testing, it can either improve or adversely affect the test-time robustness, depending on the nature of the task. Further, we find that when LoRA is used to update the unimodal encoder associated with the non-corrupted branch, the test-time robustness improves in the presence of unknown noise. This shows the importance of unimodal representations in downstream multimodal performance under non-ideal scenarios. • Modality-specific noise injection during training: If the modality-specific noise behavior is known as apriori, the ideal training setting involves the injection of a small amount of unimodal noise during training followed by evaluation using corrupted data. With noisy training, freezing the unimodal encoders followed by fine-tuning of the fusion block is enough to obtain robust behavior during testing. 119 4.11 Future directions and ongoing scope of work Based on the directions pursued in this chapter, certain areas that I am currently exploring as part of future work are as follows: • Bottleneck-only approach for dual encoders: How much information bottleneck can push in terms of test-time robustness without the application of LoRA-based unimodal encoder updates? How does the behavior change in the presence of alternative bottleneck formulations like HSIC [246]? • Inclusion of other parameter-efficient methods: How do the current findings vary for the representative datasets when other parameter-efficient methods are considered, including adapters and compacters? Further, how does the behavior change in the presence of other LoRA variants, including DoRA [247], VeRA [248]? • Dual-encoders vs single-encoders: Can similar findings be replicated for multimodal models based on single encoders that process multiple modalities as single integrated sequences? • Impact of adversarial corruptions: Apart from natural corruptions, how do the models behave in the presence of unimodal adversarial inputs? • Aligned vs non-aligned encoders: Instead of non-aligned encoders, how does the behavior change if we consider semantically aligned multimodal encoders like ImageBind [249]? • Role of different unimodal encoders: How does the test-time behavior of dual-encoder models vary based on the type of unimodal encoders used, such as those focused on semantic alignment versus reconstruction-based encoders? • Bottleneck extension to large multimodal models: Can information bottleneck (IB)-based formulations for noise reduction be applied to enhance the robustness of the modality alignment layer in large multimodal models? 120 • Corruption application to unimodal streams: Apart from unimodal corruption introduced one at a time, how does the behavior change when both modalities are corrupted ? ‘ 121 Chapter 5 Conclusion Here, we would like to revisit the overall structure for my thesis that links the core elements of multimodal perception with salient challenges in media understanding. Media sources Modality connection Modality representation and reasoning Modality heterogeneity Diverse content tagging Narrative understanding Quality variations Challenges in media understanding Multimodal perception Figure 5.1: Connection between multimodal perception at scale and computational media understanding Media content can be diverse, and the tags associated with media sources can vary from broad to fine-grained elements. Media tags can act as initial metadata for narrative understanding. Instead of considering content tagging and narrative understanding as two separate buckets, they can be combined under a broader direction called Fine-grained narrative understanding. 122 5.1 Fine-grained narrative understanding One key component of fine-grained narrative understanding is long-form summarization that relies on faithful description of media content and requires the following: • Multi-level tags: Similar to the visual scenes approach in Chapter 2, fine-grained tags include open-vocabulary object categories, character descriptions (appearance, IDs, etc.), locations along with broad tags like presentation styles, themes, etc. Based on the current advancements regarding large multimodal models, multiple tag prediction can be considered as a sequence-tosequence prediction task with video frames (from short snippets) considered as input sequences. • Tag-based description grounding: Current video-language models like VideoLLaVA [250], VideoLLama [251], Vid2Seq [252], Pllava [253] can generate short coherent generic descriptions of video snippets. The conversion of generic descriptions to content-specific short descriptions is possible by integrating the multi-level tags through an aggregation scheme via large language models. • Long-term linkage of descriptions: For enabling long-form video understanding, the shortsnippet descriptions should be combined into a coherent video-level description. Prior works that have considered language-model-based aggregation include LLoVi [254] and MM-Vid [255]. The above-mentioned steps for generating fine-grained narratives can be used to augment existing media-centered benchmarks associated with movies and advertisements. Based on the advertisement benchmark MM-AU curated in Chapter 3, a possible direction involves the extension into a long-term temporal video understanding benchmark along the lines of LV-Bench [256], CinePhile [257] and TemporalBench [258], since the existing benchmarks lack media-centered reasoning tasks from diverse domains. 123 5.2 Characterizing modality heterogeneity in multimodal learning In Chapter 4, we considered the simple case of the dual encoder multimodal model and showed how simple unimodal corruptions can adversely affect test-time performance. We explored parameterefficient strategies for mitigating the train and test-time gap through information-bottleneck guided objectives and low-rank updates of unimodal encoders. Based on our findings, certain directions can be considered as part of future research: • Role of unimodal encoders: Since the current dual-encoder model considered depends on the unimodal representations, future work should include understanding the role of unimodal encoder properties (reconstruction-based vs semantically aligned ) and relationship with test-time performance. • Consideration of novel modality combinations: Apart from the vision-text modalities, future work involves testing the behavior of audio-visual models for action recognition and audio-event detection tasks. • Generalization to more corruptions: In the current framework, we have shown the performance of the models under the limited family of corruptions. Future work involves the consideration of diverse modality-specific corruptions (audio, visual, and text) as mentioned in MMCBench [237]. • Alternative kernel-based bottleneck formulations: Due to the difficulty in estimating high dimensional mutual information, existing literature has considered kernel-based estimation of random variable dependence through Hilbert Schmidt criterion (HSIC) [219, 246] as an alternative for adversarial robustness. A potential extension of our current framework involves comparing HSIC and IB under the presence or absence of parameter-efficient updates within the existing multimodal setup. • Role of parameter efficient updates: While prior works [239] have addressed the robustness properties of parameter efficient adaptation methods for vision-language generative models, 124 there is a lack of understanding how these adaptation methods work for unimodal branches in a standard multimodal setting. While in this thesis, we have shown initial experiments regarding the role of LoRA (with and without IB) and how it impacts test time performance, future work involves the inclusion of other parameter-efficient methods under similar settings like adapters, prefix-tuning, compacters, etc. 125 Bibliography 1. Baltruaitis, T., Ahuja, C. & Morency, L.-P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 423–443 (2017). 2. Alain, C., Arnott, S. R., Hevenor, S. J., Graham, S. & Grady, C. L. “What” and “where” in the human auditory system. Proceedings of the National Academy of Sciences of the United States of America 98, 12301–12306 (2001). 3. Bornkessel-Schlesewsky, I., Schlesewsky, M., Small, S. L. & Rauschecker, J. P. Neurobiological roots of language in primate audition: common computational properties. Trends in Cognitive Sciences 19, 142–150 (2015). 4. Wallace, M. N., Johnston, P. W. & Palmer, A. R. Histochemical identification of cortical areas in the auditory region of the human brain. Experimental Brain Research 143, 499– 508 (2002). 5. Liang, P. P., Zadeh, A. & Morency, L.-P. Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions. ArXiv abs/2209.03430 (2022). 6. Gao, J., Li, P., Chen, Z. & Zhang, J. A Survey on Deep Learning for Multimodal Data Fusion. Neural Computation 32, 829–864 (2020). 7. Abdar, M., Kollati, M., Kuraparthi, S., Pourpanah, F., McDuff, D. J., Ghavamzadeh, M., et al. A Review of Deep Learning for Video Captioning. ArXiv abs/2304.11431 (2023). 8. Apostolidis, E., Adamantidou, E., Metsai, A. I., Mezaris, V. & Patras, I. Video Summarization Using Deep Neural Networks: A Survey. Proceedings of the IEEE 109, 1838–1863 (2021). 126 9. De Faria, A. C. A. M., de Castro Bastos, F., da Silva, J. V. N. A., Fabris, V. L., de Sousa Uchoa, V., de Aguiar Neto, D. G., et al. Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature 2023. 10. Calvert, G. A., Hansen, P. C., Iversen, S. & Brammer, M. J. Detection of Audio-Visual Integration Sites in Humans by Application of Electrophysiological Criteria to the BOLD Effect. NeuroImage 14, 427–438 (2001). 11. Driver, J. & Noesselt, T. Multisensory Interplay Reveals Crossmodal Influences on ‘SensorySpecific’ Brain Regions, Neural Responses, and Judgments. Neuron 57, 11–23 (2008). 12. Macaluso, E. & Driver, J. Multisensory spatial interactions: a window onto functional integration in the human brain. Trends in Neurosciences 28, 264–271 (2005). 13. Williams, P. L. & Beer, R. D. Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515 (2010). 14. Koh, J. Y., Salakhutdinov, R. & Fried, D. Grounding Language Models to Images for Multimodal Inputs and Outputs. ICML (2023). 15. Liu, H., Song, Y., Wang, X., Xiangru, Z., Li, Z., Song, W., et al. Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image Retrieval. arXiv preprint arXiv:2403.13317 (2024). 16. Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., Agha, Z., et al. Automatic Understanding of Image and Video Advertisements. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1100–1110 (2017). 17. Marsh, E. E. & White, M. D. A taxonomy of relationships between images and text. J. Documentation 59, 647–672 (2003). 18. Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D. & Salakhutdinov, R. Gated-Attention Architectures for Task-Oriented Language Grounding. ArXiv abs/1706.07230 (2017). 19. Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is All you Need in Advances in Neural Information Processing Systems (eds Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., et al.) 30 (Curran Associates, Inc., 2017). 127 20. Gadzicki, K., Khamsehashari, R. & Zetzsche, C. Early vs Late Fusion in Multimodal Convolutional Neural Networks. 2020 IEEE 23rd International Conference on Information Fusion (FUSION), 1–6 (2020). 21. Barnum, G. M., Talukder, S. & Yue, Y. On the Benefits of Early Fusion in Multimodal Representation Learning. ArXiv abs/2011.07191 (2020). 22. Peng, C., He, J. & Xia, F. Learning on Multimodal Graphs: A Survey. ArXiv abs/2402.05322 (2024). 23. Teney, D., Liu, L. & van den Hengel, A. Graph-Structured Representations for Visual Question Answering. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3233–3241 (2016). 24. Zheng, J., Liang, M., Yu, Y., Du, J. & Xue, Z. Multimodal Knowledge Graph-Guided CrossModal Graph Network for Image-Text Retrieval in 2024 IEEE International Conference on Big Data and Smart Computing (BigComp) (2024), 97–100. 25. Wang, Z., You, H., Li, L. H., Zareian, A., Park, S., Liang, Y., et al. SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning in AAAI Conference on Artificial Intelligence (2021). 26. Rajagopalan, S. S., Morency, L.-P., Baltruaitis, T. & Gocke, R. ¨ Extending Long Short-Term Memory for Multi-View Structured Learning in European Conference on Computer Vision (2016). 27. Bisk, Y., Holtzman, A., Thomason, J., Andreas, J., Bengio, Y., Chai, J. Y., et al. Experience Grounds Language. ArXiv abs/2004.10151 (2020). 28. Luketina, J., Nardelli, N., Farquhar, G., Foerster, J. N., Andreas, J., Grefenstette, E., et al. A Survey of Reinforcement Learning Informed by Natural Language. ArXiv abs/1906.03926 (2019). 29. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L. & Girshick, R. B. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1988– 1997 (2016). 128 30. Amizadeh, S., Palangi, H., Polozov, A., Huang, Y. & Koishida, K. Neuro-symbolic visual reasoning: Disentangling in International Conference on Machine Learning (2020), 279– 290. 31. Zeng, A., Wong, A. S., Welker, S., Choromanski, K., Tombari, F., Purohit, A., et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language. ArXiv abs/2204.00598 (2022). 32. Suzuki, R., Yanaka, H., Yoshikawa, M., Mineshima, K. & Bekki, D. Multimodal Logical Inference System for Visual-Textual Entailment. ArXiv abs/1906.03952 (2019). 33. Chen, J., Tang, J., Qin, J., Liang, X., Liu, L., Xing, E. P., et al. GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. arXiv preprint arXiv:2105.14517 (2021). 34. Yang, J., Chen, X., Jiang, M., Chen, S., Wang, L. & Zhao, Q. VisualHow: Multimodal problem solving in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022), 15627–15637. 35. Bollacker, K., Evans, C., Paritosh, P., Sturge, T. & Taylor, J. Freebase: a collaboratively created graph database for structuring human knowledge in Proceedings of the 2008 ACM SIGMOD international conference on Management of data (2008), 1247–1250. 36. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R. & Ives, Z. Dbpedia: A nucleus for a web of open data in international semantic web conference (2007), 722–735. 37. Rebele, T., Suchanek, F., Hoffart, J., Biega, J., Kuzey, E. & Weikum, G. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames in The Semantic Web–ISWC 2016: 15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part II 15 (2016), 177–185. 38. Fellbaum, C. in Theory and applications of ontology: computer applications 231–243 (Springer, 2010). 39. Zhu, X., Li, Z., Wang, X., Jiang, X., Sun, P., Wang, X., et al. Multi-modal knowledge graph construction and application: A survey. IEEE Transactions on Knowledge and Data Engineering 36, 715–735 (2022). 129 40. Huang, Q., Xiong, Y., Rao, A., Wang, J. & Lin, D. MovieNet: A Holistic Dataset for Movie Understanding in The European Conference on Computer Vision (ECCV) (2020). 41. Lei, J., Yu, L., Bansal, M. & Berg, T. L. Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018). 42. Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. From recognition to cognition: Visual commonsense reasoning in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), 6720–6731. 43. Park, J. S., Bhagavatula, C., Mottaghi, R., Farhadi, A. & Choi, Y. Visualcomet: Reasoning about the dynamic context of a still image in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 (2020), 508–524. 44. Sharma, P., Ding, N., Goodman, S. & Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018), 2556–2565. 45. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems 35, 25278–25294 (2022). 46. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., et al. Youtube8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016). 47. Somandepalli, K., Guha, T., Martinez, V. R., Kumar, N., Adam, H. & Narayanan, S. Computational Media Intelligence: Human-Centered Machine Analysis of Media. Proceedings of the IEEE 109, 891–910. doi:10.1109/JPROC.2020.3047978 (2021). 48. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. Learning Transferable Visual Models From Natural Language Supervision in International Conference on Machine Learning (2021). 49. Bar, M. Visual objects in context. Nature Reviews Neuroscience 5, 617–629 (2004). 50. Qiao, X., Zheng, Q., Cao, Y. & Lau, R. W. H. Object-Level Scene Context Prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5280–5292 (2021). 130 51. Bordwell, D. & Thompson, K. Film Art: An Introduction in (1979). 52. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A. & Torralba, A. SUN database: Large-scale scene recognition from abbey to zoo. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 3485–3492 (2010). 53. Zhou, B., Lapedriza, A., Khosla, A., Oliva, A. & Torralba, A. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017). 54. Bain, M., Nagrani, A., Brown, A. & Zisserman, A. Condensed Movies: Story Based Retrieval with Contextual Embeddings 2020. 55. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R. & Kembhavi, A. Visual Semantic Role Labeling for Video Understanding in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021). 56. Helm, D. & Kampel, M. Shot Boundary Detection for Automatic Video Analysis of Historical Films in New Trends in Image Analysis and Processing – ICIAP 2019 (eds Cristani, M., Prati, A., Lanz, O., Messelodi, S. & Sebe, N.) (Springer International Publishing, Cham, 2019), 137–147. 57. Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J., Stiefelhagen, R., et al. Large Scale Holistic Video Understanding in Computer Vision – ECCV 2020 (eds Vedaldi, A., Bischof, H., Brox, T. & Frahm, J.-M.) (Springer International Publishing, Cham, 2020), 593–610. 58. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., et al. Learning Transferable Visual Models From Natural Language Supervision. CoRR abs/2103.00020 (2021). 59. Cascante-Bonilla, P., Sitaraman, K., Luo, M. & Ordonez, V. Moviescope: Large-scale Analysis of Movies using Multiple Modalities. ArXiv abs/1908.03180 (2019). 60. Quattoni, A. & Torralba, A. Recognizing indoor scenes in 2009 IEEE Conference on Computer Vision and Pattern Recognition (2009), 413–420. doi:10.1109/CVPR.2009.5206537. 61. Xiao, J., Ehinger, K. A., Hays, J., Torralba, A. & Oliva, A. SUN Database: Exploring a Large Collection of Scene Categories. International Journal of Computer Vision 119, 3–22. doi:10.1007/s11263-014-0748-y (2016). 131 62. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. The Kinetics Human Action Video Dataset. CoRR abs/1705.06950 (2017). 63. Fabian Caba Heilbron Victor Escorcia, B. G. & Niebles, J. C. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 961–970. 64. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., et al. Ava: A video dataset of spatio-temporally localized atomic visual actions in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 6047–6056. 65. Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., et al. The ”something something” video database for learning and evaluating visual common sense. CoRR abs/1706.04261 (2017). 66. Ray, J., Wang, H., Tran, D., Wang, Y., Feiszli, M., Torresani, L., et al. Scenes-ObjectsActions: A Multi-task, Multi-label Video Dataset in Computer Vision – ECCV 2018 (Springer International Publishing, Cham, 2018), 660–676. 67. Marszałek, M., Laptev, I. & Schmid, C. Actions in Context in IEEE Conference on Computer Vision & Pattern Recognition (2009). 68. Vicol, P., Tapaswi, M., Castrejon, L. & Fidler, S. MovieGraphs: Towards Understanding Human-Centric Situations from Videos in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018). 69. Wu, C.-Y. & Krahenb ¨ uhl, P. ¨ Towards Long-Form Video Understanding in CVPR (2021). 70. Fei-Fei, L. & Perona, P. A Bayesian hierarchical model for learning natural scene categories in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) 2 (2005), 524–531 vol. 2. doi:10.1109/CVPR.2005.16. 71. Sariyildiz, M. B., Perez, J. & Larlus, D. Learning Visual Representations with Caption Annotations in European Conference on Computer Vision (ECCV) (2020). 72. Desai, K. & Johnson, J. VirTex: Learning Visual Representations from Textual Annotations in CVPR (2021). 132 73. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., et al. How Much Can CLIP Benefit Vision-and-Language Tasks? ArXiv abs/2107.06383 (2021). 74. Gu, X., Lin, T.-Y., Kuo, W. & Cui, Y. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation in (2021). 75. Li, B., Weinberger, K. Q., Belongie, S. J., Koltun, V. & Ranftl, R. Language-driven Semantic Segmentation. ArXiv abs/2201.03546 (2022). 76. Cour, T., Jordan, C., Miltsakaki, E. & Taskar, B. Movie/Script: Alignment and Parsing of Video and Text Transcription in Computer Vision – ECCV 2008 (eds Forsyth, D., Torr, P. & Zisserman, A.) (Springer Berlin Heidelberg, Berlin, Heidelberg, 2008), 158–171. 77. Everingham, M., Sivic, J. & Zisserman, A. Hello! My name is... Buffy” – Automatic Naming of Characters in TV Video in BMVC (2006). 78. Laptev, I., Marszalek, M., Schmid, C. & Rozenfeld, B. Learning realistic human actions from movies in 2008 IEEE Conference on Computer Vision and Pattern Recognition (2008), 1–8. doi:10.1109/CVPR.2008.4587756. 79. Rohrbach, A., Rohrbach, M., Tandon, N. & Schiele, B. A dataset for Movie Description. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3202–3212 (2015). 80. Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (Association for Computational Linguistics, 2019). 81. Frey, B. J. & Dueck, D. Clustering by Passing Messages Between Data Points. Science 315, 972–976. doi:10.1126/science.1136800 (2007). 82. Van der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008). 83. Parkhi, O. M., Vedaldi, A., Zisserman, A. & Jawahar, C. V. Cats and Dogs in IEEE Conference on Computer Vision and Pattern Recognition (2012). 133 84. Bossard, L., Guillaumin, M. & Van Gool, L. Food-101 – Mining Discriminative Components with Random Forests in European Conference on Computer Vision (2014). 85. Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization in 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13) (Sydney, Australia, 2013). 86. Barsoum, E., Zhang, C., Canton Ferrer, C. & Zhang, Z. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution in ACM International Conference on Multimodal Interaction (ICMI) (2016). 87. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (2009). 88. Fei-Fei, L., Fergus, R. & Perona, P. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories. Computer Vision and Pattern Recognition Workshop (2004). 89. Carreira, J., Noland, E., Hillier, C. & Zisserman, A. A Short Note on the Kinetics-700 Human Action Dataset. CoRR abs/1907.06987 (2019). 90. Soomro, K., Zamir, A. R. & Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR abs/1212.0402 (2012). 91. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. Language Models are Few-Shot Learners in Advances in Neural Information Processing Systems 33 (Curran Associates, Inc., 2020), 1877–1901. 92. Rao, A., Wang, J., Xu, L., Jiang, X., Huang, Q., Zhou, B., et al. A Unified Framework for Shot Type Classification Based on Subject Centric Lens in The European Conference on Computer Vision (ECCV) (2020). 93. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735– 1780 (1997). 134 94. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR (2021). 95. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E. & Garnett, R.) 8024–8035 (Curran Associates, Inc., 2019). ´ 96. He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385 (2015). 97. Carreira, J. & Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset in (2017), 4724–4733. doi:10.1109/CVPR.2017.502. 98. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y. & Paluri, M. A closer look at spatiotemporal convolutions for action recognition in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (2018), 6450–6459. 99. Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition in Proceedings of the IEEE international conference on computer vision (2019), 6202–6211. 100. Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015). 101. Bertasius, G., Wang, H. & Torresani, L. Is Space-Time Attention All You Need for Video Understanding? ArXiv abs/2102.05095 (2021). 102. Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., et al. Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021). 103. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030 (2021). 104. Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization in ICLR (2019). 105. Tran, D., Bourdev, L., Fergus, R., Torresani, L. & Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks (2014). 135 106. Pardun, J. C. Advertising and Society : an Introduction (John Wiley & Sons, Inc., New York, NY, USA, 2013). 107. Guo, M., Hwa, R. & Kovashka, A. Detecting persuasive atypicality by modeling contextual compatibility in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), 972–982. 108. Liebhart, K. & Bernhardt, P. Political storytelling on Instagram: Key aspects of Alexander Van der Bellen’s successful 2016 presidential election campaign. Media and Communication 5, 15–25 (2017). 109. Hollis, N. Why Good Advertising Works (Even When You Think It Doesn’t) https://www. theatlantic.com/business/archive/2011/08/why- good- advertising- workseven-when-you-think-it-doesnt/244252/. 2011. 110. Kim, E. A., Ratneshwar, S. & Thorson, E. Why Narrative Ads Work: An Integrated Process Explanation. Journal of Advertising 46, 283–296 (2017). 111. Mick, D. Toward a Semiotic of Advertising Story Grammars in (1987). 112. Leong, S. M., Ang, S. H. & Heng, L. Using Drama to Persuade: the Effects of Involvement and Ad Form on Persuasion. ACR Asia-Pacific Advances (1994). 113. Puto, C. P. & Wells, W. D. Informational and Transformational Advertising: the Differential Effects of Time. ACR North American Advances (1984). 114. Google. Diversity and inclusion in advertisement videos 115. Veer, E. & Pervan, S. How the tone and wording of advertisements interact. International Journal of Advertising 27, 191–207 (2008). 116. Brooks, M. E., Craig, C. M. & Bichard, S. L. Exploring Ads of the World: How Social Issues Are Framed in Global Advertisements. Howard Journal of Communications 31, 150–170 (2020). 117. Fisher, W. R. Human Communication As Narration: Toward a Philosophy of Reason, Value and Action in (1987). 136 118. Escalas, J. E. ADVERTISING NARRATIVES: What are they and how do they work? in (1998). 119. Lien, N.-H. & Chen, Y.-L. Narrative ads: The effect of argument strength and story format. Journal of Business Research 66, 516–522 (2013). 120. Li, B. A., Cardier, B., Wang, T. & Metze, F. Annotating High-Level Structures of Short Stories and Personal Anecdotes. ArXiv abs/1710.06917 (2017). 121. Ouyang, J. & McKeown, K. Modeling Reportable Events as Turning Points in Narrative in Conference on Empirical Methods in Natural Language Processing (2015). 122. Boyd, R. L., Blackburn, K. G. & Pennebaker, J. W. The narrative arc: Revealing core narrative structures through text analysis. Science Advances 6 (2020). 123. Holbrook, M. B. & O’Shaughnessy, J. J. The role of emotion in advertising. Psychology & Marketing 1, 45–64 (1984). 124. Micu, A. C. & Plummer, J. T. Measurable Emotions: How Television Ads Really Work. Journal of Advertising Research 50, 137–153 (2010). 125. McDuff, D. J., Kaliouby, R. E., Cohn, J. F. & Picard, R. W. Predicting Ad Liking and Purchase Intent: Large-Scale Analysis of Facial Responses to Ads. IEEE Transactions on Affective Computing 6, 223–235 (2014). 126. Teixeira, T. S., Picard, R. W. & Kaliouby, R. E. Why, When, and How Much to Entertain Consumers in Advertisements? A Web-Based Facial Tracking Field Study. Mark. Sci. 33, 809–827 (2014). 127. Shukla, A., Gullapuram, S. S., Katti, H., Yadati, K., Kankanhalli, M. & Ramanathan, S. Affect Recognition in Ads with Application to Computational Advertising. Proceedings of the 25th ACM international conference on Multimedia (2017). 128. Shukla, A., Gullapuram, S. S., Katti, H., Yadati, K., Kankanhalli, M. & Ramanathan, S. Evaluating content-centric vs. user-centric ad affect recognition. Proceedings of the 19th ACM International Conference on Multimodal Interaction (2017). 137 129. Shukla, A., Gullapuram, S. S., Katti, H., Kankanhalli, M., Winkler, S. & Ramanathan, S. Recognition of Advertisement Emotions With Application to Computational Advertising. IEEE Transactions on Affective Computing 13, 781–792 (2019). 130. Koelstra, S., Muhl, C., Soleymani, M., Lee, J.-S., Yazdani, A., Ebrahimi, T., ¨ et al. DEAP: A Database for Emotion Analysis ;Using Physiological Signals. IEEE Transactions on Affective Computing 3, 18–31 (2012). 131. Jiang, Y.-G., Xu, B. & Xue, X. Predicting Emotions in User-Generated Videos in AAAI Conference on Artificial Intelligence (2014). 132. Douglas-Cowie, E., Cowie, R., Sneddon, I., Cox, C., Lowry, O., McRorie, M., et al. The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data in Affective Computing and Intelligent Interaction (2007). 133. Zlatintsi, A., Koutras, P., Evangelopoulos, G., Malandrakis, N., Efthymiou, N., Pastra, K., et al. COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017, 1–24 (2017). 134. Baveye, Y., Dellandrea, E., Chamaret, C. & Chen, L. L. LIRIS-ACCEDE: A Video Database ´ for Affective Content Analysis. IEEE Transactions on Affective Computing 6, 43–55 (2015). 135. Sun, J. J., Liu, T., Cowen, A. S., Schroff, F., Adam, H. & Prasad, G. EEV Dataset: Predicting Expressions Evoked by Diverse Videos. ArXiv abs/2001.05488 (2020). 136. Ong, D. C., Wu, Z., Tan, Z.-X., Reddan, M. C., Kahhale, I., Mattek, A., ´ et al. Modeling Emotion in Complex Stories: The Stanford Emotional Narratives Dataset. IEEE Transactions on Affective Computing 12, 579–594 (2019). 137. Ye, K., Buettner, K. & Kovashka, A. Story Understanding in Video Advertisements in British Machine Vision Conference (2018). 138. Soldan, M., Pardo, A., Alcazar, J. L., Caba, F., Zhao, C., Giancola, S., ´ et al. MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 5026–5035. 138 139. Pardo, A., Heilbron, F. C., Alc’azar, J. L., Thabet, A. K. & Ghanem, B. MovieCuts: A New Dataset and Benchmark for Cut Type Recognition in European Conference on Computer Vision (2021). 140. Bose, D., Hebbar, R., Somandepalli, K., Zhang, H., Cui, Y., Cole-McLaughlin, K., et al. MovieCLIP: Visual Scene Recognition in Movies in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2023), 2083–2092. 141. Hebbar, R., Bose, D., Somandepalli, K., Vijai, V. & Narayanan, S. S. A dataset for AudioVisual Sound Event Detection in Movies. ArXiv abs/2302.07315 (2023). 142. Singla, Y. K., Jha, R. A., Gupta, A., Aggarwal, M., Garg, A., Bhardwaj, A., et al. Persuasion Strategies in Advertisements: Dataset, Modeling, and Baselines. ArXiv abs/2208.09626 (2022). 143. Jiang, J., Li, Z., Xiong, J., Quan, R., Lu, Q. & Liu, W. Tencent AVS: A Holistic Ads Video Dataset for Multi-Modal Scene Segmentation. IEEE Access 10, 128959–128969 (2022). 144. Zhang, Z., Hou, X., Niu, K., Huang, Z., Ge, T., Jiang, Y., et al. Attract me to Buy: Advertisement Copywriting Generation with Multimodal Multi-structured Information. ArXiv abs/2205.03534 (2022). 145. Cannes. Cannes Lions 146. Carreira, J. & Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733 (2017). 147. Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., et al. Moments in Time Dataset: One Million Videos for Event Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 502–508 (2018). 148. Heilbron, F. C., Escorcia, V., Ghanem, B. & Niebles, J. C. ActivityNet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 961–970 (2015). 139 149. Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D. A., Toderici, G., et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056 (2017). 150. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, A., Toderici, G., Varadarajan, B., et al. YouTube-8M: A Large-Scale Video Classification Benchmark. ArXiv abs/1609.08675 (2016). 151. Diba, A., Fayyaz, M., Sharma, V., Paluri, M., Gall, J., Stiefelhagen, R., et al. Large scale holistic video understanding in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16 (2020), 593–610. 152. Gupta, V., Mittal, T., Mathur, P., Mishra, V., Maheshwari, M., Bera, A., et al. 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21032–21043 (2022). 153. Vedula, N., Sun, W., Lee, H., Gupta, H., Ogihara, M., Johnson, J., et al. Multimodal Content Analysis for Effective Advertisements on YouTube. 2017 IEEE International Conference on Data Mining (ICDM), 1123–1128 (2017). 154. Somandepalli, K., Martinez, V., Kumar, N. & Narayanan, S. Multimodal Representation of Advertisements Using Segment-Level Autoencoders in (Association for Computing Machinery, Boulder, CO, USA, 2018), 418–422. doi:10.1145/3242969.3243026. 155. Ye, K., Nazari, N. H., Hahn, J., Hussain, Z., Zhang, M. & Kovashka, A. Interpreting the Rhetoric of Visual Advertisements. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 1308–1323 (2019). 156. Jaegle, A., Borgeaud, S., Alayrac, J.-B., Doersch, C., Ionescu, C., Ding, D., et al. Perceiver IO: A General Architecture for Structured Inputs & Outputs. ArXiv abs/2107.14795 (2021). 157. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C. & Sun, C. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems 34, 14200–14213 (2021). 158. Akbari, H., Yuan, L., Qian, R., Chuang, W.-H., Chang, S.-F., Cui, Y., et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text in Neural Information Processing Systems (2021). 140 159. Zhang, H., Luo, Y., Ai, Q., Wen, Y. & Hu, H. Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning in Proceedings of the 28th ACM International Conference on Multimedia (Association for Computing Machinery, Seattle, WA, USA, 2020), 430–438. doi:10.1145/3394171.3413582. 160. Lin, R., Xiao, J. & Fan, J. NeXtVLAD: An Efficient Neural Network to Aggregate Framelevel Features for Large-scale Video Classification in ECCV Workshops (2018). 161. Weng, Z., Meng, L., Wang, R., Wu, Z. & Jiang, Y.-G. A Multimodal Framework for Video Ads Understanding. Proceedings of the 29th ACM International Conference on Multimedia (2021). 162. Ciment, J. Social Issues in America: An Encyclopedia (M.E. Sharpe, Armonk, NY, 2006). 163. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. ArXiv abs/2212.04356 (2022). 164. OpenAI. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). 165. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019). 166. Gong, Y., Chung, Y.-A. & Glass, J. AST: Audio Spectrogram Transformer in Proc. Interspeech 2021 (2021), 571–575. doi:10.21437/Interspeech.2021-698. 167. Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., et al. Audio Set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780 (2017). 168. Paszke, A., Gross, S., Massa, F., et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. ArXiv abs/1912.01703 (2019). 169. Zhao, W. X., Zhou, K., Li, J., et al. A Survey of Large Language Models. ArXiv abs/2303.18223 (2023). 141 170. Iyer, S., Lin, X., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. ArXiv abs/2212.12017 (2022). 171. Chung, H. W., Hou, L., Longpre, S., et al. Scaling Instruction-Finetuned Language Models. ArXiv abs/2210.11416 (2022). 172. Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., et al. Stanford Alpaca: An Instruction-following LLaMA model https://github.com/tatsu-lab/stanford_alpaca. 2023. 173. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H.-h., Xia, F., et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022). 174. Li, C., Gan, Z., Yang, Z., Yang, J., Li, L., Wang, L., et al. Multimodal Foundation Models: From Specialists to General-Purpose Assistants. Found. Trends Comput. Graph. Vis. 16, 1– 214 (2023). 175. Bommasani, R., Hudson, D. A., Adeli, E. & et.al. On the Opportunities and Risks of Foundation Models. ArXiv abs/2108.07258 (2021). 176. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding in NAACL (2019). 177. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. & Sutskever, I. Language Models are Unsupervised Multitask Learners (2019). 178. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv [Preprint](2019). arXiv preprint arXiv:1907.11692 (1907). 179. He, P., Liu, X., Gao, J. & Chen, W. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020). 180. Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., et al. Scaling up visual and vision-language representation learning with noisy text supervision in International conference on machine learning (2021), 4904–4916. 142 181. Bao, H., Dong, L., Piao, S. & Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021). 182. OpenAI. ChatGPT: A Conversational AI Accessed via OpenAI API at https://openai.com/chatgpt. 2024. 183. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024). 184. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023). 185. Li, J., Li, D., Savarese, S. & Hoi, S. C. H. BLIP-2: Bootstrapping Language-Image Pretraining with Frozen Image Encoders and Large Language Models. ArXiv abs/2301.12597 (2023). 186. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35, 23716–23736 (2022). 187. Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems 36 (2024). 188. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., et al. Qwen-vl: A frontier large visionlanguage model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023). 189. Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024). 190. Shen, Y., Song, K., Tan, X., Li, D., Lu, W. & Zhuang, Y. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems 36 (2024). 191. Brade, S., Wang, B., Sousa, M., Oore, S. & Grossman, T. Promptify: Text-to-image generation through interactive prompt exploration with large language models in Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (2023), 1–14. 143 192. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021). 193. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., et al. Qwen2-VL: Enhancing VisionLanguage Model’s Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191 (2024). 194. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (2023). 195. Zhao, Y., Huang, J., Hu, J., Wang, X., Mao, Y., Zhang, D., et al. SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning. ArXiv abs/2408.05517 (2024). 196. Zhang, Q., Wei, Y., Han, Z., Fu, H., Peng, X., Deng, C., et al. Multimodal fusion on lowquality data: A comprehensive survey. arXiv preprint arXiv:2404.18947 (2024). 197. Zhang, D., Shen, D., Initiative, A. D. N., et al. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer’s disease. NeuroImage 59, 895–907 (2012). 198. Liu, F., Wee, C.-Y., Chen, H. & Shen, D. Inter-modality relationship constrained multimodality multi-task feature selection for Alzheimer’s Disease and mild cognitive impairment identification. NeuroImage 84, 466–475 (2014). 199. Wang, W., Tran, D. & Feiszli, M. What Makes Training Multi-Modal Networks Hard? CoRR abs/1905.12681 (2019). 200. Peng, X., Wei, Y., Deng, A., Wang, D. & Hu, D. Balanced Multimodal Learning via On-thefly Gradient Modulation in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022). 201. Huang, Z., Niu, G., Liu, X., Ding, W., Xiao, X., Wu, H., et al. Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems 34, 29406–29419 (2021). 144 202. Zhang, Q., Wu, H., Zhang, C., Hu, Q., Fu, H., Zhou, J. T., et al. Provable dynamic fusion for low-quality multimodal data in International conference on machine learning (2023), 41753–41769. 203. Han, Z., Zhang, C., Fu, H. & Zhou, J. T. Trusted multi-view classification with dynamic evidential fusion. IEEE transactions on pattern analysis and machine intelligence 45, 2551– 2566 (2022). 204. Wang, W.-W., Shui, P.-L. & Feng, X.-C. Variational models for fusion and denoising of multifocus images. IEEE Signal Processing Letters 15, 65–68 (2008). 205. Kumar, M. & Dass, S. A total variation-based algorithm for pixel-level image fusion. IEEE Transactions on Image Processing 18, 2137–2143 (2009). 206. Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904 (2022). 207. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., et al. Parameter-efficient transfer learning for NLP in International conference on machine learning (2019), 2790–2799. 208. Li, X. L. & Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021). 209. Zaken, E. B., Ravfogel, S. & Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021). 210. He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T. & Neubig, G. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021). 211. Zanella, M. & Ben Ayed, I. Low-Rank Few-Shot Adaptation of Vision-Language Models in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2024), 1593–1603. 212. Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., et al. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379 (2023). 145 213. Gandikota, R., Materzynska, J., Zhou, T., Torralba, A. & Bau, D. ´ Concept sliders: Lora adaptors for precise control in diffusion models in European Conference on Computer Vision (2025), 172–188. 214. Tishby, N. & Zaslavsky, N. Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW), 1–5 (2015). 215. Hu, S., Lou, Z., Yan, X. & Ye, Y. A Survey on Information Bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024). 216. Alemi, A. A., Fischer, I., Dillon, J. V. & Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 (2016). 217. Fischer, I. The conditional entropy bottleneck. Entropy 22, 999 (2020). 218. An, S., Jammalamadaka, N. & Chong, E. Maximum entropy information bottleneck for uncertainty-aware stochastic embedding in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 3809–3818. 219. Wang, Z., Jian, T., Masoomi, A., Ioannidis, S. & Dy, J. Revisiting hilbert-schmidt information bottleneck for adversarial robustness. Advances in Neural Information Processing Systems 34, 586–597 (2021). 220. Mahabadi, R. K., Belinkov, Y. & Henderson, J. Variational Information Bottleneck for Effective Low-Resource Fine-Tuning. ArXiv abs/2106.05469 (2021). 221. Wang, B., Wang, S., Cheng, Y., Gan, Z., Jia, R., Li, B., et al. InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective in International Conference on Learning Representations (2021). 222. Zhang, C., Zhou, X., Wan, Y., Zheng, X., Chang, K.-W. & Hsieh, C.-J. Improving the adversarial robustness of NLP models by information bottleneck. arXiv preprint arXiv:2206.05511 (2022). 223. Mai, S., Zeng, Y. & Hu, H. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Transactions on Multimedia 25, 4121– 4134 (2022). 146 224. Jiang, J., Liu, Z.-y. & Zheng, N. Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering. International Journal of Computer Vision (2022). 225. Gao, Z., Jiang, X., Xu, X., Shen, F., Li, Y. & Shen, H. T. Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024), 26876–26885. 226. Zhu, Z., Cheng, X., Chen, Z., Chen, Y., Zhang, Y., Wu, X., et al. InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing in Proceedings of the 32nd ACM International Conference on Multimedia (2024), 515–524. 227. Lee, C. & Van der Schaar, M. A variational information bottleneck approach to multi-omics data integration in International Conference on Artificial Intelligence and Statistics (2021), 1513–1521. 228. Hendrycks, D. & Dietterich, T. G. Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697 (2018). 229. Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J. & Song, D. Natural Adversarial Examples. CVPR (2021). 230. Zhao, B., Yu, S., Ma, W., Yu, M., Mei, S., Wang, A., et al. Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images in European conference on computer vision (2022), 163–180. 231. Michaelis, C., Mitzkus, B., Geirhos, R., Rusak, E., Bringmann, O., Ecker, A. S., et al. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019). 232. Moradi, M. & Samwald, M. Evaluating the robustness of neural language models to input perturbations. arXiv preprint arXiv:2108.12237 (2021). 233. Yuan, L., Chen, Y., Cui, G., Gao, H., Zou, F., Cheng, X., et al. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and LLMs evaluations. Advances in Neural Information Processing Systems 36, 58478–58507 (2023). 147 234. Shreya, G. & Khapra, M. M. A survey in adversarial defences and robustness in nlp. arXiv preprint arXiv:2203.06414 (2022). 235. Bailey, L., Ong, E., Russell, S. & Emmons, S. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236 (2023). 236. Qiu, J., Zhu, Y., Shi, X., Wenzel, F., Tang, Z., Zhao, D., et al. Benchmarking robustness of multimodal image-text models under distribution shift. Journal of Data-centric Machine Learning Research (2023). 237. Zhang, J., Pang, T., Du, C., Ren, Y., Li, B. & Lin, M. Benchmarking large multimodal models against common corruptions. arXiv preprint arXiv:2401.11943 (2024). 238. Schiappa, M., Vyas, S., Palangi, H., Rawat, Y. & Vineet, V. Robustness analysis of videolanguage models against visual and language perturbations. Advances in Neural Information Processing Systems 35, 34405–34420 (2022). 239. Chen, S., Gu, J., Han, Z., Ma, Y., Torr, P. & Tresp, V. Benchmarking robustness of adaptation methods on pre-trained vision-language models. Advances in Neural Information Processing Systems 36 (2024). 240. Arevalo, J., Solorio, T., Montes-y-Gomez, M. & Gonz ´ alez, F. A. Gated multimodal units ´ for information fusion. arXiv preprint arXiv:1702.01992 (2017). 241. Niu, T., Zhu, S., Pang, L. & El Saddik, A. Sentiment analysis on multi-view social data in MultiMedia Modeling: 22nd International Conference, MMM 2016, Miami, FL, USA, January 4-6, 2016, Proceedings, Part II 22 (2016), 15–27. 242. Harper, F. M. & Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 1–19 (2015). 243. Kingma, D. P. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013). 244. Cai, Y., Wu, Z., Jia, J. & Meng, H. LoRA-MER: Low-Rank Adaptation of Pre-Trained Speech Models for Multimodal Emotion Recognition Using Mutual Information in Proc. Interspeech 2024 (2024), 4658–4662. 148 245. Jiao, P., Zhu, B., Chen, J., Ngo, C.-W. & Jiang, Y.-G. Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning. arXiv preprint arXiv:2411.12787 (2024). 246. Ma, W.-D. K., Lewis, J. & Kleijn, W. B. The HSIC bottleneck: Deep learning without backpropagation in Proceedings of the AAAI conference on artificial intelligence 34 (2020), 5085–5092. 247. Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., et al. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353 (2024). 248. Kopiczko, D. J., Blankevoort, T. & Asano, Y. M. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454 (2023). 249. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., et al. Imagebind: One embedding space to bind them all in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), 15180–15190. 250. Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., et al. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023). 251. Zhang, H., Li, X. & Bing, L. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023). 252. Yang, A., Nagrani, A., Seo, P. H., Miech, A., Pont-Tuset, J., Laptev, I., et al. Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning in CVPR (2023). 253. Xu, L., Zhao, Y., Zhou, D., Lin, Z., Ng, S. K. & Feng, J. Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994 (2024). 254. Zhang, C., Lu, T., Islam, M. M., Wang, Z., Yu, S., Bansal, M., et al. A simple llm framework for long-range video question-answering. arXiv preprint arXiv:2312.17235 (2023). 255. Lin, K., Ahmed, F., Li, L., Lin, C.-C., Azarnasab, E., Yang, Z., et al. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773 (2023). 149 256. Wang, W., He, Z., Hong, W., Cheng, Y., Zhang, X., Qi, J., et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035 (2024). 257. Rawal, R., Saifullah, K., Farre, M., Basri, R., Jacobs, D., Somepalli, G., ´ et al. Cinepile: A long video question answering dataset and benchmark. arXiv preprint arXiv:2405.08813 (2024). 258. Cai, M., Tan, R., Zhang, J., Zou, B., Zhang, K., Yao, F., et al. TemporalBench: Towards Finegrained Temporal Understanding for Multimodal Video Models. arXiv preprint arXiv:2410.10818 (2024). 259. Wolf, T., Debut, L., Sanh, V., Chaumond, J., et al. Transformers: State-of-the-Art Natural Language Processing in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Association for Computational Linguistics, Online, 2020), 38–45. 150 Appendices C MovieCLIP dataset C.1 Scene classes distribution wrt sources Here Movie Slugline refers to the set of labels obtained exclusively from sluglines in movie scripts. Common Label refers to the set of common labels between taxonomy considered in HVU [57] and Movie Slugline. Here HVU [57] refers to the set of labels obtained exclusively from the taxonomy used for curating HVU [57] dataset. Human expert refers to the set of labels added by the human expert during the taxonomy refinement procedure. • Movie Slugline: tent, computer room, truck, study, gas station, cafe, shuttle, courthouse, elevator, tower, dorm, station, club, lobby, mall, salon, prison, bus, stairs, theater, car, booth, locker room, hangar, closet, farmhouse, post office, townhouse, ship, loft, yard, zoo, funeral, art gallery, castle, subway, lounge, train, morgue, museum, wagon, manor, mansion, library, pool, cellar, cab, safe house, classroom, helicopter, police station, courtroom, city hall, fire station, corridor, control room, airport, cabin, war room, plane, press room, cottage, residence, penthouse, inn, church, suburban, interrogation room, conference room • Common label: tunnel, bakery, shack, building, baseball field, hotel, desert, factory, bathroom, downtown, restaurant, village, playground, boxing ring, gym, bridge, beach, workshop, cave, clinic, arena, garden, stage, office, attic, bowling alley, apartment, deck, cockpit, dining room, basketball court, grove, ballroom, forest, house, barn, alley, park, bay, golf course, chapel, home, parking, bar, kitchen, school, swamp, basement, walkway, bedroom, 151 garage, lake, bank, living room, room, auditorium, street, valley, casino, hall, waterfall, warehouse, tennis court, farm, hospital, palace, estate, river • HVU: archaeological site, shore, batting cage, animal shelter, plaza, hot spring, harbor, bullring, sandbank, town, mountain, retail, courtyard, sea, road, shooting range, pond, stadium, foundry, skyline, amusement park, market, laboratory, race track, kindergarten, ice rink • Human expert: agriculture field, makeup studio, grassland, construction site, graveyard, automotive repair, overpass, studio, boat, fair, balcony, battlefield, banquet, phone booth, concert hall, meadow D MM-AU Benchmark D.1 Topic categories We provide the mapping between Cannes(CC) [145], Ads of the World (AOW)1 and Video-Ads (VA) [16] coding schemes for obtaining the final set of topic categories as follows: • Games: Games and toys [VA]; Gaming [AOW] • Household: Household: Home Appliances, Furnishing [CC]; Cleaning products, Home improvements and repairs, Home appliances [VA] • Services: Other services i.e. dating, tax, legal, loan, religious, printing, catering, etc. [VA]; Professional Services [AOW]. • Misc: Miscellaneous, Business equipment and services [CC]; Petfood, Political candidates (Politics) [VA]; Pets [AOW] • Sports: Sports equipment and activities [VA]; Sports [AOW] 1https://www.adsoftheworld.com/ 152 • Banking: Banking and services [CC]; Financial services [VA]; Finance [AOW] • Clothing: Clothing, Footwear & Accessories [CC]; Clothing and accessories [VA]; Personal Accessories [AOW] • Industrial and agriculture: Industrial, Agriculture Public Interest, Agriculture Professional Services [AOW] • Leisure: Entertainment & Leisure [CC]; Gambling (lotteries, casinos, etc.) [VA]; Recreation, Gambling [AOW] • Publications & media: Media & Publications [CC]; Media and arts [VA]; TV Promos, Music, Media, Movies [AOW] • Health: Healthcare & Pharmacy [CC]; Health care and medications [VA]; Health, Pharmaceutical [AOW] • Car: Cars & Automotive Products & Services [CC]; Car [VA]; Automotive [AOW] • Electronics: Home electronics and audio-visual [CC]; Electronics, Phone, TV and internet service providers [VA]; Electronics [AOW] • Cosmetics: Cosmetics & Toiletries [CC]; Beauty products and cosmetics, Baby products [VA]; Beauty [AOW] • Food and drink: Savoury Foods, Sweet Foods & Snacks, Non Alcoholic drinks, Alcoholic drinks [CC]; Chocolate, Chips, Seasoning, Coffee, Soda, juice, milk, energy drinks, water, Alcohol [VA]; Food, Non-Alcoholic Drinks, Confectionery, Alcoholic drinks [AOW] • Awareness: Charities and non-profit [CC]; Environment, Animal rights, Human rights, Safety, Smoking, Alcohol Abuse, Domestic Violence, Self-esteem, cyberbullying [VA]; Education, Agency Self-Promo [AOW] 153 • Travel and transport: Travel & Transport [CC]; Vacation and travel [VA]; Transport, Hospitality [AOW] • Retail: Retail & e-commerce [CC]; Shopping (department stores, drug stores, groceries, etc.) [VA]; Retail Services [AOW] The taxonomy sources are listed within [.] for respective subcategories for the final list of topic categories. E MM-AU Experiments E.1 Language based reasoning We investigate the zero-shot performance of several large language models i.e. GPT-4[164], Opt-IML [170], Flan-T5 (XXL,XL,L) [171] and Alpaca [172] on the benchmark tasks associated with MM-AU dataset. For zero-shot evaluation, we report the results on 1670 non-empty transcripts out of the test split of 1692 samples. E.1.1 Flan-T5: For Flan-T5, we use the following prompts for the social message (SM), tone transition (TT), topic categorization(Topic) tasks: • TT: <Text from transcript> Based on the given text transcript from the advertisement, determine if the advertisement has any transitions in tones. OPTIONS: - Transition - No transition ANSWER: 154 • SM: <Text from transcript> An advertisement video has a social message if it provides awareness about any social issue. Examples of social issues: gender equality, drug abuse, police brutality, workplace harassment, domestic violence, child labor, environmental damage, homelessness, hate crimes, racial inequality etc. Based on the given text transcript, determine if the advertisement has any social message. OPTIONS: - Yes - No ANSWER: • Topic: <Text from transcript> Associate a single topic label with the transcript from the given set: OPTIONS: - Games - Household - Services - Sports - Banking - Clothing - Industrial and agriculture - Leisure - Publications media - Health - Car 155 - Electronics - Cosmetics - Food and drink - Awareness - Travel and transport - Retail ANSWER: E.1.2 OPT: For OPT, we use the following prompt templates for different tasks: • TT: Instruction: In this task, you are given a transcription of an advertisement, determine if the advertisement has any transitions in tones. Transcription: <Text from transcript> OPTIONS: - Transition - No transition Answer: • SM: In this task, you are given a transcription of an advertisement. An advertisement video has a social message if it provides awareness about any social issue. Example of social issues: gender equality, drug abuse, police brutality, workplace harassment, domestic violence, child labor, environmental damage, homelessness, hate crimes, racial inequality etc. Your task is to give label ”Yes” if the advertisement given has any social message, otherwise give label ”No”. Transcription: <Text from transcript> Answer: 156 • Topic: In this task, you are given a transcription of an advertisement. Your task is to associate a single topic label with the transcript from the given set. Transcription: <Text from transcript> OPTIONS: - Games - Household - Services - Sports - Banking - Clothing - Industrial and agriculture - Leisure - Publications media - Health - Car - Electronics - Cosmetics - Food and drink - Awareness - Travel and transport - Retail Answer: 157 E.1.3 alpaca: For alpaca, we use the following prompt templates for different tasks: • TT: Instruction: In this task, you are given a transcription of an advertisement determine if the advertisement has any transitions in tones. Transcription: <Text from transcript> Options: - Transition - No transition Answer: • SM: Instruction: In this task, you are given a transcription of an advertisement. An advertisement video has a social message if it provides awareness about any social issue. Example of social issues: gender equality, drug abuse, police brutality, workplace harassment, domestic violence, child labor, environmental damage, homelessness, hate crimes, racial inequality etc. Based on the given text transcript, determine if the advertisement has any social message. Transcription: <Text from transcript> Options: - Yes - No Answer: • Topic: Instruction: In this task, you are given a transcription of an advertisement. Your task is to associate a single topic label with the transcript from the given set. Transcription: <Text from transcript> Options: 158 - Games - Household - Services - Sports - Banking - Clothing - Industrial and agriculture - Leisure - Publications media - Health - Car - Electronics - Cosmetics - Food and drink - Awareness - Travel and transport - Retail Answer: For GPT-4 we use the recently released API to pass the prompts for individual tasks. For Flan-T5 and Opt-IML, we use the publicly available models as a part of Huggingface [259] library. For alpaca, we use the publicly available implementation in Github [172]. The large language models sometimes assign a label to the prediction that does not lie within the set of valid labels for the respective tasks. In the case of those samples, we randomly assign a label from the task-specific label taxonomy. 159 F Atypicality labeling We provide a detailed list of atypicality labels, including definitions as listed in [107]. • Texture Replacement 1: – Definition: Object’s texture borrowed from another object, e.g., kiwi inside apple. • Texture Replacement 2: – Definition: Texture created by combining several small objects, e.g., owl from beans. • Object inside object: – Definition: One object is inside another object, e.g., jar in an onion. • Object with missing part: – Definition: Object is missing a part, e.g., a dog without a mouth. • Combination of parts: – Definition: Object composed of parts from different objects, e.g., deer head with hand horn. • Solid deformed objects: – Definition: Solid objects that have been deformed from the original shape by bending, twisting, e.g., human arm bent. • Liquid deformed objects: – Definition: Bent and deformed objects but for liquids, e.g., beer as player. • Object replacement: – Definition: Object appearing in the context normally associated with another, e.g., cigarettes placed in the context where bullets occur. 160 • Others: – Definition: Any other atypicality not covered by the above categories. • Regular object: – Definition: If the image does not contain atypical objects, please mark this category. G Persuasion stratgies • Guarantees: – Definition: Guarantees reduce risk and people try out such products more often. • Authority: – Definition: Authority indicated through expertise, source of power, third-party approval, credentials, and awards. • Trustworthiness: – Definition: Trustworthiness indicates honesty and integrity of the source through tropes like years of experience, “trusted brand”, numbers, and statistics. • Social Identity: – Definition: Normative influence, which involves conformity with the positive expectations of ”another,” who could be ”another person, a group, or one’s self” (includes self-persuasion, fleeting attraction, alter-casting, and exclusivity). • Social Proof: – Definition: Informational influence by accepting information obtained from others as evidence about reality, e.g., customer reviews and ratings. 161 • Reciprocity: – Definition: By obligating the recipient of an act to repayment in the future, the rule for reciprocation begets a sense of future obligation, often unequal in nature. • Foot-in-the-door: – Definition: Starting with small requests followed by larger requests to facilitate compliance while maintaining cognitive coherence. • Overcoming resistance: – Definition: Overcoming resistance (reactance) by postponing consequences to the future, by focusing resistance on realistic concerns, by forewarning that a message will be coming, by acknowledging resistance, by raising self-esteem and a sense of efficacy. • Concreteness: – Definition: Using concrete facts, evidence, and statistics to appeal to the logic of consumers. • Anchoring and Comparison: – Definition: A product’s value is strongly influenced by what it is compared to. • Social Impact: – Definition: Emphasizes the importance or bigger (societal) impact of a product. • Scarcity: – Definition: People assign more value to opportunities when they are less available. This happens due to psychological reactance of losing freedom of choice when things are less available or they use availability as a cognitive shortcut for gauging quality. • Anthropomorphism: 162 – Definition: When a brand or product is seen as humanlike, people will like it more and feel closer to it. • Active: – Definition: Feeling energetic, adventurous, vibrant, enthusiastic, playful. • Amazed: – Definition: Feeling surprised, astonished, awed, fascinated, intrigued. • Cheerful: – Definition: Feeling delighted, happy, joyful, carefree, optimistic. • Creative: – Definition: Inventive, productive. • Eager: – Definition: Feeling of hunger, thirst, passion. • Fashionable: – Definition: Trendy, elegant, beautiful, attractive. • Feminine: – Definition: Womanly, girlish. • Unclear: – Definition: If the ad strategy is unclear or it is not in English. • Others: – Definition: Other strategies that are not listed above. 163 • Emotion: – Definition: Aesthetics, feeling, and other non-cognitively demanding features used for persuading consumers. 164
Abstract (if available)
Abstract
In today’s information-rich landscape, there has been a rapid rise in media content available through web-based platforms reliant on diverse modalities. Multimodal content can appear in various formats, including online news articles, cinematic productions, advertisements short digital videos and print images. The proliferation of multimodal content across diverse platforms necessitates a large-scale understanding of its impact on individuals and society as a whole.Our ability to comprehend the underlying narratives in these sources depends on advancing core multimodal perception capabilities that rely on the integration of information from diverse modalities, including audio, visual, and textual streams.
In this work, we explore how the core elements of multimodal perception including modality connectedness, representation and reasoning, and heterogeneity can enable media understanding across the dimensions of content tagging, narrative understanding and handling diverse quality variations. Furthermore, in light of current advancements in large-scale multimodal processing, we identify certain key areas that can drive the next generation of media understanding.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning shared subspaces across multiple views and modalities
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Multimodal reasoning of visual information and natural language
PDF
Visual representation learning with structural prior
PDF
Semantically-grounded audio representation learning
PDF
Towards understanding language in perception and embodiment
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Incorporating large-scale vision-language corpora in visual understanding
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Speech recognition error modeling for robust speech processing and natural language understanding applications
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Multimodal representation learning of affective behavior
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
Asset Metadata
Creator
Bose, Digbalay
(author)
Core Title
Multimodal perception guided computational media understanding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical and Computer Engineering
Degree Conferral Date
2024-12
Publication Date
01/10/2025
Defense Date
12/18/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer vision,deep learning,media understanding,modality corruption,modality representation,multimodal learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Jenkins, Keith (
committee member
), Thomason, Jesse (
committee member
)
Creator Email
dbose@usc.edu,digbose92@gmail.com
Unique identifier
UC11399F8KL
Identifier
etd-BoseDigbal-13728.pdf (filename)
Legacy Identifier
etd-BoseDigbal-13728
Document Type
Dissertation
Format
theses (aat)
Rights
Bose, Digbalay
Internet Media Type
application/pdf
Type
texts
Source
20250110-usctheses-batch-1232
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer vision
deep learning
media understanding
modality corruption
modality representation
multimodal learning