Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Establishing cross-modal correspondences for media understanding.
(USC Thesis Other)
Establishing cross-modal correspondences for media understanding.
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Establishing cross-modal correspondences for media understanding. by Rahul Sharma A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL. UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2023 Copyright 2023 Rahul Sharma Acknowledgements Firstandforemost,IexpressmyimmensegratitudetomyPh.D.advisorProf. ShrikanthNarayanan forprovidingmewithsuchatremendousopportunityforlifetimelearning. Shri’suniqueadvisement style has motivated and encouraged me to ask the right and impactful questions. It inculcated a systematic approach to solving problems, which helped me grow professionally and personally. I want to thank him for keeping confidence in me through this program and letting me learn and advance at my own pace, helping me develop a researcher’s instinct. I thank my Ph.D. dissertation and qualifying exam committee members, Prof. Antonio Ortega, Prof. Aiichiro Nakano, Prof. C.-C. Jay Kuo, and Prof. Mahdi Soltanolkotabi, for making time to serve on my committees. I am grateful for their critical and valuable feedback, which helped me to ask harder questions and better shape this thesis. I want to thank the whole SAIL family, current students, and those who have graduated for having numerous fruitful discussions that helped me comprehensively solve the problems involved. Furthermore, I want to express my gratitude to all the academic advisors, particularly Tanya Acevedo-lam and Diane Demitras, who tirelessly worked behind the scene, taking care of all ad- ministrative responsibilities. I also want to thank my mentors during my internship at Amazon, Dr. Rahul Gupta, and Dr. Anil Ramakrishna, who provided me with industrial exposure and helped me learn to work more efficiently. I also want to extend my gratitude to my advisors at the Indian Institute of Technology Kanpur, Prof. Tanaya Guha and Prof. Gaurav Sharma, who introduced me to academic research and guided me to land this opportunity to pursue a Ph.D. at USC. Lastly, I want to acknowledge the contributions of my sweet family in the form of my constant supportsystem. Withouttheirloveandsupport,Iwouldnothavecomethisfar. Ithankmyparents for their sacrifices and my brother for happily fulfilling my part of the social responsibilities, and my lovely wife for pushing me through the last mile. I am grateful to them for standing by me and motivating me throughout. ii Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Cross-modal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Weakly supervised object detection (WSOD) . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Active speaker localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Sound source localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Face and speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.2 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Speaker diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 3: Crossmodalvideorepresentationsforweaklysupervisedactivespeaker localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iii 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Cross-modal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Weakly supervised object detection (WSOD) . . . . . . . . . . . . . . . . . . 17 3.2.3 Active speaker localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.4 Sound source localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Cross-modal visual representations . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Weakly supervised active speaker localization . . . . . . . . . . . . . . . . . . 23 3.3.3 Audio-assisted active speaker detection . . . . . . . . . . . . . . . . . . . . . 26 3.4 Experiments and evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.3 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4: Unsupervisedcross-modalidentityassociationforestablishingspeech- face correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Stage1: Speech-face Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.3 Stage2: Off-screen speaker correction . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.2 Stage1: Speech-face assignation . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.3 Stage2: Off-screen speaker correction . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.4 Comparisons with state-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.5 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 iv 4.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 5: Audio-visual activity guided cross-modal identity association for ac- tive speaker detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Speakers’ cross-modal identity association (SCMIA) . . . . . . . . . . . . . . 73 5.2.3 Audio-visual activity guidance (GSCMIA) . . . . . . . . . . . . . . . . . . . . 75 5.3 Experiments and implementation details . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.3 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter 6: Using Active Speaker Faces for Diarization in TV shows . . . . . . . . 88 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2.1 Active speaker detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2.2 Face and speaker recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.2.3 Speaker diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.2 Active speaker detection (ASD) . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.3 Speaker diarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4.1 Active speaker detection performance . . . . . . . . . . . . . . . . . . . . . . 95 6.4.2 Speaker diarization performance . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v Chapter 7: Audio visual character profiles for detecting background characters in entertainment media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Background Character Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.2 Visual active speaker score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.3 Profile matching score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.3.4 Iterative profile matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3.5 Background character detection . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.4.1 Active speaker localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.4.2 Background character detection . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 8: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 110 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter A:Toward visual voice activity detection for unconstrained videos . . . . 123 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 A.2.3 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.3.2 VAD Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.3.3 Visualization Analysis of Learned Representations . . . . . . . . . . . . . . . 129 A.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 vi List of Tables 3.1 Fraction of overlapping speech in various datasets. . . . . . . . . . . . . . . . . . . . 28 3.2 Comparison with the state-of-the-art methods on AVA active speaker validation set in terms of mean average precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Performance comparison of the audio-assisted MIL-based framework with the base- lines on videos from VPCD, reported in % mean average precision (mAP). . . . . . 35 3.4 Comparison of the speaker-wise weighted F1 scores (%) for all the speakers in Columbia dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Fraction of overlapping speech (%) for various videos . . . . . . . . . . . . . . . . . . 47 4.2 F1-scores at various stages for the videos from VPCD. Reported F1 scores are aver- aged over all the episodes for the TV shows (Friends, TBBT, and Sherlock). . . . . . 54 4.3 Performance for the off-screen speech segment classification for the videos in VPCD, in terms of area under the ROC curve (auROC). . . . . . . . . . . . . . . . . . . . . 57 4.4 Comparison with the state-of-the-art methods on AVA active speaker validation set in terms of mean average precision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Comparison of the speaker-wise weighted F1 scores for all the speakers in Columbia dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1 Performance of various strategies for ASD when guided with Talknet (R. Tao et al., 2021). The reporting metric is mean average precision (%). . . . . . . . . . . . . . . 79 5.2 Performance of various strategies for ASD when guided with Syncnet (J. S. Chung & Zisserman, 2016). The reporting metric is mean average precision (%). . . . . . . 79 5.3 Constituents of positive and effective positive guides and their exactness. All values are shown in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 vii 5.4 Comparison of the speaker-wise weighted F1 scores for all the speakers in Columbia dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Constituentsofnegativeandeffectivenegativeguidesandtheirexactness. Allvalues are shown in %. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.6 Effect of using the Negative guides: performance enhancement for speech segments with off-screen speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 Active speaker detection performance improvement with iterative profile matching strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Speaker diarization performance, DER (lower is better), using face-tracks compared against audio-only system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 Variation in speaker diarization performance (using face clustering) with the quality of active speaker detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.1 Performance for active speaker localization using audio-visual character profiles.. . . 106 7.2 Performance for background character detection. . . . . . . . . . . . . . . . . . . . . 109 viii List of Figures 3.2.1The crossmodal architecture with 3D CNNs and stacked convolutional BiLSTM lay- ers. Thenetworkobservestherawvisualframesandistrainedtopredictthepresence of speech (PoS) activity in audio modality. . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1Class activation maps for positive class imposed on the input frames showing the localization ability of the learned embeddings. Sample frames from videos of Row1: AVA, Row2: Friends and Row3: TBBT. . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2The overview of the weakly supervised active speaker localization system. The MIL framework observes the visual representations from cross-modal architecture, and face bounding boxes for each frame and is trained to predict presence of speech in a video segment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.3Example of a speaker-homogeneous speech segment and corresponding temporally overlapping face tracks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1Illustration of localization performance of the crossmodal embeddings specifically for the frames with more than one face. Row1: AVA active speaker (movie), row2: TBBT (TV show), row3: Friends (TV show) and row4: Columbia dataset . . . . . . 30 3.4.2Qualitative comparison of CAMs for the last convolutional layer against the last convolutional-LSTM layer for the case of speech and non-speech events. . . . . . . . 30 3.4.3Distributions of number of faces in each frame for various datasets. . . . . . . . . . . 32 3.4.4Performance comparison of audio-assisted and visual-only formulations for videos from VPCD and AVA, in mean average precision (%). . . . . . . . . . . . . . . . . . 36 3.4.5Performance comparison of the audio-assisted system (using system VAD), against the oracle speaker-homogeneous speech segments. . . . . . . . . . . . . . . . . . . . . 38 ix 3.4.6Performanceoftheaudio-assistedframeworkfor3groupsoffacesizes: small(<1%), medium (1-5%) and large (>5%), for VPCD and AVA. . . . . . . . . . . . . . . . . 39 3.5.1Sample frames showing positive class activation maps for animated videos, and pro- vides a non-trivial signal for animated characters. . . . . . . . . . . . . . . . . . . . . 40 4.1.1a) Speech-identity distance matrix (SD) and b) Face-identity Distance matrix (FD) for the movie Hidden Figure. The active speaker faces are acquired form the ground truth. The speech segments are ordered to gather the speech segments of each speaker together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.2The overview of the proposed framework: We gather temporally overlapping faces for each speaker-homogeneous speech segment. Using the speaker recognition em- beddings, we construct a speech distance matrix. From the set of possible sequences of corresponding active speaker faces, we select the sequence of faces such that the face-identity distance matrix, obtained using the face-recognition features, displays a high resemblance with the speech-identity distance matrix. . . . . . . . . . . . . . 44 4.4.1DistancematricesforthemovieHiddenFiguresatvariousstagesofthesystemalong with the value of objective function, Corr(FD). a) Speech-identity distance matrix (SD) b)Random: Face-identity distance matrix (FD) for randomly initialized ASD. c) Stage1: FD with speech-face assignation, maximizing Corr(FD). d) Stage2: FD post removing the speech segments with off-screen speaker. e) Ground truth: FD obtained using ground truth active speaker faces. . . . . . . . . . . . . . . . . . . . . 54 4.4.2Evolution of the objective function Corr(FD) at different stages for the videos in VPCD.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.3Plot showing the comparison between the distributions of row-wise correlations be- tween SD and FD for speech segments with off-screen ( S off ) and on-screen (S on ) speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.4Increase in precision with a slight drop in recall introduced by correcting the speech- face assignments for speech segments with off-screen speakers (stage2). . . . . . . . 59 x 4.4.5Comparisonofthesystem’sperformanceagainstthescenariowithidealcasespeaker- homogeneousspeechsegments,reportedintermsofmAPforvideofromVPCD.The system’s performance relative to the ideal case scenario is shown beneath each video. 61 4.4.6Computational time (in logarithmic scale) for the system against the performance (reported in mAP) for various values of the partition length L for the videos in VPCD. 63 4.4.7Comparison of active speaker detection performance for systems using different speech/facerecognitionarchitectures, onvideosfromtheVPCDdataset. Forspeech embeddings, Resnet is superior to x-vectors and for face Resnet is superior to the Facenet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.8Distance matrices for Columbia dataset with Corr(FD). a) Speech distance matrix SD. b) FD with ground truth active speaker faces. c) FD post stage2 for proposed system. d) FD post stage 2 for the system assisted with 15% ground truth active speaker faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.9Sample frames from Columbia dataset with manually marked identities of the faces and the name of the current speaker. a) The frames shows the incorrect speech-face associations by the proposed system. b) The frames shows the corrected speech-face associations when assisted with 15% ground truth faces . . . . . . . . . . . . . . . . 68 5.2.1Overview of the audio-visual activity guided speaker identity association across modalities, GSCMIA. a) Construction of positive and negative guides from audio- visual activity. b) Guiding SCMIA using positive and negative guides. . . . . . . . . 74 5.3.1Comparison of SCMIA and TalkNet (R. Tao et al., 2021) perdictions for positive (top row) and negative samples (bottom row). . . . . . . . . . . . . . . . . . . . . . . 81 5.3.2Variation in performance, reported in mAP, of SCMIA and GSCMIA with the con- text length (L) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.3.1Active speaker localization using HiCA architecture and GradCAMs (Sharma, So- mandepalli, & Narayanan, 2022). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4.1Distance matrices for the speech-face associations, using cosine distance among face- track embeddings, for different sets of speech-face instances. Selecting all (b), top 75%(c), 50%(d) and 25%(e) samples in order of ASD scores. . . . . . . . . . . . . . 95 xi 6.4.2Distance matrices for simulated ASD output. a) and c): Simulated output with all samples. b) and d):s output with just the correct samples. . . . . . . . . . . . . . . . 97 7.2.1Left: Sample frames with background characters marked in green. Right: Number of tracks in each episode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4.1Performance for active speaker localization (ROC) for different iteration of profile matching algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4.2Increase in high confidence instances saturates with iterations. . . . . . . . . . . . . 108 7.4.3Distribution of characters among high confidence instances (HCI), for two extreme steps of iterative profile matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.1.1Completearchitectureoftheproposedhierarchicallycontextaware(HiCA)framework.124 A.3.1Examples of missed, matched and extra boxes in a frame. . . . . . . . . . . . . . . . 128 A.3.2Training stats for HiCA network, averaged over each epoch. . . . . . . . . . . . . . . 129 A.3.3Qualitative localization performance of the proposed HiCA network for various test videos. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.4.1Trend of F − score, for different experiments and matching percentage for expt 3 . . 133 xii Abstract An objective understanding of media depictions, such as inclusive portrayals of how much some- one is heard and seen on screen, such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. It is, however, a challenging problem due to the wide variety and contextual variability in media content and the need for labeled data. In this dissertation, I present two strategies utilizing the cross-modal information in the media videos to establish a correspondence between the speech in the audio modality and the faces in the video modality such that the face is the source of underlying speech in the audio. First, I present a cross-modal neural network to modal observed audio-visual activity, which has implicit information about speaker’s spatial location in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring which is very expensive, I formulated a weakly supervised system for localizing active speakers in movie content. Second, I present a strategy that leverages speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker’s face and the underlying speech identify the same person (character). I express the speech segments in terms of their associated speaker identity distances from all other speech segments to capture a relative identity structure for the video. Then I assign an active speaker’s face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing. At the same time, the speakers’ identity-based methods are limited to videos having enough disambiguating informa- tion to establish a speech-face association. Since the two approaches are independent, I further investigate their ability to complement each other. I propose a novel unsupervised framework to guidethespeakers’cross-modalidentityassociationwiththeaudio-visualactivityforactivespeaker detection. It enables a comprehensive active speaker detection framework relying on no manual annotations. I evaluate all the proposed frameworks on three benchmark datasets– Visual Person xiii Clustering dataset, AVA-active speaker dataset, and Columbia dataset– consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods. xiv Chapter 1: Introduction Tremendous variety and amounts of multimedia content are created, shared, and consumed every- day,andacrosstheworld,withagreatinfluenceonoureverydaylives. Thesespanvariousdomains, from entertainment and education to commerce and politics, and in various forms; for example, in the entertainment realm these include film, television, streaming, and online media forms. There is an imminent need for creating human-centered media analytics to illuminate the stories being told by using these various content forms to understand their human impact: both societal and economic. Recent efforts to address this need has led to the emergence of computational media intelligence (CMI) (Somandepalli et al., 2021) which deals with building a holistic understanding of persons, places, and topics involved in telling stories in multimedia, and how they impact the experiences and behavior of individuals and society at large. Creating such rich media intelligence requires the ability to automatically process and interpret large amounts of media content across modalities (audio, video, language, etc.), each modality with its strengths and limitations to help understand the story being told. The ability to process multiple modalities hence becomes essential to learn robust models for media content analysis. It shouldbenotedthathumansconcurrentlyprocessandexperiencedifferentaspectsofthepresented media: sights, sounds, and language use to develop a holistic understanding of the story presented (Klemen & Chambers, 2012). For example, several studies in psychology and neuroscience have shown evidence for how visual perception in humans is intertwined with other senses such as sound and touch. These mechanisms can be altered even at early stages of development of the primary visual cortex (e.g., (Shams & Kim, 2010)). This integration of multiple sensory modalities to holistically perceive visual stimuli is a widely studied field in human psychology, referred to as crossmodal perception (Schmiedchen, Freigang, Nitsche, & R¨ ubsamen, 2012). Recently, there have been several works focused on the computationally harnessing the idea of crossmodal perception in 1 theaudio-visualdomain. Mostofthesestudiesusetheideaofthenaturallyexistingrelationsinthe audio and the corresponding visual frames, in produced media content (Arandjelovic & Zisserman, 2017, 2018; Owens & Efros, 2018; Zhao et al., 2018). When and where constructs are the fundamental pillars of CMI, for developing a holistic un- derstanding of a scene, which direct to locate the events of interest in time and space. In general, the events that we observe in multimedia content are spatio-temporal phenomenon that manifests in both audio and visual modalities. An event in the temporal domain addresses the construct of when, in time, the event is observed in the audio modality in form of sound, while in the spatial domain determining the visual actions pertaining to the observed sound relies on the visual signal. The occurrence of events, concurrently in audio and video modality, motivates us to address the problem of connecting the observations in audio modality with their visual actions in the visual modality to develop a holistic understanding of the underlying event. This problem is referred to as sound-source localization in literature(Barzelay & Schechner, 2007; Fisher III, Darrell, Freeman, & Viola, 2001; Hershey & Movellan, 2000). In this work, we are interested in developing tools to understand the representation and por- trayals of characters in entertainment media, primarily in Hollywood movies and TV shows. By design, the dialogues constitute nearly 50-60% of the screenplays in movies, thus making speech the most prominent sound event and the most crucial to understand the story being told. In this thesis, we target the problem of sound source localization for the specific sound event, i.e. speech. We address the problem of audio-visual speech event localization in entertainment media which essentially finds the source face for the speech in the audio modality, among the possible faces appearing in the visual modality. We refer to this as the active speaker detection (ASD). 1.1 Background Theconventionalapproachesforactivespeakerlocalizationinvideosutilizetheknowledgethatthe motion in the lip region on a human face is the potential source of speech in audio modality and tries to capture the same(J. S. Chung & Zisserman, 2016). These approaches involved the complex processofexplicitextractionoflipregionsonhumanfaces. Furthermore, theyquantifytheactivity in captured lip regions and study their interactions with the underlying speech occurring in the 2 audio modality. Various methods have emerged that utilize the core idea of exploring audio-visual activity. Following are the broad categories we can bin these approaches based on the methods to model the audio-visual activities in videos: • Unsupervised methods : Unsupervised methods involved measuring the motions in the lip region and studying their correlations with the speech waveforms. • Supervised methods : The advances in deep learning methodologies and the introduction of large-scale active speaker datasets have led to a tremendous amount of methods model- ing audio-visual information in a supervised fashion. These methods range from visual-only information to audio-visual information in cross-modal and multi-modal formulations. They involve training large convolutional neural networks (CNNs) to model the audio-visual in- formation and use LSTMs and graph neural networks (GNNs) to capture short-term and long-term temporal contexts for active speaker detection. • Self-supervised methods : These methods exploit the inherent synchrony between the audio and visual signal for the generalized task of sound source localization. Active speaker detection is a direct application for such methodologies; the active speaker faces are the sound sources for speech audio events. They modeled the audio-visual activity using CNNs and LSTMs for a contrastive problem formulation determining if the audio-visual pair under consideration are synchronized. Apartfromexplicitlystudyingthevisualactivityanditsinteractionwiththeaudio, anothersource of information concerning speech-face correspondence is the character’s cross-modal identity. As mentioned earlier, the speech event is a spatio-temporal phenomenon that manifests in both audio and visual modalities. The speech in the audio modality shares the identity with the speaker’s face in the visual modality, inferring that the speech and the speaker’s face identify the same character. The utilization of identity information connecting the speech and the active speaker’s face is fairly unexplored in literature. The entertainment media videos, such as Hollywood movies and TV shows, display a highly dynamic interaction of characters among themselves and with the environment. Due to the very nature of these videos, they pose some serious challenges to the above-discussed methodologies for active speaker detection. The challenges are listed as follows: 3 1. Videocontentchallenges: Thehighlydynamicinteractionofcharactersleadstocharacter faces appearing in various poses and sizes. A significant portion of these face poses is either profileor, insomeextremepostures, whichlimittheabilityoftheemployedsystemstomodel the visual activity. The frameworks that rely on explicit lip region extraction are more prone toerrorsduetoextremefacialpostures. Thevaryingandunknownnumberoffacesappearing in each frame further adds to the challenges. 2. Audio content challenges : The audio in entertainment media videos consists of music and a wide variety of artificially created sounds and sound effects along with the speech. A prominent portion of movies and TV shows consists of speech utterances overlapping with music or noise in the background, such as crowd laughter or traffic noises. These instances impair the ability of the methods, explicitly relying upon speech and face synchronization tasks, to disambiguate the speaker’s face. 3. Annotation challenges : The supervised methods modeling the audio-visual activity em- ploy large CNNs and LSTMs, and such models require massive data for training. Acquiring such large-scale annotations for movies is tedious and expensive in terms of human effort and time. For instance, manually annotating active speakers in each frame for a 90min film requires 15 hours of an expert annotator. As with all fully-supervised models, the trained models lack transferability. Media videos from a different domain or demographic origin will require retraining the model with new annotated data. In this thesis, I focus on developing methods for solving the cross-modal task of establishing speech-face correspondences specifically for challenging entertainment media videos. I propose to use the complementary set of information concerning speakers available in the form of audio-visual activity and characters’ cross-modal identities. This thesis follows the notion of weaker-than-full supervision while capturing cross-modal information and laying out methods for active speaker detection, thus requiring no manual annotations. 4 1.2 Thesis statement Cross-modal activity modeling and cross-modal identity association provide complementary sets of information toward establishing speech-face correspondence for holistic media under- standing. The following concepts are primary to this thesis: • Cross-modalcorrespondence: Speecheventisaphenomenonthatmanifestsinbothaudio and visual modalities. We refer to the speech event as a latent event that emits observations in the audio modality in the form of speech activity in the audio waveform and the visual modality in the form of the active speaker’s face (or lip region activity). At any point in time, multiple faces in the visual modality and more than one speech segment in the audio modality might exist. In this thesis, my primary focus is to establish an association between the observations in the audio modality and the visual modality such that they are emitted by the same underlying latent event (speech event). • Cross-modal activity modeling : One of the most intuitive sources of information for determining the source face for the underlying speech utterance is studying the lip region activity and its interaction with the speech waveform. We refer to the speech part of the audio waveform as the audio activity and the motion in the lip region of the active speaker’s face as the visual activity. We call all methods of measuring or learning to represent such activities corresponding to speech events as modeling audio-visual activity. The stream of work that employs modeling activity in one modality using information from the other is called cross-modal activity modeling. • Cross-modal identity association : The speech and faces are known to possess biometric information and thus can identify a person. The inherent correspondence between the audio andthevisualmodalityconcerningspeechsoundeventscanbeestablishedusingthefactthat the speech and the corresponding active speaker’s face must identify the same person. We refertothemethodusingthecharacter’scross-modalidentityinformation, gatheredfromthe 5 audio modality (speech) and the visual modality (faces), to establish the required speech-face correspondence as cross-modal identity association. 1.3 Contributions Following are the primary contributions made by this thesis: 1. Cross-modal representation learning for active speaker detection: We formulated a cross-modal learning task particularly suited for localizing the active speaker faces in the visual frames. We first propose a novel architecture tasked for cross-modal representation learning, consisting of 3D convolutional neural networks (CNNs) and stacked convolutional Bi-LSTMs. It enables the system to capture multi-scale temporal context and introduces an ability to learn hierarchical abstractions in the presented information, thus making the system interpretable at several levels. We further propose a multiple instance learning-based weakly supervised learning setup, formalizing the active speaker detection task. We evaluate the active speaker localization task on a public dataset and show reasonable performance comparedtofully-supervisedsystems,validatingthatthecross-modalrepresentationlearning can establish correspondences in cross-modal observations. 2. Cross-modal identity association for speech-face correspondence: We propose a generalized framework to establish correspondence in observations across multiple modalities by harnessing the information available in co-occurrence patterns of observations in multiple modalities. We use the framework for gathering the correspondence between the speech in the audio modality and the faces in the visual modality, particularly utilizing the charac- ter identity information available in both speech and faces. The proposed system relies on pretrained speaker recognition and face recognition systems, but does not require any form of manual annotations or training. Thus the proposed unsupervised framework for active speaker detection can be applied to any domain without any constraints. We validated the performance on benchmark public datasets consisting of movies and TV shows. 3. Audio-visualactivityguidedcross-modalidentityassociation: Thetwoapproachesto establishing speech-face correspondence in entertainment media videos, as described above: 6 i) using cross-modal activity modeling and ii) using cross-modal identity association, have their limitations. The audio-visual activity information gets confused with other frequently occurring vocal activities, such as laughing and chewing. At the same time, the speakers’ identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the sources of information for the two methods are independent, we investigate the complementary nature of the two. We propose a novel unsupervised framework to guide the cross-modal identity association framework with the audio-visual activity for active speaker detection. We show that incorporating the two sys- tems leads to an improved active speaker detection system in terms of performance and computational requirements. 1.4 Thesis outline The remainder of the thesis is organized as follows: • Chapter 2 highlights the previous research work relevant to this thesis. It particularly discusses the efforts to utilize cross-modal information, particularly in the context of media understanding. It mentions several works introducing weakly supervised training setups and self-supervised formulations for the generalized problem of sound source localization. Fur- thermore, it details the datasets and methods for active speaker localization in visual frames. • Chapter 3 presents the weakly supervised framework for active speaker detection utilizing cross-modal activity modeling. It details the cross-modal representation learning task and presents the system’s performance on benchmark datasets. • Chapter4discussestheunsupervisedcross-modalidentityassociationforestablishingspeech- face correspondences. It further highlights the superior performance compared to SOTA self-supervised systems. • Chapter5detailstheframeworktoincorporatetheaudio-visualactivityinformationtosup- port the cross-modal identity association. It establishes that the sets of information derived fromaudio-visualactivitymodelingandcross-modalidentityassociationhavemutuallyexclu- 7 sive components and thus act complementary to each other, further enhancing the system’s performance. • Chapter6presentscharacterdiarizationasanapplicationofactivespeakerdetectiontoward understanding media content. It emphasizes that active speaker detection improves face and speech clustering to enhance character diarization. • Chapter 7 presents background character detection as another application of the proposed active speaker detection framework. • Chapter 8 concludes this thesis by summarizing the methods for active speaker detection presented in this thesis and further highlights the future directions of research. 8 Chapter 2: Prior Work 2.1 Cross-modal learning There has been a recent surge of studies focused on cross-modal machine perception, especially in media content analysis. The idea of cross-modal learning primarily revolves around modelling one modality guided by another. In (Somandepalli, Martinez, Kumar, & Narayanan, 2018), the authors target video advertisement classification, using cross-modal autoencoders, reconstructing one modality from the other. In a more recent work by (H. Xu, Zeng, Wu, Tan, & Gan, 2020), a cross-modal relation-aware network is proposed for audio-visual event localization involving a self- attention mechanism where query is derived from one modality while the key-value pairs the other. Another work (Song, Ning, Zhang, & Wu, 2021) targets the problem of fake news detection using a cross-modal residual network, where the text modality guides the attention for learning visual representation and vice-versa. In our earlier work (Sharma, Somandepalli, & Narayanan, 2019), we proposed a cross-modal problem setup for the the task of visual VAD invovling a hierarchically context-aware network (HiCA) which observes the visual frames and predict the audio VAD labels. 2.2 Weakly supervised object detection (WSOD) WSODreferstothetrainingsetupwhenonlyimagelevellabelsareprovidedforsupervisionopposed to bounding box labels in fully-supervised scenarios. Recent research in WSOD can be broadly categorizedintotwodirections,i)Classactivationmaps(CAMs),andii)Multipleinstancelearning (MIL)basedsetups. CAMsbasedmethodsleveragetherelationshipbetweenCNNembeddingsand theclassposteriorstocomputelocalizationmaps. Oneoftheearlierapproaches(Bazzani,Bergamo, Anguelov, & Torresani, 2016) used the idea that the recognition score will drop if the object of 9 interest is artificially masked out in the input image. The idea of CAMs (Zhou, Khosla, Lapedriza, Oliva, & Torralba, 2016) was initially proposed to compute the discriminative image regions for a class of interest in the case of linear prediction layers. Grad CAM (Selvaraju et al., 2017) was later introduced, generalizing CAMs by using the gradients of the posteriors with respect to the activations of the pertinent layer. Furthermore, GradCAM++ (Chattopadhay, Sarkar, Howlader, & Balasubramanian, 2018), introduced weighted average of pixel-wise gradients to improve the coverage of detections and dealt with multiple occurrences of the same object. MIL setups pose the input image for classification as a bag of instances where instances are object proposals. In an early attempt (Bilen & Vedaldi, 2016) a two stream CNN was proposed, one stream to predict bag scores while the other one to compute the instance level scores. Re- cently (Tang, Wang, Bai, & Liu, 2017) proposed a multistage instance classifier (MIDN) to predict thetighterobjectdetectionboxes,whichisfurtherenhancedtoimprovecoverageofdetectionbyus- ing2MIDMs(Gaoetal.,2019). Toalleviatethenon-convexityissuesassociatedwithMIL(F.Wan et al., 2019) proposed to use a combination of smoothed loss functions. 2.3 Active speaker localization Earlier works (Everingham, Sivic, & Zisserman, 2006) in active speaker detection largely focus on usingtheactivityinthelipregionavailableinthevisualmodality. Inanotherapproach(J.S.Chung & Zisserman, 2016), authors proposed to use the synchrony between the cropped images of lip regionsandtheassociatedaudiotodetermineactivespeakers. Furthermore, (Chakravarty,Mirzaei, Tuytelaars, & Van hamme, 2015) introduced the use of cues from upper body motion to determine anactivespeaker, whichtheyfurtherrefinedusingpersonalizedvoicemodels(Chakravarty, Zegers, Tuytelaars,&Vanhamme,2016). Recently(Rothetal.,2020)proposedalargescaledataset(AVA active speaker dataset), consisting of movies and the corresponding active speaker annotations along with baseline performance using a supervised framework. Several frameworks have since followed (Alc´ azar et al., 2020; J. S. Chung, 2019; Y.-H. Zhang, Xiao, Yang, & Shan, 2019) for improving the performance on the AVA dataset. But all these works are restricted to supervised frameworks. Toovercometheneedofexpensiveannotations(Afouras,Owens,Chung,&Zisserman, 2020)proposedaself-supervisedframeworktrainedforthetaskofaudiovisualcorrespondenceusing 10 the optical flow information. 2.4 Sound source localization The problem of active speaker localization falls within the general domain of sound source localiza- tion, but for a particular audio event: speech. The core idea driving the research in this direction is to exploit the existing audiovisual correspondence in the media content. Earlier efforts (Barzelay & Schechner, 2007; Fisher III et al., 2001; Hershey & Movellan, 2000) used canonical correlation analysis to model the audio-visual correspondence. Recent research has been dominated by self- superviseddeeplearningmethods,whereresearcherstrytocapturetheaudio-visualcorrespondence using various proxy tasks. One such proxy task (Zhao, Gan, Ma, & Torralba, 2019; Zhao et al., 2018) uses the additive nature of audio and reconstruct the sound for each pixel by learning a mask for the audio spectrogram. Another proxy task (Owens & Efros, 2018) predicts the time alignment of the given audio and video pair. The work by (Arandjelovic & Zisserman, 2018) used the audio- visual correspondence to predict a localization score for every pixel and (Afouras, Asano, Fagan, Vedaldi, & Metze, 2021) extended the same formulation for object detection. Furthermore (H. Xu et al., 2020) proposed a cross-modal attention mechanism for audio event classification and used the learned attention for modeling the localization task. Majority of these works qualitatively es- tablished the gained localization ability from the inherent audio-visual correspondence but lacks quantitative evaluation. In this work we present a qualitative as well as thorough quantitative analysis of the acquired localization ability of the visual embeddings. 2.5 Face and speaker recognition 2.5.1 Face recognition Approachesinfacerecognitioncanbedividedinto2categories: i)Faceverificationtask (G.B.Huang, Ramesh, Berg, & Learned-Miller, 2007; Lu & Tang, 2015) designed to predict if a given pair of images belong to the same person. ii) Face identification tas k (Guillaumin, Verbeek, & Schmid, n.d.; Guo, Zhang, Hu, He, & Gao, 2016; Parkhi, Vedaldi, & Zisserman, 2015): given a gallery set and a query set of images, the framework finds the most similar images in the gallery set for each 11 image in the query set. A common approach to face recognition is metric learning (Guillaumin et al., n.d.), where a deep neural network is trained for the task of face verification or identification. Here we use the features extracted from SENet-50 (Hu, Shen, & Sun, 2018) pretrained on MS- Celeb-1M (Guo et al., 2016) and fine-tuned on VGGFace2 (Cao, Shen, Xie, Parkhi, & Zisserman, 2018). 2.5.2 Speaker Recognition Speaker recognition is a well-explored field with earlier approaches focused on learning speaker embeddings using variants of the softmax classification loss (J. S. Chung, Huh, Mun, et al., 2020; Z. Huang, Wang, & Yu, 2018; Kenny, Stafylakis, Ouellet, Alam, & Dumouchel, 2013; Nagrani, Chung, Xie, & Zisserman, 2020; Yu, Fan, & Li, n.d.). Recent efforts developing metric learning objectives to learn an embedding space with small intra-class and large inter-class distances have shown promising results for speaker recognition (J. S. Chung, Huh, Mun, et al., 2020; Q. Wang, Downey, Wan, Mansfield, & Moreno, n.d.; C. Zhang, Koishida, & Hansen, 2018) which later com- bined with PLDA (Kenny et al., 2013) framework to improve the discriminative power of the classification loss. Further improvements to the softmax classification were proposed in the form of angular softmax (Z. Huang et al., 2018), involving cosine similarity, and additive margin ver- sions (Yu et al., n.d.) of the same aimed to increase inter-class variance. Recent efforts developing metric learning objectives to learn an embedding space with small intra-class and large inter-class distances have shown promising results for speaker recognition (C. Zhang et al., 2018). In this work, we use the embeddings from a ResNet-34 (He, Zhang, Ren, & Sun, 2016) trained on Vox- Celeb2 (J. S. Chung, Nagrani, & Zisserman, 2018) with an angular objective (J. S. Chung, Huh, Mun, et al., 2020). 2.6 Speaker diarization Speaker diarization in audio modality has been extensively addressed, and an active area of re- search (Park et al., 2022). In general, diarization frameworks consist of multistage paradigms involving voice activity detection, speaker embedding extraction, and then clustering the speech regions in embedding space (Pal et al., n.d.; Park, Han, Kumar, & Narayanan, 2020; Q. Wang 12 et al., n.d.). Recently there has been an increase in end-to-end neural speaker diarization sys- tems (Z. Huang et al., n.d.). Works have also leveraged the visual cues available in lip motion and facial attributes for speaker diarization (Everingham et al., 2006). There exist works addressing the character naming problem in TV shows, using additional meta information from the cast list, subtitles, and transcripts, making it a relaxed version of speaker diarization (Nagrani & Zisserman, 2018). Specific to diarization in TV shows, previous methods have used visual patterns among the shots (Bost, Linares, & Gueye, 2015), face clustering, and talking face detection (Bredin & Gelly, 2016; J. S. Chung, Huh, Nagrani, Afouras, & Zisserman, 2020) to complement audio-driven speaker embeddings. Large-scale audio-visual diarization datasets around TV shows (J. S. Chung, Huh, Nagrani, et al., 2020) and feature films (E. Z. Xu, Song, Feng, Ye, & Shou, 2021) have also emerged, involving a semi-automatic annotation process. 13 Chapter 3: Crossmodalvideorepresentations forweaklysupervisedactivespeaker localization 3.1 Introduction Tremendous variety and amounts of multimedia content are created, shared, and consumed every- day,andacrosstheworld,withagreatinfluenceonoureverydaylives. Thesespanvariousdomains, from entertainment and education to commerce and politics, and in various forms; for example, in the entertainment realm these include film, television, streaming, and online media forms. There is an imminent need for creating human-centered media analytics to illuminate the stories being told by using these various content forms to understand their human impact: both societal and economic. Recent efforts to address this need has led to the emergence of computational media intelligence (CMI) (Somandepalli et al., 2021) which deals with building a holistic understanding of persons, places, and topics involved in telling stories in multimedia, and how they impact the experiences and behavior of individuals and society at large. Creating such rich media intelligence requires the ability to automatically process and interpret Part of this work has appeared in Sharma, Rahul, Krishna Somandepalli, and Shrikanth Narayanan. ”Toward visual voice activity detection for unconstrained videos.” 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019. More details in Appendix A. Longer version of this work has been published in R. Sharma, K. Somandepalli and S. Narayanan, ”Cross modal video representations for weakly supervised active speaker localization,” in IEEE Transactions on Multimedia, doi: 10.1109/TMM.2022.3229975. 14 largeamountsofmediacontentacrossmodalities(audio, video, language, etc.), eachmodalitywith itsstrengthsandlimitationstohelpunderstandthestorybeingtold. Theabilitytoprocessmultiple modalities hence becomes essential to learn robust models for media content analysis. It should be noted that humans concurrently process and experience different aspects of the presented media: sights,sounds,andlanguageusetodevelopaholisticunderstandingofthestorypresented(Klemen & Chambers, 2012). For example, several studies in psychology and neuroscience have shown evidence for how visual perception in humans is intertwined with other senses such as sound and touch. These mechanisms can be altered even at early stages of development of the primary visual cortex (e.g., (Shams & Kim, 2010)). This integration of multiple sensory modalities to holistically perceive visual stimuli is a widely studied field in human psychology, referred to as crossmodal perception (Schmiedchen et al., 2012). Recently, there have been several works focused on the computationally harnessing the idea of crossmodal perception in the audio-visual domain. Most of these studies use the idea of the naturally existing relations in the audio and the corresponding visual frames, in produced media content (Arandjelovic & Zisserman, 2017, 2018; Owens & Efros, 2018; Zhao et al., 2018). When and where constructs are the fundamental pillars of CMI, for developing a holistic un- derstanding of a scene, which direct to locate the action of interest in time and space. In this paper, we address the problem of visual speech event localization in (Hollywood) movies, which essentially detects speech activity in space (where), signifying active speakers’ faces in the visual frames. Inspired by the cross-modal integration in humans to address the challenges of partial ob- servability and dynamic variability of the audio and visual modalities, we developed a cross-modal neural network that can efficiently fuse the complementary information of the visual and audio modalities to localize a visual speech event effectively. Inourpreliminarywork(Sharmaetal.,2019),weintroducedacross-modalproblemformulation forthetaskofvisualvoiceactivitydetection. Weproposeda3Dconvolutionalnetworkthatobserves the raw visual frames of a video segment and predicts the posterior for segment-level audio voice activity detection (VAD). We further established that the learned embeddings were capable of localizing humans in the visual frames. In this work we further advance the proposed framework for localizing active speakers in space (visual frames). The novel contributions reported include the following: 15 1. We introduce an enhanced cross-modal architecture consisting of 3D convolutional neural networks (CNNs) and stacked convolutional Bi-LSTMs. This enables the system to capture multi-scale temporal context and introduces an ability to learn hierarchical abstractions in the presented information. The presence of convolutional operations throughout the archi- tecture, in CNNs as well as in Bi-LSTMs, enables the system to preserve the spatiotemporal information, thus making it interpretable at several levels. 2. We present an end-to-end trainable cross-modal system for active speaker localization in visual frames, trained in a weakly supervised fashion. The proposed setup utilizes a multiple instance learning formulation designed for detecting the presence of speech in audio while considering the location of active speaker faces in the visual modality as the key instances. 3. We propose an audio-assisted active speaker detection formulation, which uses the high-level information from the audio stream and integrates them with active speaker posteriors ob- tained from the visual information, as a post-processing step. Furthermore, we evaluate the system’s performance on three benchmark datasets comprising videos from movies (AVA ac- tivespeakerdataset(Rothetal.,2020)), TVshows(Visualpersonclusteringdataset(Brown, Kalogeiton, & Zisserman, 2021)), and a panel discussion (Columbia dataset (Chakravarty et al., 2016)) and demonstrate performance comparable to fully-supervised methods. 3.2 Related Work 3.2.1 Cross-modal learning There has been a recent surge of studies focused on cross-modal machine perception, especially in media content analysis. The idea of cross-modal learning primarily revolves around modelling one modalityguidedbyanother. In(Somandepallietal.,2018), theauthors targetvideoadvertisement classification, using cross-modal autoencoders, reconstructing one modality from the other. In a more recent work by (H. Xu et al., 2020), a cross-modal relation-aware network is proposed for audio-visual event localization involving a self-attention mechanism where query is derived from one modality while the key-value pairs the other. Another work (Song et al., 2021) targets the problem of fake news detection using a cross-modal residual network, where the text modality 16 guides the attention for learning visual representation and vice-versa. In our earlier work (Sharma et al., 2019), we proposed a cross-modal problem setup for the the task of visual VAD invovling a hierarchically context-aware network (HiCA) which observes the visual frames and predict the audio VAD labels. 3.2.2 Weakly supervised object detection (WSOD) WSODreferstothetrainingsetupwhenonlyimagelevellabelsareprovidedforsupervisionopposed to bounding box labels in fully-supervised scenarios. Recent research in WSOD can be broadly categorizedintotwodirections,i)Classactivationmaps(CAMs),andii)Multipleinstancelearning (MIL)basedsetups. CAMsbasedmethodsleveragetherelationshipbetweenCNNembeddingsand the class posteriors to compute localization maps. One of the earlier approaches (Bazzani et al., 2016) used the idea that the recognition score will drop if the object of interest is artificially masked out in the input image. The idea of CAMs (Zhou et al., 2016) was initially proposed to compute the discriminative image regions for a class of interest in the case of linear prediction layers. Grad CAM (Selvaraju et al., 2017) was later introduced, generalizing CAMs by using the gradients of the posteriors with respect to the activations of the pertinent layer. Furthermore, GradCAM++ (Chattopadhay et al., 2018), introduced weighted average of pixel-wise gradients to improve the coverage of detections and dealt with multiple occurrences of the same object. MIL setups pose the input image for classification as a bag of instances where instances are object proposals. In an early attempt (Bilen & Vedaldi, 2016) a two stream CNN was proposed, one stream to predict bag scores while the other one to compute the instance level scores. Re- cently (Tang et al., 2017) proposed a multistage instance classifier (MIDN) to predict the tighter object detection boxes, which is further enhanced to improve coverage of detection by using 2 MIDMs (Gao et al., 2019). To alleviate the non-convexity issues associated with MIL (F. Wan et al., 2019) proposed to use a combination of smoothed loss functions. 3.2.3 Active speaker localization Earlier works (Everingham et al., 2006) in active speaker detection largely focus on using the activityinthelipregionavailableinthevisualmodality. Inanotherapproach(Afourasetal.,2020; Arandjelovic & Zisserman, 2018; J. S. Chung & Zisserman, 2016; Owens & Efros, 2018; Zhao et al., 17 2018), authors proposed to use the synchrony between the cropped images of lip regions and the associated audio to determine active speakers. Furthermore, (Chakravarty et al., 2015) introduced the use of cues from upper body motion to determine an active speaker, which they further refined using personalized voice models (Chakravarty et al., 2016). Recently (Roth et al., 2020) proposed a large scale dataset (AVA active speaker dataset), consisting of movies and the corresponding active speaker annotations along with baseline performance using a supervised framework. Several frameworks have since followed (Afouras et al., 2020; Alc´ azar et al., 2020; J. S. Chung, 2019; Le´ on-Alc´ azar, Heilbron, Thabet, & Ghanem, 2021; Y.-H. Zhang et al., 2019) for improving the performance on the AVA dataset. But all these works are restricted to supervised frameworks. To overcome the need of expensive annotations (Afouras et al., 2020; J. S. Chung & Zisserman, 2016) proposed a self-supervised framework trained for the task of audio visual correspondence. 3.2.4 Sound source localization The problem of active speaker localization falls within the general domain of sound source localiza- tion, but for a particular audio event: speech. The core idea driving the research in this direction is to exploit the existing audiovisual correspondence in the media content. Earlier efforts (Barzelay & Schechner, 2007; Fisher III et al., 2001; Hershey & Movellan, 2000) used canonical correlation analysis to model the audio-visual correspondence. Recent research has been dominated by self- superviseddeeplearningmethods,whereresearcherstrytocapturetheaudio-visualcorrespondence using various proxy tasks. One such proxy task (Zhao et al., 2019, 2018) uses the additive nature of audio and reconstruct the sound for each pixel by learning a mask for the audio spectrogram. Another proxy task (Owens & Efros, 2018) predicts the time alignment of the given audio and video pair. The work by (Arandjelovic & Zisserman, 2018) used the audio-visual correspondence to predict a localization score for every pixel and (Afouras et al., 2021) extended the same for- mulation for object detection. Furthermore (H. Xu et al., 2020) proposed a cross-modal attention mechanism for audio event classification and used the learned attention for modeling the localiza- tion task. Majority of these works qualitatively established the gained localization ability from the inherent audio-visual correspondence but lacks quantitative evaluation. In this work we present a qualitative as well as thorough quantitative analysis of the acquired localization ability of the visual embeddings. 18 t 1 t 2 ŷ 1 ŷ 2 t s ŷ s v 1 v 2 v s ‘s’ sec. 1 sec 1 sec 1 sec 3D CNN GAP Unrolled stacked convolutional Bi-LSTM Sigmoid Grad-CAM Heatmaps [5x7x7], 64/[2,2,2] Pool [1,3,3]/[1,2,2] [3x3x3], 64/[1,1,1] [3x3x3], 64/[2,2,2] [3x3x3], 64/[1,1,1] [3x3x3], 128/[1,1,1] [3x3x3], 256/[1,2,2] [3x3x3], 256/[1,1,1] 3D CNN [filter size], #filters/[stride] Conv layer x2 x4 x3 #of such layers V t 1 t 2 t s t 1 t 2 t s Figure3.2.1: Thecrossmodalarchitecturewith3DCNNsandstackedconvolutionalBiLSTMlayers. The network observes the raw visual frames and is trained to predict the presence of speech (PoS) activity in audio modality. 3.3 Methodology 3.3.1 Cross-modal visual representations Problem Formulation The work in this paper is especially motivated by the application of active speaker localization in media content such as entertainment media, notably Hollywood movies. From a computer vision perspective, movie videos are challenging due to the presence of rich variety and high dynamics in the content with potentially multiple variable number of persons in both the foreground and the background. Supervised modeling of such videos requires large amounts of (labeled) data in form of bounding-box annotations. In particular, training an audiovisual system for person localization task in a supervised fashion requires large-scale bounding box annotations, which are tedious and expensive to acquire. Inspired by the recent success of cross-modal representations in understanding media content (Song et al., 2021; H. Xu et al., 2020), we formulate our problem in a cross-modal fashion where we model a function of audio modality i.e., talking/non-talking person, bydirectlyobservingthevisualframes. Thishelpsusincircumventingthewidespreadissuesofdrop in performance while jointly modeling multiple modalities against uni-modal systems (W. Wang, Tran, & Feiszli, 2020), that arises primarily due to the difference in the rate of generalization for 19 different modalities. In our preliminary work (Sharma et al., 2019), we trained a cross-modal network for predicting segment-level audio voice activity by using the visual information and established that the learned embeddings implicitly acquired a capability to localize humans in the visual frames. Motivated by the attained localizing ability, in this work we modified the cross-modal formulation described in (Sharma et al., 2019), such that the learned embeddings can localize active speakers in the visual frames. To do so, we propose a modified formulation of the learning task to predict the presence of speech (PoS) for a video segment by observing the visual frames. For a given video segment v i of t− seconds, we define PoS as the step function of the duration of voice activity. PoS= 1 duration of voice activity>0 0 duration of voice activity=0 (3.3.1) The task of predicting PoS is specifically chosen with a hypothesis that the neural network will assess the active speaker regions in visual frames as the most salient to detect the PoS in the video segment. Segment-wise voice activity labels (VAD) can be ambiguous for video segments with partially present speech. Depending on the definition of segment-wise VAD, the VAD label may be false for video segments with a small fraction of speech, but active speaker faces are still present in the video segment. The PoS labels are more relaxed and help resolve such ambiguity in the video segment-level labels. In our experimental setup, we use data from Hollywood movies for training under this formulation and thus utilize the readily available movie subtitles to acquire the PoS labels involving no manual annotations. The relaxed nature of the PoS labels (compared to VAD) makes it easier to obtain finer labels. Formally, given a video V, we partition the video into smaller segments v i of t− seconds each. For each of the v i , we acquire a label y i , where y i indicating the PoS in the video segment. The network sees k such small segments at once, and the network is trained for the mapping problem v i →y i . In the current setup, t=1sec and k =10. {v i ,...,v i+k }→{y i ,...,y i+k } y i ∈{0,1} (3.3.2) 20 Cross-modal network architecture To model the visual signal in a cross-modal fashion, in preliminary work (Sharma et al., 2019), we introduced a Hierarchical Context-Aware (HiCA) architecture providing the temporal context at different levels, modeling the short-term context using 3D CNNs and long-term context using BiL- STM.Furthermore,wequalitativelyandquantitativelyestablishedthatthetrainedrepresentations were selective to human faces and the human body. In this work, we enhance the decentralized temporal context of the HiCA architecture by employing three stacked convolutional Bi-LSTM on top of the 3D CNNs to provide multi-scale temporal context. The introduction of the stacked Bi-LSTMs is motivated by the fact that stacked LSTM networks introduce a hierarchical level of abstractions, as established in various works (Hermans & Schrauwen, 2013) in the field of Natural Language Processing. The convolutional Bi-LSTMs enable the integrated interpretability for the architecture, sincetheypreservethespatialandtemporalstructureoftheinput. Suchamodelalso enablesvisualizingthelearnedrepresentationsatdifferentlevelsofthestackedBi-LSTMs, allowing to analyze the learned hierarchical abstractions. The elaborated neural network architecture is shown in Figure 3.2.1. The network is trained on a set of 268 Hollywood movies, released during the period 2014-18. The videos are sampled at 24 frames per second and are lowered in resolution to 180 x 360 pixels. The Presence of speech labels are implicitly obtained using the readily available movie subtitles since they correspond to the human speech dialogues present in the movies. The obtained labels are coarse and do not employ any manual annotations. The subtitles are first processed to remove the presence of special sounds by removing the content quoted within [.]/{.}. It has been observed that the acquired subtitles are not accurately time aligned with the audio. We used the gentle force aligner 1 , a Kaldi-based 2 tool to align speech and text, which time aligns the subtitles and audio and provides a confidence score with each alignment. We discard the part of the videos which has not been aligned with high enough confidence(empiricallydetermined). Wefurthercomputeabinarylabelforeach t− secofthevideo 1 https://lowerquality.com/gentle/ 2 https://kaldi-asr.org/ 21 Figure 3.3.1: Class activation maps for positive class imposed on the input frames showing the localization ability of the learned embeddings. Sample frames from videos of Row1: AVA, Row2: Friends and Row3: TBBT. segments using the presence of subtitles as a proxy for presence of speech. To provide a tolerance for subtitle alignment errors, we assign a video segment a positive PoS label only if it has subtitles appearing for more than 10% of the duration. Afterpre-processingand timealigningthesubtitles withaudio, we obtained, onaverage, nearly 70% of the movie duration with a high enough speech-subtitle alignment confidence score. We used k = 10sec and t = 1sec, which were driven heuristically, ensuring that CNNs and LSTMs observe enough temporal context to learn. Our training set consists of nearly 360 hours of video data, which comprises 130k samples (1.3 million video-label pairs, since each sample consists of 10 pairs). The network has been optimized to minimize the cross-entropy loss using an accelerated SGD optimizer for nearly 1 million iterations for a batch size of 8. The data consisting of the PoS labels for the 268 Hollywood movies will be publicly available to promote research. Visualizing representations The utmost factor motivating the use of convolutional networks throughout the cross-modal archi- tecture is the ability of CNNs to enhance the interpretability of the learned embeddings. In this work we use an extension of GRAD CAMs (Selvaraju et al., 2017) to 3D CNNs to visualize the information learned by the visual embeddings. We first differentiate the output sigmoid score, the posterior ˆ p i for PoS, with respect to each of the filters F m of the pertaining convolutional layer 22 with m filters. The obtained gradients are aggregated across temporal and spatial dimensions to obtain the contribution of each filter towards the presence of speech event as shown in Eqn. 6.3.1 (Z is an averaging factor). The filters of the convolutional layer in consideration are averaged in accordance with the weights computed in Eqn. 6.3.1 (α m ), and rectified linearly to obtain the final class activation maps, C. α m = 1 Z X i X j X k ∂ˆ p ∂F m ijk C =ReLU( X m α m F m ) (3.3.3) The proposed cross-modal system consists of 3DCNNs and ConvBiLSTMs, which preserve the temporal and spatial information throughout the network, and thus enable the CAMs to take advantage of this spatio-temporal information to analyze the abstracted information in the various convolutional layers qualitatively. We compute the CAMs for the output from various layers for a given video segment and linearly interpolate them across the temporal and spatial dimensions to obtain the maps matching the spatio-temporal resolution of the given video segment. Fig. 3.3.1 shows the positive-class activation maps for selected key-frames from various videos, not seen by the network earlier, and shows that the CAMs provide a non-trivial signal for localizing active speakers. The video segments are present in supplementary materials. 3.3.2 Weakly supervised active speaker localization In this section, we propose a systematic method to utilize the learned visual embeddings for active speakerlocalization,inaweaklysupervisedmanner. Wepresentamultipleinstancelearning(MIL) setup optimized for the proxy task of the presence of speech (PoS) by observing the learned cross- modal embeddings §3.3.1 and face bounding boxes, to model the active speaker faces as the key instances. The setup is inspired by the recent works in weakly supervised object detection(Gao et al., 2019; Z. Ren et al., 2020; Tang et al., 2017; F. Wan et al., 2019). MIL problem formulation Multiple instance learning falls under the domain of supervised learning scenarios, where given labeled bags, each bag consisting of multiple instances, we learn a mapping from bags to labels. Particularly for a binary classification task, the bag is assigned a positive label if at least one of 23 [1] Crossmodal system Visual frames (1 sec) [3x3x3], 128/[1,1,1] x2 Face bounding boxes ROI Pooling [64] Presence of speech Embeddings Modified linear softmax Fixed weights For MIL training 3D Convolutions Bag of pooled face instances in embedding space. Shared classifier for face instances Instance-wise posteriors ( i ) Bag posteriors ( ) MIL framework Figure 3.3.2: The overview of the weakly supervised active speaker localization system. The MIL framework observes the visual representations from cross-modal architecture, and face bounding boxes for each frame and is trained to predict presence of speech in a video segment. the instances in the bag is positive, and a negative label otherwise. This scenario fits appropriately with our problem formulation described in §3.3.1. We define a small video segment, v t , as a bag, and all the faces appearing during the video segment as instances. We train the system for the presence of speech (PoS) events, thus assigning v t (bag) a positive label only if at least one of the faces (instances) corresponds to the active speaker. The proxy task of PoS in the MIL setup is specifically chosen, so as to make it consistent with the earlier cross-modal setup (§3.3.1). With such a setup the cross-modal architecture observes the raw visual frames and is trained for the PoS labels. The output embeddings along with face proposals further become input for the MIL system, which is also trained for the PoS task. This consistency in both the setups’ learning tasks makes them compatible to be trained in an end-to- end fashion. The combined CNN + MIL architecture observes the raw visual frames and predicts the PoS tags. Due to computational constraints, in this work, we restrict to training the two components separately. Implementation details The MIL system observes the visual representations, obtained from the last convolutional layer output of the proposed cross-modal architecture (§3.3.1), and face detection boxes obtained using 24 RetinaFace (Deng, Guo, Ververas, Kotsia, & Zafeiriou, 2020) detector, for each frame to predict the PoS in the video segment. The face detections are sampled to match the temporal resolution of the visual representations, which is 6 fps. We employ an ROI (X. Wang, Shrivastava, & Gupta, 2017) pooling layer to generate instance-level descriptors, which next passes through a set of fully connected layers, terminating with a sigmoid activation layer. Thus, we obtain the instance level predictions, which are further pooled using a modified linear softmax (Y. Wang, Li, & Metze, 2019) to produce the bag level predictions. Since we froze the weights of the cross-modal (HiCA) architecturewhiletrainingtheMILsystem,weintroduceatrainable3D-CNNblockwhichobserves the visual embeddings from the HICA architecture and, its output representations further act as an input to the MIL system. This enables fine tuning of the initially learned embeddings for the MIL system. The complete architecture is shown in Fig. 3.3.2. In a recent work (Y. Wang et al., 2019) it was suggested that in MIL system training, the max-pooling of the instance posteriors, to obtain the bag posteriors, shows a selective behavior highlighting one of the instances among all others. It was also pointed that linear softmax pooling boosts the larger posteriors while suppressing the smaller posteriors at the same time. For the application of active speaker localization, we assume the case of non-overlapping speakers, i.e., there can be at most one active speaker in each frame. This requires the selective behavior of the pooling method, selecting one instance (face), at the frame level. Concurrently, there will likely be more than one frame in the video, consisting of instances of active speakers. Thus we require linear-softmax kind of behavior at inter-frame level pooling, boosting the posterior of the more confident frames. We propose to use a combination of the two pooling methods, thus pooling the instances in each frame using max-pooling to obtain the frame-level posteriors. We further pool the frame-level posteriors using linear softmax pooling operation to obtain the video level posterior score. Let the instance posterior for the i th face in frame f is denoted as ˆ ρ fi . The bag posterior, ˆ ρ , is obtained as shown in eq 3.3.4. We optimize the MIL system for the cross-entropy loss, Loss MIL between the bag posteriors ˆ ρ and corresponding PoS labels (y). ˆ ρ = P f (max i ˆ ρ fi ) 2 P f (max i ˆ ρ fi ) Loss MIL = X bags − ylog(ˆ ρ ) (3.3.4) 25 f 1 f 2 f 3 Speaker-homogeneous speech segment (s n ) Temporally overlapping face tracks Active speaker: (s n ⟷ f 1 ) Figure 3.3.3: Example of a speaker-homogeneous speech segment and corresponding temporally overlapping face tracks. 3.3.3 Audio-assisted active speaker detection TheproposedMILframeworkprovidesasystematicwaytoobtainactivespeakerposteriorsforeach face bounding box in every frame, derived from just the visual information from the frames. Since activespeakerdetectionisinherentlyamulti-modaltask,inthissection,weproposeaframeworkto combinetheinformationfromtheaudiomodalitywiththevisualmodality(theobtainedposteriors from the MIL framework) as a post-processing step. We start with extracting the active voice regions from the audio segment, employing an off- the-shelf voice activity detector (Bredin et al., 2020). We further partition the obtained voiced regions such that each obtained voice segment consists of a speech from only one speaker and call them speaker-homogeneous speech segments and denote the set of all such speech segments as S = {s n }. Weusesimpleheuristicsratherthansophisticatedneuralnetworksystemstoobtainspeaker- homogeneous speech segments. We partition the voice-active segments by the scene boundaries, inspired by the speaker change being one of the prominent movie-cut attributes (Pardo, Heilbron, Alc´ azar, Thabet, & Ghanem, 2021). It decreases the likelihood of observing a speaker change in 26 the obtained partitions. We further partition the obtained segments to have a maximum duration of 1sec. Since the number of speaker changes in a video is constant, partitioning the entire audio into a larger number of segments (smaller duration segments), effectively reduces the fraction of speech segments with a speaker change. From the visual signal, we extract face tracks using RetinaFace (Deng et al., 2020) for face detection and SORT (Bewley, Ge, Ott, Ramos, & Upcroft, 2016) for tracking and denote the set of all face tracks as F all = {f i }. Using the start and end time of speech segments, s n ∈ S, and the obtained face tracks, we collect the set of temporally overlapping face tracks for each s n , and denote them as F n = {f k }. Now we formulate the active speaker detection as a speech-face assignation task, selecting a face track for each speaker-homogeneous speech segment s n from the set of temporally overlapping face tracks f k ∈ F n , as an active speaker face. We show an example of a speaker-homogeneous speech segment and temporally overlapping face tracks in Figure 3.3.3. We denote the speech-face assignations as: F n ={f k |f k temporally overlaps s n } (3.3.5) (s n ←→ f n )|f n ∈F n :f n active speaker face for s n (3.3.6) As a speech-face assignation criterion, we compute an active-speaker likelihood for each face track, f k ∈ F n , by computing a score α k using the average of MIL framework-based posteriors ρ i for the constituting face bounding boxes. We assign the face track with the maximum score, α k , as the active speaker face for the speech segment s n if the score is greater than a heuristically driven threshold (τ ). On the contrary, the scenario when none of the temporally overlapping face tracks for s n shows significant active speaker likelihood max(α k ) < τ , signifies the case with off-screen speakers. The speech-face assignation process is described in Algorithm 1. α k = 1 T X i ρ i (3.3.7) s n ←→ f n |f n =argmax f k ∈Fn α k and α k >τ (3.3.8) Theproposedframeworktakesadvantageofthehigher-levelinformationfromtheaudiomodal- 27 Table 3.1: Fraction of overlapping speech in various datasets. Videos AVA active speaker dataset Friends (VPCD) TBBT (VPCD) Fraction of overlapping speech 2.31% 1.39% 4.42% ity to post-process the obtained face-box level active speaker posteriors from the MIL-based frame- work. By formulation, speech-face assignation enables two-fold constraints on the active speaker posteriors: 1. It restricts a face to be an active speaker’s face only when speech is present in the audio modality. 2. Speech-face assignation being one-to-one enables the constraint that there can be only one speaker at any time, thus enabling the non-overlapping speech. This constraint is inspired from the nature of entertainment media videos, which are designed to have non-overlapping speech. In Table 3.1 we show the fraction of overlapping speech in various datasets and observe a marginal fraction for all. Additionally, selecting one face-track for the entire speech segment enforces continuity for the active speaker’s face between frames and thus adds a smoothness to the active speaker predictions. We want to emphasize that having only one speaker at any time enables non-overlapping speech constrain: although there can be more than one face present in the frames, only one of them can be an active speaker face. Algorithm 1: Speech-face assignation framework 1 Obtain S ={s n } ; // speaker-homogeneous 2 for each s n ∈S do 3 Obtain F n =f k ; // overlapping 4 for each f k ∈F n do 5 α k = 1 T P i ρ i ; // face track scores 6 end 7 f n =argmax f k ∈Fn α k ; 8 if α n >τ then 9 s n ←→ f n ; // assignment 10 end 11 end 28 3.4 Experiments and evaluations 3.4.1 Qualitative analysis Hereweevaluatethehypothesis,thatthevisualembeddingslearnedtodetectthepresenceofspeech in audio modality can localize active speakers in the visual frames, from a qualitative perspective. We visualize the salient regions in the video frames for the positive class i.e., presence of speech event inthe audio modalityusing themethodologydescribedin §3.3.1. Fig. 3.3.1shows theCAMs imposed on the frames in form of heatmaps and demonstrates that the positive class activations situate around human faces with high concentration. Multiple faces in a frame: An approach for generalized sound source localization was recently proposed by (Owens & Efros, 2018), where a neural network for audio-visual synchrony wastrained, aframeworkthatappearstobeclosesttoourcase. Thisworkpresentsclassactivation maps for various videos in audio-set (Gemmeke et al., 2017) and shows that the learned audio- visual representations are selective to human faces and moving lips in case of speech event. But the majorityofcasespresentedinthisworkconsistofjustonehumanfaceintheframe; moreover, they do not provide extensive quantitative analysis to support the claim. As we specialize the cross- modal embeddings for the presence of speech events, it becomes interesting to see what happens when more than one human face is present in a frame. In Fig. 3.4.1, we present the CAMs for selected frames, for videos from various datasets, neither seen by the network during training, corresponding to positive class. The frames are particularly selected to have more than one faces present. The figure illustrate that the learned crossmodal embeddings are able to select the active speaker even when more than one face is present on the frame. The sample videos with imposed CAMs can be found in supplementary material. Importance of stacked LSTMs: In this work, we enhance the HICA (Sharma et al., 2019) architecturewiththreeadditionalstackedConvolutional-BiLSTM,withanideathatitpreservesthe spatialandtemporalinformationandachieveshierarchicalabstraction. Toqualitativelyvalidatethe advantages of stacked-ed Convolutional BiLSTM layers, we compare the CAMs at two hierarchical levels, one at the end of 3D convolutional network (Conv CAMS) and the other at the end of the stacked LSTMs (LSTM CAMs). 29 Figure 3.4.1: Illustration of localization performance of the crossmodal embeddings specifically for the frames with more than one face. Row1: AVA active speaker (movie), row2: TBBT (TV show), row3: Friends (TV show) and row4: Columbia dataset Conv LSTM Conv LSTM Class activation maps for frames from speech event (Red-box: active speaker) Class activation maps for frames from non-speech event Figure 3.4.2: Qualitative comparison of CAMs for the last convolutional layer against the last convolutional-LSTM layer for the case of speech and non-speech events. 30 In Fig. 3.4.2, we present CAMs for the two hierarchical levels under two different scenarios: 1. Speech event: We present frames with more than one face present in the frame, from the set of speech events. To make it more informative, we manually marked the active speakers in the frames using a red box. It can be observed that the activations in the case of the Conv CAMs extend to the non-speaker face as well, while the LSTM CAMs can correct the activations to concentrate just on the active speaker. 2. Non-speech event: In this scenario, we present the CAMs for the frames corresponding to non-speech events. We observed that the Conv CAMs are concentrating on the faces visible in the frames irrespective of their activity while the LSTM CAMs can correct the undesired activations, and are selective to speech events. It can be inferred that the group of 3D convolutional layers is selecting the available faces in the frames and the stacked LSTMs, as they can observe a longer context, are narrowing down to selecting the active speakers. 3.4.2 Quantitative Analysis In the earlier section, we have visually established that the learned cross-modal visual representa- tions can successfully localize the active speakers. This section formally quantifies the performance of the embeddings for active speaker detection using various benchmark datasets. This work tar- gets the active speaker detection problem for videos in entertainment media; thus, we focus on evaluating the performance of the proposed system on datasets consisting of videos from movies and TV shows. We use three widely used datasets for evaluation purposes: i) AVA active speaker dataset (Roth et al., 2020) (movies), ii) Visual person clustering dataset (Brown et al., 2021) (TV shows), and iii) Columbia dataset (Chakravarty et al., 2016) (a panel discussion). We employ the MIL framework (§3.3.2) to obtain face-wise active speaker posteriors, which uses the cross-modal visual representations (§3.3.1) obtained by a window-wise inference, with a windowlengthof10secandastrideof0.5sec. Theproposedaudio-assistedsystem(§3.3.3)provides the speech-face assignations utilizing the face-track wise active speaker likelihood computed using the mean of obtained face-wise posteriors of the constituent faces. For the face tracks temporally overlapping with any speech segments, we extend the face track active speaker likelihood score 31 Figure 3.4.3: Distributions of number of faces in each frame for various datasets. α k to all the face bounding boxes constituting the overlapping part. On the contrary, all other face boxes, including the constituents of face tracks not overlapping with any speech segments, are scored 0. We use the altered scores for the face boxes to compute the mean average precision (mAP) to report the system’s performance. We report performance for two trivial baselines and an audio-visual self-supervised system along with the system’s performance to add further context. • Random face baseline: Using the audio-assisted framework, instead of using the active speaker likelihood score derived from the MIL-based framework, we select a face track ran- domly from the temporally overlapping face tracks and assign it as the active speaker face for the underlying speaker-homogeneous speech segment. • Largest face baseline: Using the same audio-assisted framework, we select the largest, in terms of area, among the temporally overlapping face tracks for each speaker-homogeneous speech segment as the active speaker face. The videos in entertainment media focusing on speaking characters is the primary motivation behind such a baseline. • Syncnet: Syncnet (J. S. Chung & Zisserman, 2016) is closest to this work; it employs a 32 self-supervised framework trained for audio-visual synchronization. It inherently studies the visual activity of every face and its synchronization with the underlying speech to provide an active speaker score for each face box. We also present the performance of Syncnet assisted withproposedaudiopost-processing(§3.3.3). Weusetheaudio-assistedframeworkandselect a face track with the maximum syncnet score, computed as an average of the constituent face boxes’ scores. To provide insight into the evaluation datasets, we present the distribution of the number of facesineachframe. Figure3.4.3showstheplotsforallvideodatasetsandthatasignificantportion of the videos has more than one face on the screen, thus emphasizing the non-trivial nature of the videos. Particularly for the Columbia dataset, we observe that the majority of the frames has multiple faces, making it a difficult scenario for the trivial baselines. AVA active speaker dataset TheAVAactivespeakerdataset(Rothetal.,2020)isoneofthefewlarge-scalebenchmarkdatasets for active-speaker detection. It consists of the face bounding-box-wise active speaker annotations for 15min duration of 161 international movies. We use the official implementation provided by the authors,tocomputethemeanaverageprecisionandreportthesamefortheproposedaudio-assisted MIL-based strategy and the mentioned baselines in Table 3.2. We also present the performance of otherstate-of-the-artmethodsinTable3.2. Weobservethattheproposedaudio-assistedMIL-based framework outperforms the random face baseline and the largest face baseline with a significant margin. It even outperforms the solid syncnet baselines; the primary reason is AVA’s noisy audio conditions. In nearly 65% of data, when a speaker is visible, the audio is accompanied either by noise or music (Roth et al., 2020), which affects Syncnet’s performance. The proposed system relying on visual embeddings is relatively indifferent to the audio noise. Wenotethattheproposedsystemisnotuptothemarkwithotherstate-of-the-artmethods,pri- marilyduetothesignificantdifferenceintheemployedstrategies. Thementionedsystems(Alc´ azar et al., 2020; J. S. Chung, 2019; Roth et al., 2020; Y.-H. Zhang et al., 2019) in Table 3.2 employs an early integration of audio and visual information and models the active speaker detection as a fully-supervised task, training the systems using the AVA active speaker dataset. On the con- 33 Table 3.2: Comparison with the state-of-the-art methods on AVA active speaker validation set in terms of mean average precision. Methods Stragetgy mAP (%) Syncnet (J. S. Chung & Zisserman, 2016) Self supervised 40.5 Roth et al. (Roth et al., 2020) Supervised 79.2 Zhang et al. (Y.-H. Zhang et al., 2019) Supervised 84.0 Alcazar et al. (Alc´ azar et al., 2020) Supervised 87.1 Chung et al (J. S. Chung, 2019) Supervised 87.8 Alcazar et al (Alcazar, Cordes, Zhao, & Ghanem, 2022) Weakly supervised 76.2 Random Face Unsupervised 47.2 Largest face Unsupervised 51.0 Audio-assisted Syncnet (J. S. Chung & Zisserman, 2016) Self supervised 59.7 Audio-assisted MIL Weakly supervised 67.3 trary, the proposed system models the visual information without using active speaker annotations and utilizes the higher-level audio information for post-processing. Additionally, the movies in the AVA active speaker dataset are international movies shot earlier; thus, they are different from contemporaneous Hollywood movies (used to train the proposed cross-modal network) in terms of cinematography and illumination conditions, further affecting the system’s performance. Visual Person Clustering Dataset To evaluate the proposed system’s performance on videos aligning more with entertainment media, we use the videos from Visual Person Clustering Dataset (Brown et al., 2021), which consists of annotations for widely watched Hollywood TV shows and movies. The annotations constitute manually verified bounding boxes for instances of primary characters (along with identity) and timestampsoftheircorrespondingspeechactivity. AlthoughVPCDcoversawidevarietyofvideos frommovies(HiddenFiguresandAboutLastnight)andTVshows(Friends, TBBT,Sherlock), the annotations are limited to primary characters. Thus, the videos with more secondary characters have non-exhaustive annotations. Due to the limited number of characters in the TBBT and Friends, the VPCD annotations exhaustively covers all active speaker instances. We thus use the videos from the TBBT (6 episodes) and Friends (25 episodes) for evaluation. In Table 3.3 we report the performance for the baselines and audio-assisted MIL-based frame- work averaged over the episodes of the TV shows. We note that the proposed system outperforms 34 Table 3.3: Performance comparison of the audio-assisted MIL-based framework with the baselines on videos from VPCD, reported in % mean average precision (mAP). Methods Friends (25 episodes) TBBT (6 episodes) Random face 52.8 60.8 Largest face 59.6 66.3 Syncnet (J. S. Chung & Zisserman, 2016) 63.5 70.4 Audio-assisted syncnet (J. S. Chung & Zisserman, 2016) 77.1 80.0 Audio-assisted MIL 75.8 81.6 Table 3.4: Comparison of the speaker-wise weighted F1 scores (%) for all the speakers in Columbia dataset. Methods Abbas Bell Boll Lieb Long Sick Avg Chakravarty et al. (Chakravarty et al., 2016) - 82.9 65.8 73.6 86.9 81.8 78.2 Shahid et al (Shahid, Beyan, & Murino, 2019) - 89.2 88.8 85.8 81.4 86 86.2 Syncnet (J. S. Chung & Zisserman, 2016) - 93.7 83.4 86.8 97.7 86.1 89.5 Afouras et al. (Afouras et al., 2020) - 92.6 82.4 88.7 94.4 95.9 90.8 S-VVAD (Shahid, Beyan, & Murino, 2021) - 92.4 97.2 92.3 95.5 92.5 94 Random Face 63.8 54.2 57.1 53.3 49.5 51.9 55.0 Largest Face 96.9 41.7 71.4 98.6 41.3 33.8 64.0 Audio-assisted MIL 82.7 73.6 61.8 81.7 81.1 82.7 77.3 therandomandlargestfacebaselinesandthenativeSyncnetwithasignificantmarginconsistently for all the videos while performing comparably to the audio-assisted syncnet baseline. We point out that the Syncnet and the proposed system show significantly superior performance for the TV shows compared to AVA videos, attributing to the structured nature TV shows against the in-the-wild AVA videos. Columbia Dataset Columbia dataset consists of active speaker annotations for 35min duration of a video recording of a panel discussion featuring six speakers, where speakers take turns. The notable point is that the scenario of the panel discussion differs from the entertainment media videos in the form of fewer shot changes, lower dynamics in camera movement, and focuses on capturing the speaker, particularly with frontal face poses. In table 3.4 we present the performance of the system and baselines along with various other state-of-the-art systems. The performance is reported using a weighted F1-score for each speaker, a widely used metric for the Columbia dataset. 35 Friends TBBT AVA Videos 0 10 20 30 40 50 60 70 80 90 Mean Average Precision (%) 69.7 76.1 51.5 75.3 81.5 63.7 75.8 81.6 67.3 Visual-only Audio Audio-assisted (Proposed) Figure 3.4.4: Performance comparison of audio-assisted and visual-only formulations for videos from VPCD and AVA, in mean average precision (%) Wenotethattheaudio-assistedMIL-basedframeworkconsistentlyperformssignificantlybetter thantherandomfacebaselineforallspeakers. Thelargestfacebaselinefortwospeakers, Abbas and Lieb, performs significantly better than any other system. It is coincidentally due to the position of thespeakersrelativetothecamera, andadditionallystaticcamerashotsplayasignificantrole. For instance, when Abaas takes the turn to speak, the camera pans into a frame where Abbas’s face is marginally bigger than the others due to the camera position. It remains the same throughout his turn;thus,thelargestfacebaselinegivesahighF1score. ThesameistruewhenLiebtakeshisturn. On the contrary, the proposed system performs reasonably well for all the speakers. Again all the other methods in some form model the audio-visual information in a fully-supervised manner, not necessarilytrainingontheColumbiadataset. Itnaturallyleadstothedisplayedhigherperformance of these systems (Afouras et al., 2020; Chakravarty et al., 2016; J. S. Chung & Zisserman, 2016; Shahid et al., 2019, 2021). 3.4.3 Ablation studies Audio-assisted formulations In this work, to take advantage of the high-level audio information, we posed the problem of active speaker detection as a speech-face assignation task. To each speaker-homogeneous speech 36 segment, we assign an active speaker face track, from the set of temporally overlapping face tracks, based on the MIL-based posteriors of the constituting face boxes. In this section, we compare the performance of the proposed formulation with two baseline formulations: • Visual-only baseline: Relying on information from just the visual signal, we use the face- wise posteriors obtained from the MIL-based framework as the active speaker scores for each box. Using no further post-processing step, we report the performance in terms of mean average precision. • Audio post-processing: We obtain the face-wise posterior scores from the MIL-based framework and voice active regions in a video using an off-the-shelf VAD system (Bredin et al., 2020). As a post-processing step, we make the posteriors of the boxes lying out of VAD regions 0, while keeping the ones in the VAD region unaltered. In addition to the MIL posteriors, thisbaselineimposestheconstraintthatactivespeakerfaceslieinthevoiceactive regions only. In Figure 3.4.4, we present the performance comparison of the above baseline formulations and the proposed speech-face assignation framework. We observe a significant performance enhance- ment with the audio post-processing on top of MIL-based face-wise posteriors for all the videos, signifying the importance of the constraint that an active speaker’s face can only be present if there is speech in audio modality. Furthermore, the proposed speech-face assignation framework, consistently for all videos, improves the performance further more significantly for the AVA videos. The observed advantage is due to the additional constraint of having at most one speaker at any time(non-overlappingspeech)andthespeakercontinuitythroughthespeaker-homogeneousspeech segment offered by the proposed formulation. We also note the significant improvement in Syncnet performance when assisted with audio using the proposed system for all the videos (Tables 3.2 and 3.3), even though it uses an early-stage audio-visual fusion. Effect of VAD performance The proposed audio-assisted framework relies on off-the-shelf VAD system (Bredin et al., 2020) and uses simple heuristics to obtain speaker-homogeneous speech segments. In this section, we explore the impact of the VAD system’s performance on active speaker detection. We compare 37 TBBT Friends Videos 0 20 40 60 80 100 Mean Average Precision (%) 76.1 69.7 81.6 75.8 89.0 82.4 Visual-only Proposed System Oracle Figure 3.4.5: Performance comparison of the audio-assisted system (using system VAD), against the oracle speaker-homogeneous speech segments. the performance of the audio-assisted active speaker detection system with an ideal case scenario where we acquire speaker-homogeneous speech segments using the ground truth. Unlike AVA, the annotations in VPCD contain time-stamps for character-wise speech segments, which enables to obtain ideal case speaker-homogeneous speech segments; thus we use videos from VPCD for this study. In Figure 3.4.5, we show the performance of the proposed audio-assisted system and the one withoraclespeaker-homogeneousspeechsegments. Toaddfurthercontext,weplottheperformance of the earlier described visual-only system, utilizing no audio post-processing step. We note that there is a significant difference between the performance of the proposed system and the ideal case scenario, indicating scope for improvement in the active speaker detection performance with the advancements in VAD systems. Effect of face size Hereweinvestigatetheeffectofthefacesizesinthevideosonthesystem’sperformance. Wedivide all the faces in a video into three sets: 38 Friends TBBT AVA Videos 0 20 40 60 80 100 Mean Average Precision (%) 44.00 55.60 35.00 82.10 86.60 49.20 79.20 84.30 77.90 Face box size small (<1%) medium (1-5%) large (>5%) Figure 3.4.6: Performance of the audio-assisted framework for 3 groups of face sizes: small (<1%), medium (1-5%) and large (>5%), for VPCD and AVA. • Small: that occupy less than 1% of the frame. • Medium: that occupy between 1-5% of the frame. • Large: that occupy greater than 5% of the frame. InFigure3.4.6,wepresenttheproposedaudio-assistedsystem’sperformanceforfaceboxesinthree groups. We observe that the performance for the small boxes is significantly lower than others for all the datasets. The reason is the lower resolution of the final convolution layer, which is 12 × 23. Small boxes constituting lesser than 1% of the frame will end up being represented by at most 2 points in the embedding space. The lack of representative information for small boxes leads to degraded performance. The performance for medium and large boxes is nearly equivalent for the TV shows, while more enhanced for large faces in AVA videos. 39 Figure 3.5.1: Sample frames showing positive class activation maps for animated videos, and pro- vides a non-trivial signal for animated characters. 3.5 Summary and Future Work In this paper, we present a cross-modal framework for learning visual representations, capable of localizing an active speaker in the visual frames. We further formalized a system for active speaker localization, in a weakly supervised manner, requiring no manual annotations. The consistency in the problem formulation for the cross-modal network and the MIL setup makes the system end- to-end trainable. We evaluated the performance of audio-visual speech event localization on the three benchmark datasets comprising a variety of videos, and demonstrated a good active speaker detection performance, provided its weakly-supervised formulation. The presented system is self-contained in the sense that it can be adapted to any domain in a straightforward manner. To do so, it requires no manual annotations, but just coarse voice activity labels,whichcanbeobtainedusingtheoff-the-shelfVADsystems. Oneoftheimmediateextensions of our work is to adapt the system for animated content for animated character discovery such as illustrated in Fig. 3.5.1. Since the system connects the speech to its spatial source in visual frames, it can be extended to jointly model audio and visual modality for diarization. 40 Chapter 4: Unsupervisedcross-modalidentity associationforestablishingspeech- face correspondences 4.1 Introduction Speech-related activity in a video manifests spatio-temporally both in audio and visual modalities. Mapping speech activity in the temporal domain related to when, in time, someone is speaking is often approached from the audio modality. On the other hand, in the spatial domain, determining where in a frame the speaker is present relies on the visual signal. In this work, we use the speech activity information obtained from the audio modality to find the corresponding speech activity in the visual modality. We address the problem of finding the speaking face (if any) from the set of candidate faces appearing in the video. We refer to this task as active speaker detection (ASD). ASDplaysacrucialroleincomputationalmediaintelligence(CMI)(Somandepallietal.,2021), enabling the study of representation and portrayals of characters portrayed in media and their societal impact. Specifically, ASD is a crucial component of tools to discern who is speaking when? in a video, such as for primary character detection, background character detection (Sharma & Narayanan, 2022a), speaker diarization (Bredin & Gelly, 2016; J. S. Chung, Huh, Nagrani, et al., This work has been submitted to IEEE Transactions on Image Processing and is under review. It has available on arxiv: Sharma, Rahul, and Shrikanth Narayanan. ”Unsupervised active speaker detection in media content using cross-modal information.” arXiv preprint arXiv:2209.11896 (2022). 41 2020; Sharma & Narayanan, 2022c), and other human-computer-interaction applications. This work focuses on ASD in entertainment media content, notably Hollywood movies and TV shows. ASD in entertainment media content is a challenging task due to the unknown and varying number of speakers and faces in the visual frames. The speaking faces appear in diverse poses because of the rich dynamics in character interactions. The face detector fails when the speakers appear in extreme poses including looking away from the camera or moving out of the frame. Non-ideal conditions in the audio modality, such as background (including off-screen) speakers and noisy speech, also contribute to problem’s complexity. The inherent multimodal nature of speech activity has motivated numerous approaches for active speaker detection to use the connections between the visual and audio modality. The idea of localizing the visual actions in the video frames responsible for the audio activity, notably speech, dominatestheseapproaches. Audio-visualspeechmodelingtypicallyinvolvescapturingtheactivity in the lip region of the detected faces in the visual frames, explicitly (Bendris, Charlet, & Chollet, 2010; Bredin & Gelly, 2016) or implicitly (Owens & Efros, 2018; Zhao et al., 2018), and studying its correlation with the speech waveform to discern the active speaker. Most of these approaches share the same drawbacks: i) These methods employ computationally expensive 3D Convolutional neural networks and Recurrent neural networks to model the visual activity. ii) They train the models in a full-supervised manner requiring manual annotations, which are tedious and expensive to acquire. iii) A severe limitation is that the visual activity in the facial gestures concerning other activities, such as laughing, chewing, etc., can be easily confused with the speech activity, thus adding false positives. To complement the noisy visual activity information, we propose to use the speaker identity informationforactive speakerface-voiceassignment. Usingtheknowledgethatfaceandvoiceboth possessinformationaboutaspeaker’sidentity, wepresentanunsupervisedframeworkthatexploits the fact that both the active speaker’s face in the visual modality and the concurrent voice in the audio modality refer to the same speaker. We use well researched and readily available face and speech recognition systems to represent the faces and speech, respectively, so that both capture the speaker’s identity. Using the obtained representations, we construct a speech-identity distance matrix, capturing the relative identity structure of the speech segments through the entire video. Figure 4.1.1a 42 shows an example of such a distance matrix. We assign an active speaker face to each speech segment such that the obtained set of corresponding active speaker faces displays a similar relative identity structure in form of a face-identity distance matrix, an example is shown in Figure 4.1.1b. Figure 4.1.2 shows an overview of the proposed strategy. Furthermore, studying the similarity between the two distance matrices (speech and face), we present a simple strategy to address the speech segments with speakers present off-screen. Weevaluatethesystem’sperformanceforactivespeakerdetectiononthreebenchmarkdatasets: i) AVA active speaker (Roth et al., 2020) and ii) Visual Person Clustering (Brown et al., 2021) datasets, consisting of videos of TV shows and movies, and iii) Columbia (Chakravarty & Tuyte- laars, 2016) dataset, consisting of a panel discussion video. The implementation code and the required dependencies to reproduce all the results in this work will be publicly available. The contributions of this work are as follows: 1. We present a novel framework to harness the speaker identity information, present within the speech in the audio modality and the speaker’s face in the visual modality, for the speech- face assignment task. This framework can complement state-of-the-art visual-activity-based active speaker detection systems as the acquired information does not depend solely on the visual activity present in a frame. 2. We formulate the active speaker detection task as a cross-modal unsupervised optimization task deriving information from pretrained speaker recognition and face recognition models. This eliminates the need for manual annotations and does not require additional training of any expensive models. 3. Wepresentanextensiveperformanceevaluationofthesystem,validatingthestrategy’sappli- cability for various challenging domains, including videos from TV shows, Hollywood movies, international movies, and panel discussions. Furthermore, we present ablation studies to study the performance of the proposed system and to highlight potential limitations. 43 a) Audio Distances ( D a ) b) Face Distances ( D v ) Figure 4.1.1: a) Speech-identity distance matrix (SD) and b) Face-identity Distance matrix (FD) for the movie Hidden Figure. The active speaker faces are acquired form the ground truth. The speech segments are ordered to gather the speech segments of each speaker together. A B B C A 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Face-identity distance matrices for possible active-speaker faces sequences A B C A B A B B C B B A Speech-identity distance matrix Temporally overlapping facetracks Speaking face Non-speaking face A B A ɑ β ɑ Speech ID Face ID ɑ A β B Speech id: ɑ Speech id: β Speech id: ɑ Speaker-homogeneous speech segments segment #1 segment #2 segment #3 segment # segment # segment # segment # A Face ID segment # ɑ β ɑ Speech id Face id Speech-face assignment Figure 4.1.2: The overview of the proposed framework: We gather temporally overlapping faces for each speaker-homogeneous speech segment. Using the speaker recognition embeddings, we construct a speech distance matrix. From the set of possible sequences of corresponding active speaker faces, we select the sequence of faces such that the face-identity distance matrix, obtained using the face-recognition features, displays a high resemblance with the speech-identity distance matrix. 44 4.2 Related Work An intuitive solution for detecting an active speaker is to study the activity in the lip region of faces in a video; this idea has led to a wide range of methods, from using just the visual sig- nal (Bendris et al., 2010) to posing it as multi-modal (Arandjelovic & Zisserman, 2018; Bendris et al., 2010; J. S. Chung, 2019; J. S. Chung & Zisserman, 2016; Owens & Efros, 2018) and cross- modal (Chakravarty et al., 2015; Chakravarty & Tuytelaars, 2016; Sharma et al., 2019, 2022) formulations. One of the earlier works proposed explicitly detecting the lip region and quantifying the observed activity to classify faces as talking or non-talking in TV shows (Bendris et al., 2010). Along similar lines, sync-net (J. S. Chung & Zisserman, 2016) has proposed jointly modeling the activity in the explicitly detected lip region and the concurrent audio stream using a ConvNet ar- chitecture and training it to detect a speech-face synchronization leading to a self-supervised active speaker detection system. Following the success of modeling audio-visual synchronization, various self-supervised methods have emerged targeting the task of sound source localization (Afouras et al.,2020;Arandjelovic&Zisserman,2018;Owens&Efros,2018;Zhaoetal.,2018). Thesemethods were qualitatively shown to detect an active speaker’s face as the location for the speech sound event. The nature of speech activity, interleaving the audio and visual modalities, has led to vari- ous cross-modal approaches, predicting primarily audio-centric tasks such as voice activity detec- tion(Chakravartyetal.,2015;Chakravarty&Tuytelaars,2016;Sharmaetal.,2019,2022)usingthe visual input. Earlier works (Chakravarty et al., 2015; Chakravarty & Tuytelaars, 2016) have mod- eledthefacialactivityusingHoGfeaturesandtrainedSVM-basedclassifiersusingweaksupervision fromaudioVADlabels. Morerecently,theauthorsin (Sharmaetal.,2019,2022)introducedamore sophisticated3DConvNetsarchitecturetomodelthevisualframesandtrainedaweakly-supervised systemforactivespeakerdetectionforthemorechallengingHollywoodmovievideodomain. Lately, theintroductionoflarge-scaleaudio-visualdatasets, suchasAVA-activespeaker(Rothetal.,2020) and Active Speaker in Wild (Y. J. Kim et al., 2021) consisting of manual active speaker annota- tions, has led to the emergence of a series of fully-supervised methods. These methods rely on the large-scalenatureoflabeleddatatomodeltheaudio-visualactivitybytrainingtrainawidevariety of massive ConvNet architectures, ranging from 3D convolutions (J. S. Chung, 2019; C. Huang & 45 Koishida, 2020) to graph convolutional networks (Alc´ azar et al., 2020; Le´ on-Alc´ azar et al., 2021). Unlike the extensively explored idea of modeling the visual activity, efforts in the direction of utilizingthespeakeridentityinformationforactivespeakerdetectionisrelativelylimited. Thework of Hoover et al. (Hoover, Chaudhuri, Pantofaru, Sturdy, & Slaney, 2018) proposed to cluster the speech andfaces separatelyto contain instancesof a singlespeaker andthen derive theinformation from the temporal co-occurrence of speech and face clusters to match speech and face for the same speaker. However, for the audio-visual diarization problem (Park et al., 2022), the speaker’s identity information from speech and faces is widely explored (Bredin & Gelly, 2016; J. S. Chung, Huh, Nagrani, et al., 2020; Kapsouras, Tefas, Nikolaidis, & Pitas, 2015). For instance, using the temporal co-occurrence of the speech and the face identities (obtained using face clusters) to derive the speech identities (Kapsouras et al., 2015). Inspired by the same, in this work, we propose to use the well-established speaker recognition (J. S. Chung, Huh, Mun, et al., 2020; J. S. Chung et al., 2018) and face recognition (Guo et al., 2016; Hu et al., 2018) systems to capture the speaker’s identity information from speech and faces and rely on speech and corresponding active speaker’s face identifying the same character to derive speech-face associations. 4.3 Methodology 4.3.1 Problem Formulation The aim is to generate a speech-face association such that the associated face, if there is one, for everyspeechactivityintheaudiomodality,isthesourceoftheunderlyingspeech. Weformulatethe problem definition under the following assumptions: i) that the video content under consideration is shot to capture the speaker’s face, which stands valid, especially in the case of entertainment media videos. ii) that at any point in time, there can be at most one person speaking; thus, no overlapping speech. To investigate the validity of assuming no overlapping speech, we tabulate the fraction of overlapping-speech occurrences for several videos from VPCD and AVA active speaker datasets,inTable4.1. Weobservethatthefractionisinsignificantforallthevideos,thusvalidating the assumption. We use the contiguous segments of speech activity in the audio modality of one speaker as the fundamental unit for our analysis. By definition, such a voiced segment consists of voices from only 46 Table 4.1: Fraction of overlapping speech (%) for various videos Videos Fraction of overlapping speech (%) Friends 1.39 TBBT 4.42 Sherlock 0.79 Hidden Figures 0.44 About Last Night 0.21 AVA active speaker 2.31 one speaker, called a speaker-homogeneous voiced segment. We denote the set of all the speaker- homogeneous segments through a video as S all ≡{ s 1 ...s n ,...s N }. Similarly, we use a face-track (locations of one face appearing in consecutive frames) as the smallest unit for the visual modality. For each s n , we gather the temporally overlapping face-tracks available in visual modality and denote them as F n ≡{ f 1 ,...f k ,...f K }. We call f k as the active speaker face for s n , denoted as (s n ←→ f k ), if face-track f k in video modalityisthesourceofspeechsegment s n inaudiomodality. However, aspeechsegment, s n , may havenofacetracksavailableinthevisualmodalityinitscourse, F n =ϕ ; therein, thespeakerisoff- screen. Such segments are not relevant to this work, and hence purged from S all . We only consider thespeechsegmentswhichhaveoneormoreface-tracksoverlappingintime,betheactive-speaker’s face is on-screen (S on ) or off-screen ( S off ), denoted in equation (4.3.1) and (2) S on ≡{ s n |∃f k ∈F n :(s n ←→ f k )} (4.3.1) S off ≡{ s n |F n ̸=ϕ and ∄f k ∈F n :(s n ←→ f k )} (4.3.2) We formulate the task of active speaker detection (ASD) as the problem of finding an active speaker face-track for all the speaker-homogeneous speech segments, if there exists one. We denote the set of active speaker faces, A, as: A≡{ a n |a n =f k and f k ∈F n and (s n ←→ f k )} (4.3.3) Such a formulation using the speaker-homogeneous speech segments helps reduce the ASD task to choosing one face track out of temporally overlapping tracks of (possibly multiple) faces for each 47 speech segment. It also satisfies the constraint that active speaker faces can only be present when there is speech activity in the audio modality. At the same time, using a face track as one entity, rather than each face, helps facilitate that there can be at most one active speaker face in a frame which aligns with our assumption of non-overlapping speakers. This formulation intends to find an active speaker face for each speech segment, s n , but as mentioned earlier, the speech segments may have off-screen speakers ( s n ∈S off ), which may introduce false positives. Since the information separating the speech segments into the sets S on and S off is not readily available we propose a two-stage process. We first devise a system that provides an associated active speaker face-track for all speech segments in S a ≡ S on ∪S off and call it stage1. This is then followed by a system to remove the speech segments with off-screen speakers from the set S all , to correct the introduced errors and call it stage2. 4.3.2 Stage1: Speech-face Assignment Speech in the audio and face in the visual modality is widely known to possess an individual character’s identity information. Utilizing it, we propose to select an active speaker face track for each speech segment s n from the set of temporally overlapping face tracks, a n ∈F n , imposing that the selected face and the underlying speech depict the same character identity. Since the identity captured in the audio modality (using speech) can not be directly compared with the captured identity in the visual modality (using faces), we compute an identity distance matrix for the two modalities independently, bringing them into a shared space of relative identities. We start with initializing the set of active speaker face tracks A by randomly selecting a face track for each speech segment from the set of face tracks temporally overlapping with the speech segment, a n = f k | f k ∈ F n ∀s n ∈ S all . Then we construct representations for the speech segments and the active speaker faces to capture the character identity information. For speech segments, we extract the speaker recognition embeddings using a ResNet34 (He et al., 2016) pretrained on VoxCeleb2 (J. S. Chung et al., 2018) with angular learning objective (J. S. Chung, Huh, Mun, et al., 2020). We represent the faces f k using SENet-50 (Hu et al., 2018), pretrained on MS-Celeb- 1M (Guo et al., 2016), widely used in face recognition systems. Using the obtained representations, we construct a distance matrix for each modality. In the audio modality, we represent the speech segments, s n in terms of its distances from all other speech 48 segments. We compute the distance between two speech segments s i and s j as the cosine distance between their speaker recognition embeddings. We call this matrix as the speech-identity distance matrix, SD and show it in equation 4.3.4. Similarly, in the visual modality, we construct a face- identity distance matrix, FD representing each active speaker face, a n in terms of its distance from all other active speaker faces. We compute the distances between two active speaker faces a i and a j as the cosine distance between their obtained face recognition embeddings, as shown in equation 4.3.4. While computing these distance matrices, we preserve the order of the speech segmentsandcorrespondingactivespeakerfaces,implyingthatthei th rowofSDandFDrepresents the speech segment, s i , and the corresponding active speaker face, a i , respectively. SD[i,j]= s i · s j ∥s i ∥∥s j ∥ FD[i,j]= a i · a j ∥a i ∥∥a j ∥ (4.3.4) Theuseofidentity-capturingspeakerrecognitionrepresentationsforspeechsegment, s n enables thedistancematrixSDtocapturetherelativeidentitystructurewithcontextfromtheentirevideo. For instance, speech segments s i and s j from the same speaker will show a smaller value in SD[i,j] than those from different speakers. Similarly, the face-identity distance matrix FD constructed using the face recognition representations captures the relative identity of the active speaker faces inthevisualmodality. Figure4.1.1showstheSDandFDfortheHollywoodmovieHiddenFigures, where we have constructed the matrix FD using the ground truth active speaker faces. For the visualization purposes, we gather the speech segments of each speaker together. Since the active speaker’s face and the concurrent speech must identify the same person, we hypothesize that the relative identity structure captured by FD should have a high resemblance with SD. We observe this similarity in Figure 4.1.1, in the form of similar low-distance square formations along the diagonals of both the matrices. The i th row in both the matrices depicted as SD[i] and FD[i] represents the identity of the underlying person (character) in terms of its distance from all other characters in the video. Since the components of the two rows come from distances computed in two independent dimensions (the audio and visual modality), they may differ in scale and thus may not be directly comparable. However, as the components of each row vector show the distance relative to other identities and 49 all the identities are supposed to be the same among the two matrices, we expect the row vectors from the two matrices to show a linear relationship. So we propose to use Pearson’s correlation as the measure to quantify the similarity between the two matrices. We formalize the ASD task as an optimization problem where we select the active speaker faces for each speech segment from the set of temporally overlapping face tracks, such that the Pearson’s correlation between the speech- identity matrix and face-identity matrix, computed as an average of the row-wise correlations, is maximized. The formulation is denoted below in equation 4.3.5, where SD[i] and FD[i] denotes the vector corresponding to the i th rows of SD and FD respectively and N is the number of rows. Corr(FD)= 1 N X i P (SD[i]− SD[i])(FD[i]− FD[i]) q P (SD[i]− SD[i]) 2 P (FD[i]− FD[i]) 2 (4.3.5) A≡ argmax {an|an=f k and f k ∈Fn} Corr(FD) (4.3.6) The notable point is that the face-identity distance matrix, FD is symmetric; the rows are not independent. The change in the active speaker’s face for i th speech segment implies the change in the i th row of FD, which reflects a change in all the other rows of FD in the form of the altered i th column. Changed FD further reflects a change in the objective function Corr(FD). Thus, in the ideal case scenario, the active speaker faces for all the speech segments should be jointly selected to maximize the Corr(FD). Without loss of generality, let us consider that there are N speech segments in a video and, on average, k faces that temporally overlap each speech segment. The complexity for computing the objective function Corr(FD) for a given set of active speaker faces, A would be O(N 2 ), and there would exist k N such possible sets, A (all possible combinations of potential active speaker faces). Thus maximizing the Corr(FD) jointly for all active speaker faces willcostO(k N N 2 ). ForatypicalTVHollywoodmovie, k≈ 3andN ≈ 900bringthecostofjointly optimizing high and out of the scope of current computing standards. We propose to approximate the optimization process by iteratively finding the active speaker’s facecorrespondingtoonespeechsegmentatonetimetomaximizetheobjectivefunctionCorr(FD), keeping the active speaker faces for all other speech segments fixed. Once iterating through all the 50 speech segments (an epoch), we repeat the process until the Corr(FD) converges. This approx- imation reduces the complexity of the optimization from an exponential to a cubic order of N, O(EkN 3 ), where E is the number of epochs, kN is the number of times the Corr(FD) is com- puted in each epoch and O(N 2 ) is complexity for computing Corr(FD). Furthermore, assuming that even a portion of a movie has enough information to disambiguate between the speech and speaker’s faces, we propose to split the number of speech segments into p partitions and optimize them separately. This helps further reduce the complexity of the process to O(Ek N 3 p 2 ) where p>1. The process is detailed in Algorithm 2. Algorithm 2: Stage1: Speech-face assignation 1 Random initialization:A≡{ a n |a n ∈F n }∀s n ∈S all ; 2 Compute SD and FD ; // using eq: (4.3.4) 3 objective← Corr(FD) ; // using eq: (4.3.5) 4 while objective increases do 5 for each a i ∈A do 6 a i =argmax f k ∈Fn Corr(FD) ; // i th row 7 Update FD; 8 Update A; 9 end 10 objective← Corr(FD); 11 end 4.3.3 Stage2: Off-screen speaker correction Maximizing the objective function, Corr(FD), we assign a corresponding active speaker face for all the speech segment under consideration, i.e., s n ∈ S all , which includes an active speaker face even for the speech segments having off-screen speakers, s n ∈ S Off . This incorrect assignment of speakers’ faces introduces false positives. The underlying hypothesis to maximize the correlation between the speech-identity and the face-identity distance matrices is that since the active speaker’s faces and the speech segments must identify the same person, the relative identity patterns from the speech-distance matrix (SD) and the face-distance matrix (FD) should show a high resemblance. In contrast, in the case of speechsegmentswithspeakerpresentoff-screen,noneofthefacetrackstemporallyoverlappingwith the speech segment corresponds to the active speaker’s face. Thus the speech segment represents 51 an identity different from any of the concurrent face tracks. We hypothesize that this discrepancy in the identity leads to a relatively lower similarity between the identity representations obtained using the underlying speech segment and any of the temporally overlapping faces for the speech segments with off-screen speakers. Formally we classify the speech segment, s i , as the speech segment with an off-screen speaker if it displays a low enough row-wise correlation between the identityrepresentationsobtainedfromthespeech-distancematrix, SD[i]andanyofthetemporally overlappingfacetracks, FD[i],denotedinEq4.3.7. Furthermore,weremovethefacetracks,earlier incorrectly classified as active speaker faces corresponding to such speech segments with off-screen speakers. S off ≡ s i P (SD[i]− SD[i])(FD[i]− FD[i]) q P (SD[i]− SD[i]) 2 P (FD[i]− FD[i]) 2 <τ (4.3.7) 4.4 Performance Evaluation 4.4.1 Implementation details We first obtain the active voice regions in a video using an off-the-shelf speech segmentation tool calledpyannote(Bredinetal.,2020). Astheproposedsystemrequiresspeakerhomogeneousspeech segments, we use naive heuristics to segment further the obtained voice active regions instead of employing a sophisticated speaker change detection system. We partition the voiced regions at the shot boundaries, which we gather using PyScenedetect 1 . The partitioning at the shot boundary is motivated by speaker change being a prominent movie cut attribute (Pardo et al., 2021), thus decreasing the likelihood of observing a speaker change in the obtained segments. We further partition the obtained segments to have a maximum duration of 1sec, effectively decreasing the fractionofspeechsegmentsconsistingofaspeakerchange. Werepresentthespeechsegmentsusing a ResNet-34 (J. S. Chung, Huh, Mun, et al., 2020) model pretrained on Voxceleb2 (J. S. Chung et al., 2018) dataset for the speaker recognition task. Within the gathered shot boundaries, we use the RetinaFace (Deng et al., 2020) to obtain face 1 http://scenedetect.com/en/latest/ 52 detections and track them using the SORT tracker (Bewley et al., 2016) to construct face tracks. We extract identity representations for each face detection box using a SENet-50 (Hu et al., 2018), pretrained on MS-Celeb-1M (Guo et al., 2016) for the face recognition task, and average them over all the constituting faces to construct a representation for a face track. Once the representations for all the speaker homogeneous speech segments and all the faces are gathered, we initialize the active speaker face for each speech segment with a face randomly selected from a set of temporally overlapping faces. We partition the set of speech segments to contain at most 500 speech segments (L=500),groupedintemporalorder,whichhelpsreducetheoptimizationprocess’stimecomplexity. We then employ the Algorithm 2 to assign an appropriate active speaker face to each of the speech segments for each partition (stage1) and further improve the predictions by discovering the speech segments with off-screen speakers (stage2). Toevaluatetheperformanceoftheproposedsystem,weusethreebenchmarkdatasets: i)Visual person clustering dataset (VPCD) (Brown et al., 2021), ii) AVA-active speaker dataset (Roth et al., 2020) and iii) Columbia dataset (Chakravarty & Tuytelaars, 2016). The AVA-active speaker datasetconsistsofface-wiseactivespeakerannotationsfor161internationalmovies,freelyavailable on YouTube. We use the publicly available validation split, consisting of 33 movies, to report and compare the performance of the proposed strategy against the state-of-the-art systems. VPCD is more exhaustive, consisting of character identity information in addition to the active speaker annotations for videos from several widely watched episodes of TV shows (Friends, Sherlock and TBBT) and 2 Hollywood feature films (Hidden Figures and About Last Night). It accompanies bounding boxes for the appearing faces and their corresponding character identities, along with the information about which character is speaking at any time. One of the challenges of working with VPCD is that the video files are not freely available. For purposes of this research, we acquired the video files using the DVDs and aligned the annotations in VPCD with the acquired videos. The Columbia dataset consists of manually obtained active speaker annotations for an 85 minutes long video of a panel discussion with six speakers. The panel discussion scenario is relatively more controlled than the movies in terms of frontal face poses and noise-free speech. 53 Table 4.2: F1-scores at various stages for the videos from VPCD. Reported F1 scores are averaged over all the episodes for the TV shows (Friends, TBBT, and Sherlock). Videos Random initialization Stage 1 speech-face Stage 2 Off-screen speakers Stage 2 (mAP) Friends (25 episodes) 0.54 0.78 0.80 0.82 TBBT (6 episodes) 0.60 0.83 0.84 0.86 Sherlock (3 episodes) 0.55 0.64 0.66 0.70 Hidden Figures 0.45 0.62 0.67 0.66 About Last Night 0.51 0.59 0.62 0.57 a) Speech-identity distance matrix (SD) b) Randomly initialized Corr(FD)= 0.1 c) Stage1: Speech-face assignations Corr(FD) = 0.23 d) Stage2: Off-screen correction Corr(FD) = 0.49 e) GT active speaker faces Corr (FD) = 0.58 Figure 4.4.1: Distance matrices for the movie Hidden Figures at various stages of the system along with the value of objective function, Corr(FD). a) Speech-identity distance matrix (SD) b)Random: Face-identity distance matrix (FD) for randomly initialized ASD. c) Stage1: FD with speech-faceassignation,maximizingCorr(FD). d)Stage2: FD postremovingthespeechsegments with off-screen speaker. e) Ground truth: FD obtained using ground truth active speaker faces. 54 4.4.2 Stage1: Speech-face assignation The hypothesis underlying the Algorithm 2 is that the relative identity patterns captured by the face-identity distance matrix FD, constructed using the active speaker faces, must show a high resemblance with the speech-identity distance matrix SD. In this section, we present the qual- itative and quantitative validation of the hypothesis using videos from the VPCD. We compare the predictions for each face-bounding box against the ground truth and report the F1-score in Table 4.2. In Figure 4.4.1 we show the evolution of the FD at various stages for the movie Hidden Figures. Usinggroundtruthspeakeridentities,weorderthespeechsegmentsin SD,andthecorrespond- ing predicted active speaker faces in FD to bring the speech segments from each speaker together. Thismakesthematricesinterpretableandbringsoutthevisiblesquarepatternsalongthediagonal in SD in Figure 4.4.1a. Figure 4.4.1b shows the FD for the random initialization case with a low value of the Corr(FD), displayed in the form of minutely visible similarity with SD. However small, the visible similarity signifies the correct predictions for the trivial case of speech segments with just the speaker’s face visible in the frames. The performance of the random chance system is quantified in terms of F1-scores in Table 4.2. Figure 4.4.1c shows the FD post stage1 speech-face assignments, maximizing the Corr(FD) employing Algorithm 2 and displays significantly higher resemblance with SD, quantified by the higher value of Corr(FD). Figure 4.4.2 compares the objective function value (Corr(FD)) at various stages for the videos in VPCD, and shows the higher value fro stage1 than the random ini- tialization case, consistently for all the videos. The higher resemblance translates to the enhanced activespeakerdetectionperformance,reportedintermsofahigherF1-scoreinTable4.2,compared to the random chance system. We consistently observe a substantially higher F1-score than the randomchancebaselineforallthevideosintheVPCD.Althoughpostoptimization, the FD isvis- iblysimilartoSD (comparingFigure4.4.1a&c),theamountofnoiseisstillnotablewhencompared against the FD obtained using the ground truth active speaker faces, shown in Figure 4.4.1e. The same is quantitatively visible in the difference between respective values of Corr(FD) and is due to the false positives introduced by the system for the speech segments with off-screen speakers. 55 Friends TBBT Sherlock Hidden Figures About Last Night Videos 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Objective function (Corr(FD)) 0.16 0.21 0.08 0.10 0.11 0.41 0.47 0.17 0.23 0.20 0.51 0.60 0.35 0.47 0.34 Random Stage1 Stage2 Figure 4.4.2: Evolution of the objective function Corr(FD) at different stages for the videos in VPCD. 4.4.3 Stage2: Off-screen speaker correction The proposed strategy to segregate the speech segments with off-screen speakers relies on the hypothesis that for such speech segments, since the speaker’s identity from the speech is different from any of the faces appearing in the frames, the row-wise correlations between SD and FD are relatively lower. To validate this hypothesis, we first divide the set of speech segments, using the ground truth, into S On (on-screen) and S Off (off-screen) and study the distribution of row-wise correlations between SD and FD for the two groups. We assign ground truth active speaker face for s i ∈ S On and a face randomly selected from the set of temporally overlapping face tracks for s i ∈S Off and then construct the SD and FD using all speech segments s i ∈S On ∪S Off . In Figure 4.4.3, we show the distributions of row-wise correlations between SD and FD for the two sets of speech segments: i) Off-screen, S Off and ii) On-screen, S On , for the movie Hidden Figures. We observe a notable difference in the two distributions, which we quantify using the non-parametric Mann-Whitney U (Mann & Whitney, 1947) test and obtain a p-value of 10 − 70 verifyingthatthedifferencebetweenthetwodistributionsisstatisticallysignificant. Weperformed the same test for all other videos in VPCD and found that the hypothesis held for all of them. In Table 4.3 we report the fractional duration of speech segments having off-screen speakers 56 0.5 0.0 0.5 Row-wise correlations 0.0 0.0 0.1 0.1 0.1 0.1 0.2 0.2 0.2 Probability Off-screen 0.5 0.0 0.5 Row-wise correlations 0.0 0.1 0.1 0.2 0.2 0.2 0.3 Probability On-screen Figure 4.4.3: Plot showing the comparison between the distributions of row-wise correlations be- tween SD and FD for speech segments with off-screen ( S off ) and on-screen (S on ) speakers. Table 4.3: Performance for the off-screen speech segment classification for the videos in VPCD, in terms of area under the ROC curve (auROC). Videos off-screen speech segments (%) Classification performance (auROC) Friends 0.1 0.8074 TBBT 0.1 0.8057 Sherlock 0.18 0.6927 Hidden Figures 0.21 0.8259 About Last Night 0.17 0.6897 (S Off : S all ) for each video in the VPCD and observe that the fraction can be significant for some videosand thusvalidatingthe needto addressthe induced errorsin activespeakerdetectionbythe stage1 speech-face assignations. Using various values of τ in equation 4.3.7, we compute an ROC curve to evaluate the efficacy of the proposed strategy to classify each speech segment into S Off and S On . We report the area under ROC (auROC) curve in Table 4.3 for all the videos and notice the good performance of the simple strategy to discover speech segments with off-screen speakers. For all the videos in VPCD, we use τ = 0.1 to gather the speech segments with off-screen speakers and classify the faces, earlier incorrectly marked as active speaker faces in stage1, as non-speaking faces. In Figure 4.4.1d, we show the obtained FD for the movie, Hidden Figures and observeavisiblyhighersimilaritywiththespeech-identitydistancematrix(SD),furtherquantified 57 in attained a higher value of Corr(FD), compared to FD post stage1. We further observe that the FD post stage2 looks even more similar to ground truth FD. In Figure 4.4.2, we show that the objective function Corr(FD) attains a higher value, after correcting the speech-face assignments for speech segments with off-screen speakers, compared to stage1 assignments for all the videos in VPCD. We report the F1-score for the active speaker detection using the corrected predictions in Ta- ble4.2(stage2)andcomparethemwiththepredictionsfromthestage1. Weobservethatoff-screen speakercorrectionincreasestheF1-scoreconsistentlyforallthevideos. InFigure4.4.4,wecompare theperformanceofthetwostagesonaprecision-recallcurveandobservethatthesystem’sprecision improves significantly with a slight drop in the recall. The enhanced precision is associated with removing the false positives introduced by assigning an active speaker face to speech segments with off-screen speakers (stage1), while the decline in the recall is due to the incorrect classification of a speech segment to have an off-screen speaker in stage2. 4.4.4 Comparisons with state-of-the-art Weevaluatetheperformanceoftheproposedstrategyinvolvingfindinganactivespeakerfacetrack for each speech segment and then removing the speech segments with off-screen speakers on the videos from the publicly provided validation split of the AVA dataset. We score the face boxes of the predicted active speaker face tracks with the row-wise correlation (correlation(SD[i],FD[i])) values where a higher correlation value signifies a better match in identities represented by the face and the speech. All other face boxes are scored -1, the minimum possible correlation value. We then use the AVA official implementation to evaluate the performance and report the mean average precision(mAP)valuesaggregatedoverallthevideosinTable4.4. Wealsoreporttheperformance of various recent state-of-the-art techniques in Table 4.4. We observe that the proposed unsupervised strategy of using the identity information performs well, although not as well as other fully supervised methods using videos from the AVA train set to model the visual activity in the frames. The primary reason for the observed lower performance lieswiththenatureofvideosintheAVAdatasetwhichconsistsofadverseaudioconditions. Nearly 65% of the speech segments with a visible speaker overlap with either noise or music (Roth et al., 2020). Thepresenceofsuchnoiseconditionshinderstheabilityoftheemployedspeakerrecognition 58 Stage 2 Stage 2 Stage 2 Stage 2 Stage 2 Stage 1 Stage 1 Stage 1 Stage 1 Stage 1 Figure 4.4.4: Increase in precision with a slight drop in recall introduced by correcting the speech- face assignments for speech segments with off-screen speakers (stage2). Table 4.4: Comparison with the state-of-the-art methods on AVA active speaker validation set in terms of mean average precision. Methods Stragetgy mAP (%) Roth et al. (Roth et al., 2020) Supervised 79.2 Zhang et al. (Y.-H. Zhang et al., 2019) Supervised 84.0 Alxazar et al. (Alc´ azar et al., 2020) Supervised 87.1 Chung et al (J. S. Chung, 2019) Supervised 87.8 MAAS-TAN (Le´ on-Alc´ azar et al., 2021) Supervised 88.8 Proposed Unsupervised 71.40 59 models to capture the identity information from speech segments, which naturally translates into pooractivespeakerdetectionperformance. Wefurtherelaborateontheeffectofspeakerrecognition performance on active speaker detection in section 4.4.5. Using the same method as AVA, we report mAP values for the videos in the VPCD dataset (averaged over episodes for TV shows) in Table 4.2. We observe that the performance for videos of TV shows, specifically Friends and TBBT, is notably higher than the movies (Hidden Figures and AboutLastNight). ThescenesinFriendsandTBBTpredominantlyconsistofjustprimarycharac- ters’ faces (characters who actively speak in the video), unlike the movies, which have a significant portion of scenes consisting of faces of background characters (i.e., characters who do not speak anytime in the video). The presence of primary characters’ faces provides useful disambiguating information since we can gather their audio-visual identity information from the parts of videos where they speak. While the background characters’ faces contribute to increased confusion since there is no way we can acquire their identity information. This leads to the displayed difference in the performances for the TV shows and the movies. Additionally, a higher fraction of off-screen speakers for the movies than the TV shows further contributes to the lower performance for movies (Table 4.3). Selected video clips, presenting the system’s active speaker predictions for several videos from VPCD are present in supplementary material. 4.4.5 Ablation studies Effects of VAD performance The availability of the speaker-homogeneous speech segments is fundamental to the proposed ap- proach. Toobtainthespeechsegments,weuseanoff-the-shelfvoiceactivitydetector(Bredinetal., 2020). To make the obtained speech segments speaker-homogeneous, we use a simple approxima- tion involving partitioning the speech segments by the scene boundaries and ensuring a maximum duration of 1sec. Since the number of speaker changes is constant in a video, splitting the speech segments into a larger number of partitions (obtained by making the partition length smaller) ef- fectivelydecreasesthefractionofspeechsegmentsconsistingofaspeakerchange. Furthermore, the errors introduced by the speech segments with a speaker change can be the most for 1sec duration. To evaluate the effect of the proposed simple proxy for speaker-homogeneous speech segments, 60 Friends (94%) TBBT (92%) Sherlock (89%) HiddenFigures (100%) About LAst Night (82%) Videos 0.0 0.2 0.4 0.6 0.8 Mean average precision 0.82 0.86 0.70 0.66 0.58 0.87 0.93 0.79 0.66 0.71 Voice activity detection System Oracle Figure 4.4.5: Comparison of the system’s performance against the scenario with ideal case speaker- homogeneous speech segments, reported in terms of mAP for video from VPCD. The system’s performance relative to the ideal case scenario is shown beneath each video. we compare the system’s performance, on videos from VPCD, with the ideal case scenario. The VPCDannotationsconsistofcharacter-wisespeechsegments,enablingtoobtainidealcasespeaker- homogeneousspeechsegments. Wefurthersegmentthespeechsegmentsusingthesceneboundaries and ensuring a maximum duration of 1 sec. As the speech recognition methods’ performance is susceptible to the duration of segments, splitting to maximum duration of 1 sec eliminates the impact of the difference in durations of speech segments in the two cases. In Figure 4.4.5 we compare the performance in terms of mAP for all the videos in VPCD for the two cases: i) System VAD using the proposed speaker-homogeneous proxy and ii) Oracle VAD using the ground truth speaker-homogeneous speech segments. WeobservethattheproposedsetupusingtheVADsystemandthespeaker-homogeneousproxy approximatestheidealcasescenarioquitewell,reflectedinnearlyequivalentmAPvaluesforthetwo casesformostvideos. Onaverage, thesystem’sperformanceattainsnearly91%oftheperformance of the ideal case. 61 Time complexity vs performance In section 4.3.2 we proposed to split the video into p partitions and maximize the objective func- tion, Corr(FD) for each partition independently to reduce the time complexity of the system to O(Ek N 3 p 2 ). It assumes that the obtained partitions have enough information to disambiguate the active speakers. For implementation purposes, we partition a video to have at maximum L speech segments in each partition. Using p= N L , we represent the time complexity of the system as O(EkNL 2 ),makingitalinearfunctionofthevideo’slength, N anddependentontheparameterL. The length of a partition L signifies the context available to the video split to solve the speech-face assignments, implying that the larger values of L likely provide more disambiguating information and are ideal for the setup. In contrast, selecting a larger L increases the time complexity of the system, thus trading off the time for performance. InFigure4.4.6wecomparetheperformanceofthesystem(reportedinmAP)andthecomputa- tionaltimeforvariousvaluesofL=[200,500,800]andforusingtotalvideolengthatonce(L=N). For visualization purposes, we plot the computational time on a logarithmic scale. We observe a significant performance increase for L=500 compared with L=200 consistently for all the videos in VPCD. Further increasing L, we see marginal improvement in the performance along with a significant increase in computational time. Therefore we use L = 500 for all the videos providing enough context to the video partitions with reasonable computational time. Interestingly we also observe a marginal decrease in performance for the movies, Hidden Figures and About Last Night, and the TV show Sherlock when using the total video length (L = N). This drop in performance is due to a more significant amount of noise in the form of off-screen speakers (Table 4.3) compared to other videos, which accumulate while using a larger context. Effects of speech and face recognition quality The central idea of this work relies on the fact that speech and face both possess identity informa- tion and harness the certitude that the speaker’s face and the underlying speech identify the same character. It makes capturing the identity information from speech and face fundamental to the system. Here we study the effect of the quality of captured speech and face identity information on the system’s performance. We use two state-of-the-art speech recognition systems to construct 62 N=200 N=500 N=800 All N=200 N=500 N=800 All N=200 N=200 N=200 N=500 N=500 N=500 N=800 N=800 All All All N=800 Figure 4.4.6: Computational time (in logarithmic scale) for the system against the performance (reported in mAP) for various values of the partition length L for the videos in VPCD. different speech identity representations: i) ResNet-34 (default to the system) pretrained on Vox- celeb2 dataset displaying 1.2% equal error rate (EER) for speaker recognition task on Voxceleb1 test set. ii) X-vector embeddings extracted from a TDNN model trained on Voxceleb1+Voxceleb2 training data and displays an EER of 3.2% on the same Voxceleb1 test set. Note that the lower the EER better is the speaker recognition performance; thus, ResNet-34 performs better for capturing speaker identity information from speech than x-vectors. In Figure 4.4.7, we show the performance of the active speaker detection system for videos from VPCD, utilizing the two speaker recognition systems. We observe that the performance displayed by the sub-optimal x-vectors embeddings is consistently lower for all the videos than the ResNet- derived embeddings. Using a non-parametric ManWhitneyU test, we noticed that the difference between the performance of the two systems is statistically significant. This implies that the sub- optimal speaker recognition features lead to degraded active speaker recognition performance. It further explains that the low performance for the AVA validation set (reported in Table 4.4) is due to the sub-optimal speaker identity embeddings attributed to the noisy speech conditions. Similarly, to capture the impact of the face recognition quality, we use two state-of-the-art face recognition systems to construct the face identity representations: i) ResNet-50 trained on VGGFace2 (default to the system) and ii) Facenet trained on VGGFace2, compared on IJB-c dis- 63 Friends TBBT Sherlock Hidden Figures About Last Night Videos 0.0 0.2 0.4 0.6 0.8 Mean average precision 0.75 0.81 0.62 0.55 0.49 0.82 0.86 0.70 0.66 0.58 0.82 0.86 0.69 0.69 0.58 x-vectors/Resnet Resnet/Resnet Resnet/Facenet Figure 4.4.7: Comparison of active speaker detection performance for systems using different speech/face recognition architectures, on videos from the VPCD dataset. For speech embeddings, Resnet is superior to x-vectors and for face Resnet is superior to the Facenet. playing 0.95 and 0.66 True acceptance rate (TAR) respectively. Higher TAR implies better face recognition performance, pointing out that the Resnet-50 is a superior face recognition system to the Facenet. In Figure 4.4.7 we present the active speaker detection performance comparison of systems employing the two different face recognition systems and the same speaker recognition system (ResNet-34). We observe that the difference in performance is marginal and shows no sta- tistical significance. It suggests that the system is more tolerable to degradation in face recognition performance. 4.4.6 Discussion The proposed framework relies on the assumption that the video under consideration has enough information to disambiguate the characters in the video. The presence of scenes with just the speaking character in the frames is one of the simple examples of such character disambiguating information, as it provides an instant speech-face association. The videos intending to capture the speakingcharacterswilllikelyhaverelevantinformationtoderivethespeech-faceassociationbased ontheco-occurrenceofthespeechandfacesthroughthevideo. Onthecontrary, oneoftheadverse limiting cases for the system is a video with a fixed set of faces appearing in the frames at all times. 64 Table 4.5: Comparison of the speaker-wise weighted F1 scores for all the speakers in Columbia dataset. Methods Abbas Bell Boll Lieb Long Sick Avg Chakravarty et al. (Chakravarty & Tuytelaars, 2016) - 82.9 65.8 73.6 86.9 81.8 78.2 Shahid et al (Shahid et al., 2019) - 89.2 88.8 85.8 81.4 86 86.2 Chung et al. (J. S. Chung & Zisserman, 2016) - 93.7 83.4 86.8 97.7 86.1 89.5 Afouras et al. (Afouras et al., 2020) - 92.6 82.4 88.7 94.4 95.9 90.8 S-VVAD (Shahid et al., 2021) - 92.4 97.2 92.3 95.5 92.5 94 RealVAD (Beyan, Shahid, & Murino, 2021) - 92 98.9 94.1 89.1 92.8 93.4 TalkNet (R. Tao et al., 2021) - 97.1 90.0 99.1 96.6 98.1 96.2 Proposed 97.0 95.0 91.0 65.3 0.1 34.6 65.0 Proposed (assisted) 96.7 95.3 90.5 98.2 93.2 96.1 95.0 Asimpleexamplecanbeapaneldiscussionshotwithastaticcamera, capturingallthepanelistsat all times. In such a video, the system will have no way to correctly associate the identity from the speech with one of the faces on the screen. Columbia dataset (Chakravarty & Tuytelaars, 2016): a video of a panel discussion is very similar to such a scenario. Even though the video is 85 min long, the annotations are available for 35 min duration, comprising six speakers. We evaluate the performance of the proposed system on the Columbia dataset and report the speaker-wiseweightedF1-score,awidelyusedmetricforevaluatingthisdataset,inTable4.5,along withperformancereportedbyvariousothermethods. WeobservethattheF1scoreisprettylowfor three speakers (Lieb, Long, and Sick), while for the other three, it is at par with other systems. We further investigated and observed that the face distance matrix FD obtained using the predicted activespeakerfacesattainsahighvalueoftheobjectivefunction Corr(FD). Thespeech(SD)and the face distance (FD) matrices are shown in Figure 4.4.8a & c. The obtained Corr(FD) is close to the one displayed by the ground truth active speaker faces, shown in Figure 4.4.8b, suggesting the intended working for the proposed system. On visual inspection of the predicted active speaker faces, we observed that the system incor- rectly associates Long’s speech with Sick’s face and Lieb’s with Long’s face. The system selects Sick’s face as an active speaker whenever Long speaks and similarly for the other two speakers. Relevant frames from the video with the system’s predictions (green box: active speaker) and man- ually marked speakers’ identities are shown in Figure 4.4.9a. We further observed that Sick always appears with Long throughout the video, providing the system with insufficient information to 65 a) Speech distance matrix (SD) b) Ground truth FD (Corr(FD) = 0.79) c) System FD (Corr(FD) = 0.75) d) Assisted system FD (Corr(FD) = 0.79) Figure 4.4.8: Distance matrices for Columbia dataset with Corr(FD). a) Speech distance matrix SD. b) FD with ground truth active speaker faces. c) FD post stage2 for proposed system. d) FD post stage 2 for the system assisted with 15% ground truth active speaker faces. 66 resolve the speech-face association. Similarly, Lieb always appears with Long, limiting the system to solve the discrepancy. To close the loop on the system’s performance on the Columbia dataset, we simulate a setup by manually forcing disambiguating information into the video, which assists the system in solving the speech-face association. We randomly select 15% of the speech segments of each speaker and initialize the corresponding active speaker faces using the ground truth active speaker faces. While maximizing the objective function Corr(FD) (using Algorithm 2 in stage1), we forbid updating the active speaker face assignment for the selected speech segments. It replicates the trivial case of a scene with just the speaker’s face visible in frames, thus synthetically injecting information to resolve speech-face associations. We show the face distance matrix (FD) for the obtained active speaker faces in Figure 4.4.8d and note that the objective function Corr(FD) is nearly equivalent to the one shown by the ground truth active speaker faces (Figure 4.4.8b). We report speaker- wise F1-score in Table 4.5 and observe that the assisted system performs similar to state-of-the-art methods for all the speakers. In Figure 4.4.9b we show the frames with corrected speech-face associations. Video clips highlighting the system’s initial incorrect active speaker predictions and the later obtained corrected predictions using the assistance from the ground truth are present in supplementarymaterial. Wepointoutthatthiseasywayofincorporatingexternalinformationcan beextendedtointegratecomplementaryinformationfromvisual-activity-basedsystemstogenerate a comprehensive solution to active speaker detection. 4.5 Summary and Conclusion We presented a cross-modal unsupervised framework for active speaker detection in videos, har- nessingthecharacteridentityinformationinthespeechandthefacesofthespeakers. Weevaluated the proposed system on three benchmark datasets, consisting of various videos from entertainment media (TV shows and movies) and broadcast media (panel discussion), and showed competitive performance compared to state-of-the-art fully supervised methods. The framework utilizes off- the-shelf speech and face identity representation systems and shares their limitations when speech accompanies noise and music. The framework provides an easy way to inject external information, enabling it to collaborate 67 Abbas Lieb Long Lieb Long Sick Long Sick Bell Lieb Long Sick Long Sick Bell Abbas Lieb Long a) Proposed system predictions: Confused speech-face for Lieb, Long and Sick b) Assisted system predictions: Correct speech-face assignations. Speaker: Lieb Speaker: Lieb Speaker: Long Speaker: Long Speaker: Sick Speaker: Sick Figure 4.4.9: Sample frames from Columbia dataset with manually marked identities of the faces and the name of the current speaker. a) The frames shows the incorrect speech-face associations by the proposed system. b) The frames shows the corrected speech-face associations when assisted with 15% ground truth faces with other methods. It can incorporate high-confidence predictions of other state-of-the-art meth- ods and even inputs from expert humans in the loop, enabling a provision to provide supervision at various degrees. One of the future directions can be to include the complementary set of informa- tion from visual-activity-based methods providing more disambiguating information to the system and thus further enhance the performance. 68 Chapter 5: Audio-visualactivityguidedcross- modal identity association for ac- tive speaker detection 5.1 Introduction Active speaker detection (ASD) aims at finding a source face corresponding to the speech activity present in the audio modality. It targets selecting a face from the set of possible candidate faces appearing in the video frames, such that the selected face and the foreground speech belong to the same person. Active speaker detection plays a fundamental building block in our target do- main of Computational media intelligence (Somandepalli et al., 2021), which includes the study of character portrayals in media and their impact on society. The tools include speaker/character diarization (Min, 2022; Sharma & Narayanan, 2022c), audio-visual speech recognition (Braga & Siohan, 2021), background character detection (Sharma & Narayanan, 2022a), character role de- tection (Haq, Muhammad, Ullah, & Baik, 2019), speaker behavior understanding (Sharma, Guha, &Sharma,2018),etc.. Inthiswork,weaddressASDinthecontextofentertainmentmediavideos. These videos pose additional challenges to ASD due to the presence of an unknown and varying numberofspeakersandtheirhighlydynamicinteractionsleadingtofacesinextremeposes. Speech This work has been submitted to IEEE Open Journal on Signal Processing and it under review. It has been made availble on arxiv: Sharma, Rahul, and Shrikanth Narayanan. ”Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection.” arXiv preprint arXiv:2212.00539 (2022). 69 present with noise and music further adds to the challenges. Focused on finding an association between speech and faces, ASD is inherently a multimodal task, requiring integration of the audio and visual information. The majority of previous research is situated around modeling the activity in the visual frames, in the form of visual actions, and activity in the audio modality, in the form of speech, and studying their interactions to establish speech-face associations (Arandjelovic & Zisserman, 2018; Bendris et al., 2010; J. S. Chung, 2019; J. S. Chung & Zisserman, 2016; Owens & Efros, 2018). Earlier approaches have modeled the visual activity in the explicitly extracted mouth regions of faces and analyzed their correlations with the underlying speech waveforms to detect the active speaker faces (Bendris et al., 2010). More recently, researchers have proposed audio-visual synchronization as a proxy task to jointly model audio-visualactivityinaself-supervisedmanner(J.S.Chung&Zisserman,2016). Theyshowedits applicability to sound-source localization (Owens & Efros, 2018), which directly extends to ASD, as active speakers are essentially the sound sources for the underlying speech sound event. This led to various self-supervised methods, introducing several proxy tasks in multimodal (Afouras et al., 2020; Arandjelovic & Zisserman, 2018; Owens & Efros, 2018; Zhao et al., 2018) and cross- modal (Chakravarty et al., 2015; Chakravarty & Tuytelaars, 2016; Sharma et al., 2019, 2022) scenarios targeted to active speaker detection and, sound source localization in general. Recently, several large-scale datasets consisting of face box-wise active speaker annotations for videos from movies (Y. J. Kim et al., 2021; Roth et al., 2020) and TV shows (Brown et al., 2021) have emerged. These datasets have led to various fully-supervised methods, where researchers model the audio-visual activity using sophisticated neural networks consisting of 3D CNNs and RNNs (J. S. Chung, 2019; C. Huang & Koishida, 2020). With further advancements in deep neural networks, researchers have modeled the audio-visual activity’s short-term and long-term temporal dynamics usinggraph-convolutionalneural networksand self-attention neural networks (Alc´ azar et al., 2020; Le´ on-Alc´ azar et al., 2021; R. Tao et al., 2021). Researchers have also proposed to utilize context in the form of visual activity of the co-occurring speakers to enhance the performance of ASD further (Min, Roy, Tripathi, Guha, & Majumdar, 2022). The recent advancements in face recognition (Guo et al., 2016; Hu et al., 2018) and speaker recognition (using speech in the audio modality) systems (J. S. Chung, Huh, Mun, et al., 2020; J. S. Chung et al., 2018) have enabled the use of the speaker’s identity present in the speech and 70 the speaker’s face to establish speech-face associations (Hoover et al., 2018; Sharma & Narayanan, 2022b). The idea of analyzing the co-occurrences of speakers’ identities in the audio and the visual modalities to establish speech-face associations is not as well explored as modeling the visual activity for active speaker detection. Hoover et al. (Hoover et al., 2018) proposed to cluster the speech and faces independently and then used the temporal co-occurrence of the obtained clusters to establish speech-face correspondences. In one of our previous works (Sharma & Narayanan, 2022b), we constructed the temporal identity structure by observing the speech segments through the video. We proposed to select active speaker faces for each speech segment from the set of faces concurrently appearing in the frames, such that the chosen active speaker faces display similar identity structure as those obtained observing the speech segments. TheapproachestoASDmentionedabovecanbecategorizedbasedonthesourceofinformation into: i) audio-visual activity: the ASD relevant information is derived from modeling the audio and visual activities and their interactions. ii) speakers’ identity co-occurrence: the speech-face correspondences are solved by studying the cross-modal co-occurrences of the speakers’ identity (speech and faces). The audio-visual activity-based methods model the visual actions relevant to the speaking activity. In daily life scenarios, which are often present in entertainment media videos, these methods get confused with other actions similar to the speaking activity, such as eating, grinning, etc. They often miss the cases when the speaker’s face appears in extreme poses and the lip region is not visible. On the other hand, the speakers’ identity-based methods are independent of the observed visual actions while limited to videos having enough information to disambiguate speech-face associations. A simple scene of a panel discussion where all the speakers are visible for all time is an example where speakers’ identity co-occurrences do not have adequate information to establish speech-face correspondences. The speakers’ identity-based methods also strugglewhenspeakersarepresentoff-screen. However,theaudio-visualactivitycanhelpovercome these limitations. In this work, we propose to use two sources of information, i) visual-activity-based informa- tion and ii) speakers’ identity co-occurrence information, in collaboration with each other, for active speaker detection in entertainment media videos. We built this work on top of our previous work (Sharma & Narayanan, 2022b), where we proposed an unsupervised strategy for ASD using thespeakers’identityco-occurrencesacrossmodalities. Togatheraudio-visualactivityinformation, 71 we use TalkNet (R. Tao et al., 2021), which shows state-of-the-art performance on the benchmark AVA-active speaker dataset (Roth et al., 2020) for ASD. We propose a novel framework that ac- quires assistance from the audio-visual activity to associate speakers’ identities across the audio and visual modalities for active speaker detection. We show that the acquired av-activity assis- tance help overcome the limitations of the speakers’ identity systems. Furthermore, we show that a simple late-fusion of the scores from two methods enhances ASD performance. We evaluate the proposed approach on three datasets: the AVA-active speaker dataset (Roth et al., 2020), Friends, andTBBT(partoftheVisualPersonClusteringDataset(Brownetal.,2021))consistingofmovies and videos from TV shows. The contributions of this work are as follows: 1. We perform an error analysis for two ASD systems: i) audio-visual activity-based, and ii) speakers’ identity-based. We show that a significant fraction of errors is exclusive to each system, highlighting the complementary nature of the two sources of information; thus sup- porting the need to integrate the two approaches for ASD. 2. Weproposeanovelunsupervisedframeworktoguidethespeakers’identity-basedspeech-face associations using audio-visual activity information. The framework is generalized to ASD predictions from any external source. 3. We propose a late fusion of the posterior scores from the audio-visual activity system and the improved speakers’ cross-modal identity association system and show an enhanced ASD performanceonentertainmentmediavideos: AVAactivespeakerdataset(movies)andVPCD (TV shows). 5.2 Methodology 5.2.1 Problem Setup We use the contiguous speech segments belonging to one speaker as the fundamental speech unit for our analysis and call them speaker-homogeneous speech segments. We denote the set of all the speaker-homogeneous speech segments as S all ≡ { s 1 ,...s n ,...s N }, where N is the total number of such segments. From the video frames, we use the face tracks (contiguous instances of faces 72 belonging to one person) as the fundamental unit. For each speaker-homogeneous speech segment, s n , we collect the set of face tracks that temporally overlap with the s n and denote them as F n ≡{ f 1 ,...f k ,...f Kn }. If f k ∈ F n is the active speaker’s face for the underlying speech segment, s n , we denote it as (s n ←→ f k ). For some speech segments, there may be no temporally overlapping face tracks. In the context of active speaker face detection, such speech segments are irrelevant; thus, we remove the speech segments where F n = ϕ from the set, S all . We only use the speech segments with at least one face track concurrently appearing with the speech segment in the video frames. These speech segments included cases when the active speaker’s face is off-screen, even though some face tracks (none belonging to the speaker) are visible in the video frames. We set up the task of active speaker detection as associating a face track, a n , if there exists one, for each speaker-homogeneous speech segment, s n , from the set of temporally overlapping face tracks, F n , such that a n is the source face for the underlying speech segment. A≡{ a n |a n ∈F n and (s n ←→ a n )} ∀s n ∈S all (5.2.1) 5.2.2 Speakers’ cross-modal identity association (SCMIA) We use the unsupervised framework proposed in our previous work (Sharma & Narayanan, 2022b), which utilizes the speakers’ cross-modal identity association for ASD. We initialize the set of active speaker faces, A, by randomly selecting a face track as the active speaker face for each speech segment(s n )fromthecollectionoftemporallyoverlappingfacetracks(F n ). Tocapturetheidentity information, we represent the speaker-homogeneous speech segments, s n , using speaker recognition embeddings obtained using a ResNet (He et al., 2016) pretrained on VoxCeleb2 (J. S. Chung et al., 2018) with the angular objective function (J. S. Chung, Huh, Mun, et al., 2020). Similarly, we represent the faces, f k , using the face recognition embeddings obtained using SENet-50 (Hu et al., 2018), pretrained on the MS-Celeb-1M dataset (Guo et al., 2016). We construct a speech-identity distance matrix, SD, using all the speaker-homogeneous speech segments s n ∈ S all . We compute the distance between two speech segments, s i and s j , using the cosine distance measure between their corresponding speaker recognition embeddings, as shown in 73 High confidence negative prediction High confidence positive prediction Positive label guidance ( PG) Negative label guidance ( NG) Audio-visual activity ASD In PG? Potential active speaker faces Negative guidance Positive guidance Speech-identity distance matrix Face-identity distance matrix Guided Speakers’ identity ASD Yes No Figure5.2.1: Overviewoftheaudio-visualactivityguidedspeakeridentityassociationacrossmodal- ities, GSCMIA.a)Constructionofpositiveandnegativeguidesfromaudio-visualactivity. b)Guid- ing SCMIA using positive and negative guides. Eqn. 5.2.2. The constructed matrix, SD, represents each speech segment in terms of its identity- based distance from all other speech segments, thus capturing the temporal identity structure existing through the video. Using the corresponding active speaker faces, we construct a face-identity distance matrix, FD, similar to SD, by representing each active speaker face in terms of its distance from all other active speaker faces. We compute the distance between two active speaker faces using the cosine distance between their face recognition embeddings, as shown in Eqn. 5.2.2. The matrix FD again captures the temporal identity structure but utilizes the active speaker faces. Since the speech and the corresponding active speaker’s face identify the same person, the speaker identity structure captured by the two matrices must show a high resemblance. SD[i,j]= s i · s j ∥s i ∥∥s j ∥ FD[i,j]= a i · a j ∥a i ∥∥a j ∥ (5.2.2) We compute the resemblance between the two distance matrices, SD and FD, using Pearson’s 74 correlation, computed row-wise, as shown in Eqn. 5.2.3. We establish the speech-face correspon- dence by iteratively selecting an active speaker face, a n , for each speech segment, s n , from the set oftemporallyoverlappingfaces, F n , suchthattheresemblancebetweenthe SD andFD, computed using the Eqn. 5.2.3, is maximized. We update the set of active speaker faces iteratively until the objective function, O(FD), converges. O(FD)= 1 N X i P (SD[i]− SD[i])(FD[i]− FD[i]) q P (SD[i]− SD[i]) 2 P (FD[i]− FD[i]) 2 (5.2.3) A≡ argmax {an|an=f k and f k ∈Fn} Corr(FD) (5.2.4) Postconvergence,thisframeworkprovidesanactivespeakerfacetrackforeachspeaker-homogeneous speech segment. It even provides an active speaker face for the speech segments where speakers are present off-screen, which adds to the false positives of the system. Moreover, this framework is limited to the videos having enough information to disambiguate characters in the video. One of the failure cases is a single-scene video where all speakers are visible in frames at all times. In such a scenario, the system will have no way to associate the speech and faces correctly. For brevity, we will refer to this system, using speakers’ cross-modal identity associations, as SCMIA. 5.2.3 Audio-visual activity guidance (GSCMIA) To address the limitations of the SCMIA, we propose incorporating the audio-visual activity in- formation, which can complement the speaker’s identity information. To gather the audio-visual activity information, we use a state-of-the-art audio-visual model, TalkNet (R. Tao et al., 2021). TalkNet observes a face track and the concurrent audio waveform and models the short-term and long-termtemporalcontexttoprovideface-box-wiseactivespeakerpredictions. Usingtheobtained face box scores we compute an active speaker score, ˆ P a (f k ) for each face track, f k , as the mean score of the constituent face boxes. We use the obtained TalkNet predictions to inject the disambiguating information into the SCMIA. To minimize the amount of noise introduced by the audio-visual TalkNet, we utilize only 75 the highly confident predictions, whether positive or negative predictions. We use the predictions in two ways: i) Positive-label guidance: We intend to collect the speech segments for which the audio-visual activitypredictionsprovideahighlyconfidentspeech-faceassociation. Foreachspeechsegment s n , we select the face track from the set of temporally overlapping face tracks, t n ∈ F n , that have the maximum AV-activity-based active speaker posteriors, ˆ P a (t n ). The speech segments with selected face tracks, t n , displaying a highly confident positive posterior are collected; we call it the positive guidance set (PG), denoted in Eqn. 5.2.5. For all these speech segments, we use the AV-activity predictions as the ground truth and assign the selected face track, t n , as the active speaker face (Eqn.6). Wekeepthesespeech-faceassignmentsunalteredwhileiterativelymaximizingthe O(FD) in Eqn. 5.2.3 & 4. Explicitly assigning the selected face track to the speech segment replicates the case when only the speaker’s face is visible in the video frames. Such cases are especially disambiguating as it provides a trivial signal for the speech-face association. t n =argmax f k ∈Fn ˆ P a (f k ) PG≡{ s n | ˆ P a (t n )>τ p } (5.2.5) (s n ←→ t n ) ∀s n ∈PG (5.2.6) ii)Negative-labelguidance: Weintendtotakeadvantageofthehighlyconfidentnegativepredictions by the AV-activity system. For each speech segment, s n , we collect the face tracks having highly confident negative av-activity predictions, and call them a negative guidance set ( NG), denoted in Eqn. 5.2.7. Then we remove the face tracks in NG n from the candidate active speaker faces, F n , as they correspond to non-speaking faces. It makes the set, F n smaller and makes it easier to find the active speaker face for the underlying speech segment. It also addresses the speech segments withoff-screenspeakers, removingallthenon-speakingfaces, thusleavingnofacesinthecandidate active speaker faces, F n . We call this audio-visual activity guided speaker identity association as GSCMIA. The overview of GSCMIA is shown in Figure 5.2.1 and pseudo-code in Algorithm 3. NG n ={f k |f k ∈F n , ˆ P a (f k )<τ n } F n =F n − NG n (5.2.7) 76 Algorithm 3: GSCMIA 1 Positive guidance:≡ PG ; // using eq: 5.2.5 2 a n =t n ∀s n ∈PG; 3 Negative guidance:≡ NG n ; // using eq: 5.2.7 4 F n =F n − NG n ∀s n ∈S all ; 5 S all =S all −{ s n |F n =ϕ }; 6 Randomly initialize:A≡{ a n |a n ∈F n }∀s n ∈S all ; 7 Compute SD and FD ; // using eq: (5.2.2) 8 objective← Corr(FD) ; // using eq: (5.2.3) 9 while objective increases do 10 for each a i ∈A do 11 if s i ∈PG then 12 continue ; // remains unaltered 13 end 14 a i =argmax f k ∈Fn Corr(FD) ; // i th row 15 Update FD; 16 Update A; 17 end 18 objective← Corr(FD); 19 end 5.3 Experiments and implementation details 5.3.1 Implementation details We start with extracting the active voice regions for a video under consideration using an open- sourcetool,pyannote (Bredinetal.,2020). Wethensegmenttheobtainedvoicedpartsintoapprox- imated speaker-homogeneous speech segments, s n , by partitioning at the shot boundaries (Pardo et al., 2021) and keeping the maximum duration to be 1sec each (Sharma & Narayanan, 2022b). We gather the set of temporally overlapping face tracks, F n , for all the obtained speech segments, using the RetinaFace (Deng et al., 2020) face detector, and SORT (Bewley et al., 2016) tracker. As a trade-off between the computational time and the context length (Sharma & Narayanan, 2022b), we partition the set of speech segments, S all , to contain L segments each. Next, we compute each partition’s speech-identity and face-identity distance matrix and solve the speech-face association by separately maximizing each partition’s objective function O(FD). For SCMIA we use L=500. For GSCMIA we employ TalkNet (R. Tao et al., 2021) to gather AV-activity ASD predictions. We construct the positive guidance set of speech segments, PG, using τ p > 0.9 (Eqn. 5.2.5) for 77 AVA dataset and τ p >0.5 for VPCD videos, where TalkNet prediction scores lie in the range [0,1]. While constructing the negative guidance set of face tracks, NG n , we use τ n < 0.2 (Eqn. 5.2.7) for both the datasets. These thresholds are derived using a subset of training sets for both the datasets. In contrast to SCMIA, we use L=50 for GSCMIA. Forevaluationpurposesweusetwobenchmarkdatasets: i)theAVA-activespeakerdataset(Roth et al., 2020) and ii) the Visual person clustering dataset (Brown et al., 2021). The AVA-active speaker dataset consists of 160 international movies, publicly available, and face-boxwise annota- tions for 15 min duration of each film. We report performance for the publicly available and widely used validation split of the AVA-active speaker dataset consisting of 33 movies. To incorporate diversity, we use a part of the VPCD consisting of active speaker annotations and character iden- tities for several Hollywood TV shows and movies. We observed that the available annotations in VPCD are exhaustive only for TV shows: Friends and TBBT, primarily due to the limited number of characters in the videos. So we restrict our evaluation to TV shows: 25 episodes of Friends from season 3 and 6 episodes of TBBT from season 1. We use the widely used mean average precision (mAP) metric to report performances. 5.3.2 Performance Evaluation In Table 5.1, we report the performance of various strategies described in Section 5.2 for the three datasets. We present the performance of SCMIA and TalkNet (R. Tao et al., 2021), computed as an average over mAP of the constituting videos for each dataset. We compute the mAP using the official tool provided by AVA-active speaker. We observe that TalkNet performs significantly better than SCMIA for the AVA dataset. At the same time, the performance is comparable for both strategies in the case of the TV shows: Friends and TBBT. A reason for such observed behavior can be the fully-supervised setup of TalkNet, trained on AVA training set. Table 5.1 also shows the performance of GSCMIA, following Algorithm 2. The av-activity guidance enhances the performance of SCMIA across all datasets. We note that the performance enhancement for the AVA dataset is more significant than for others. Since the sources of information for audio-visual activity in the case of TalkNet and SCMIA is independent, we expect them to have some exclusive components. Here we investigate the validity of this hypothesis by combining the two methods at the score level. We score each face box as the 78 Table 5.1: Performance of various strategies for ASD when guided with Talknet (R. Tao et al., 2021). The reporting metric is mean average precision (%). Datasets SCMIA TalkNet GSCMIA Late-fusion AVA active speaker 68.84 91.96 80.9 92.86 Friends (25 episodes) 81.45 85.42 84.13 87.54 TBBT (6 episodes) 85.74 83.84 86.39 87.19 Table 5.2: Performance of various strategies for ASD when guided with Syncnet (J. S. Chung & Zisserman, 2016). The reporting metric is mean average precision (%). Datasets SCMIA SyncNet GSCMIA Late-fusion AVA active speaker 68.84 58.70 71.74 73.74 Friends (25 episodes) 81.45 77.14 81.98 83.90 TBBT (6 episodes) 85.74 80.00 85.87 86.14 weighted aggregate of the individual scores from the two methods: i) GSCMIA and ii) TalkNet, in a ratio of α :(1− α ). We report the obtained mAP for the three datasets in Table 5.1. We observe that a simple late fusion of the two methods improved the performance for all three datasets, showing performance better than the individual performances of the involved methods. We further evaluate the GSCMIA guided with another av-activity model: syncnet (J. S. Chung & Zisserman, 2016). Syncnet is trained for audio-visual synchronization in a self-supervised frame- work, unlike fully-supervised TalkNet. In Table 5.2, we show the performance of Syncnet and GSCMIA assisted with Syncnet, for the three datasets. We note the weaker performance of Sync- net compared to TalkNet. We observe that the assistance from Syncnet enhances the performance forGSCMIA,althoughmarginallyforTVshows,attributingtotheweakerperformanceofSyncNet. The two methods’ late fusion of posterior scores performs improves the overall performance. The observed trends in performance enhancement with SyncNet and TalkNet guidance are consistent. 79 5.3.3 Ablation studies Error Analysis Inthiswork,weconsideredtwosourcesofinformationconcerningactivespeakerdetectioninvideos: i) the audio-visual activity modeling and ii) the speaker identity co-occurrences across modalities. The fundamental hypothesis underlying this work is that the information captured by the two sources may have complementary components. Here we investigate this further by comparing the predictions of the two systems. We generate face box-wise scores using i) SCMIA and ii) TalkNet for the three datasets. We then generate the corresponding predictions by thresholding the obtained scores such that the systems’ precision and recall are nearly equivalent. We divide the samples into two groups: i) positive samples and ii) negative samples, using ground truth. For both groups, we construct the confusion matrices, comparing the predictions of the two systems. In Figure 5.3.1 we show the confusion matrices for the positive and negative samples for three datasets (AVA-active speaker, Friends, and TBBT). The off-diagonal cells in confusion matrices shown in Figure 5.3.1 represent the exclusive cor- rectness of one of the methods. The cells marked in green shows the fraction of samples where the SCMIA is exclusively correct. We note that the fraction of samples where one method is solely accurate is significant, especially for the positive samples. From Table 5.1, we note that the perfor- mance of TalkNet on AVA is significantly higher than the SCMIA, precisely 92.0 mAP score, and has marginal room for further improvement. Even then, the fraction of exclusive correctness of the SCMIA is significant. This verifies the hypothesis that the information comprising the audio-visual activity and SCMIA have some exclusive components, which motivates us to explore the use of two sources of information in a complementary fashion towards enhancing ASD performance. Role of Positive guidance (PG) in GSCMIA ThePositiveguidesintendtoprovideexplicitdisambiguatinginformationtothe SCMIAbyacquir- ing high-confidence positive class (speaking) predictions from the audio-visual activity information (TalkNet). The speech segments in PG and their corresponding active-speaker face tracks readily provide the speech-face associations to the SCMIA. To quantify the role of PG, we report, in Ta- 80 A V A-active speaker Friends (25 episodes) TBBT (6 episodes) Figure 5.3.1: Comparison of SCMIA and TalkNet (R. Tao et al., 2021) perdictions for positive (top row) and negative samples (bottom row). 81 Table 5.3: Constituents of positive and effective positive guides and their exactness. All values are shown in %. Videos Positive guides (PG) Effective PG Fraction Accuracy Fraction Accuracy AVA 24.65 91.67 4.41 82.03 Friends 73.42 79.69 12.09 47.73 TBBT 66.59 83.13 7.81 65.82 ble 5.3, the fraction of all speech segments which provide positive guidance: |PG| |S all | , where|.| denotes the cardinality of a set. We also report the accuracy of the positive guides toward predicting the positive class (speaking). We observe that the accuracy of the PG is high for all the datasets. This validates the usage of the PG as the ground truth, adding a marginal amount of noise in the form of a small number of false positives. We note that the PG fraction is larger and accuracy is smaller for TV shows compared to AVA. This is due to the lower value of τ p for VPCD. For some of the speech segments in PG, speech-face associations are correctly deduced even by the SCMIA (no-guidance). A simple example is a case when only the speaker’s face is visible in the frames. Effectively the speech segments in PG, for which the SCMIA can establish the same speech-face association, do not help other than reducing the computational load. Removing such speech segments from the PG, we report the fraction of the S all that presents effectively helpful information in Table 5.3 as Effective positive guides. We observe that a much smaller fraction of S all constitutes the effective positive guides. We also tabulate the accuracy of the effective positive guides and observe that they are less accurate than the bigger set PG. The observed smaller accuracy is notable for TV shows, attributed to the lower value of τ p for VPCD. Positive guides at work Here we display the advantage of the positive guidance from the audio-visual activity to solve the speech-face association using GSCMIA. We consider an adverse case where the speaker identity- based method fails due to insufficient disambiguating information. One such scenario is where all thespeakerfacesarealwaysvisibleinframes;apaneldiscussionisasimpleexample. TheColumbia dataset, consisting of an 85 min long panel discussion video, is the closest, and it consists of active speaker annotations for 35 min duration of the video comprising six speakers. 82 Table 5.4: Comparison of the speaker-wise weighted F1 scores for all the speakers in Columbia dataset. Methods Abbas Bell Boll Lieb Long Sick Avg Chakravarty et al. (Chakravarty & Tuytelaars, 2016) - 82.9 65.8 73.6 86.9 81.8 78.2 Shahid et al (Shahid et al., 2019) - 89.2 88.8 85.8 81.4 86 86.2 Chung et al. (J. S. Chung & Zisserman, 2016) - 93.7 83.4 86.8 97.7 86.1 89.5 Afouras et al. (Afouras et al., 2020) - 92.6 82.4 88.7 94.4 95.9 90.8 S-VVAD (Shahid et al., 2021) - 92.4 97.2 92.3 95.5 92.5 94 RealVAD (Beyan et al., 2021) - 92 98.9 94.1 89.1 92.8 93.4 TalkNet (R. Tao et al., 2021) - 97.1 90.0 99.1 96.6 98.1 96.2 SCMIA 97.0 95.0 91.0 65.3 0.1 34.6 65.0 GSCMIA 97.3 96.3 89.4 98.7 98.7 96.8 96.2 We evaluate the performance of the SCMIA (no guidance), on the Columbia dataset and report the widely used speaker-wise weighted F1 score in Table 5.4. We also report the performance of several state-of-the-art methods for comparison. We observe that the performance is terrible, particularly for three speakers: Lieb, Long, and Sick, while comparable to other methods for the remaining speakers. By visually inspecting the system’s active speaker predictions, we noted that the system incorrectly associates Long’s speech with Sick’s face and Lieb’s with Long’s face. The relevant fragments of the video output are present in the supplementary material. These erroneous associations can be attributed to the fact that Lieb and Sick always appear together in the video, providing the system with insufficient information to resolve the speech-face association. We inject the explicit disambiguating information in form of positive guides from TalkNet. We usejustthepositiveguides,nonegativeguides,whileemployingAlgorithm3toquantifytheimpact of the positive guides. We report the performance of the positively guided GSCMIA for all the speakers in Table 5.4. We observe that the guided system restores the earlier incorrect speech- face associations and performs at par with other methods for all the speakers. The relevant video fragments with enhanced predictions are present in the supplementary material. The improved performance highlights the importance of positive guides in providing the required disambiguating information to establish accurate speech-face associations. 83 Table 5.5: Constituents of negative and effective negative guides and their exactness. All values are shown in %. Videos Negative guides (NG) Effective NG Fraction Accuracy Fraction Accuracy AVA 46.69 96.06 20.54 89.57 Friends 22.00 98.43 2.98 86.72 TBBT 17.87 93.87 3.61 56.59 Role of Negative guides (NG n ) The visual information modeling the activity in the lip region and its coordination with the under- lying audio waveform can reliably predict which of the visible faces are not speaking. The negative guides are introduced to incorporate the information concerning the non-speaking faces to help the speech-faceassociationusingthespeakeridentitycontextualinformation. Thefacetracksclassified by the TalkNet as not speaking with high enough confidence are used as ground truth. We remove these faces from the set of candidate active speaker faces while establishing speech-face association using the GSCMIA. Although not explicitly as in the case of positive guides, the negative guides also introduce disambiguating information by eliminating some of the candidate active speaker faces. An example canbeascenariowithtwofacesvisibleintheframes, andoneofthemiseliminatedbythenegative guides classifying as not speaking: establishing a direct speech-face correspondence. In Table 5.5, we quantify the fraction of all face tracks that constitute the negative guides: P n |NGn| P n |Fn| , along with their accuracy predicting the negative class (not speaking). We observe high accuracy values for all datasets, attributed to selecting only high-confidence predictions from TalkNet. We also note that the fraction of face tracks that negative guides constitutes is much higher for AVA than others. The reason can be that TalkNet performs better for AVA than others, primarily due to training on the AVA train set. As in the case of PG, only some of the face tracks in negative guides provide new information to the GSCMIA. Even with no guidance, the SCMIA can correctly deduce some of the face tracks in negative guides as not speaking. Discarding such face tracks, we report the remaining fraction of all face tracks in Table 5.5 as effective negative guides, along with their precision for negative class. We observe that only a small fraction of face tracks in NG n effectively add value to GSCMIA for 84 Table 5.6: Effect of using the Negative guides: performance enhancement for speech segments with off-screen speakers. Videos off-screen (%) SCMIA GSCMIA AVA 33.30 16.67 9.51 Friends 13.31 19.64 17.28 TBBT 12.21 10.32 10.02 TV shows while a notable fraction for the AVA dataset. Surprisingly, we note that the accuracy of the effective negative guides is low for TBBT, adding noisy information. The noisy guides lead to only marginal increase in the overall performance with the av-guidance, GSCMIA, for TBBT compared to others, as shown in Table 5.1 Effects on offscreen speakers One of the limitations of SCMIA is the inability to explicitly address the speech segments where facesarevisibleintheframes,butthespeakerisoff-screen. However,audio-visualactivitymethods can handle such cases reliably. The negative guides are intended to fuse this ability of the AV- activity information with the SCMIA. Here we explore the effect of AV-activity guidance on the speech segments with off-screen speakers. We tabulate the fraction of speech segments with off-screen speakers in Table 5.6 for all three datasets. We observe that the AVA has much higher fraction of speech segments with off-screen speakers than others. It is due to the nature of videos in AVA which are more in the wild compared to structured TV shows where the video primarily captures the speaker. The higher fraction of off-screen speakers in AVA is one of the reasons for the lower performance of the SCMIA for AVAcomparedtoothers(Table5.1). Wecomparetheerrorsof SCMIAandGSCMIA,particularto speechsegmentswithoff-screenspeakers. Wereportthefractionofspeechsegmentswithoff-screen speakers for which the systems incorrectly assign an active speaker face in Table 5.6. We observe that the errors in the form of false positives by assigning active speaker faces to speech segments with off-screen speakers reduce consistently for all datasets when guided with av- activity. We point out that the improvement in the AVA dataset is greater than TV shows. The largerfractionofeffectivenegativeguidesforAVAdataset, observedinTable5.5, whichinturnare due to large fraction of off-screen speakers, is the reason for the observed substantial improvement. 85 Figure5.3.2: Variationinperformance,reportedinmAP,of SCMIAandGSCMIAwiththecontext length (L) Effects on required temporal context The SCMIA studies the contextual information concerning the speakers’ identity appearing in the speech and faces and establishes the speech-face associations. By formulation, it relies on the observed speakers’ identities in temporal context to have enough disambiguating information to solve the required speech-face associations. Using all the available speech segments, i.e., the entire video length, is ideal. However, as described in (Sharma & Narayanan, 2022b), the complexity of finding the speech-face association, using the Algorithm 3, is a quadratic function of the context length (O(L 2 )); thus, a lower value of L is desirable. The value of L acts as a tradeoff between the system’s performance and time complexity. As mentioned in the implementation details, we use L=500 for SCMIA. The system observes the context of identities from speech and faces appearing during the 500 speech segments and assumes that it consists of enough disambiguating information. At the same time, we use a context 86 of only 50 segments (L = 50) while employing GSCMIA. Here we explore the role of required context of speech segments to solve speech-face associations. In Figure 5.3.2, we show the variation in the performance of each system (SCMIA & GSCMIA) with the length of context (L). We observe a notable increase in the performance (reported in mAP) of the SCMIA with the increase in L. This increased performance follows from the fact that the system observes more disambiguating information with a larger context length (L) and thus better solves the speech-face association. In contrast, the system’s performance with av-activity guidance (GSCMIA) is robust to the variations in the context length. It validates the hypothesis that the AV-activity guides, positive and negative, contribute to the disambiguating information making even the smaller context length enough to solve the speech-face associations. We also observe that the performance of the guided system with just 25 speech segments is significantly better than the unguided system, even with a context of 500 speech segments. Thus guiding the speaker identity-based system with AV-activity information shows a 400-times improvement in the time complexity along with the noted improvement in performance. 5.4 Conclusion This work showed that the errors of the two independent systems for ASD, using: i) audio-visual activity and ii) co-occurrence of speakers’ identity, have an exclusive component, validating the need to develop methods collaborating the two approaches. We proposed a generalized framework that can incorporate external information from any source while studying the co-occurrences of speakers’ identities across modalities to establish speech-face associations. We have shown that its integration with fully supervised (TalkNet) and self-supervised (Syncnet) av-activity systems improve performance. As the framework is unsupervised, its integration with self-supervised av- activity eliminates the need for ASD labels. One immediate future direction is to enable speech- body association, which can help further enhance ASD, particularly helping cases where faces are in extreme poses or not visible. 87 Chapter 6: Using Active Speaker Faces for Diarization in TV shows 6.1 Introduction We use digital media in our daily lives to create, consume and experience stories in various forms: books, movies, TV shows, broadcast TV, etc. It impacts how we perceive, form, and communicate opinions and ideas. Recently there have been multiple efforts in the emerging field of computa- tional media intelligence (CMI) to characterize and understand the impact of media portrayals on society. CMI focuses on developing computational tools to analyze media content to obtain a holistic understanding of the stories being told and their impact on the experiences and behavior of individuals (Somandepalli et al., 2021). One of the central components of CMI is understanding the portrayals of characters in media along with their representation among the dimensions of character attributes such as age, gender, race, and appearance. It also aims to analyze the interactions between the characters. To address these aspects of CMI, knowing who speaks when in the media session is crucial. In this work, we address the broad problem of speaker diarization in dialogues of TV shows: determining the active speaker leveraging visual information. Speaker diarization refers to assigning all the speech segments in an audio session to their re- This work is being prepared for a submission to Interspeech 2023. Initial idea has published on arxiv: Sharma, Rahul, and Shrikanth Narayanan. ”Using Active Speaker Faces for Diarization in TV shows.” arXiv preprint arXiv:2203.15961 (2022). 88 spective speakers, without using prior knowledge of the involved speakers, be their real identity or the number of participating speakers. There has been a plethora of work in diarization using the audio modality(Park et al., 2022). The conventional speaker diarization system involves voice ac- tivity detection and speaker turn detection to get speaker homogeneous speech segments, followed by clustering these segments in the speaker embedding space. Such solely audio-driven speaker diarization systems are prone to errors when applied to media content domain, consisting of narra- tive films, due to diverse and heterogeneous acoustic conditions and context: speech in background music, sound effects, overlapping speech, and wide variation in characters and their portrayals. Unlike read-speech as in broadcast news, in media such as TV shows and movies, character interaction is more spontaneous, thus leading to more frequent turn change and smaller speech segments, making audio-based diarization systems further prone to errors. However, at the same time,thevisualmodalitycontainssubstantialcluesforspeakeridentification,especiallythedialogue deliveries in entertainment media. In this work, we investigate the applicability of active speaker face clustering for supporting speaker diarization in media content; our exemplary use case is TV shows. Weuseaself-supervisedframeworkfollowedbyanaudio-visualprofilematchingstrategy, intro- duced in our previous works (Sharma & Narayanan, 2022a; Sharma et al., 2019, 2022), to obtain active speaker faces in the videos from TV episodes (Friends show). We take advantage of estab- lished face verification (Schroff, Kalenichenko, & Philbin, n.d.), and speaker recognition (L. Wan, Wang, Papir, & Moreno, n.d.) features to represent the active speaker faces and the corresponding speechsegment. Toobtaindiarizationresults,weclusterthefacefeaturesusingDBSCAN(Hahsler, Piekenbrock, & Doran, 2019) providing no information about the number of participating charac- ters. In this work, we present experimental results on specific episodes of Friends, a part of Video Person Clustering Dataset (VPCD) (Brown et al., 2021). We are interested in exploring the use of active speaker faces for diarization and in comparing them against the speaker embeddings derived from speech segments. In this paper, our contributions are twofold: i) We show that active speaker face representa- tions outperform state-of-the-art audio-based embeddings for the task of speaker diarization in TV shows. Face representations are coupled with a simple DBSCAN-based clustering, thus avoiding sophisticated refinement procedures explicitly designed in recently proposed speaker diarization 89 clustering schemes (Q. Wang et al., n.d.). ii) We report a systematic analysis on the impact of the quality of active speaker detection systems on speaker diarization performance. We show that even a moderately well-performing active speaker detection system (∼ 60% accurate) can exceed the performance of audio-only systems. 6.2 Related Work 6.2.1 Active speaker detection Earlier works on active speaker localization in visual frames focused on detecting activity in the lip regions of the appearing faces in the frames (Everingham et al., 2006). More recently, (Roth et al., 2020) released a large-scale dataset consisting of active speaker annotations for parts movies, which was followed by the introduction of a wide variety of fully-supervised methods for active speaker localization (Alc´ azar et al., 2020; J. S. Chung, 2019; Y.-H. Zhang et al., 2019). In parallel, there have been approaches posing the active speaker detection as a sound source localization task and proposed self-supervised systems trained for audio-visual synchrony in videos (J. S. Chung & Zisserman, 2016; Y. J. Kim et al., 2021; Owens & Efros, 2018). In our previous approaches, we proposed a weakly supervised cross-modal system trained for the presence of speech in videos and used the class activation maps for active speaker localization (Sharma et al., 2019, 2022). We further proposed audio-visual character profile matching to improve the performance of active speaker localization (Sharma & Narayanan, 2022a). 6.2.2 Face and speaker recognition Face recognition: Approaches in face recognition can be divided into 2 categories: i) Face verifi- cation task (G.B.Huangetal.,2007;Lu&Tang,2015)andii) Face identification tas k(Guillaumin et al., n.d.; Guo et al., 2016; Parkhi et al., 2015). A common approach to face recognition is metric learning (Guillaumin et al., n.d.), where a deep neural network (Hu et al., 2018) is trained for the task of face verification or identification (Cao et al., 2018; Guo et al., 2016). Speaker Recog- nition: Speaker recognition is a well-explored field with earlier approaches focused on learning speaker embeddings using variants of the softmax classification loss (J. S. Chung, Huh, Mun, et al., 2020; Z. Huang et al., 2018; Kenny et al., 2013; Nagrani et al., 2020; Yu et al., n.d.). Recent efforts 90 developing metric learning objectives to learn an embedding space with small intra-class and large inter-class distances have shown promising results for speaker recognition (J. S. Chung, Huh, Mun, et al., 2020; Q. Wang et al., n.d.; C. Zhang et al., 2018). 6.2.3 Speaker diarization Speaker diarization in audio modality has been extensively addressed, and an active area of re- search (Park et al., 2022). In general, diarization frameworks consist of multistage paradigms involving voice activity detection, speaker embedding extraction, and then clustering the speech regions in embedding space (Pal et al., n.d.; Park et al., 2020; Q. Wang et al., n.d.). Recently there has been an increase in end-to-end neural speaker diarization systems (Z. Huang et al., n.d.). Specific to diarization in TV shows, previous methods have used visual patterns among the shots (Bost et al., 2015), face clustering, and talking face detection (Bredin & Gelly, 2016; J. S. Chung, Huh, Nagrani, et al., 2020) to complement audio-driven speaker embeddings. Large- scale audio-visual diarization datasets around TV shows (J. S. Chung, Huh, Nagrani, et al., 2020) and feature films (E. Z. Xu et al., 2021) have also emerged, involving a semi-automatic annotation process. 6.3 Methodology 6.3.1 Problem formulation Given a video, we acquire the set of speaker homogeneous speech segments {s n },∀n ∈ [1,N] and for each s n , the corresponding temporally overlapping set of face-tracks, {f n k },∀k∈[1,K]. We define active speaker detection (ASD) as the task of finding a source face-track (if any) for all speech segments s n from their respective sets of overlapping face-tracks {f n k }. We formulate speaker diarization (SD) as a clustering task to cluster all the speech segments s n into C clusters, {s c l },∀l ∈ [1,L c ]&c ∈ [1,C], where L c being the number of speech segments in the cluster c and C is the number of characters in the video, which in general is not known. We use oracle speaker homogeneous speech segments for all our experiments. 91 6.3.2 Active speaker detection (ASD) We use a two-stage method for ASD. We first derive active speaker information from the visual cues using a weakly supervised cross-modal framework (Sharma et al., 2022). We next construct audio-visual character profiles and impose a constraint that for a given active speaker instance {s n ,f k }, source face-track f k for the speech segment s n belongs to the same character (Sharma & Narayanan, 2022a). Visual score In our previous work, we proposed the Hierarchical Context Aware (HiCA) architecture (Sharma et al., 2022), trained for the task of the presence of speech in a video segment, and showed that the system could localize active speakers in video frames. We used class activation maps (CAMs) to localize the salient regions. Formally, for a video segment v i , we compute a spatio-temporal (3D) feature map F m at the last layer of the HiCA network, with m filters, and the softmax prediction score for the presence of speech, y i . We then compute the CAM, M following equaiton 6.3.1 with Z being the averaging factor. Overview of the system is shown in Figure 6.3.1. α m = 1 Z Σ i Σ j Σ k ∂ˆ y i ∂F m ijk M =ReLU(Σ m α m F m ) (6.3.1) For a face-track, f k , temporally overlapping with speech segment, s n , we compute a visual active speakerscore, (VAS k ), byROIpoolingtherelevantCAMs. WeROIpooltheCAMsforeachframe, t, in the face-track that overlaps with s n in time and compute an aggregate mean over the frames, with M t denoting the CAMs for frame t and {(x t 1 ,y t 1 ),(x t 2 ,y t 2 )} denotes the face bounding box in frame t. VAS k = 1 Z Σ f Σ x,y M t [y t 1 :y t 2 ,x t 1 :x t 2 ] (6.3.2) Iterative profile matching In (Sharma & Narayanan, 2022a) we introduced a strategy constructing audio-visual character profiles to complement the visual-only active speaker detection, which takes advantage of the fact 92 ( y i ) Presence of speech HiCA Filter maps (F m ) Video segment Grad CAM Class activation maps Active speaker localization Figure 6.3.1: Active speaker localization using HiCA architecture and GradCAMs (Sharma et al., 2022). that for a potential active speaker instance {s n ,f k } if s n is a speech of one character f k should be the face of the same character. We initiate with a set of high confidence instances (HCI) of face-speech associations and cluster them to construct audio-visual character profiles {F c ,S c }. F c ≡ { f c k } denotes the face instances clustered into character c (the real identity and number of characters are unknown), and S c ≡{ s c k } denotes the corresponding speech instances for cluster c. Using the constructed audio-visual character profiles, we compute a profile matching score (PMS) (Sharma & Narayanan, 2022a) for all other instances as P(s n → f k ), signifying the confi- dence of speech segment s n and face-track f k belonging to the same character profile. We update the set of HCI with the high scored active speaker instances (computed using a combination of VAS and PMS) and repeat the process until no more instances, scored high enough, to be added to HCI. At inference time we score any potential active speaker instance {s n ,f k } as VAS k +PMS nk . 6.3.3 Speaker diarization The diarization objective is to cluster all instances of speech segments s in the video such that speech segments coming from one character are grouped. Subsequent to active speaker detection, we have pairs of speech and associated face, {s n ,f n } along with speech instances s b which have speakers in the background; thus, no face associated. For audio-based diarization baseline, we use 93 allthespeechinstancess n ∪s b andclusterthemusingspectralclustering,postapplyingadiarization specific refinement sequence and estimating the number of speakers as suggested in (Q. Wang et al., n.d.). For vision-based clustering, we first filter the obtained set of speech-face associations {s n ,f n } based on the confidence of the association to remove the noisy instances. A≡{ s n ,f n }:VAS n +PMS n >τ (6.3.3) We employ a density based clustering algorithm DBSCAN (Hahsler et al., 2019), which does not require knowledge of the number of clusters, to cluster all the face-track instances f n ∈ A to obtain F c ≡{ f c k },c ∈ [1,C], where C is the number of clusters. DBSCAN also provides a list of instances which were not clustered and marked as noisy samples. Next we form clusters among speech segments s n ∈A, using the available associations{s n ,f n }: S c ≡{ s c n }:∃{s c n ,f c n }∈A (6.3.4) In order to assign cluster labels to the remaining speech segments (no face segments, noisy speech- face association, marked noisy by DBSCAN), we compute their cosine distance in speaker embed- dings space from all the clustered points. We assign a cluster label to each instance s q , the one that shows the least average distance, as shown in equation 6.3.5, where{·} represents dot product and∥.∥ represents l 2 norm. argmin c 1 |S c | X n s q · s c n ∥s q ∥∥s c n ∥ (6.3.5) 6.4 Experiments WeuseasubsetofVPCD(Brownetal.,2021), consistingof7episodesfromseason3oftheFriends TVshow, forevaluationpurposes. VPCDprovidesallthefacetracksappearinginthevideoframes and the character identities, and it also provides the speech segments for each character. It addi- tionally consists of extracted face-embeddings for all the annotated faces and speaker embeddings for all the speech segments. We use the speaker homogeneous voices segments directly from the 94 Table6.1: Activespeakerdetectionperformanceimprovementwithiterativeprofilematchingstrat- egy. Video Name / Approach CAMs only (accuracy) CAMS + Iterative Profile Matching (accuracy) Friends s03e17 0.69 0.76 Friends s03e01 0.69 0.79 Friends s03e02 0.65 0.75 Friends s03e03 0.76 0.81 Friends s03e04 0.77 0.81 Friends s03e05 0.76 0.83 Friends s03e06 0.77 0.84 a b c d e Face distance matrix (GT) Face distance matrix (100%) Face distance matrix (75%) Face distance matrix (50%) Face distance matrix (25%) Figure 6.4.1: Distance matrices for the speech-face associations, using cosine distance among face- track embeddings, for different sets of speech-face instances. Selecting all (b), top 75%(c), 50%(d) and 25%(e) samples in order of ASD scores. VPCD and use the provided face and speaker embeddings for our setup. 6.4.1 Active speaker detection performance The employed active speaker detection system predicts a corresponding face to each ground truth speechsegment. Wecomputetheaccuracyofthissystemastheratioofthedurationofthecorrectly predictedfacestothetotaldurationoftheavailablespeechsegments. ForaCAMs-onlysystem, for a speech segment we assign the face-track with a maximum VAS score, among all the temporally overlapping face-tracks, as an active speaker. When combined with the iterative profile matching strategy, we use VAS + PMS score as the criterion. In Table 6.1 we compare the performance under those two scenarios and show that the profile matching strategy consistently improved the ASD performance for all the videos. 95 Table 6.2: Speaker diarization performance, DER (lower is better), using face-tracks compared against audio-only system. Video Name / Topx% ASD 100% face-tracks 75% face-tracks 50% face-tracks 25% face-tracks 100% speech segments Friends s03e17 0.199 0.17 0.13 0.24 0.17 Friends s03e01 0.27 0.16 0.2 0.27 0.27 Friends s03e02 0.219 0.19 0.21 0.19 0.23 Friends s03e03 0.18 0.18 0.19 0.23 0.26 Friends s03e04 0.31 0.3 0.13 0.4 0.21 Friends s03e05 0.16 0.2 0.17 0.36 0.3 Friends s03e06 0.21 0.24 0.17 0.25 0.26 6.4.2 Speaker diarization performance As mentioned in section 6.3.3, before performing the face clustering for speaker diarization, we filter the obtained speech-face associations based on the confidence of the active speaker detection system. We present speaker diarization performance for four such filtered sets, containing top 25%, 50%, and 75% of all the instances, ranked based on ASD score (VAS+PMS) and one set containing all the samples (100%). We compare the performance against the audio-based speaker diarization (Q. Wang et al., n.d.). We report the diarization error rate (DER) as the metric to compare the SD performance, tabulated in Table 6.2. We observe that the face-clustering-based diarization outperforms the audio-based framework for all the videos, except when we use just 25% of samples for clustering. The relatively lower performance, in this case, can be attributed to the fact that 25% of samples may not have enough information for all the characters. Most of the videos perform relatively worse while using 100% of the samples than in the case with a lower number of high confidence points. This suggests that filtering of ASD results is a crucial step to remove noisy classifications before clustering them. To further understand the effects and dynamics of filtering the ASD results, in Figure 6.4.1 we show the distance matrices for the selected speech-face associations (for episode17, chosen arbitrar- ily), post-filtering, in the set A (as described in section 6.3.3) for different setups. We compute the distances in the face domain, using the cosine distances between the face-track embeddings. In Figure6.4.1a, weshowtheidealcasescenario, withperfectASDoutput(obtainedusingtheground truth). Forvisualizationpurposes, wegroupedtheinstancesforeachcharacter(usinggroundtruth identities for speech segments) and used the same grouping for rest of the cases for easier compar- 96 a) 90% accurate ASD (all samples) b) 90% accurate ASD (correct samples) c) 50% accurate ASD (all samples) d) 50% accurate ASD (correct samples) Figure 6.4.2: Distance matrices for simulated ASD output. a) and c): Simulated output with all samples. b) and d):s output with just the correct samples. ison. In Figure 6.4.1b, we show the distance matrix for all the speech-face associations obtained from the ASD system. The incorrect predictions are evident in the form of visible noise in the distance matrix compared against the ideal case scenario in Figure 6.4.1a. In Figure 6.4.1c,d,e we show the distance matrices of the filtered sets and observe that amount of noise decreases and the cluster patterns become more apparent. This results in improved SD performance with the filtered set, as shown in Table 6.2, for the majority of videos. Although in Figure 6.4.1e, showing the distance matrix for a smaller set of 25% of all the samples, we notice prominent clusters, the number of data points drastically decreased, impacting the distribution of characters among them. Inadequate distribution of characters among the data points explains the drop in performance for the set containing 25% samples, as reported in Table 6.2. 97 Table 6.3: Variation in speaker diarization performance (using face clustering) with the quality of active speaker detection. Video name / simulated accuracy 100% 90% 80% 70% 60% 50% s03e17 (all samples) 0.06 0.13 0.26 0.43 0.60 0.71 s03e17 (correct samples) 0.06 0.06 0.07 0.08 0.11 0.16 s03e01 (all samples) 0.14 0.21 0.33 0.46 0.61 0.72 s03e01 (correct samples) 0.14 0.13 0.15 0.12 0.15 0.26 s03e02 (all samples) 0.16 0.23 0.33 0.46 0.61 0.72 s03e02 (correct samples) 0.16 0.15 0.16 0.15 0.12 0.17 We note from the above discussion that the performance of ASD has a direct impact on the performance of SD. We simulated k% accurate ASD outputs, where k ∈ [100,90,80,70,60,50], using the available ground truth to study the same further. We randomly select (100− k)% of all thegroundtruthspeech-faceassociationsandshuffletheirfaceinstancesamongthem,thuscreating incorrect speech-face pairs. We perform speaker diarization using the simulated ASD outputs in two manners: i) using all the samples, including the incorrect ASD predictions, and ii) using just the correct ASD samples (we keep track of the correct samples while simulating). In Table 6.3, we report the DER for 3 randomly selected videos, using the two strategies, aver- agedover5runs. Itishighlyprominentthatforwhenusingallthesamples, thespeakerdiarization performance drastically decreases as the accuracy of ASD decreases. It can be attributed to the introductionofmorenoisyspeech-faceassociationsastheaccuracyforASDfalls. InFigure6.4.2we show the distance matrices (simulated speech-face associations in terms of face-track embeddings) for two extreme values of k: 90% and 50%. The introduction of noise is visible in the 50% accuracy case (figure 6.4.2c). InthecaseofspeakerdiarizationusingjustthecorrectASDinstances, theDERremainsnearly stagnant for higher accuracy but falls for lower accuracy (50% case.) It can be attributed to the inadequatedistributionofthesamplesacrossthecharactersinthevideo. Thechangeindistribution can be noted in Figure 6.4.2b and d as the number of samples decreases for the lower accuracy case. Comparing the performance (while using just the correct samples) against the audio-based diarization performance (mentioned in Table 6.2), we observe that even the lower accuracy cases outperform the audio-only systems. 98 6.5 Conclusion Thisworkcomparedtheuseofactivespeakerfacesforspeakerdiarizationinmediacontentagainst conventional audio-only methods. We used an off-the-shelf active speaker face detection system andperformedDBSCANclustering, notrequiringinformationaboutthenumberofcharacters. We demonstrated that this system outperforms the state-of-the-art speaker embeddings based diariza- tion system. We further performed a detailed analysis of the impact of active speaker detection quality on speaker diarization performance. We observed that even moderately performing ( 60% accurate)activespeakersystemscanhavesufficientinformationtooutperformspeakerembeddings based clustering for speaker diarization. Future work will develop methods to effectively combine face and speech information from talking characters for diarization. Another critical step is fur- ther improving the active speaker detection systems since face clustering showed high potential for speaker diarization. 99 Chapter 7: Audiovisualcharacterprofilesfor detecting background characters in entertainment media 7.1 Introduction Technological advances have made it easier than ever to produce and experience media content, across platforms and genres, be it news, movies, TV shows, digital shorts or user generated videos. Media consumption has become an integral part of our lives in this modern world. This calls for a pressing need to develop methods to analyze and understand the impact of media content on humanlife, beitsocietaloreconomic. Therehasbeenarecentsurgeofresearchunderthethemeof computationalmediaintelligenceCMI,whichfocusonprovidingtoolstoanalyzethepeople,places, and context in the multimedia content to gain a holistic understanding of the stories being told, and their impact on the experiences and behavior of individuals and society at large (Somandepalli et al., 2021). One of the goals of CMI relates to the representation and identity of the people (characters) depicted in the media, and understanding their portrayals in the stories. Characters in media stories can be broadly divided into i) primary or lead characters, ii) background characters who This work is being prepared for a submission to International Conference on Image Processing (ICIP) 2023. A version of this work has been published on arxiv: Sharma, Rahul, and Shrikanth Narayanan. ”Audio visual character profiles for detecting background characters in entertainment media.” arXiv preprint arXiv:2203.11368 (2022). 100 appear in the background of a scene and participate minimally in an ongoing interaction. There have been numerous works dealing with detecting and localizing instances of primary characters in entertainment media, primarily TV shows and movies(Bojanowski et al., 2013; Haurilet, Tapaswi, Al-Halah, & Stiefelhagen, 2016). In contrast, learning representations for background characters and studying their portrayals is relatively unexplored (Bojanowski et al., 2013; Cour, Sapp, & Taskar, 2011; Parkhi, Rahtu, Cao, & Zisserman, 2018). In this work, we address the problem of detecting background characters in media content, particularly movies and TV shows. In enter- tainment media, primary characters appear most of the time and speak very often. Background characters on the other hand may appear for a small fraction of time, and may not speak during the entire movie. In this work, we operationally define background characters as those who do not speak throughout the movie. This paper presents an unsupervised framework to detect background character face tracks in videos, enlisting the face tracks that do not show speech activity throughout the video. Since there has been little past research on computationally studying the characteristics of background characters, there are no readily available large-scale datasets for this purpose. Hence in this work weleverageanexistingdataset(Brownetal.,2021)consistingofprimarycharactersthathavebeen annotated for some selected TV shows and movies to derive a complementary set of background character face tracks for evaluation. For validating the automatically obtained set of background character face tracks, we post processed this dataset by verifying using manual annotations (to create the final evaluation set). The proposed framework’s performance is evaluated on this newly curated background character dataset. Webuildonrecentadvancesinactivespeakerlocalization(Rothetal.,2020;Sharmaetal.,2019, 2022) in unconstrained videos and audio speaker verification (J. S. Chung et al., 2018; L. Wan et al.,n.d.)strategiestodevelopaudio-visualcharacterprofilesforthetalkingcharactersinvolvedina movie. Rather than collecting all the instances of characters’ speaking, we construct generic audio- visualrepresentationsforthecharactersinamovie. Furtherusingfaceverificationtechniques(Deng et al., 2020; Schroff et al., n.d.) we collect all the face tracks that do not match the created audio- visual character profiles and call them background characters. The main contributions of this paper include developing i) method assisting the visual signal- based active speaker localization system with speaker recognition information, and show enhanced 101 Friends episode# All face-tracks Background face-tracks s0301 835 171 s0302 998 100 s0303 996 223 s0304 1150 275 s0305 1188 314 s0306 1017 264 Figure 7.2.1: Left: Sample frames with background characters marked in green. Right: Number of tracks in each episode. performance ii) an iterative clustering paradigm to construct audio-visual character profiles for primary characters in movies iii) a background character dataset, which can be used to understand the representation and characteristics of background characters in media content. 7.2 Background Character Dataset This paper introduces a pilot dataset consisting of background characters’ face tracks for episodes from the sitcom Friends. We targeted to obtain the background character face-tracks as a comple- mentary set of primary character face-tracks. We use the Visual Person Clustering Dataset (Brown et al., 2021) (VPCD), which provides annotations for all the characters involved in conversations (speaking at some point in time) and the timestamps of their corresponding voice activity along with the character IDs. We use RetinaFace (Deng et al., 2020) and MM track (Contributors, 2020) to get the face-tracks in the video. We then purge all the primary character face-tracks (obtained from VPCD), which overlap with the obtained face-tracks in space and time. To further refine the dataset, we remove the face tracks which matches a primary character with high confidence, using VGGFace-2 (Cao et al., 2018) to obtain face-track embeddings. As a final refinement step, we obtain manual annotations. Figure 7.2.1 shows details about the curated dataset. 102 7.3 Proposed Approach 7.3.1 Problem formulation Given a speaker homogeneous voiced segment v and a corresponding set of temporally overlapping face-tracks{f k } (k∈ [1,K]), we aim to find a score for the mapping v→ f k , signifying face-track f k is the source of the speech segment v. We obtain the active voice regions in a movie, using a speechsegmentationtoolpyannote(Bredinetal.,2020),andpartitionthemattheshotboundaries, which we gather using PyScenedetect. The partitioning at the shot boundaries is motivated by the fact that speaker change is a prominent movie cut attribute (Pardo et al., 2021), thus decreasing the likelihood of observing a speaker change in the obtained segments. We further partition the voice segments to have a maximum duration of 1s, which further increases the probability of the acquired voiced segments v n (n∈[1,N]) being speaker homogeneous. We use RetainFace (Deng et al., 2020) to obtain the set of face-tracks {f k } from visual frames that coincide with the speech segment v n in time. Now, for each speech segment v n , we have K potential active speaker instances, {v n ,f k }, where k ∈ [1,K] and n∈ [1,N], signifying face-track f k is a potential source for speech segment v n . We describe the methodology to compute a score for each instance using the audio-visual features in the following sections. 7.3.2 Visual active speaker score In a recent work, we introduced Hierarchichally context-aware (Sharma et al., 2022) (HiCA) cross- modal network, trained to detect the presence of speech in a video segment. It was demonstrated, usingclassactivationmaps(CAMs)(Zhouetal.,2016)forthepositiveclass,thatsuchasystemhas the ability to localize active speaker faces in a video reliably. We use the HiCA network, pretrained on movies, to obtain CAMs for a video. We further compute visual active speaker (VAS) score for all face-tracks as the aggregated mean of the CAMs pertaining to the region of interest, as denoted below. CAM f represents the CAMs for frame f and Z denotes the averaging factor. VAS k = 1 Z Σ f Σ x,y (CAM f [y 1 :y 2 ,x 1 :x 2 ]) (7.3.1) 103 We select the speech segments with only one face-track associated with the visual stream with a high VAS score and call them high-confidence instances (HCI). Essentially, HCI≡{ v n ,f k }∀n∈[1,N]:VAS k >τ,K =1 (7.3.2) 7.3.3 Profile matching score We intend to generate individual speech and face profiles for speaking characters. We impose the constraint that given an active speaker instance {v n ,f k }, if f k matches one character profile, the speech segment, v n should align with the corresponding speech profile of the character. We representaface-track,f k ,usingaverageoftheinvolvedfaceembeddings,obtainedusingaResNet50 trained on VGGFace-2 (Cao et al., 2018). To generate the speech segment representation, we used the speaker recognition model by pyannote (Bredin et al., 2020). To generate the character speech-face profiles, we cluster the instances in HCI, using the visual modality. We use the Hierarchical-DBSCAN (McInnes, Healy, & Astels, 2017) for clustering, as it requires no pre-specified cluster number and offers a way to compute soft cluster participation at inference time. We cluster the instances in HCI, to obtain L clusters F l , where l∈[1,L]. We refer to the set of corresponding speaker embeddings for points in F l as V l . We call F l and V l the face and speech profiles, respectively. Foraspeech-faceinstance,{v n ,f k }wecomputetheprofilematchingscore(PMS)as P(v n →f k ), f f k being the sound source for speech segment v n , as: P(v n →f k )=Σ l P(v n ∈V l ,f k ∈F l ) (7.3.3) P(v n →f k )=Σ l P(v n ∈V l )P(f k ∈F l ) (7.3.4) which follows from the fact that v n and f k independently share the membership with cluster l ∈ [1,L] in corresponding modalities. We estimate P(f k ∈F l ), using the outlier-based membership score from hDBSCAN, which uses the GLOSH (Campello, Moulavi, Zimek, & Sander, 2015) algorithm to compute the outlier score. Outlier-based membership score facilitates inference in the case of unseen characters. To estimate the membership for v n , we first cluster the v n ∀n∈ HCI using the hDBSCAN to obtain B clusters 104 Algorithm 4: Iterative profile matching algorithm 1 HCI←{ s n ,f k }∀n∈[1,N]:VAS k >τ,K =1; 2 iter =0; 3 while|HCI| increases do 4 iter =iter+1; 5 (F l ,V l )← HCI ; // Clustering 6 for each instance {s n ,f k }̸∈HCI do 7 Compute VAS k ; // using eq. 7.3.1 8 Compute PMS ; // using eq. 7.3.7 9 α =1− 0.95 iter ; 10 score=α PMS+(1− α )VAS k ; 11 if score>δ then 12 add instance s n ,f k to HCI; 13 end 14 end 15 end ˜ V b , where b∈[1,B]. Using outlier-based membership score for P(v n ∈ ˜ V b ) from hDBSCAN: P(v n ∈V l )=Σ b P(v n ∈V l |v n ∈ ˜ V b )P(v n ∈ ˜ V b ) (7.3.5) P(v n ∈V l |v n ∈ ˜ V b )= |V l ∩ ˜ V b | | ˜ V b | (7.3.6) which follows from the total probability rule and |.| denotes the cardinality of the set. Thus we estimate PMS = Σ l Σ b P(v n ∈V l |v n ∈ ˜ V b )P(v n ∈ ˜ V b )P(f k ∈F l ) (7.3.7) 7.3.4 Iterative profile matching Using equation 7.3.1, we start with an initial set of high confidence speech-face associations, HCI, derived from the visual signal. In each iteration, we compute a confidence score, a linear com- bination of VAS and PMS, for each instance {v n ,f k }. We add the instances with a high enough confidence score to the set HCI and repeat the algorithm. The paradigm is shown in Algorithm 4. Intheinitialiterationsofthealgorithm,sincethenumberofhighconfidenceinstancesissmaller, it likely that some characters do not have any instances in HCI. Since the cluster membership is 105 fundamental to PMS, having no instances in HCI gives unreliable PMS scores. To alleviate the issue, we decrease the weight of the PMS scores in initial iterations, as shown in Algorithm 4:line8. 7.3.5 Background character detection With the premise that all primary characters speak for considerable times in a movie, it is highly likely that after enough iterations of profile matching, all primary characters have significant in- stances in the set of high confidence instances (HCI). Post iterative matching algorithm, we obtain characterprofiles, F l ,V l forl∈[1,L], whichwerepresentbythemeanoftheinstancestheypossess. For any face-track f k , we classify it as a background character track if its minimum distance from any character profile is more significant than a threshold, β . 7.4 Performance Evaluation 7.4.1 Active speaker localization We use 6 episodes of the 3rd season of the TV show Friends for evaluation. We obtain the active- speakergroundtruthfromVPCDcorpus(Brownetal.,2021),wheretheyprovidethevoiceactivity details and face-tracks for all the characters. We used Algorithm 4, to obtain a score (weighted sum of VAS and PMS) for face-tracks that coincide with the voiced segments. For the face-tracks that do not overlap with any active voice regions, we score them with the corresponding VAS/10, scaling down to account for no voice activity. We extend the same face-track score to all the involved face-boxes and compare the performance against the ground truth. We report the area under the ROC curve for the two extreme steps of the algorithm: i) the first iteration, using just Table 7.1: Performance for active speaker localization using audio-visual character profiles. Video Name CAMs (auROC) Audio -visual (auROC) Friends s03e01 0.77 0.87 Friends s03e02 0.75 0.77 Friends s03e03 0.70 0.77 Friends s03e04 0.74 0.80 Friends s03e05 0.74 0.86 Friends s03e06 0.79 0.84 106 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC curve for active speaker performance for Friends_s03e01 CAMs (0.77) Iter_3 (0.86) Iter_5 (0.86) Iter_7 (0.86) Iter_9 (0.87) Figure 7.4.1: Performance for active speaker localization (ROC) for different iteration of profile matching algorithm. the VAS score, and ii) the last iteration, tabulated in Table 7.1. We note a clear improvement in the active speaker performance with the character profile matching, which is consistent across all the episodes. To further understand the effect of the iterative algorithm, we study the performance of active speaker localization across different iterations. In figure 7.4.1 we show the performance, in terms of ROC, for Friends s03e01, for every other iteration. We observed a significant jump in the initial iterations, which saturate as we further iterate. This can be attributed to the saturation in the number of high confidence instances with iterations, as shown in figure 7.4.2. We show the distribution of primary characters among the instances in HCI for the initial and terminating iterations, in fig 7.4.3. The notable increase in the instances across all the characters justifies the gained robustness in the character profiles, further supporting performance improvement. 7.4.2 Background character detection We use the newly curated background character dataset, as described in section 7.2 for validation. As described in section 7.3.5, we obtain a score for each face-track in the video: the minimum distance from the acquired character profiles. In Table 7.2 we show the performance of character 107 Figure 7.4.2: Increase in high confidence instances saturates with iterations. monica Janice phoebe chandler Mr. Heckles eric joey rachel ross Characters 0 10 20 30 40 50 60 70 # of instances Character distribution of high confidence instances. iter_0 iter_10 Figure 7.4.3: Distribution of characters among high confidence instances (HCI), for two extreme steps of iterative profile matching. 108 Table 7.2: Performance for background character detection. Video Name Character Profiles auROC GT representation auROC (max) Friends s03e01 0.82 0.88 Friends s03e02 0.63 0.87 Friends s03e03 0.79 0.90 Friends s03e04 0.78 0.89 Friends s03e05 0.78 0.88 Friends s03e06 0.74 0.86 profiles to detect background characters in terms of area under ROC. We further compute the ground truth character profiles by accumulating the face-track embeddings across the ground truth face-tracks provided by VPCD for all the primary characters. Using the computed ground-truth character profiles, we calculate the performance for detecting background characters as shown in Table 7.2. Against comparison to ground-truth character profiles, the estimated character profiles, using the iterative profile matching algorithm, performs well in detecting background characters. 7.5 Conclusions This work introduces a pilot dataset enlisting background characters in the TV show Friends. This dataset poses a significant advantage over existing datasets by the virtue of significantly higher number of face-tracks attributed to state-of-the-art face detectors. We introduced a strategy to construct robust audio-visual character profiles and imposed a constraint that speech and face from an active speaker instance match the same character profile. We used the character profiles to enhance active speaker localization and background character detection performance. We showed that the proposed background character detection strategy is limited by an upper bound on performance (Table 7.2). Further, it is indifferent to the source of faces, be it characters’ faces or images of faces present in the video frames. One major next step to this work is attribute analysis for background characters in entertainment media. 109 Chapter 8: Conclusion and Future Work While introducing the dissertation, I presented my thesis statement: Cross-modalactivitymodelingandcross-modalidentityassociationprovidecomplementarysets ofinformationtowardestablishingspeech-facecorrespondenceforholisticmediaunderstanding. This dissertation first presents a weakly supervised framework that models the video signal to capture the visual activity pertaining to speech in the audio modality without explicitly extracting the lip region in the visual frames. I qualitatively showed that the learned visual representations canlocalizespeakingfacesandquantitativelyverifiedthesamebyevaluatingonvariousbenchmark datasetsconsistingofprimarilyentertainmentmediavideos. ThenIpresentedanovelunsupervised framework that harnesses the readily available character’s identity information in faces and speech, using state-of-the-art face and speech recognition embeddings, and establishes speech-face corre- spondences. The presented generalized for establishing cross-modal correspondences provides an easy way to incorporate external active-speaker information in human annotations or predictions of other SOTA systems. I performed a systematic error analysis of the two primary sources of information for active speaker detection: i) audio-visual activity modeling and ii) cross-modal identity association, and established that the two sets of information have an exclusive component. The complementary nature of the two methods and the ability to incorporate external information while associating characters’ cross-modal identities motivated the late-fusion of the two methods. It led to the comprehensive unsupervised framework for active speaker detection that showed state-of-the-art performance on benchmark datasets. The presented system does not require manual annotations and is thus adaptable to various domains. Toward better media understanding, I showed the application of the developed active speaker detection system for character diarization in TV shows and showed an improved face clustering and speaker clustering system. I have also presented 110 background character detection as an application of ASD. One of the future directions is to use human body recognition features to identify human beings. Using the face that although for a small temporal context, human body features or attire can distinguish characters in the scene, human body attributes can complement face recognition features in the visual domain. It can particularly help the scenarios where faces are in profile or other extreme poses leading to failure in face recognition. For holistic media understanding, one can use the improved character diarization framework to develop dynamic character graphs capturing interactions of characters evolving with time, which can help better story understanding and representations. 111 References Afouras, T., Asano, Y. M., Fagan, F., Vedaldi, A., & Metze, F. (2021). Self-supervised object detection from audio-visual correspondence. arXiv preprint arXiv:2104.06401. Afouras, T., Owens, A., Chung, J. S., & Zisserman, A. (2020). Self-supervised learning of audio- visual objects from video. In European conference on computer vision (pp. 208–224). Alc´ azar, J. L., Caba, F., Mai, L., Perazzi, F., Lee, J.-Y., Arbel´ aez, P., & Ghanem, B. (2020). Active speakers in context. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12465–12474). Alcazar, J. L., Cordes, M., Zhao, C., & Ghanem, B. (2022). End-to-end active speaker detection. arXiv preprint arXiv:2203.14250. Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the ieee international conference on computer vision (pp. 609–617). Arandjelovic, R., & Zisserman, A. (2018). Objects that sound. In Proceedings of the european conference on computer vision (eccv) (pp. 435–451). Aubrey, A., Rivet, B., Hicks, Y., Girin, L., Chambers, J., & Jutten, C. (2007). Two novel visual voiceactivitydetectorsbasedonappearancemodelsandretinalfiltering. In Signal processing conference, 2007 15th european (pp. 2409–2413). Barzelay, Z., & Schechner, Y. Y. (2007). Harmony in motion. In 2007 ieee conference on computer vision and pattern recognition (p. 1-8). doi: 10.1109/CVPR.2007.383344 Bazzani, L., Bergamo, A., Anguelov, D., & Torresani, L. (2016). Self-taught object localization with deepnetworks. In 2016 ieee winter conference on applications of computer vision (wacv) (pp. 1–9). Bendris, M., Charlet, D.,&Chollet, G. (2010). Lipactivitydetectionfortalkingfacesclassification in tv-content. In International conference on machine vision (pp. 187–190). Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016). Simple online and realtime tracking. In2016 ieee international conference on image processing (icip) (p.3464-3468). doi: 10.1109/ ICIP.2016.7533003 Beyan, C., Shahid, M., & Murino, V. (2021). Realvad: A real-world dataset and a method for 112 voice activity detection by body motion analysis. IEEE Transactions on Multimedia, 23, 2071-2085. doi: 10.1109/TMM.2020.3007350 Bilen,H.,&Vedaldi,A. (2016). Weaklysuperviseddeepdetectionnetworks. In2016ieeeconference on computer vision and pattern recognition (cvpr) (p. 2846-2854). doi: 10.1109/CVPR.2016 .311 Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2013). Finding actors and actions in movies. In Proceedings of the ieee international conference on computer vision (pp. 2280–2287). Bost, X., Linares, G., & Gueye, S. (2015). Audiovisual speaker diarization of tv series. In 2015 ieee international conference on acoustics, speech and signal processing (icassp). Braga, O., & Siohan, O. (2021). A closer look at audio-visual multi-person speech recognition and active speaker selection. In Icassp 2021 - 2021 ieee international conference on acous- tics, speech and signal processing (icassp) (p. 6863-6867). doi: 10.1109/ICASSP39728.2021 .9414160 Bredin, H., & Gelly, G. (2016). Improving speaker diarization of tv series using talking-face detection and clustering. In Acm international conference on multimedia. Bredin, H., Yin, R., Coria, J. M., Gelly, G., Korshunov, P., Lavechin, M., ... Gill, M.-P. (2020). pyannote.audio: neural building blocks for speaker diarization. In Icassp 2020, ieee interna- tional conference on acoustics, speech, and signal processing. Brown, A., Kalogeiton, V., & Zisserman, A. (2021). Face, body, voice: Video person-clustering with multiple modalities. In Proceedings of the ieee/cvf international conference on computer vision (pp. 3184–3194). Campello, R. J., Moulavi, D., Zimek, A., & Sander, J. (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(1), 1–51. Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognisingfacesacrossposeandage. In2018 13th ieee international conference on automatic face & gesture recognition (fg 2018). Chakravarty, P., Mirzaei, S., Tuytelaars, T., & Van hamme, H. (2015). Who’s speaking? audio- supervised classification of active speakers in video. In Proceedings of the 2015 acm on international conference on multimodal interaction (pp. 87–90). Chakravarty, P., & Tuytelaars, T. (2016). Cross-modal supervision for learning active speaker detection in video. In European conference on computer vision (pp. 285–301). Chakravarty, P., Zegers, J., Tuytelaars, T., & Van hamme, H. (2016). Active speaker detection with audio-visual co-training. In Proceedings of the 18th acm international conference on multimodal interaction (pp. 312–316). 113 Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018, March). Grad- cam++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 ieee winter conference on applications of computer vision (wacv) (p. 839-847). doi: 10.1109/WACV.2018.00097 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Chung, J. S. (2019). Naver at activitynet challenge 2019–task b active speaker detection (ava). arXiv preprint arXiv:1906.10555. Chung, J. S., Huh, J., Mun, S., Lee, M., Heo, H.-S., Choe, S., ... Han, I. (2020, Oct). In defence of metric learning for speaker recognition. Interspeech 2020. doi: 10.21437/interspeech.2020 -1064 Chung, J. S., Huh, J., Nagrani, A., Afouras, T., & Zisserman, A. (2020). Spot the conversation: speaker diarisation in the wild. arXiv preprint arXiv:2007.01216. Chung, J. S., Nagrani, A., & Zisserman, A. (2018, Sep). Voxceleb2: Deep speaker recognition. Interspeech 2018. Retrieved from http://dx.doi.org/10.21437/Interspeech.2018-1929 doi: 10.21437/interspeech.2018-1929 Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Asian conference on computer vision (pp. 251–263). Contributors, M. (2020). MMTracking: OpenMMLab video perception toolbox and benchmark. https://github.com/open-mmlab/mmtracking. Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. The Journal of Machine Learning Research, 12, 1501–1536. Deng, J., Guo, J., Ververas, E., Kotsia, I., & Zafeiriou, S. (2020). Retinaface: Single-shot multi- levelfacelocalisationinthewild. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 5203–5212). Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2625–2634). Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! my name is... buffy”–automatic naming of characters in tv video. In Bmvc (Vol. 2, p. 6). Fisher III, J. W., Darrell, T., Freeman, W. T., & Viola, P. A. (2001). Learning joint statistical models for audio-visual fusion and segregation. In Advances in neural information processing systems (pp. 772–778). Gao, Y., Liu, B., Guo, N., Ye, X., Wan, F., You, H., & Fan, D. (2019). C-midn: Coupled 114 multipleinstancedetectionnetworkwithsegmentationguidanceforweaklysupervisedobject detection. In Proceedings of the ieee/cvf international conference on computer vision (pp. 9834–9843). Gemmeke,J.F.,Ellis,D.P.W.,Freedman,D.,Jansen,A.,Lawrence,W.,Moore,R.C.,... Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In Proc. ieee icassp 2017. New Orleans, LA. G´ orriz, J. M., Ram´ ırez, J., Lang, E. W., Puntonet, C. G., & Turias, I. (2010). Improved likelihood ratiotestbasedvoiceactivitydetectorappliedtospeechrecognition. Speech Communication, 52(7-8), 664–677. Guillaumin, M., Verbeek, J., & Schmid, C. (n.d.). Is that you? metric learning approaches for face identification. In 2009 ieee international conference on computer vision. Guo, Y., Zhang, L., Hu, Y., He, X., & Gao, J. (2016). Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision (pp. 87–102). Hahsler, M., Piekenbrock, M., & Doran, D. (2019). dbscan: Fast density-based clustering with R. Journal of Statistical Software, 91(1). doi: 10.18637/jss.v091.i01 Haq, I. U., Muhammad, K., Ullah, A., & Baik, S. W. (2019). Deepstar: Detecting starring characters in movies. IEEE Access, 7, 9265-9272. doi: 10.1109/ACCESS.2018.2890560 Har´ ar, P., Burget, R., & Dutta, M. K. (2017). Speech emotion recognition with deep learning. In Signal processing and integrated networks (spin), 2017 4th international conference on (pp. 137–140). Haurilet, M.-L., Tapaswi, M., Al-Halah, Z., & Stiefelhagen, R. (2016). Naming tv characters by watching and analyzing dialogs. In 2016 ieee winter conference on applications of computer vision (wacv) (pp. 1–9). He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Ieee conference on computer vision and pattern recognition. Hermans, M., & Schrauwen, B. (2013). Training and analysing deep recurrent neural networks. In Advances in neural information processing systems (pp. 190–198). Hershey, J., & Movellan, J. (2000). Audio vision: Using audio-visual synchrony to locate sounds. In S. Solla, T. Leen, & K. M¨ uller (Eds.), Advances in neural information processing systems (Vol. 12). MIT Press. Retrieved from https://proceedings.neurips.cc/paper/1999/ file/b618c3210e934362ac261db280128c22-Paper.pdf Hoover, K., Chaudhuri, S., Pantofaru, C., Sturdy, I., & Slaney, M. (2018). Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen. In 2018 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 6558–6562). 115 Hu, J., Shen, L., & Sun, G. (2018, Jun). Squeeze-and-excitation networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Retrieved from http://dx.doi .org/10.1109/CVPR.2018.00745 doi: 10.1109/cvpr.2018.00745 Huang, C., & Koishida, K. (2020). Improved active speaker detection based on optical flow. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition workshops (pp. 950–951). Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007, October). Labeled faces in the wild: A database for studying face recognition in unconstrained environments (Tech.Rep.No. 07-49). University of Massachusetts, Amherst. Huang, Z., Wang, S., & Yu, K. (2018). Angular softmax for short-duration text-independent speaker verification. In Interspeech. Huang, Z., Watanabe, S., Fujita, Y., Garc´ ıa, P., Shao, Y., Povey, D., & Khudanpur, S. (n.d.). Speaker diarization with region proposal network. In 2020 ieee international conference on acoustics, speech and signal processing. Joosten, B., Postma, E., & Krahmer, E. (2013). Visual voice activity detection at different speeds. In Auditory-visual speech processing (avsp) 2013. Kapsouras, I., Tefas, A., Nikolaidis, N.,&Pitas, I. (2015). Multimodalspeakerdiarizationutilizing faceclusteringinformation. InInternationalconferenceon imageand graphics (pp.547–554). Kenny, P., Stafylakis, T., Ouellet, P., Alam, M. J., & Dumouchel, P. (2013). Plda for speaker verification with utterances of arbitrary duration. In 2013 ieee international conference on acoustics, speech and signal processing. Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., et al. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning (pp. 2673–2682). Kim, Y. J., Heo, H.-S., Choe, S., Chung, S.-W., Kwon, Y., Lee, B.-J., ... Chung, J. S. (2021, Aug). Look who’s talking: Active speaker detection in the wild. Interspeech 2021. doi: 10.21437/interspeech.2021-2041 Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Klemen, J., & Chambers, C. D. (2012). Current perspectives and methods in studying neural mechanisms of multisensory interactions. Neuroscience & Biobehavioral Reviews, 111–133. Le´ on-Alc´ azar, J., Heilbron, F. C., Thabet, A., & Ghanem, B. (2021). Maas: Multi-modal assigna- tion for active speaker detection. arXiv preprint arXiv:2101.03682. Liu, P., & Wang, Z. (2004). Voice activity detection using visual information. In Acoustics, speech, and signal processing, 2004. proceedings.(icassp’04). ieee international conference on (Vol. 1, 116 pp. I–609). Lu, C., & Tang, X. (2015). Surpassing human-level face verification performance on lfw with gaussianface. In Twenty-ninth aaai conference on artificial intelligence. Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 50–60. McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11), 205. Min, K. (2022). Intel labs at ego4d challenge 2022: A better baseline for audio-visual diarization. arXiv. Retrieved from https://arxiv.org/abs/2210.07764 doi: 10.48550/ARXIV.2210 .07764 Min, K., Roy, S., Tripathi, S., Guha, T., & Majumdar, S. (2022). Learning long-term spatial- temporal graphs for active speaker detection. arXiv preprint arXiv:2207.07783. Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60. Nagrani, A., & Zisserman, A. (2018). From benedict cumberbatch to sherlock holmes: Character identification in tv series without a script. arXiv preprint arXiv:1801.10442. Navarathna, R., Dean, D., Sridharan, S., Fookes, C., & Lucey, P. (2011). Visual voice activ- ity detection using frontal versus profile views. In Digital image computing techniques and applications (dicta), 2011 international conference on (pp. 134–139). Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the european conference on computer vision (eccv) (pp. 631–648). Pal, M., Kumar, M., Peri, R., Park, T. J., Hyun Kim, S., Lord, C., ... Narayanan, S. (n.d.). Speaker diarization using latent space clustering in generative adversarial network. In 2020 ieee international conference on acoustics, speech and signal processing. doi: 10.1109/ ICASSP40776.2020.9053952 Pardo, A., Heilbron, F. C., Alc´ azar, J. L., Thabet, A., & Ghanem, B. (2021). Moviecuts: A new dataset and benchmark for cut type recognition. arXiv preprint arXiv:2109.05569. Park, T. J., Han, K. J., Kumar, M., & Narayanan, S. (2020). Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap. IEEE Signal Processing Letters, 27. doi: 10.1109/LSP.2019.2961071 Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 72, 101317. Parkhi, O. M., Rahtu, E., Cao, Q., & Zisserman, A. (2018). Automated video face labelling for 117 filmsandtvmaterial. IEEE transactions on pattern analysis and machine intelligence,42(4), 780–792. Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. Patrona, F., Iosifidis, A., Tefas, A., Nikolaidis, N., & Pitas, I. (2016, June). Visual voice activity detection in the wild. IEEE Transactions on Multimedia, 18(6), 967-977. doi: 10.1109/ TMM.2016.2535357 Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Cuave: A new audio-visual database for multimodal human-computer interface research. In Proceedings of international conference on acoustics, speech and signal processing (cassp (pp. II–2017). Ram´ ırez, J., Segura, J. C., G´ orriz, J. M., & Garc´ ıa, L. (2007). Improved voice activity detection usingcontextualmultiplehypothesistestingforrobustspeechrecognition. IEEETransactions on Audio, Speech, and Language Processing, 15(8), 2177–2189. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99). Ren, Z., Yu, Z., Yang, X., Liu, M.-Y., Lee, Y. J., Schwing, A. G., & Kautz, J. (2020). Instance- aware, context-focused, and memory-efficient weakly supervised object detection. In Proceed- ings of the ieee/cvf conference on computer vision and pattern recognition (pp.10598–10607). Roth, J., Chaudhuri, S., Klejch, O., Marvin, R., Gallagher, A., Kaver, L., ... others (2020). Ava active speaker: An audio-visual dataset for active speaker detection. In Icassp 2020-2020 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 4492–4496). Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015). Convolutional, long short-term memory, fully connected deep neural networks. In Acoustics, speech and signal processing (icassp), 2015 ieee international conference on (pp. 4580–4584). Schmiedchen, K., Freigang, C., Nitsche, I., & R¨ ubsamen, R. (2012). Crossmodal interactions and multisensory integration in the perception of audio-visual motion—a free-field study. Brain research, 99–111. Schroff,F.,Kalenichenko,D.,&Philbin,J.(n.d.).Facenet: Aunifiedembeddingforfacerecognition andclustering. 2015IEEEConferenceonComputerVisionandPatternRecognition(CVPR). doi: 10.1109/cvpr.2015.7298682 Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017, Oct). Grad- cam: Visual explanations from deep networks via gradient-based localization. In 2017 ieee international conference on computer vision (iccv) (p.618-626). doi: 10.1109/ICCV.2017.74 Shahid, M., Beyan, C., & Murino, V. (2019, 09). Comparisons of visual activity primitives for voice activity detection. In (p. 48-59). doi: 10.1007/978-3-030-30642-7 5 118 Shahid, M., Beyan, C., & Murino, V. (2021). S-vvad: Visual voice activity detection by mo- tion segmentation. In 2021 ieee winter conference on applications of computer vision (wacv) (p. 2331-2340). doi: 10.1109/WACV48630.2021.00238 Shams, L., & Kim, R. (2010). Crossmodal influences on visual perception. Physics of life reviews, 269–284. Sharma, R., Guha, T., & Sharma, G. (2018). Multichannel attention network for analyzing visual behaviorinpublicspeaking. In2018 ieee winter conference on applications of computer vision (wacv) (p. 476-484). doi: 10.1109/WACV.2018.00058 Sharma, R., & Narayanan, S. (2022a). Audio visual character profiles for detecting background characters in entertainment media. arXiv preprint arXiv:2203.11368. Sharma, R., & Narayanan, S. (2022b). Unsupervised active speaker detection in media content using cross-modal information. arXiv preprint arXiv:2209.11896. Sharma, R., & Narayanan, S. (2022c). Using active speaker faces for diarization in tv shows. Sharma, R., Somandepalli, K., & Narayanan, S. (2019). Toward visual voice activity detection for unconstrained videos. In 2019 ieee international conference on image processing (icip) (p. 2991-2995). doi: 10.1109/ICIP.2019.8803248 Sharma, R., Somandepalli, K., & Narayanan, S. (2022). Cross modal video representations for weakly supervised active speaker localization. IEEE Transactions on Multimedia, 1-12. doi: 10.1109/TMM.2022.3229975 Somandepalli, K., Guha, T., Martinez, V. R., Kumar, N., Adam, H., & Narayanan, S. (2021). Computational media intelligence: Human-centered machine analysis of media. Proceedings of the IEEE, 1-20. doi: 10.1109/JPROC.2020.3047978 Somandepalli, K., Martinez, V., Kumar, N.,&Narayanan, S. (2018). Multimodalrepresentationof advertisementsusingsegment-levelautoencoders.InProceedingsofthe20thacminternational conference on multimodal interaction (pp. 418–422). Song, C., Ning, N., Zhang, Y.,&Wu, B. (2021). Amultimodalfakenewsdetectionmodelbasedon crossmodal attention residual and multichannel convolutional neural networks. Information Processing and Management, 58(1), 102437. Retrieved from https://www.sciencedirect .com/science/article/pii/S0306457320309304 doi: https://doi.org/10.1016/j.ipm.2020 .102437 Tang, P., Wang, X., Bai, X., & Liu, W. (2017). Multiple instance detection network with online instance classifier refinement. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2843–2851). Tao, F., & Busso, C. (2017). Bimodal recurrent neural network for audiovisual voice activity detection. In Proc. annu. conf. int. speech commun. assoc. (pp. 1938–1942). 119 Tao,R.,Pan,Z.,Das,R.K.,Qian,X.,Shou,M.Z.,&Li,H. (2021). Issomeonespeaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th acm international conference on multimedia (p. 3927–3935). Verteletskaya, E., & Sakhnov, K. (2010). Voice activity detection for speech enhancement applica- tions. Acta Polytechnica, 50(4). Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., & Ye, Q. (2019). C-mil: Continuation multiple instance learning for weakly supervised object detection. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 2199–2208). Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (n.d.). Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. doi: 10.1109/icassp.2018.8462665 Wang, Q., Downey, C., Wan, L., Mansfield, P.A., & Moreno, I.L. (n.d.). Speaker Diarizationwith LSTM. In 2018 ieee international conference on acoustics, speech and signal processing. Wang,W.,Tran,D.,&Feiszli,M. (2020). Whatmakestrainingmulti-modalclassificationnetworks hard? In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12695–12705). Wang,X.,Shrivastava,A.,&Gupta,A. (2017). A-fast-rcnn: Hardpositivegenerationviaadversary for object detection. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2606–2615). Wang, Y., Li, J., & Metze, F. (2019). A comparison of five multiple instance learning pooling functionsforsoundeventdetectionwithweaklabeling. In Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 31–35). Xu, E. Z., Song, Z., Feng, C., Ye, M., & Shou, M. Z. (2021). Ava-avd: Audio-visual speaker diarization in the wild. arXiv preprint arXiv:2111.14448. Xu, H., Zeng, R., Wu, Q., Tan, M., & Gan, C. (2020). Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th acm international conference on multimedia (p. 3893–3901). New York, NY, USA: Association for Computing Machinery. Re- trieved from https://doi.org/10.1145/3394171.3413581 doi: 10.1145/3394171.3413581 Yu, Y.-Q., Fan, L., & Li, W.-J. (n.d.). Ensemble additive margin softmax for speaker verification. In Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing. Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9). Zhang, Y.-H., Xiao, J., Yang, S., & Shan, S. (2019). Multi-task learning for audio-visual active speaker detection. The ActivityNet Large-Scale Activity Recognition Challenge, 1–4. 120 Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In Proceedings of the ieee/cvf international conference on computer vision (pp. 1735–1744). Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In Proceedings of the european conference on computer vision (eccv) (pp. 570–586). Zhou,B.,Khosla,A.,Lapedriza,A.,Oliva,A.,&Torralba,A. (2016,June). Learningdeepfeatures for discriminative localization. In Proceedings of the ieee conference on computer vision and pattern recognition (cvpr). Zitnick, C.L.,&Doll´ ar, P. (2014). Edgeboxes: Locatingobjectproposalsfromedges. In European conference on computer vision (pp. 391–405). 121 Appendices 122 Appendix A: Towardvisualvoiceactivityde- tectionforunconstrainedvideos A.1 Introduction A key processing step in most speech technology systems, whether the target application is auto- matic speech recognition, speech enhancement or emotion recognition, is Voice Activity Detection (VAD)(G´ orriz, Ram´ ırez, Lang, Puntonet, & Turias, 2010; Har´ ar, Burget, & Dutta, 2017; Ram´ ırez, Segura, G´ orriz, & Garc´ ıa, 2007; Verteletskaya & Sakhnov, 2010). VAD is an audio segmentation problem, targeted to segregate the speech from the non-speech regions, which often may have noise and other interfering sources. While a simple classification task, VAD is severely challenged by the variety and variability in the noise seen in real world conditions(F. Tao & Busso, 2017). Since VAD is at the initial stages of a speech processing pipeline, its performance degradation or failure will severely affect the downstream processing blocks such as speech recognition(G´ orriz et al., 2010). As an alternative, researchers have proposed complementing the audio-based system with in- formation from the visual modality i.e., information from the talking face. The recent work by (F. Tao & Busso, 2017) reports the fusion of the two modalities using a bimodal RNN, modeling each modality using LSTMs. To handle the video modality they proposed 2D convolutional repre- sentations of the lip region. Previous methods such as (Liu & Wang, 2004) incorporated the visual informationusinghandcraftedfeaturesdescribingthelipregionandNavarathnaetal.(Navarathna, Dean, Sridharan, Fookes, & Lucey, 2011) used DCT features around the mouth region. Furthermore, most of these methods (Aubrey et al., 2007; Joosten, Postma, & Krahmer, 2013; Liu & Wang, 2004; Navarathna et al., 2011) have focused on constrained domains such as news broadcasts or meetings where the video is recorded in constrained settings such as controlled light- 123 t 1 t 2 ŷ 1 ŷ 2 t s ŷ s v 1 v 2 v s ‘s’ sec. V 1 sec 1 sec 1 sec 3D CNN GAP Unrolled Bi-LSTM Sigmoid Grad-CAM Heatmaps [5x7x7], 64/[2,2,2] Pool [1,3,3]/[1,2,2] [3x3x3], 64/[1,1,1] [3x3x3], 64/[2,2,2] [3x3x3], 64/[1,1,1] [3x3x3], 128/[1,1,1] [3x3x3], 256/[1,2,2] [3x3x3], 256/[1,1,1] 3D CNN [filter size], #filters/[stride] Conv layer: x2 x4 x3 #of such layers Figure A.1.1: Complete architecture of the proposed hierarchically context aware (HiCA) frame- work. ing, fixed camera, fixed background, etc (Patterson, Gurbuz, Tufekci, & Gowdy, 2002). This limits their applications to videos in which the face region is clearly visible and localized. Critically, explicit information about the mouth/lip landmark regions would be needed for feature extraction (Patrona, Iosifidis, Tefas, Nikolaidis,&Pitas,2016). Incontrast, videosindomainssuchasmovies, or street cameras, are not constrained and methods optimized for domains such as broadcast news do not generalize. Our work addresses the problem of VAD using visual information for unconstrained videos. Here, we propose an end-to-end trainable Hierarchical Context-Aware (HiCA) deep neural network to predict coarse VAD labels using just the visual information. In order to enable the network to learn from a longer context, which is a necessity in case of videos, we decentralize the temporal context in form of local 3D convolutions and a global LSTM. We do not explicitly detect the face of a speaker or extract facial features, neither for training nor for inference. We evaluate the proposed architecture with videos from Hollywood movies, which is a challenging domain due to its relatively uncontrolled settings in form of frequent shot changes and varying camera dynamics, and the variety and variability in the depiction of speaking characters. In addition to evaluating the framework for VAD performance, we perform a formal analysis of the learned representations. Recently, with the proliferation of DNN architectures, there is an increasedinterestindevelopingtoolstoprobewhatanetworklearns(B.Kimetal.,2018). Zhouet al (Zhou et al., 2016) have proposed an approach called class activation maps (CAMs), which uses 124 the global average pooling of convolutional output to visualize the activations learned by a CNN corresponding to a particular class. Their technique provided a CNN the ability to localize objects in an image pertaining to a particular class from a classification network trained with image-level labels. Selvaraju et al(Selvaraju et al., 2017) proposed an efficient generalization to CAMs where they linearly approximated any network employed ahead of the CNNs, so that the idea of CAMs can be extended to any non-linear network on top of CNNs. More recently, (Chattopadhay et al., 2018) introduced a more accurate version of Grad-CAM where they further generalized the Grad-CAM by computing the importance of each pixel in the feature map towards the decision of interest. We use Grad-CAMs to visualize the representations learned by the network, as further described in Sec. A.2.3. Our performance analysis shows that the proposed network can robustly localize the human beings in the videos. The contributions of this work are as follows: i) We investigate the problem of visual-VAD for unconstrained videos. We propose a cross-modal learning approach where the input is visual modality and the output is audio-speech labels. ii) This work propose HiCA deep architecture to learn from longer contexts. The architecture incorporates temporal context at two hierarchical levels. iii) Furthermore, taking advantage of the interpretability of the architecture, we present a detailed and systematic analysis of the learned representations. Our analysis shows that the visual-VAD can localize to humans in videos. A.2 Proposed Approach In this section, we formalize the cross-modal learning problem. Next, we introduce the HiCA deep neural network architecture specifically for the cross-modal VAD objective. Subsequently, we will detailtheprocedureforvisualizingtheoptimized3DCNNsusingamodifiedversionofGrad-CAMs. A.2.1 Problem Formulation Let V be a video segment of duration t seconds, and v i ,i∈ 1,...,N be N partitions of the video segment, of equal duration s seconds such that N· s=t. Formally, V ={v 1 ;v 2 ;...v N } (A.2.1) 125 For each v i , we are given a binary VAD label y i . y n = 1 indicates presence of speech, and y n = 0 otherwise. Thus Y ={y 1 ;y 2 ;...y N } y i ∈{0,1} (A.2.2) WeframethistaskasthesupervisedlearningproblemtomapfromV −→ Y. Inallourexperiments, we set s=1s and N =10 A.2.2 Network Architecture InspiredbytherecentsuccessofCNN-LSTMarchitecturestomodelcontextualinformation(J.Chung, Gulcehre, Cho, & Bengio, 2014; Donahue et al., 2015; Sainath, Vinyals, Senior, & Sak, 2015), we proposeacombinationof3Dconvolutionalnetworkandbi-directionalLSTM.3DCNNsandLSTMs enableustomodelthelocal(v i ineq.A.2.1)andalonger(global)temporalcontext([v 1 ...v n ]ineq. A.2.1). Duetonatureofthismulti-scalecontextmodeling, werefertoourproposedarchitectureas Hierarchical Context Aware (HiCA) architecture. The schematic of HiCA architecture is shown in Fig.A.1.1. The network can be visualized as a 2 stage pipeline, i) local spatiotemporal convolutions and ii) global bi-directional LSTM, stitched by a global average pooling (GAP) layer. i) Local spatiotemporal convolutions (3D conv):The smaller video segments v i from V is input to a ResNet-like (He et al., 2016) convolutional network with 3-dimensional convolutions i.e., con- volution operations are performed along the height, width and time of the video frames (see Fig A.1.1). The weights for the convolution layers are shared among v n for n ∈ {1,2,...s}. These 3D convolutions account for the local temporal context within the smaller segments. The output of the 3D-convs is passed through a 3D-GAP, which is capable of learning class discriminative localizations(Zhou et al., 2016). The average pooling is performed spatially as well as temporally. ii) Global bi-directional LSTM (B-LSTM): The output obtained from the GAP layer is input to B-LSTM. The N outputs corresponding to each of v 1 ,v 2 ...v N are given as input to N nodes of bi-directional LSTM. This BLSTM accounts for the temporal context for the complete N second long video segment V. The output at each node of the BLSTM is passed through a sigmoid layer to obtain final logits ˆ P = {ˆ p 1 , ˆ p 2 ... ˆ p s }. The weights for the sigmoid layer are shared among all the output nodes of BLSTM. The network is optimized to minimize the cross-entropy loss between 126 ˆ P and Y. A.2.3 Visualizations In order to understand the visual constructs captured by the 3D-conv, we modified the Grad- CAMs(Selvaraju et al., 2017) to accommodate 3D convolutions. Because we apply sigmoid activa- tion to the final layer in our network, we get only one class score as output which represents the confidence for the output class ’speech’ for the video segment v n . Thus, in order to obtain the class discriminative localization map G, we first compute the gradients of the posterior score ˆp i , of the prediction ˆ y i , with respect to each of the feature map F m of the last convolutional layer which has m number of filters. The gradients computed are averaged along spatial as well as temporal dimensions to obtain the weight of the feature map m towards the output class ’speech’. α m = 1 Z X i X j X k ∂ˆ p ∂F m ijk (A.2.3) where Z is the dimension of vectorized F m . Then we perform a weighted sum of all the feature maps m of the last convolution layer and apply ReLU to obtain the required localization map G. The magnitude at each pixel, in map G, is proportional to the attention of the network, which can be visualized as a heatmap. G=ReLU( X m α m F m ) (A.2.4) A.3 Performance Evaluation A.3.1 Implementation Details In order to train the proposed network, we compiled Hollywood movies with labels derived from the timestamps of associated subtitles. It is important to note the scale at which we obtain VAD labels is coarse. Typically in audio experiments, VAD labels are acquired at 25-100ms precision. But here, our precision is limited by the subtitle timestamps. As Hollywood movies do not have a fixed structure with respect to the appearance of the characters, they are a good representation of the videos “in the wild”, albeit with a higher quality compared with surveillance videos for 127 Matched Box Extra Box Missed Box Ground Truth Predicted Box Figure A.3.1: Examples of missed, matched and extra boxes in a frame. example. Additionally, the subtitles are available as a separate stream in the videos and are time synchronized so we were able to obtain VAD labels automatically. Our dataset consists of video clips from 96 movies released during the period 2014 to 2016. It comprises about 60k video segments, 10 seconds each. We obtain the coarse VAD labels for each one second video clips, v n using the subtitles timestamps. For the duration of each dialogue in the subtitles, the corresponding video segment is labeled as speaking. For edge cases, we use the labels associated with majority of the 1 sec segment. The network is trained using a 80 : 20, train : validation split. The network is optimized to minimize the cross-entropy loss using the Adam(Kingma & Ba, 2014) optimizer with an learning rate of 10 − 5 . The BLSTM consists of 2 single-layered LSTM cells with state size 512 each. Because of the size of the model and computational limitations we trained the network on single 12GB GPU with a mini-batch of size 2. The network is trained for 180000 iterations and each epoch took nearly 25 GPU hours. A.3.2 VAD Performance InordertoevaluatetheperformanceoftheproposednetworkforVAD,weusevideosegmentsfrom 19 Hollywood movies, not included in the training or validation set. The segments are distributed as 51:49 for speech:non-speech labels. The network attains an accuracy of 66.10%. The evolution 128 1 2 3 4 5 6 #Epochs 0.62 0.63 0.64 0.65 0.66 Accuracy/loss Training loss Training accuracy Val loss Val accuracy Figure A.3.2: Training stats for HiCA network, averaged over each epoch. of VAD accuracy and loss over training epochs is shown in Fig.A.3.2. As mentioned before, our approach to visual-VAD is novel, in the sense that we can learn the model end-to-end with the entire video frame as input. We do not need additional face-detection or specific feature extraction methods which could lead to additional errors. To the best of our knowledge all methods in the literature involve explicitly extracting facial features (specifically around lip region) thus requiring the presence of frontal face, that too with good resolution. Additionally, existing visual-VAD approaches do not handle cases having multiple faces in a frame. Hence it is not feasible to compare our approach to visual-VAD models. A.3.3 Visualization Analysis of Learned Representations In order to have a detailed understanding of the representations learned by the proposed deep network in a cross-modal scenario, we present a formal analysis of the localization capabilities of the network. We evaluated the learned representations over a set of 113 video clips chosen from six Hollywood movies: About Last Night, How to be Single, Keanu, Krampus, Max, and Tomorrowland, none included in the training set. Each clip is nearly 30sec long (32.5± 7.5, a total 129 Seq 1 Seq 2 Seq 3 Seq 4 a b c d e Figure A.3.3: Qualitative localization performance of the proposed HiCA network for various test videos. of about 62 minutes). The clips were chosen arbitrarily from the movies, ensuring that each clip had sufficient speech/non-speech parts. Using the visualization method described in sec A.2.3, we obtained the heatmaps corresponding to discriminative image regions for all the videos in the test set. For five videos, we show the qualitative localization performance in Fig.A.3.3, using 5 different frames in chronological order from the videos. These 5 videos along with more test videos can be found here 1 . As shown in the Fig.A.3.3, it is evident that the proposed network can robustly localize human faces in videos. Seq1(i) and seq2(v) highlight the capability of the network to locate multiple faces present in a frame irrespective of the view of faces (frontal or profile). Seq2(iii), seq4(v), and seq4(iii) shows that the network can localize not just the human faces but the human body too. To further scrutinize the above observations, we present a quantitative study to evaluate the network’s capability to localize human faces and bodies. For the quantitative evaluation, for each test clip, we first obtain the bounding boxes (further referred as pbox) corresponding to heatmaps generated using the procedure described in sec.A.2.3. Wethencomparetheobtainedpboxagainststate-of-the-artfacedetectorandhumanbodydetector using the following measures. 1 http://bit.ly/vvad icip 130 i) Face Detection: We first obtain the ground truth bounding boxes, gbox, for the detected faces in each frame of the video segment using Google’s API for face detection 2 . Next, we classify each pbox into one of 3 categories based on the overlap of the predicted boxes with the ground-truth boxes, as described below. The schematic in the Fig. A.3.1 shows the examples of all the three cases. Matched boxes, ϕ : Predicted boxes that match the ground-truth boxes at a given IoU (Zit- nick & Doll´ ar, 2014) threshold, ϵ . The i th pbox in frame f, pbox f i , is said to be matched if: IOU(pbox f i ,gbox f j )≥ ϵ for some j th gbox in frame f and a particular threshold ϵ . Missed boxes, θ : Ground truth boxes that were missed by the predictions, either due to failing matching criterion of the IOU-threshold or fewer predicted boxes. Formally, the j th gbox in frame f is said to be missed if: IOU(pbox f i ,gbox f j )<ϵ for all i th pbox in frame f and a particular threshold ϵ . Extra boxes, γ : Predicted boxes that do not match any of the ground truth boxes at a given IOU threshold, ϵ . The i th pbox in frame f, pbox f i , is said to be extra if: IOU(pbox f i ,gbox f j ) < ϵ for all j th gbox in frame f and a particular threshold ϵ . We use the F− score to quantify the efficiency of localization, with precision and recall computed as follows: recall = |ϕ | |ϕ |+|θ | and precision= |ϕ | |ϕ |+|γ | (A.3.1) where |.| represents the cardinality of a set. Here, ϕ ,θ &, γ can be seen as true positives, false negatives and false positives respectively. We observed that the proposed network can detect human faces with average F− score of 0.79 for a liberal IOU threshold choice of 0.1. Although there are no existing cross-modal DNN models proposed for visual-VAD tasks, a recent work (Owens & Efros, 2018) has developed methods for soundlocalizationusingacrossmodalapproach. Henceweuse(Owens&Efros,2018)asbaselinefor quantifying localization ability. The models proposed in (Owens & Efros, 2018) detects synchrony between audio and visual data and showed that their model can localize sound sources in a video, 2 Neven Vision fR TM API 131 whichsettlestolocalizinghumanfacesinthecurrentsetup. Theylocalizethefaceswithanaverage F − score of 0.62 for the same IOU threshold of 0.1. More importantly, we note that their model requires both modalities, visual as well as audio. Fig.A.4.1 shows the variation of the average F − score for various IOU thresholds. ii) Human body Detection: This setup aims to validate the extent to which the predicted localizationscanconformtoahumanbodydetector. Hereweusethestate-of-the-artFaster-RCNN (S. Ren, He, Girshick, & Sun, 2015) models for obtaining bounding boxes containing human body. Once we get the bounding boxes, we follow the same procedure as described in the previous setup to obtain the F − score. We observed that the proposed network can localize human bodies with an average F− score of 0.59 for the IOU threshold of 0.1. For the same IOU threshold, (Owens & Efros, 2018) attains an F − score of 0.63. iii)PersonDetection: Inthisexperimentwefocusonunderstandingthesignificanceoftheextra boxes γ , in case of face detection experiment setup. We computed the percentage overlap of all the boxes in γ with the available human body boxes in the corresponding frames, with a conservative matching threshold of 90% overlap. The Fig.A.4.1 shows the trend of the matching percentage for various IoU thresholds in face detection experiment. This suggests that the majority of the prediction boxes those do not attend to faces, attend to human bodies. A.4 Discussion and Conclusion Why localizing to faces?: It can be speculated that the network is learning to recognize salient motionsoractionsratherthanjustlocalizingfacesineachframeindependently. Afacewithmoving lips is one such salient action, which is present in the majority cases for the current scenario of movies. The localization to human body parts can be explained as the part of body gestures or actions considered salient by the network. We hypothesize that in ideal scenarios, in order to decide for speech regions in a video, the network should consider the talking faces as the most salient action. Since the VAD performance attained by the network is just 66%, it is still not close to the ideal case. One reason is that visual information relevant to human speech-activity (such as talking faces) may not be available in the video. Ourworkisaninitialefforttowardincorporatingvisualinformationfromvideosaboutpotential 132 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 IoU thresholds / matching percentage (expt3) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-score Proposed (Face) Proposed (Human Body) Owens[11] (Face) Owens[11] (Human Body) Person_expt3 Figure A.4.1: Trend of F − score, for different experiments and matching percentage for expt 3 speech activity, with no explicit assumptions or steps about talking faces, for the VAD task. We proposed a network which decentralizes the temporal context and thus is able to learn from longer contexts. The learned network maps can be easily visualized for a better understanding of the representations. We analyzed the learned representations and showed that the network indeed automatically captures human faces as salient to decide for the presence of speech activity. This workopensupanewapproachtothechallengingproblemofspeakingfacedetectioninvideos. One immediate future next step to this work is to include audio to enhance the VAD performance. 133
Abstract (if available)
Abstract
An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen, such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. It is, however, a challenging problem due to the wide variety and contextual variability in media content and the need for labeled data.
In this dissertation, I present two strategies utilizing the cross-modal information in the media videos to establish a correspondence between the speech in the audio modality and the faces in the video modality such that the face is the source of underlying speech in the audio. First, I present a cross-modal neural network to modal observed audio-visual activity, which has implicit information about speaker's spatial location in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring which is very expensive, I formulated a weakly supervised system for localizing active speakers in movie content. Second, I present a strategy that leverages speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). I express the speech segments in terms of their associated speaker identity distances from all other speech segments to capture a relative identity structure for the video. Then I assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure
The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing. At the same time, the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, I further investigate their ability to complement each other. I propose a novel unsupervised framework to guide the speakers' cross-modal identity association with the audio-visual activity for active speaker detection. It enables a comprehensive active speaker detection framework relying on no manual annotations. I evaluate all the proposed frameworks on three benchmark datasets– Visual Person Clustering dataset, AVA-active speaker dataset, and Columbia dataset– consisting of videos from entertainment and broadcast media, and show competitive performance to state-of-the-art fully supervised methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodal perception guided computational media understanding
PDF
Learning shared subspaces across multiple views and modalities
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Semantically-grounded audio representation learning
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Creating cross-modal, context-aware representations of music for downstream tasks
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Understanding interactional dynamics from diverse child-inclusive interactions
PDF
Understanding sources of variability in learning robust deep audio representations
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Context-aware models for understanding and supporting spoken interactions with children
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Visual representation learning with structural prior
PDF
Enhancing speech to speech translation through exploitation of bilingual resources and paralinguistic information
PDF
A green learning approach to deepfake detection and camouflage and splicing object localization
PDF
Extracting and using speaker role information in speech processing applications
PDF
Learning multi-annotator subjective label embeddings
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Neural representation learning for robust and fair speaker recognition
Asset Metadata
Creator
Sharma, Rahul
(author)
Core Title
Establishing cross-modal correspondences for media understanding.
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-05
Publication Date
01/26/2023
Defense Date
01/13/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
active speaker detection,audio-visual,cross-modal,media understanding,multimodal,OAI-PMH Harvest,self-supervised learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
ra.rahulsharma.sh@gmail.com,rahul.sharma@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112718986
Unique identifier
UC112718986
Identifier
etd-SharmaRahu-11446.pdf (filename)
Legacy Identifier
etd-SharmaRahu-11446
Document Type
Dissertation
Format
theses (aat)
Rights
Sharma, Rahul
Internet Media Type
application/pdf
Type
texts
Source
20230126-usctheses-batch-1004
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
active speaker detection
audio-visual
cross-modal
media understanding
multimodal
self-supervised learning