Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning shared subspaces across multiple views and modalities
(USC Thesis Other)
Learning shared subspaces across multiple views and modalities
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Learning Shared Subspaces across Multiple Views and Modalities by Krishna Somandepalli A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) August 2021 Copyright 2021 Krishna Somandepalli To my Father. Without his sacrice, I dared not dream for a better life, nor hope for letters after my name. ii Acknowledgements Better than a thousand days of diligent study is one day with a great teacher. Such was my privilege to be mentored and advised by Prof. Shrikanth Narayanan | for teaching me about what it is to be a researcher, a leader and a kind human being. I would like to thank Prof. Narayanan for his unwavering support and unrelenting faith in me throughout my doctoral program, leading up to this dissertation. I would like to thank my committee members, Prof. Mahdi Soltanolkotabi, Prof. Kimon Drakopoulos, Hartwig Adam, Prof. Krishna Nayak, and Prof. Jay Kuo for their insightful feedback which pushed me to ask the harder questions for the betterment of this dissertation. I would like to thank the many collaborators I had the privilege to work with during my doctoral program - who inspired several research questions I pursued in this thesis. Dr. Caroline Heldman, Madeline Di Nonno, Elizabeth Kilpatrick and others at the Geena Davis Institute for Gender in Media; Dr. Yalda Uhls at the Center for Scholars and Storytellers; Dr. Amit Kochhar and Tymon Tai at the Keck School of Medicine; Prof. Jed Brubaker at the University of Colorado at Boulder; Prof. Tanaya Guha at the University of Warwick and Dr. Naveen Kumar at Disney. I would like to acknowledge many colleagues and mentors from my internships at Google. Brendan, Florian, Gautam, Kree, Marco, Huisheng, Tom, Alan et alia. Our discussions helped me better dene the scope of this dissertation. This dissertation would not have been possible if not for the support of my department advi- sors that I could always count on: Tanya Acevedo-Lam, Diane Demetras, and Tracy Charles in particular. Finally, I would like to thank the entire SAIL family who are my friends, peers, and colleagues. I could always rely on their support, a crucial component to the preparation of this dissertation. iii Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures xi Abstract xv 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Computational media intelligence (CMI) . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Multiview and multimodal modeling for CMI . . . . . . . . . . . . . . . . . . 4 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Prior Work 10 2.1 Views vs. Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Multiview and multimodal learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Subspace alignment methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Fusion Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iv I Multiview shared subspace learning 19 3 Deep Multiview Correlation 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 CCA, GCCA and deep learning variants . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Multiset CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Deep Multiset Canonical Correlation Analysis (dMCCA) . . . . . . . . . . . . . . . . 25 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Noisy digit classication experiments . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Generalized Multiview Correlation 36 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.1 Subspace alignment for more than two views . . . . . . . . . . . . . . . . . . 40 4.2.2 View-agnostic multiview learning . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.3 Multiview paradigm for domain adaptation . . . . . . . . . . . . . . . . . . . 42 4.3 Problem Formulation and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 Multiview correlation (MV-CORR) . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.2 Implementation and practical considerations . . . . . . . . . . . . . . . . . . 45 4.4 View Bootstrapping: A Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Benchmarking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Multi-channel audio activity classication . . . . . . . . . . . . . . . . . . . . 50 4.5.2 View-invariant object classication . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.3 Pose-invariant face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 v 5 Learning Speaker and Speech Command Representations 61 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.1 Baseline Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.2 Generalized Multiview Correlation . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6 Robust face clustering in movie videos 75 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.1 Deep face embeddings and data resources . . . . . . . . . . . . . . . . . . . . 79 6.2.2 Character labeling in videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.3 Self-supervised feature adaptation . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2.4 Benchmark datasets for movie face clustering . . . . . . . . . . . . . . . . . . 84 6.3 Data Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.1 SAIL-MCB: SAIL Movie Character Benchmark Dataset . . . . . . . . . . . . 85 6.3.2 SAIL-MultiFace: Harvesting Weakly Labeled Face Tracks . . . . . . . . . . . 87 6.3.3 Mining Hard-positive and Hard-negative Tracklets . . . . . . . . . . . . . . . 90 6.4 Self-supervised Feature Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.1 ImpTriplet: Improved Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . 92 6.4.2 MvCorr: Multiview Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5.1 Face Verication for video data . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.5.2 Self-supervised Feature Adaptation . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5.3 Face clustering experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vi II Multimodal shared subspace learning 104 7 Tied Crossmodal Autoencoders 105 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.2 Advertisement dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.1 Unimodal representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.3.2 Segment level autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3.3 Classication of ads as funny or exciting . . . . . . . . . . . . . . . . . . . . . 112 7.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.1 Multimodal autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.4.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8 Modeling Multimodal Event Streams as Temporal Point Processes 117 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.2.1 Self-supervised multimodal learning . . . . . . . . . . . . . . . . . . . . . . . 120 8.2.2 Temporal point process modeling . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.3 Recurrent Marked Temporal Point Process . . . . . . . . . . . . . . . . . . . . . . . 123 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.4.1 Evaluation dataset and baselines . . . . . . . . . . . . . . . . . . . . . . . . . 124 8.4.2 Creating multimodal event streams . . . . . . . . . . . . . . . . . . . . . . . . 125 8.4.3 Multimodal events as point processes . . . . . . . . . . . . . . . . . . . . . . . 126 8.4.4 Downstream task classication . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9 Conclusion and Future Work 128 References 129 Appendices 151 vii A Proof for total covariance 152 B Upper bound for multiview correlation 153 C Upper bound for bootstrapped multiview correlation 155 C.1 Upper Bound for Bootstrapped Within-View Covariance . . . . . . . . . . . . . . . . 155 C.2 Upper Bound for Bootstrapped Total-View Covariance . . . . . . . . . . . . . . . . . 160 C.3 Theorem: Error of the Bootstrapped Multi-view Correlation . . . . . . . . . . . . . . 162 D SAIL MultiFace dataset details 164 E SAIL Movie Character Benchmark: Additional results 167 viii List of Tables 3.1 Comparison of the anity measures for dMCCA system with the supervised system 33 4.1 Summary of datasets used for benchmarking our proposed method. . . . . . . . . . . 46 4.2 Performance evaluation of multi-channel acoustic activity classication on the DCASE dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Accuracy of clustering for seen and unseen views. SD computed from ten trials. Bold indicates signicantly higher accuracy. . . . . . . . . . . . . . . . . . . . . . . . 56 4.4 3D object recognition and retrieval comparison with other methods. Bold indicates state-of-the-art results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 1-NN matching accuracy comparison for pose-invariant face recognition. Bold indi- cates the best performing model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Speaker and utterance (utt.) characteristics . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Performance evaluation with clustering and classication; No. of classes for the two tasks are 30 and 146 respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 Comparison of MV-CORR framework with domain adversarial methods . . . . . . . 73 6.1 Details of Movie Character Benchmark (SAIL-MCB) dataset. The number of char- acters were chosen to label at least 99% of the detected faces. The range of number of tracks-per-character shows that we label both prominent and minor characters. . 85 6.2 Count statistics of the tracks mined at each stage of the harvesting process. Sample size of movies = 240 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 ix 6.3 Comparison of face verication performance for standard video datasets using TPR @ 0.1 FPR % metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Face verication performance with adaptation averaged across all videos in the SAIL- MCB benchmark dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5 V-measure for hierarchical agglomerative clustering (HAC) and anity propagation (AP) with Over-clustering Index (OCI) . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.6 Comparison of average clustering accuracy with state-of-the-art methods based on self-supervision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.1 Description of the unlabeled and labeled advertisements in our dataset . . . . . . . . 107 7.2 Unimodal performance: autoencoder representation vs. features (Acc., Accuracy (%); F 1 , F1 score (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Performance evaluation of unimodal vs. joint representations from the autoencoders 114 8.1 Description of the unlabeled and labeled advertisements in our dataset . . . . . . . . 125 8.2 Performance evaluation of self-supervised representations from modeling multimodal event streams as temporal point processes . . . . . . . . . . . . . . . . . . . . . . . . 125 D.1 Movie titles - Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 D.2 Movie titles - Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 E.1 No adaptation: VggFace2 clustering performance for the benchmarking dataset; Measure: Homogeneity (homog), Completeness (comp); V-measure (v-meas); over- clustering index (OCI); Number of clusters (K) . . . . . . . . . . . . . . . . . . . . . 169 E.2 ImpTriplet adaptation: VggFace2+ImpTriplet clustering performance for the benchmarking dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 E.3 MvCorr adaptation: VggFace2+MvCorr clustering performance for the bench- marking dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 x List of Figures 1.3.1 Schematic of multimodal and multiview learning paradigms. Starting with unla- beled data, the inherent correspondences between multiple views and modalities can be used to mine weak labels in such data. Next, unimodal and shared subspace rep- resentations can be learn using self-supervision methods such that these embeddings are naturally discriminative of the underlying semantic class of events. . . . . . . . 7 2.1.1 Taxonomy of dierent methodologies for multiview and multimodal learning . . . . . 13 3.5.1 Anity measures for synthetic data. Number of correlated components in the gen- erated data is 10 (boxed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5.2 Performance comparison of dMCCA with baseline methods . . . . . . . . . . . . . . 34 3.5.3 Performance evaluation of dMCCA handwritten digits for three views; (A) noisy- MNIST digits; AWGN: additive white Gaussian noise; (B) noisy-Bangla digits; (C) Comparison of dMCCA with supervised empirical upper-bound and othe CCA- based methods for classifying n-MNIST; (D) Comparison of dMCCA with super- vised upper-bound for classifying noisy-Bangla digits; (E) Cross-data generalizabil- ity: Train on n-MNIST, test on n-Bangla. In all cases, the performance of dMCCA representations for classication is comparable to the supervised end-to-end method. 35 4.1.1 multiview data includes observations acquired by recording an underlying event or object in its various presentations. For example, images acquired from dierent angles can be used as multiview data to learn a holistic model of the observed object (vis-www.cs.umass.edu/mvcnn). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 xi 4.2.1 Schematic of view bootstrapping for multiview shared subspace learning. Inset: ex- ample sub-network architecture. The central idea of our proposal is that multiview representations from a large number of views M can be modeled by randomly sub- sampling a small number of viewsm at each training iteration. Data from each of the subsampled views is used to train m identical networks to maximize the multiview correlation objective. After network convergence, the shared subspace representa- tions are captured by the last layer embeddings. . . . . . . . . . . . . . . . . . . . . 40 4.5.1 t-SNE visualization of the embeddings learnt on the multi-channel audio data. The nine classes are watching-TV ( ), absence ( ), working ( ), vacuum({), dishwashing (), eating (j), social-activity ( ), cooking ( ) and others ( ). See demo for an interactive version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5.2 Distribution of the magnitude of eigenvalues during MV-CORR training. As training progresses more eigenvalues approach the maximum value of 1, thereby discovering more directions of separability in the data. . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5.3 Clustering accuracy of unseen views for dierent choices of embedding-dimension d and number of views subsampled m. Notice that the best clustering performance is achieved for d = 40 and a subsample of m = 5 views, consistent with the bounds from our theoretical analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1.1 (A) Representing the `signal' from multiple `views' (B) speakers as views (C) words as views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.1 CNN architecture for the view branches in dMCCA . . . . . . . . . . . . . . . . . . . 65 5.3.1 The eect of the number of views m on multiview correlation and classication performance for command-ID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 t-SNE visualization of x-vectors and generalized MvCorr on the test set (A) command- representations: centroids are indicated by solid squares and the points belonging to a class are shaded. Notice the proximity of similar sounding words (circled in dotted lines) (B) speaker representations: markers+colors represent dierent speakers. 54 speakers with the most utterances in test-set are shown . . . . . . . . . . . . . . . . 70 5.4.1 Performance of MV-CORR for spoken word recognition in SCD . . . . . . . . . . . . 72 xii 6.1.1 Challenging instances of faces detected in a movie for character labeling task. The example shows the prominent characters from a 2016 Academy award winning movie Hidden Figures. Face quality labels associated with each track are also shown to tag some of the visual distractors. The images in the rst column are character label exemplars taken from the IMDb page: www.imdb.com/title/tt4846340 . . . . . . . 77 6.3.1 Distribution of face quality labels in SAIL-MCB at the track-level. Over 50% of the face tracks were labeled as \Prole" which means that at least one face in the track was shown posing sideways. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.2 Distribution of the genres for the 240 movies used for harvesting weakly labeled face tracks in SAIL-MultiFace. Movies may have multiple genres associated with them as listed on IMDb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.3.3 Example of hard-positive tracklets resulting from our proposed method with the nearest neighbor parameter k = 3; 5. Each color indicates one tracklet. Notice that they dier from each other with respect to face orientation (e.g., Row 1: Tracklet 1 vs. Tracklet 2) or with eyes open/closed (e.g., Row 2: Tracklet 2 vs. Tracklet 3). A single hard-positive tracklet can be formed from faces that may not appear in a sequential order allowing us to mine harder positives (see Tracklet 1). The pair of cannot-link face tracks shown here are from the movie Hidden Figures (2016) at time 00 : 11 : 05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3.4 Mining Hard-positive and Hard-negative tracklets using Nearest Neighbor Search . . 90 6.5.1 Eect of feature adaption with weakly labeled data. Feature adaptation is expected to bring positive samples closer to each other and pull apart negative samples further in the transformed embedding space. Qualitative comparison of the distributions of hard-positive and hard-negative distances in SAIL-MultiFace development set shows the benet of adaptation with (b) ImpTriplet and (c) MvCorr over the (a) original embeddings without adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5.2 Face quality error analysis in SAIL-MCB. For tracks with faces either wearing glasses (Glasses) or always facing the camera (frontal), MvCorr adaptation (+MvCorr) per- formed on-par with VGGFace2 pre-adaptation. In all other cases, +MvCorr signi- cantly improved clustering accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 xiii 7.3.1 Schematic diagram of segment-level autoencoders for (A) joint representation (B) au- dio: a-to-a, and (C) video: v-to-v. Input video and audio features of dimensionsd v ;d a for an ad of duration t. Video and audio segments of length v ; a . . . . . . . . . . . 109 7.4.1 Similarity of representations with each other and across time in a sample . . . . . . 115 8.1.1 Illustration of multimodal event streamsSf(t i ;k i );:::g in media content. The set of predictions k i and associated timing t i from pre-trained models can be used to create a sequence of timing-event pairs (t i ;k i ). Marked temporal point process can then be used to learn self-supervised representations of the multimodal event streams that capture the underlying content. The example shown here is from the ad campaign by Nike, 2017 available at youtu.be/WYP9AGtLvRg. . . . . . . . . . . . 119 8.2.1 Architecture of recurrent marked temporal point processes (RMTPP) proposed in (Du et al., 2016). BCE denotes binary cross entropy loss for predicting the event type and f (t) denotes the conditional density function. The parameters ; v;w are system parameters specic to estimating f (t). . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.4.1 Qualitative visualization of the self-supervised representations from audio, visual and text multimodal event streams. A few sentiment classes are highlighted. . . . . . . . 127 E.0.1ROC curves comparing the performance of face embeddings trained on images in videos for (a) IJB-B dataset without face alignment (b) IJB-B dataset with model- specic face alignment and (c) YTFaces dataset. VGGFace2 and FaceNet embed- dings not only performed well in IJB-B regardless of face alignment but performed consistently well in YTFaces which is typically more challenging as it consists of videos in-the-wild, mined from YouTube. . . . . . . . . . . . . . . . . . . . . . . . . . 170 xiv Abstract Human perception involves reconciling dierent sources of information that may appear con icting because of the multisensory nature of our experiences and interactions. Similarly, machine per- ception can benet by learning from multiple sources of measurements to develop a comprehensive model of an observed entity or event. This thesis focuses on addressing three open research problems in the area of multiview and multimodal machine learning: (1) how to learn robust representations from unlabeled or weakly labeled data (2) how to model many (more than two) views/modalities and (3) how to handle absent views/modalities. I begin by presenting a unied framework to delineate views and modalities as allied but distinct concepts in learning paradigms to facilitate subsequent modeling and analysis. For multiview learning, I propose deep multiview correlation to learn embeddings that capture the information shared across corresponding views such that they are discriminative of the underlying semantic class of events. Experiments on a diverse set of audio and visual tasks|multi-channel acoustic activity classication, spoken word recognition, 3D object classication and pose-invariant face recognition|demonstrate the ability of deep multiview correlation to model a large number of views. This method also shows excellent capacity to generalize for view-agnostic settings and when data from certain views is not available. For multimodal learning, I explore self-supervision between images, audio and text that natu- rally co-occur in data. I propose cross-modal autoencoders to learn joint audio-visual embeddings and model the arrival times of multimodal events with marked temporal temporal point processes. Experiments on a variety of tasks including sentiment and topic classication in videos show the benet of cross-modal autoencoder embeddings to capture information complementary to individ- ual modalities. Results underscore that point process modeling not only oers a scalable solution to model a large number of modalities but also capture the variable rate of change of information in individual modalities. The methods developed in this thesis have been successfully applied for large- scale media analytic tasks such as robust face clustering in movies and automatic understanding of video advertisements. xv Chapter 1 Introduction Human expressions, encounters, and interactions are experienced in a multisensory fashion. We perceive our surroundings through the seemingly distinct and independent senses of sight, sound, smell, taste and touch. In reality, the ve senses collaborate closely, enabling us to better understand our surroundings and interactions. Humans also communicate with the aid of these multiple senses. For example, when we are happy, we may express it through our facial expression as well as our voice and body language. Shi and M ueller perhaps best articulated these concepts of perception and expression (Z. Shi & Mueller, 2013): \Surrounded by multiple objects and events, receiving multisensory stimulation, our brain must sort through relevant and irrelevant multimodal signals to correctly decode and represent the information from the same and dierent objects and, respectively, events in the physical world." The neural underpinnings of human experience have also been shown to rely on multiple sensory areas of the brain. Due to the recent advances in brain imaging, neuroscientic studies have been reporting that the connections previously thought to be strictly dependent on individual sensory stimulation are in fact multisensory (Klemen & Chambers, 2012). Evidence shows that for a wide range of cognitive tasks, neural connections exist in the brain between motor and multiple sensory areas at dierent temporal and spatial system levels. Machine perception - i.e., building machines which have the ability to perceive and learn like humans - is one of the central tenets of machine learning research. Such technology is often deployed 1 towards automated understanding of human perceptual processes and the enriching of human experience. Analogous to the multisensory nature of human perception, robust machine perception can be developed by collecting data from dierent sources and various modes of measurement of an underlying event to incorporate the information from them. For example, consider the task of automatically recognizing a chair. To build this model, we may need to collect images of dierent types of chairs from a variety of angles and extract the shared information to robustly represent and visually discriminate the object-of-interest from other objects. Similarly, to identify human- centered constructs like emotion, we need to be able to incorporate not only facial expressions but also voice patterns or physiological information (for example, heart-rate measurements) whenever available. Much like how human perception and experience is multisensory in nature, we can develop a holistic understanding of the underlying object (for example, robust recognition of a chair) or a phe- nomenon (identifying expressed emotions) by factoring in the dierent aspects of its representation. Two important machine learning questions arise in this context: 1. How to model information about entities and events of interest, ltering through relevant and irrelevant data collected from multiple sources of measurement? 2. How to develop models that can scale to a large number of measurement sources, particularly when measurements from some sources may be absent? These are some of the questions central to the present dissertation. 1.1 Background Automatically understanding multimedia content such as movies, TV shows, or advertisements, and it's impact on individuals is the area of interest explored in this dissertation. In this section, I will rst discuss the diverse nature of multimedia content, the challenges it presents for machine learning, and how they can be addressed by learning from dierent aspects of the media content such as sound design, visuals and dialogue through multiview/multimodal learning. 2 1.1.1 Computational media intelligence (CMI) Media is created by humans, for humans: to tell stories that educate, entertain, sell, and create general awareness. There is an imminent need for creating human-centered media analytics to illu- minate the stories being told and understand their human impact. Objective, rich media content analysis has numerous applications to dierent stakeholders: from creators and decision/policy makers to content curators and consumers. Advances in machine learning can enable detailed and nuanced characterization of media content: of who, what and why, to understand and predict im- pact, both commercial and social. In recent years, this has emerged as an area for multidisciplinary research across social sciences, media studies, and psychology. This area of research is referred to as \Computational Media Intelligence" (CMI, (Somandepalli et al., 2021)). Its overarching goal is to automatically understand media stories through the portrayal of people, places and topics in the content and their impact on individuals and the society at large. When we watch TV or a movie, we concurrently experience dierent aspects of the presented media content | visuals, sound design, and dialogue | at the same time. We are able to process these dierent aspects to develop a holistic understanding of the scene we are watching in terms of who the characters are, where they are shown, how they are portrayed and how they interact with each other. We can also comprehend the overall message from the content that may be spread across many scenes. CMI deals with modeling and measuring the what-why-how-where-when attributes at scale to quantify their portrayal in media content. Creating this machine intelligence needs the ability to process and understand dierent aspects of media content (for example: images, audio, and language), each with its strengths and limitations to understand the story being told. These dierent aspects are presented with a vast variability both within and across dierent media content types. For example, consider modeling imagery from live-action genre versus animated content, or sounds in short-form content such as ads versus long-form content such as TV shows or movies, and broadcast news. Additionally, owing to the creative process underlying the production of the media content, individual constituent aspects may only convey partial information about the stories being told. 3 1.1.2 Multiview and multimodal modeling for CMI The dierent aspects of media content|such as visual imagery, sound design, and language use| can be treated as the multiple views or modalities associated with the stories told through the content. In general, a view or modality can be dened as data that is sampled from observing a semantic class of events or phenomena using dierent sensors or instruments at dierent states to capture its various manifestations. In many science and engineering applications, we often rely on multiview and multimodal data to learn a reliable and comprehensive representation of an observed event. This class of (machine) learning problems are referred to as multiview and multimodal learning - which is a fundamental building block towards developing CMI. Humans tend to learn by reconciling dierent views of information that might appear con icting (Klemen & Chambers, 2012): to augment the knowledge from individual measurements to better understand the underlying event. Similarly, in multiview and multimodal learning, the objective is to learn vector representations (subspaces) that capture both the information shared across the multiple views and information distinct to a modality to assist downstream learning tasks. Let us consider two exemplar applications to highlight the scope CMI applications: Automatic character labeling in movie videos: The objective here is to automatically un- derstand who appears when in a movie video to quantify how characters are portrayed in movie videos or how they interact with each other. This system requires robust face recognition that can identify characters through their faces no matter what the pose of the face is or what kind of background visual conditions they are presented in. We need to incorporate information across dierent appearances of a person's face (e.g., variations in pose, illumination or occlusions). In this case, mutiview learning can help capture the shared information from dierent views of a person's face to build models for robust face recognition. Understanding of video advertisements: The ability to automatically describe the topic dis- cussed in an advertisement (ad), or the sentiment evoked by it is essential to build computational understanding of the content. In this case, dierent modalities (images, audio and language from say, closed captions) may oer dierent cues to understanding the content. Here, multimodal learn- ing can help incorporate information from multiple modalities to learn a complete and robust model of what is being conveyed by the advertisement. Since ads are mostly created to sell products or 4 create awareness, the learnt multimodal representations can help in predicting the impact of ad content. A fundamental challenge in applying the existing multiview/multimodal frameworks to such CMI tasks is the lack of large-scale labeled media data. Creating such resources can be expensive, both in terms of manual eort and time. For example, manually labeling characters in a movie video of about 90 minutes takes up to 15 hours depending on the number of characters and style of the movie. While we cannot label the data at scale, we can certainly exploit the inherent correspondences between dierent aspects of media content such as visuals, sound and language. These correspondences are a result of the way in which data is collected or the content is created. The focus of this dissertation is to leverage these correspondences in media content to create weakly labeled data. The weakly labeled data can then be considered within a multiview/multimodal framework. In this context, three vital challenges remain open research questions: 1. How to learn robust representations from unlabeled or weakly-labeled multiview and multi- modal data? 2. How to model an underlying semantic class of events when a large number of views and modalities are available? 3. How to handle missing or absent modalities or views? These technical challenges form the core of this dissertation. 5 1.2 Thesis Statement Semi-supervised neural-based methods that leverage the inherent correspondence be- tween multiple views and modalities can learn comprehensive representations from unlabeled data to develop robust machine perception. The following themes are central to this dissertation and serve as common threads across all ex- periments in this thesis: 1. Semi-supervised methods: Due to the sheer scale of multimodal data, it is often not prac- tical to label all dimensions-of-interest in a dataset exhaustively. Thus, this thesis explores, in large part, unsupervised or self-supervised methods on large unlabeled datasets followed by mimimal supervision on smaller related and labeled datasets to learn robust representations for downstream machine learning tasks. 2. Inherent correspondence: Inherent to the way in which the data collected or how the content is created, we can identify correspondences between dierent views and mdalities in media content. For example, in a video advertisement, we have implicit time alignment between the closed-caption text, video and audio streams. Similarly if two faces appear in a movie frame, they cannot belong to the same person. We exploit such correspondences to mine weakly labeled data for self-supervised learning. 3. Comprehensive representation: The overarching objective in this dissertation is to learn the factors or representations associated with individual modalities and views and those which can capture information shared across the modalities. The central claim of this thesis is that one can learn a holistic representation of the underlying content by leveraging the inherent correspondence between multiple views and modalities. I hope that the dierent methodolo- gies explored in this dissertation provides evidence to support this claim. 6 Figure 1.3.1: Schematic of multimodal and multiview learning paradigms. Starting with unlabeled data, the inherent correspondences between multiple views and modalities can be used to mine weak labels in such data. Next, unimodal and shared subspace representations can be learn using self-supervision methods such that these embeddings are naturally discriminative of the underlying semantic class of events. 1.3 Summary of Contributions A schematic describing the overview of this dissertation is shown in Fig. 1.3.1. The primary contributions of this thesis are: 1. A unied framework for views and modalities: In much of the exisiting multiview and multimodal learning literature, the concepts of views and modalities are used interchangeably. However, within the context of application areas explored here, this thesis presents a unied framework to delineate views and modalities as allied but distinct concepts within learning paradigms to facilitate subsequent modeling and analysis 7 2. Multi-view shared subspace learning : The objective here is to capture the information shared across multiple views, in the presence of a large number of views. In this thesis, deep multiview correlation is proposed to learn embeddings that capture the information shared across corresponding views such that they are discriminative of the underlying semantic class of events. This method can incorporate information from a large number of views by subsampling a smaller number of views when training the models. It also shows excellent capacity to generalize for view-agnostic settings and when data from certain views is not available. 3. Cross-modal autoencoders: To learn from the the correspondence between co-occurring modalities such as audio and images in media content, cross-modal autoencoders are proposed to learn joint audio-visual representations by co-training encoder-decoder models to reconstruct one modality from an- other. The joint audio-visual embeddings are then obtained at the bottleneck. Analysis shows that these joint embeddings capture information complementary to unimodal autoen- coder representations and improve the performance of downstream tasks such as classication and clustering. 4. Temporal point processes to model multimodal event streams: A fundamental limitation of self-supervised autoencoder-like frameworks is that they are pairwise in construction and do not scale well with a large number of modalities. Additionally, these frameworks do not account for the variable rate of change of information presented by individual modalities. To address these drawbacks, this thesis proposes to model the arrival times of events across multiple modalities as marked temporal point processes. Models pre- trained for specic tasks in individual modalities are used to generate events streams for point process modeling. This approach not only scales to a a large number of modalities, but can also handle instances where some modalities may be absent or missing. 8 1.4 Structure • Chapter 2 presents the literature related multimodal and multiview machine learning along with presenting a unied framework in which views and modalities can be conceptualized as distinct concepts. PART I: Learning shared subspaces across multiple views • Chapter 3 presents deep multiset canonical correlation analysis as a measure of multiview correlation to model more than two views in parallel. • Chapter 4 presents a generalized multiview correlation formulation to be able to incorporate information from hundreds of views. Detailed benchmarking experiments and theoretical analysis are presented. • Chapter 5 presents an application of generalized multiview correlation in the speech domain for learning speech and speaker representations. • Chapter 6 presents an application of generalized multiview correlation for robust face cluster- ing in movie videos. Two open-sourced data resources: a large-scale weakly labeled dataset and a movie character labeling benchmark is developed as part of this eort. PART II: Learning shared subspaces across multiple modalities • Chapter 7 presents a tied crossmodal autoencoder approach to learn self-supervised represen- tations between audio and visual modalities in videos. • Chapter 8 presents a novel direction for modeling more than two modalities in a self-supervised setup using temporal point processes to model multimodal event streams. • Chapter 9 presents a summary of the methods developed in this thesis and future research directions. 9 Chapter 2 Prior Work 2.1 Views vs. Modalities Multiple views and modalities in data are instances sampled from observing a phenomenon, an entity or an event with dierent instruments, at dierent states to capture its various manifesta- tions. To further contextualize the scope of problems we consider for experiments and analysis, we delineate two kinds of allied but distinct learning formulations: multi-view and multi-modal problems following the ideas proposed in a recent survey (Ding, Shao, & Fu, 2018). Although, in related work of these domains, the two terms|views and modalities|are used interchangeably (see surveys (Ding et al., 2018; Y. Li et al., 2018; Zhao, Xie, Xu, & Sun, 2017)), they present dis- tinct modeling challenges and opportunities if we consider the source from which they are generally acquired. Multiple views of an event are samples drawn from identically or similarly distributed random processes; for example, a chair photographed from dierent angles. The semantic class (here, chair) and the dierent view angles can be treated as dierent modes of the same underlying distribution. Depending on which parameter of the distribution is modied, we either get the same object class in dierent angles or dierent objects imaged in the same angle. In practice, multiple views of an event often share similar characteristics of domain and feature space. In our example of a chair imaged in multiple angles, the dierent view samples are all digital images. In contrast, samples from dierent modalities need not arise from similarly distributed processes. For example, consider a person's expressed emotions such as happiness, measured by their facial expression, voice, and 10 body language. The individual modalities may not be similarly distributed but when considered together help understand the expressed emotion. Another example is automatically recognizing a person from their face and voice. This functional distinction between views and modalities is interesting to highlight for several reasons: 1. Systematize learning paradigms: Treating views and modalities distinctly helps to sys- tematize the objective of the learning tasks in such data. For example, learning from multiple views focuses on representing the event being observed by capturing the shared information. Thus, one may conceptualize the task as learning a common or shared subspace represen- tation across multiple corresponding views. For example, consider recognizing an object as chair from dierent angles (Su, Maji, Kalogerakis, & Learned-Miller, 2015). However, when dealing with multiple modalities such as images, audio and language, we are often interested in learning the joint subspace representation that capture dierent aspects of the observed entity or event: consider multimodal emotion recognition (Pham, Liang, Manzini, Morency, & P oczos, n.d.). 2. Adopt self-supervision methodologies: The age of representation learning has given rise to many ideas that require minimal or no manual labeling. Self-supervised learning deals with learning the underlying semantic concepts by leveraging dierent signals that may naturally co-occur in data. Perhaps, one of the earliest large-scale application of this idea was to learn word representations called word2vec (Mikolov, Chen, Corrado, & Dean, 2013) using the context inherent to language use. Self-supervision has also been successfully applied in computer vision tasks by setting up pretext or proxy tasks to learn rich representations from unlabeled images. Some examples of proxy tasks include: solving a jigsaw puzzle (Doersch, Gupta, & Efros, 2015), image colorization (R. Zhang, Isola, & Efros, 2016) and time-aligning video frames (X. Wang & Gupta, 2015). Similar ideas can also be adopted when multiple views and modalities are available. In multi-view data, contrastive-loss or triplet-loss methods can be readily applied since the correspondences between multiple views maybe readily available in domains such as videos (Somandepalli & Narayanan, 2019). Here the anchoring signal is whether a given set of views belong to the same semantic class. In contrast, multimodal 11 data, the supervision is directly between the co-occurring modalities. Some popular examples include visual ambience learning from audio supervision (Owens, Wu, McDermott, Freeman, & Torralba, 2016) and cross-modal similarity learning (Jansen et al., 2019). 3. Connections with domain adaptation: The functional distinction of views as distinct from modalities helps us to formulate domain adaptation problems in a multiview paradigm. For example, consider the case of spoken word recognition where we need to recognize a wake-word such as \Alexa". The words spoken by dierent speakers or repeated by the same speaker can be assumed to be drawn from similarly/identically distributed processes. The signal component common to these processes gives a robust representation of the wake- word (Somandepalli, Kumar, Jati, Georgiou, & Narayanan, 2019). The premise that multiple views can be modeled as samples from identically distributed processes allows us to adapt our framework for applications that need to scale for hundreds of views (e.g., wake-word recognition invariant to the speakers). 12 Multiview and multimodal subspace learning Subspace Alignment Generative models Fusion methods Correlation Based Metric Learning Graphical models Deep generative Early/Late fusion Attention Auto- encoders CCA [Hotelling,1936] kernel-CCA [Kettenring, 1971] MDBP [Li ‘16] Deep Gen-CCA [Benton ‘17] Correspondence LDA [Blei ‘03] Multiview transition HMM [Ji ‘16] DBM [Srivastava ‘14] GAN [Vukotic ‘17] m-VAE [Wu ‘18] mv-CNN [Su ‘15] Correlational RNN [Yang ‘17] Recursive Attention [Beard ‘18] mv-Attention [Zadeh ‘18] CCA-AE [Wang ’15] Cross-modal AE [Dumpala ’19] Figure 2.1.1: Taxonomy of dierent methodologies for multiview and multimodal learning 13 2.2 Multiview and multimodal learning We dene multiview machine learning as the task of learning a vector (low-dimensional) represen- tation that captures the information common to multiple views. Similarly, multimodal machine learning aims to build models that can process and relate information from multiple modalities to predict the entity or event being observed by the modalities. These representations are used to explore the underlying structure in the data in an unsupervised fashion (Parra, Haufe, & Dmo- chowski, 2018) or to provide robust features for downstream tasks such as classifying the event being observed (W. Wang, Arora, Livescu, & Bilmes, 2015). As described in Figure 2.1.1, the methods proposed in the literature can be generally classied into three broad categories: (1) sub- space alignment, (2) generative models and (3) fusion methods. Note that many proposed methods in the literature may belong in more than one of these categories. The following sections give a brief overview of these categories, their strengths and limitations for addressing some of the open research problems in this area. 2.3 Subspace alignment methods Subspace alignment deals with learning a projection space between two or more views in such as way that a measure of correlation between them is maximized (correlation-based methods). One of the earliest works in this class of methods is called canonical correlation analysis (CCA) proposed by Hotelling (Hotelling, 1936). CCA nds a linear transformation between two sets of variables such that their projections are maximally correlated. By denition, CCA is constrained to two sets of variables. Five dierent methods were proposed by Kettenring (Kettenring, 1971) to extend CCA for multiple set of variables by considering the sums of pairwise correlations. Subsequently, kernel versions (S.-Y. Huang, Lee, & Hsiao, 2009) and deep learning based methods (Andrew, Arora, Bilmes, & Livescu, 2013; Dumpala, Sheikh, Chakraborty, & Kopparapu, 2018) were proposed to learn possibly non-linear projections to learn maximally correlated subspaces. Alignment between the data from multiple views can also be achieved by learning a metric subspace to maximize the similarity between corresponding sets of variables (metric-learning). Generalized CCA (GCCA) was proposed by Horst et al., (Horst, 1961) to extend CCA to multiple sets of variables by learning an orthonormal representation that is close (in terms of angle between 14 the subspaces) to the multiview data matrices upto a linear transformation that is estimated by a view-specic matrix. A deep learning extension to GCCA was proposed in (Benton et al., 2017) to learn non-linear transformations of high-dimensional datasets. Another prominent metric-learning approach called multi-view discriminative bilinear projection (MDBP (S. Li, Li, & Fu, 2016)) was proposed to learn shared subspaces between multiview time series. A few prominent applications of these methods are for learning correlated components from brain signals across individuals (Parra et al., 2018) and phoneme recognition using speech features and articulatory information (Andrew et al., 2013). The central premise of this class of methods is that a shared subspace can be learnt because of the correspondence between the multiple views in the data. Such a subspace representation is inherently discriminative (Livescu & Stoehr, 2009) or can be additionally supervised by using the label space as a dierent view (Benton et al., 2017). Many of these methods, in their linear version have a global optimum which can be solved using SVD (Benton et al., 2017) or eigenvalue decomposition (Andrew et al., 2013; S. Li et al., 2016; Livescu & Stoehr, 2009). Local optima often arise because of the assumption of a graph Laplacian kernel (e.g., multiview clustering, (Kumar, Rai, & Daume, 2011)) or through the use of deep learning framework (Benton et al., 2017). In most of these methods, the number of learnable parameters scale at least linearly with the number of views (Kumar et al., 2011). Additionally, in the multimodal case, no mechanisms have been proposed to explicitly dissociate the factors corresponding to the shared information and view- specic information in such data. 2.4 Generative models The multiple views or modalities are dierent manifestations of an intrinsic process which can be parameterized by latent factors associated with the dataset being examined. The goal of generative modeling is to learn a feature space associated with the generative process that is modeled with a set of latent factors that parameterize the distribution of the multiple views in a given dataset (Y. Li et al., 2018). Simple probabilistic graphical approaches include Gaussian mixture models (GMM) (Taskar, Segal, & Koller, 2001) which use a single discrete latent variable to represent the joint clustering of two modalities in dataset, hidden markov models (HMM) and latent Dirichlet 15 allocation (Blei & Jordan, 2003). Prominent examples include the application of multinomial GMM to jointly model images and captions (Barnard et al., 2003), and HMM to model the view-transition matrix for action recognition where dierent feature sets were treated as views (Ji, Ju, Wang, & Wang, 2016) and correspondence latent Dirichlet analysis which allows for representation of spatial and temporal changes in multimodal factors(Blei & Jordan, 2003). Along similar lines, factor analysis methods (Harman, 1960) have been eciently used by treat- ing dierent modes of variability as multiple views to learn a global low-dimensional representation of the attribute to be classied in a dataset (Lian, Rai, Salazar, & Carin, 2015). One prominent use case is to obtain speech representations robust to speaker or channel variability with total vari- ability modeling (TVM) (Dehak, Kenny, Dehak, Dumouchel, & Ouellet, 2011) which has shown to perform eectively for speaker and language identication (Dehak et al., 2009; Travadi, Segbroeck, & Narayanan, 2014) among other speech processing applications. With the advent of deep learning, approaches such as generative adversarial networks (GAN) have also been successfully deployed for multimodal learning. A bi-modal deep Boltzmann machine (BM) was proposed in (Srivastava & Salakhutdinov, 2012) by viewing multiple restricted BM as unimodal undirected pathways to learn joint image-text representations for classication and information retrieval. A cross-modal GAN was introduced in (Vukoti c, Raymond, & Gravier, 2017) for the task of video hyperlinking with visuals and text as the two modalities. One of the primary benets of generative modeling is the ability to handle missing modalities at inference time by learning ecient joint distributions over the multiple modalities. This was explored in (M. Wu & Goodman, 2018) with a multimodal variational autoencoder (VAE) with a product-of-experts inference framework for machine translation, among other tasks. Multimodal-VAEs were also eectively applied for fake news detection (Khattar, Goud, Gupta, & Varma, 2019). Some of the recent approaches have also focused on multiview data generation by considering pairwise samples in the dataset as belonging to dierent views (M. Chen, Denoyer, & Arti eres, 2017) circumventing the need for specic view labels. A major limitation in (conditional) generative models is that, without additional supervision, much like the total variability modeling approaches, these generative models pick out joint distri- butions along the dominant modes of variability in a given dataset. As such, their generalizability 16 across datasets, especially with limited amounts of data, remains to be proven. 2.5 Fusion Methods Fusion methods are perhaps the most versatile and widely used group of methods because they can be discriminatively trained for a task at hand. There is a long history of using early fusion (concatenating the features of dierent modalities) or late fusion (making a decision based on unimodal predictions) with machine learning approaches such as SVM (Molina et al., 2014). Such fusion methods can be scaled up using convolutional neural networks (CNN) for applications such as multiview 3D object detection (Su et al., 2015) by max-pooling the embeddings from dierent views to, say classify a 3D object. One our initial contributions was within the scope of the fusion methods where we showed that late fusion of features from audio, visual and physiological data using Kalman lters can eciently track the dimensional measures of aect (arousal and valence) in time (Somandepalli et al., 2016) for conversational data. These methods can also be extended Correlational recurrent neural networks (Co-RNN) (Yang et al., 2017) was proposed to fuse temporal features from multiple modalities. Here, a pairwise correlation loss between the intermediate states generated by a RNN was used to ensure that that joint representations were uncorrelated in time. One of the central challenges in building discriminative multimodal models is the ability to focus on dierent modalities or features to a variable extent. This can be achieved using attention mech- anism introduced for deep learning methods (Bahdanau, Chorowski, Serdyuk, Brakel, & Bengio, 2016) which learns to weigh dierent parts of an embedding with dierent importance as it relates to the output. Attention can also be thought of as a subspace alignment method if one of the subspaces is the label space of the data (Y. Li et al., 2018). A prominent application of attention is for multimodal emotion recognition. A recursive temporal attention mechanism for fusing face, speech and language was proposed in (Beard et al., 2018). Another approach proposed distinct attention modules to learn intra- and inter-view interactions (Zadeh et al., 2018) for classifying emotion in videos. In the proposed taxonomy, we have also classied autoencoder (AE) models under fusion meth- ods because the extensions of AE, with deep learning to multimodal applications often include 17 concatenating the intermediate layer representations. Our own work in (Somandepalli, Martinez, Kumar, & Narayanan, 2018) showed that tied cross-modal AE that learn uncorrelated features useful for multimodal tasks. More recently, AE with CCA to obtain correlated intermediate level features has also been proposed (W. Wang et al., 2015) for letter recognition. Furthermore, cross- modal AE with CCA was recently used for emotion recognition (Dumpala et al., 2018) to learn correlated cross-modal features for unlabeled data. AE approaches are indeed promising toward the task of learning factorized representations in multimodal data, i.e., learning to disentangle modality-specic and shared representations in a unsupervised fashion. Two recent works (Hsu & Glass, 2018; Tsai, Liang, Zadeh, Morency, & Salakhutdinov, 2018) have explored this line of research, for tasks such as digit recognition. However, none of these methods have been developed for multimodal data, especially for unsupervised tasks. 18 Part I Multiview shared subspace learning 19 Chapter 3 Deep Multiview Correlation In this chapter, we start exploring canonical correlation analysis (CCA) based methods to develop multiview shared subspace learning methods. We propose deep multiset CCA as a measure of multiview correlation to model more than two views of data in parallel. We formulate this measure as a loss function in a neural network framework to learn potentially non-linear transformations of the input data to learn low-dimensional representations that are correlated across the multiple views. We conduct simulations and experiments on synthetic and simple tasks such as noisy digit classication to study the applicability of the proposed method. We have released the code and more results at github.com/usc-sail/mica-deep-mcca 3.1 Introduction In many signal processing applications, we often observe the underlying phenomenon or mechanism through dierent modes of signal measurement, or views or modalities. For example, a person's facial identity can be captured by taking their photographs under various lighting conditions, in dierent angles and with or without make up. These observations are often corrupted by sensor noise, as well as variability of the signal itself, which can be in uenced by interactions with other nuisance factors such as pose and illumination. The question then becomes: Given that the dierent The work presented in this has been uploaded to arXiv while being prepared for a journal publication: Soman- depalli, et al. \Multimodal Representation Learning Using Deep Multiset Canonical Correlation Analysis " arXiv, 2019. 20 views are observing the same phenomenon, how do we learn to represent the information that is common across them. Multiview representation learning refers to this task of learning a subspace that captures the information shared across the views. Multiview representations have been shown to capture the variability of the underlying signal better than a single view. (Sridharan & Kakade, 2008), and generalize better to unseen data. Recently, deep-learning-based methods in this domain have been successful for downstream tasks such as clustering or classication. Some examples of related works are multimodal autoencoders (Ngiam et al., 2011), multimodal restricted Boltzmann machines (Srivastava & Salakhutdinov, 2012), and our own work on shared representations using segment-level autoencoders (Somandepalli et al., 2018). Most of these methods optimize some form of reconstruction error. In contrast, methods like canonical correlation analysis (Anderson, 1958; Hotelling, 1936) and kernel CCA (KCCA (Akaho, 2006)) are widely used to nd projections that maximially correlate two spaces. However, by denition, these methods are limited to two views. CCA and KCCA have been applied to address a broad range of problems and applications{from unsupervised learning for fMRI data (e.g, (Hardoon, Szedmak, & Shawe-Taylor, 2004)) to decreasing sample complexity for regression (Kakade & Foster, 2007). See (Andrew et al., 2013) for other application areas. One of the drawbacks of CCA is that it can only learn a linear projection that maximizes the correlation of the signal between the views. This can be somewhat alleviated by KCCA, where a non-parametric kernel is used to learn nonlinear representations. However, the ability of KCCA to generalize is limited by the xed kernel. Deep CCA (DCCA (Andrew et al., 2013)) addresses some of the drawbacks of CCA and KCCA by learning possibly nonlinear transformations of the data that can maximize the correlation be- tween the two views. DCCA has been successfully applied to several applications where the dimen- sionality of the inputs is high and the relationship between them maybe complex and nonlinear. Some prominent examples are unsupervised learning of acoustic features from both acoustic and articulatory data (X. Wang & Gupta, 2015), matching images with caption (F. Yan & Mikolajczyk, 2015), multilingual word embedding (Lu, Wang, Bansal, Gimpel, & Livescu, 2015) and audio-visual speech recognition (Mroueh, Marcheret, & Goel, 2015). In spite of the success of DCCA, by de- nition of the problem it can only operate on two views. Hence, in problems with arbitrary many views, this approach does not scale. 21 One of the rst works to extend CCA for several `sets of variables' was proposed in (Kettenring, 1971). Following this work, there have been several versions of multiset CCA. A seminal work in this direction is by Nielsen (Nielsen, 2002) which examines ve dierent formulations to optimize multiset correlations including maximizing the sum of correlation, referred to as SUMCOR (Nielsen, 2002). The common element in all these formulations is to examine the eigen vectors (projections) of the sample covariance matrices of the datasets. Another popular method for multiset CCA is generalized CCA (GCCA (Horst, 1961)), which nds a low-dimensional orthonormal representation of the data by whitening the data matrix across all views. This formulation is similar to the MAXVAR approach proposed in (Kettenring, 1971). GCCA, along with its kernel variant have been successfully applied to multiview problems such as multi-view latent semantic analysis (Rastogi, Van Durme, & Arora, 2015) and to model a discriminative multiview latent space (Sharma, Kumar, Daume, & Jacobs, 2012). Similar to CCA, GCCA methods are also limited to learning linear projections or with a xed kernel. An extension to GCCA with deep learning, DGCCA (Benton et al., 2017) has been proposed to be able to generalize GCCA for nonlinear transformations. It is important to note that DGCCA only considers the sum of within-view covariances, which is similar to principal component analysis (PCA), in that it captures components with maximum variance within a view. In contrast, the multiset CCA (Kettenring, 1971; McKeon, 1967) formulation to extend CCA for multiple datasets considers both the inter-set (between views) and intra-set (within view) covariances. The crux of MCCA formulation is to nd a linear projection space that maximizes some form of the ratio between the sum of inter-set and intra-set covariances. A formal characterization, and theoretical results for MCCA was published in (Parra et al., 2018) to show that maximizing the ratio of inter-set and intra-set covariances is akin to maximizing the `reliability' of repeated measurements. It is important to note that the ratio of covariance formulation here is similar to the ratio of scatter matrices used in linear discriminant analysis (LDA). However, LDA and MCCA are conceptually dierent in that LDA needs class information but MCCA only needs the view-correspondence information. The relation between LDA and MCCA has also been shown in (Parra et al., 2018). In this work, we propose deep MCCA (dMCCA) to be able to leverage the nonlinear relation- 22 ships between the views using deep learning. Our work also enables a CCA-like formulation for arbitrarily many views, and shows that such a network can be optimized with stochastic gradient descent (SGD) in a mini-batch fashion. While we address the same problem as in (Benton et al., 2017), we use the MCCA formulation in (Kettenring, 1971; Parra et al., 2018) that can model both the between-view and within-view covariance to obtain maximally correlated components. We rst analyze our dMCCA network with synthetic data. We also evaluate our model for noisy handwrit- ten digit data with more than two views, and compare our performance with that of MCCA and DGCCA, as well as PCA and supervised methods for a downstream task of classication. 3.2 Background In this section, we brie y review CCA, GCCA, its deep-learning variants, and MCCA. We discuss how this framework relates to deep multimodal representation learning in general, and end with a brief note on the similarities to LDA. 3.2.1 CCA, GCCA and deep learning variants Let (X 1 ; X 2 )2R n1 R n2 denote vectors corresponding to the two views, with covariances, ( 11 ; 22 ) and cross covariance 12 . CCA nds linear projections in the direction that maximizes the corre- lation between them: (v 1 ; v 2 ) = arg max v 1 2R d1 ;v 2 2R d2 v > 1 12 v 2 q v > 1 11 v 1 v > 2 22 v 2 (3.2.1) In deep CCA framework, the data matrices above are instead X 1 ! H 1 ; X 2 ! H 2 , where H 1 =f(X 1 ); H 2 =g(X 2 ) are outputs from the top-most fully connected layers of the two (possibly deep) neural network transformations f and g respectively. The loss function used to optimize this network is the sum of the top k singular values of the matrix T = ^ 1=2 11 ^ 12 ^ 1=2 22 where the covariance matrices are estimated on H 1 and H 2 . These networks can either full-batch opti- mized (Andrew et al., 2013), or optimized using mini-batches in a stochastic method (F. Yan & Mikolajczyk, 2015). GCCA and MCCA extend the CCA formulation to arbitrarily many views. They work withN 23 DNM tensor where N is the number of data samples and M is the number of views, each with dimension D (notation consistent with (Parra et al., 2018)). Let X l 2R Nd l ;l = 1;:::;N where the feature dimensiond l can vary with the view. GCCA seeks to nd a low (k-)dimensional shared representation G2 R kN and a view-specic rotation matrix U l 2 R d l k by solving: min U l ;G N l=1 kG U l> X l k 2 F s:t: GG > = I k (3.2.2) In other words, GCCA nds a low-dimensional orthonormal space G that minimizes the recon- struction errors for all views up to a rotation matrix U l . In the deep variant DGCCA, the N networks are optimized using SGD on mini-batches with the following objective: min U l ;G N l=1 kG U > l H l k 2 F = min G kN tr(GMG > ) (3.2.3) =) GG > = I k max k i=1 i (M) where M = N l=1 P l is the sum of the projection matrices that whitens the data with P l = H l> (R l W ) 1 H where R l W = H l> H. Notice that R l W is akin to covariance of the l th view if the hidden layer representation are mean-centered. Thus, (D)GCCA method maximizes the top k eigenvalues of the sum of whitening matrices for all views. See (Benton et al., 2017) for more details. 3.2.2 Multiset CCA As mentioned earlier, whereas the GCCA method only examines the within-view covariances. MCCA examines both the between- and within-view covariances. As mentioned earlier, whereas the GCCA method only examines the within-view covariances. MCCA examines both the between- and within-view covariances Let X l 2 R ND ;l = 1;::::M be the M views with N samples of D dimension each. The inter-set correlation (ISC) is dened as: d = 1 M 1 v > d R B v d v > d R W v d ; d = 1;::;D (3.2.4) 24 where R B and R W referred to as the between-set and within-set covariance matrix dened as: R B = M l=1 M k=1;k6=l X l ( X k ) > (3.2.5) R W = M l=1 X l ( X l ) > (3.2.6) where X = XE(X) are the centered datasets. We omit the common scaling factor (N1) 1 M 1 here. MCCA nds the set of projection vectors v d ;d = 1;:::;D that maximizes the ratio of sums of between-set and within-set covariances by solving the following generalized eigenvalue (GEV) problem: R B V = R W V; diagonal matrix dd = d (3.2.7) In summary, MCCA nds the projection of the data that maximizes ISC 6.4.5 by nding the principal eigenvector of between-set over within-set covariance. In simpler terms, MCCA examines the eigenvalue decomposition of R 1 W R B when R W is invertible. 3.3 Deep Multiset Canonical Correlation Analysis (dMCCA) In this section, we describe dMCCA which leverages the deep learning framework to learn possibly nonlinear transformations of data from many views such that the representations learnt are maxi- mally correlated (as dened by ISC). We describe our loss function and provide a sketch to derive gradients following related work (Andrew et al., 2013; Benton et al., 2017; Dorfer, Kelz, & Widmer, 2015) for backpropagation. In practice, computing R B can be computationally expensive because all pairs of views need to be considered. Instead we estimate the total covariance R T , and estimate R B = R T R W as follows: R T =M 2 ( H 1 > )( H 1 > ) > (3.3.1) where H = M l=1 H l is the average across all views, and = 1 N N t=1 ( 1 M M l=1 h l t ) is the grand mean 25 across all samples of all views. The pseudocode of the our algorithm is given in Alg. 1. We rst initialize the weights of the N d-layer networks and perform a forward pass of the input data to obtain the top-most layer activations, H l . We then estimate between-set and within-set covariances with eqn. 5-6 and solve the GEV problem in eqn. (3.2.7) using Cholesky decomposition of R W (line 6, Alg. 1) as described in (Parra et al., 2018). We recompute eqn (4) using the eigenvectors V to estimate ISC (line 7 Alg. 1). Our objective is to maximize the average ISC. Notice that the GEV solution V simultaneously diagnolizes R B and R W , hence the loss function which is ratio of matrix traces is equivalent to the ratio of the diagonal elements in the ratio term. In order for backpropagation to work, we must be able to derive gradients over the loss function. We begin with the GEV problem in eqn (7): d R W v d = R B v d = (R T R W )v d (3.3.2) =) R T v d = ( d + 1)R W v d (3.3.3) The partial derivative of the eigenvalue d with respect to each hidden layer representation H l can be written as follows (de Leeuw, 2011): @ d @H l = v > d @R T @H l ( d + 1) @R W @H l v d (3.3.4) WLOG, assume = 0 in eqn (8) and with simpler notation H ! H the partial derivative of R T in a mini-batch size of M is : @[R T ] ab @[H] ij = 8 > > > > > > > > > > < > > > > > > > > > > : 2 M1 [H] ij 1 M k [H] ik if a =b =i 1 M1 [H] bj 1 M k [H] bk if a =i;b6=i 1 M1 [H] aj 1 M k [H] ak if a6=i;b =i 0 else (3.3.5) We omit the derivation of within-set covariance gradients, R W as it is identical to the derivation 26 Algorithm 1: Deep Multiset Canonical Correlation Analysis Input: M -views with batch size B [X 1 ;:::; X M ]; X l 2 R BD learning rate Output: K -dim representations from a d-layer network [H 1 d ;:::; H M d ] Initialize: N -network weights [W 1 ;:::; W M ] 1 while not converged do 2 for l=1,2,..,N do 3 H l forward pass of X l with W l 4 end 5 Estimate R B and R W for H l ;l = 1;:::;M 6 Solve V in (3.2.7) by factorizing R W = LL > 7 Compute L = Tr(V > R B V) Tr(V > R W V) = 1 D D d=1 d 8 for l=1,2,..,M do 9 W l backprop(@L=@H l ; W l ) 10 W l W l W l 11 end 12 end in (Andrew et al., 2013; Dorfer et al., 2015). The gradients needed for the back propagation can then be calculated by substituting equations 3.2.5 and 3.2.6 into the loss function (Line 7, Alg. 1). 3.4 Experiments 3.4.1 Simulation Experiments To evaluate that dMCCA is learning highly correlated components, we generate synthetic obser- vations as detailed in (Parra et al., 2018) where the number of common signal components across dierent views is known. Because the source signal is given, we can build a supervised deep learning model to reconstruct the source signal, which provides an empirical upper-bound of the performance in our experiments. Data Generation Consider N samples of signal and noise components for M views to be s l n 2R K and b l n 2R D n = 1;:::;N;l = 1;:::;M;K < D respectively, both drawn from a standard normal distribution. Be- cause our objective is to obtain correlated components across the views, we xed the same signal component across the M views, i.e, s l n s n , but corrupted with a view-specic noise l . Thus, the signals were mapped to the measurement space as x l s;n = A l s s n + l ; x l b;n = A l b b l n and were 27 z-normalized. The multiplicative noise matrices were generated as A l s = O l s D l s 2 R DK and A l b = O l b D l b 2R DD The two matrices O l s 2R DK and O l b 2R DD are composed of orthonormal columns. The non-zero eigenvalues of the signal and noise covariance matrices were set with D l s 2 R KK and D l b 2 R DD by constructingD ii = exp(d i );d i N (0; 1). We used dierent matrices A l s and A l b to simulate a case where dierent views of the underlying signal are corrupted by dierent noise. As is the case with many real world datasets, the noise in the measurement signal is further correlated between the views. We simulated this by x l b;t x l b;n +(1)x l b;n ;2 [0; 1]. Finally, the SNR of the measurements is controlled by to generate the multiview data as y l n =x l s;n +(1)x l b;n ;2 [0; 1] resulting in a data matrix of size N DM withN samples ofD-dimensional data fromM views. For all our experiments, we generated data with N = 100000;D = 1024;K = 10;M = 3; = 0:7 and spatial noise correlation = 0:5. Supervised learning baseline In order to obtain an empirical upper-bound on the performance of estimating the correlated components, we train M neural networks where the input is [X 1 ;:::; X M ] supervised with the corresponding signal [s 1 ;:::; s M ]. In our experiments, with M = 3 we use identical architecture for the three networks. Each network has an input layer of 1024 nodes, followed by 512 nodes and 10 notes. We denote this model as Supervised-512-10. We also tested linear and tanh activation functions for the baselines with varying mini-batch sizes. A dropout of 0.2 was included for the networks with tanh activation to prevent overtting. All models were trained using SGD with Nesterov momentum with 10% of the data for testing and 20% for validation. The learning rate and momentum were xed at 1e-3 and 0.9 respectively. The weights of all layers were l 2 regularized with a parameter of 1e-5. Deep multiset CCA The network for the dMCCA model trained on the synthetic data is similar to the architecture de- scribed in 3.4.1 with two modications: 1) There was no supervision, instead the mean ISC between the views was maximized as described in Algo 1, and 2) the number of dimensions of the output em- bedding, i.e., the number of correlated components were varied as d = [5; 10; 15; 20; 40; 50; 64; 128]. 28 Consistent with our notation for the baseline model, we denote this model as dMCCA-512-d. The number of dimensions of the resulting embedding was varied to examine the anity of the representations with the source signal. This is important, since in real world applications the num- ber of correlated components is not known apriori. In our preliminary experiments, we noted that RMSProp instead of SGD with momentum yielded stable results in terms of loss at each epoch for the train and validation data. For all models, the learning rate and decay rate of RMSProp was set to 1e-3 and 0.9 respectively. Early stopping criteria were employed for both the supervised and dMCCA models (i.e, stop training when validation loss is less than 1e-6 for at least 5 consecutive epochs). All experiments were repeated ten times with dierent train, testing and validation (val) partitions of the synthetic data. 3.4.2 Noisy digit classication experiments We use the noisy-MNIST data and n-Bangla datasets released in (Basu et al., 2017) to demonstrate that dMCCA algorithm can be used to learn the correlated signal components using observations from dierent views. The n-MNIST and n-Bangla dataset was created using the handwritten roman numeral digits dataset, MNIST (LeCun, 1998) and Bangla digits(Basu et al., 2017) respectively. For each sample, multiple views were synthetically generated by adding 1) additive white Gaussian noise (AWGN) 2) motion blur, and, 3) reduced contrast + AWGN. The noise parameters were dierent for the n-MNIST and n-Bangla datasets. In all experiments, we used 50000 samples for training, 10000 for validation and 10000 for testing. Both datasets have ten digit classes and are nearly class balanced. We used two datasets in order to test the cross-daatset generalizability of the proposed model. dMCCA for n-MNIST The network architecture for dMCCA is similar to 3.4.2 with two modications: 1) the d-dimensional embedding layers are not concatenated, instead the mean ISC is optimized as described in Algo- rithm 1. 2) After convergence as determined by the early stopping criteria, we concatenate the multimodal representations and use them as features for a linear SVM for 10-class classication. We use the same train-val-test partitions as in 3.4.2, where we only use the train set to train the dMCCA network using the val set to determine early stopping. The trained model is then applied 29 to the test partition for SVM classication. The penalty parameter of the SVM was tuned on the val set and the accuracy results from a 10-fold cross-validation on the test set are reported. The experiments are repeated by varying mini-batch size and embedding dimension, d. We also assessed linear separability using TSNE to visualize the embeddings qualitatively and normalized mutual information (NMI) score from a k-means clustering algorithm. Baseline experiments We setup ve dierent baseline experiments to compare the performance of dMCCA representa- tions on downstream tasks. For all experiments, we report the 10-fold CV results on the test-set, after tuning the system (SVM or DNN) parameters on the val-set. We also varied the feature or embedding size, K and the mini-batch sizes (See Fig. 2 for ranges) where applicable. We used permutation testing to test for dierences in performance between the models. 1. Supervised baseline: Similar to the baseline experiment we described in 3.4.1, we use a DNN architecture to capture the shared information of the three views for the task of classifying the digits into ten classes. This is the most competitive baseline for our dMCCA experiments because it is supervised and trained in an end-to-end fashion. The network has three branches, one for each view: Each branch has an input layer of 784 nodes, an intermediate layer of 1024 nodes followed by d-dimensional layer{all with a tanh activation function (on this architecture, a model with linear activation function does not learn to classify). The three d-dimensional layers are concatenated as common input to the nal classication layer of 10 nodes with softmax activation. The concate- nation of the penultimate layer ensures that the representation is shared across the three views. The model was optimized with categorical cross-entropy using early stopping criteria as described in 3.4.1. Arguably this may not be the best architecture for classifying n-MNIST dataset, but our objective here was to obtain an upper-bound in performance using a supervised learning approach. 2. PCA: We compare with PCA because both PCA and CCA methods perform dimensionality reduction. We concatenated the 784-dim vectors of three views of n-MINST and perform PCA on the test set by estimating the covariance on the train set varying the K principal components. 3. DCCA: We chose two of the three views randomly and applied DCCA (Andrew et al., 2013) to obtain embeddings that are then concatenated for input to SVM. 4. DGCCA: We used the publicly released DGCCA code (Andrew et al., 2013) to learn embed- 30 dings using a network architecture same as 3.3 for dierent embedding dimensions. The embeddings were concatented to obtain the nal features. 5. MCCA: The projection vector space V2 R DK was estimated on the training set partition of the data. The correlated components were estimated as: Y l = X l V where the test set, X l 2 R TD has T samples and D=784 dim. The components are concatenated to obtain features. 3.4.3 Performance Evaluation The benet of using synthetic data is that we can examine what the network learns when the generative process is known. The anity measures we use enable us to compare the similarity of the embedding subspaces to that of the source signal. The objective of our simulations is to measure if the correlated signal components are correctly identied from the measurements. Because the components with equal ISC can be produced by an arbitrary linear combination of the vectors in the corresponding subspace, we examined the normalized anity measure between two subspaces as dened in (Soltanolkotabi, Elhamifar, Candes, et al., 2014) to compare the representations with the source signal. Let ^ X l s 2 R TK 0 be the reconstructed signal or the representation learnt by optimizing eqn. 11 corresponding to the source signal X l s 2 R TK . The anity between ^ X and X can be estimated using the principal angles () as: a(X; ^ X) = r cos 2 (1) +::: + cos 2 (K^K 0 ) K^K 0 (3.4.1) The cosine of the principal angles are the singular values of the matrix U > V where U and V are the orthonormal bases for X and ^ X respectively. The anity is a measure of correlation between subspaces and has been extensively used to compare the distance between subspaces in the subspace clustering literature (Soltanolkotabi et al., 2014). This measure is low when the principal angles are nearly orthogonal and has a maximum value equal to one when one of the subspaces is contained in the other. We estimate two anity measures: 1) reconstruction anity, R a : average anity between the reconstructed signal and the source signal across theN views and 2) inter-set anity, R s : average 31 Figure 3.5.1: Anity measures for synthetic data. Number of correlated components in the gener- ated data is 10 (boxed) anity between the dierent views of the reconstructed signal. Formally, R a = 1 N N X l=1 a(X l s ; ^ X l s ) (3.4.2) R s = 2 N(N 1) N X l=1 N X k=1 l6=k a( ^ X l s ; ^ X k s ) (3.4.3) One of the benets of using this anity measure is that it allows us to compare two subspaces of dierent dimensions. For all experiments with synthetic data, we estimate R a and R s and report the performance across dierent repetitions with varying embedding dimensions and mini-batch size. 3.5 Results Synthetic data results: We rst analyzed the performance of the dMCCA algorithm by varying the embedding dimension and mini-batch size. Figure 3.5.1 shows the reconstruction anity mea- sure (R a ) and the inter-set anity measure (R s ) for these parameters. Notice that the maximum R a is achieved for the embedding dimension of 10 (which is the number of correlated components used to generate the data), indicating that the dMCCA retains some notion of the ambient dimen- 32 Table 3.1: Comparison of the anity measures for dMCCA system with the supervised system Method Activation R a R s Supervised-512-10 linear 0.85 0.02 0.60 0.03 tanh 0.84 0.01 0.45 0.01 dMCCA-512-10 linear 0.73 0.02 0.82 0.01 tanh 0.76 0.01 0.78 0.03 Random baseline { 0.05 0.01 0.07 0.03 sion for maximizing the correlation between views. The R s measure consistently decreased with increasing embedding dimension. Because we estimate the covariances in the loss function and use SGD with mini-batches for optimization, we also examine the performance with varying batch sizes. As shown in Fig. 3.5.1 a mini-batch size greater than 400 gives consistent results. Additionally, these measures were comparable when we used tanh activation function for the layers. Next, we compared the performance of our system with an empirical upper-bound obtained by training a DNN to reconstruct the source signals from the input observations. As shown in Table 1, and as expected { the R a measure for our system is lower than the supervised system. However, the anity between the views for the embeddings in dMCCA is higher than that of the supervised system. This is perhaps the benet of modelling the between-set covariance over just minimizing the reconstruction error that is common to many deep representation learning methods, as well as DGCCA. n-MNIST results: We evaluated our model for classifying n-MNIST dataset to assess the use of these embeddings for the classication task. Note that classifying noisy MNIST data is signicantly dicult than the MNIST database (Basu et al., 2017). As described in the previous section, we vary the dimension of the embedding of each view and evaluate the performance. Because the ten classes in the test set were very nearly balanced, we only report the accuracy averaged (solid line in Fig. 2) over the 10-folds and with varying batch size. The standard deviation of accuracy is re ected by the error bars in Fig. 2. While a supervised DNN trained end-to-end performed slightly better than our method, the accuracy was not statistical dierent. PCA on the raw features yielded a best accuracy of about 87% (red in Fig. 2), which was signicantly greater than that of DCCA which only looked at two views. CCA-based methods that used all three views outperformed DCCA method suggesting that multiple views can be leveraged eectively to improve the performance on classication tasks. Finally, our model outperformed the linear features 33 Figure 3.5.2: Performance comparison of dMCCA with baseline methods from MCCA and other CCA methods, suggesting the benet of using deep learning to model the nonlinear relationships. A key factor here is the embedding dimension K. In real-world data, this number is not known, and but in practice, it can be tuned on a validation set. Increasing the dimension size further did not improve the classication performance. We also performed clustering on the embeddings to assess linear separability (NMI=0.72, completeness=0.67). This suggests that dMCCA method can also be used for unsupervised learning. 3.6 Discussion With the advent of deep learning, we can use multiple parallel measurements to model the un- derlying signal. However, it is unclear how these models may scale with increasing the number of parallel observations. In this context, CCA-like approaches in a DNN framework oer an appealing alternative to model the complex relationships between views. Unlike prior methods like GCCA, our method looks at both between- and within-view covari- ance which results in a reliable representation of the underlying signal. This ratio of variances is also referred to as intraclass correlation coecient (Bartko, 1966) and is a popular measure of reliability in test-retest literature. Hence, multiple views can be viewed as analogous to repeated measurements of the underlying signal. For classifying noisy MNIST data, we show that our model 34 n-MNIST + AWGN + Motion Blur + Motion Blur + AWGN n-Bangla + AWGN + Motion Blur + Motion Blur + AWGN (A) (B) (C) n-MNIST classification performance (D) n-Bangla classification (E) train on n-MNIST; test on n-Bangla No. of Dimensions, K Classification Accuracy Figure 3.5.3: Performance evaluation of dMCCA handwritten digits for three views; (A) noisy- MNIST digits; AWGN: additive white Gaussian noise; (B) noisy-Bangla digits; (C) Comparison of dMCCA with supervised empirical upper-bound and othe CCA-based methods for classifying n-MNIST; (D) Comparison of dMCCA with supervised upper-bound for classifying noisy-Bangla digits; (E) Cross-data generalizability: Train on n-MNIST, test on n-Bangla. In all cases, the performance of dMCCA representations for classication is comparable to the supervised end-to- end method. not only outperforms other CCA-based approaches but competes well with supervised learning models. For future work, we are working to include dMCCA with an autoencoder framework, and explore the interpretability of features using MCCA. 35 Chapter 4 Generalized Multiview Correlation In the previous chapter 3, we presented the deep multiset canonical correlation analysis (dMCCA) to be able to model more than two views in parallel. However, two questions remain to be addressed: (1) Does this method scale to hundreds of views? and (2) How does this method work for real- world data. To answer these questions, we proposed a general version of dMCCA, called generalized multiview correlation and explore its applicability on a variety of audio and visual tasks in dierent multiview settings. We also explore the theoretical and practical properties of the generalized multiview correlation. 4.1 Introduction In many technology applications, we often rely on data from multiple views of an object/event in order to learn its representation in a reliable and comprehensive manner. This class of machine learning problems is referred to as multiview learning. A distinguishing feature of this paradigm is that the dierent views of an object/event share an association or a correspondence that can be exploited to build more informed models to represent the underlying phenomenon (Xu, Tao, & Xu, 2013). We dene a view as data sampled by observing an object/event at dierent states or with The work presented in this has been submitted to IEEE Transactions on Signal Processing, 2021 as Generalized Multi-view Shared Subspace Learning using View Bootstrapping. A pre-print version has been uploaded to ArXiv at https://arxiv.org/pdf/2005.06038.pdf 36 Figure 4.1.1: multiview data includes observations acquired by recording an underlying event or object in its various presentations. For example, images acquired from dierent angles can be used as multiview data to learn a holistic model of the observed object (vis-www.cs.umass.edu/mvcnn). dierent instruments to capture its various presentations. For example, a chair photographed at dierent angles (see Fig. 4.1.1) or a sound event like laughing captured in dierent acoustic environments. The objective of multiview learning is to learn vector representations or embeddings that are discriminative of the underlying phenomena by explicitly factoring in or disentangling out the shared correspondence between many views. These embeddings can provide robust features for subsequent downstream learning; for example, supervised tasks such as text-to-image retrieval (Dorfer, Schl uter, Vall, Korzeniowski, & Widmer, 2018) and bilingual word embeddings for machine translation (W. Wang et al., 2015). They can also be used in an unsupervised fashion to uncover the inherent structure in multiview data; for example, learning common components from brain signals such as EEG (Parra et al., 2018) across individuals. Here, data from each person engaged in the same activity represents a dierent view of the underlying cognitive process. Another application is mapping individual dierences in cortical brain structures from functional MRI acquired while a participant is resting or performing a task (Sellami et al., 2020). Multiview learning solutions have explored various ways to model the correspondence between views to leverage the knowledge across them. In a taxonomy proposed by a recent survey by Li et al., (Y. Li et al., 2018), three broad categories were identied: subspace alignment, generative modeling and fusion-based methods. The present work can be considered a subspace alignment 37 method, which deals with learning projections between two or more views to estimate a subspace that maximizes their similarity. Most of the methods in this category learn multiview representa- tions by estimating at least one distinct projection per view; consequently, the model complexity scales at least linearly with the number of views. Many methods in this class often assume that the view information (for example, face pose angle of 45 ) is available for the samples in the training dataset, and sometimes for the probing samples during inference, thus requiring to also collect information about how the dierent views were acquired. Considering the sheer scale of multiview datasets|both with respect to the data size and number of views per event|two critical questions remain: how can we model hundreds of views of an event? and, can we learn the multiview rep- resentations eectively in a view-agnostic fashion? i.e., when the view acquisition information is absent. In the previous chapter 3, we proposed a multiview correlation objective called deep multiset canonical correlation analysis (dMCCA) to learn a shared subspace across multiple views in a dataset. Therein, data from dierent views are transformed using identical neural networks to obtain view-invariant embeddings, which were shown to be discriminative of the underlying event. In this chapter, we advance this framework along three directions: 1. We present a theoretical analysis of our method called view bootstrapping applied during training that allows to incorporate a large number of views within the multiview correlation objective. We derive an upper bound for the error of the bootstrapped objective that relates the number of views needed at each training iteration to the embedding dimension. This result is signicant because it allows us to determine the overall network size required to learn an embedding of certain dimension. 2. We conduct empirical analyses to understand why the learnt embeddings using this method are discriminative of the underlying phenomena, using the case study of identifying sound events using multi-channel audio. 3. We conduct experiments in the audio and visual domains to benchmark the performance of view-bootstrapping for downstream learning tasks. Specically, to assess if our framework generalizes to views and objects not seen during training, and in the presence of variable number of views per event. We highlight its applicability for modeling a large number of 38 views, across diverse settings, particularly in a view-agnostic fashion. There are several practical benets of the proposed framework. During training, we only need some basis to assume that the sample of views considered at each iteration have an implicit corre- spondence. Furthermore, it is view agnostic, i.e., we do not need to know how the various views were obtained, only that they generally observe the same set of events. Finally, we do not need to know the total number of semantic classes/categories of the events being observed. Thus, our framework can be used to learn discriminative embeddings by extracting the shared information between the views without the need for labels. This allows our framework to also be applied in a self-supervised fashion (Banville et al., 2019). To illustrate these benets, consider the task of detecting audio events in a conference room with multiple microphones (mics) distributed across the space. The recordings from dierent mics serve as multiple views observing the events occurring in the room in the audio domain. This is a \natural" example for our multiview paradigm as a correspondence can be constructed using the timestamps of the recordings from each mic. We do not need to know the relative location of the mics with respect to the room or the type of the sensors used. The multiview embeddings of such data can then be used to detect audio events in a data-driven fashion; for example, discriminating between talking, laughing, silence and applause. The rest of the chapter is organized as follows: In Section 4.2, we discuss the relevant back- ground for the multiview learning challenges addressed in this work. In section 5.2, we review the multiview correlation objective we proposed originally in (Somandepalli, Kumar, Jati, et al., 2019) and discuss the view bootstrapping approach, followed by a theoretical analysis in section 4.4. In Section 4.5, we present an empirical analysis to understand how the proposed objective evolves during training as well as benchmark our framework for two audio tasks and two vision tasks posed in the multiview setup. Finally we discuss some future work and conclusions of the proposed framework in Section 4.6. 4.2 Related Work In this section, we review relevant existing work to contextualize the two research questions central to this paper: how do multiview learning models scale with an increasing number of views and 39 X 1 X 2 X m Sub-network 1 Sub-network 2 Sub-network m Identical network branches ⍴ m Network architecture “OK” class m ~ U(1, M) Subsample m views: m ≪ M Limited network size (m) h 1 h 2 h m M-view data Conv (3x3) + BatchNorm Global average pooling Max-pooling (2x2) Sub-network architecture No. filters 16 16 32 32 64 64 Embedding layer Input Figure 4.2.1: Schematic of view bootstrapping for multiview shared subspace learning. Inset: ex- ample sub-network architecture. The central idea of our proposal is that multiview representations from a large number of viewsM can be modeled by randomly subsampling a small number of views m at each training iteration. Data from each of the subsampled views is used to train m identical networks to maximize the multiview correlation objective. After network convergence, the shared subspace representations are captured by the last layer embeddings. whether they can operate in a view-agnostic setting. We also discuss how multiview learning can be seen as analogous to domain adaptation and highlight the dierences between multiview and multi-modal problem formulations. 4.2.1 Subspace alignment for more than two views As mentioned in Sec. 5.1, our proposed framework can be considered a subspace alignment method using the taxonomy presented in (Y. Li et al., 2018). Subspace alignment methods can be further classied into correlation-based and metric-learning methods (S. Li et al., 2016). In correlation- based methods, the objective is to learn a projection space between two or more views that aims to maximize a measure of correlation between them. One of the earliest works within this class of methods is canonical correlation analysis (CCA (Hotelling, 1936)). It nds a linear transformation between two sets of variables such that their projections are maximally correlated. By denition, CCA is constrained to two sets of variables. Later, ve dierent variants were proposed in (Ketten- ring, 1971) to extend CCA for multiple variables by considering the sum of pairwise correlations. Subsequently, kernel version (S.-Y. Huang et al., 2009) and deep learning based methods (Andrew et al., 2013; Dumpala et al., 2018) were also proposed to estimate non-linear projections to learn maximally correlated subspaces. Metric learning in this domain involves learning a measurable subspace to maximize the sim- ilarity between multiple views for a given object/event. Several metric-learning based methods 40 have been proposed to extend CCA for multiple views by learning view-specic or view-invariant transformations of the input data. Two prominent examples include the generalized CCA (GCCA, (Horst, 1961)) and multiview CCA (Chaudhuri, Kakade, Livescu, & Sridharan, 2009). GCCA in- corporates data from more than two views by learning an orthonormal transformation that is \close to" (in terms of the angle between the subspaces) all the individual views. The transformation is ad- ditionally constrained by a view-specic matrix. A deep learning extension to GCCA was proposed in (Benton et al., 2017) to learn nonlinear transformations of high-dimensional datasets. Another metric learning approach called multiview discriminative bilinear projection (MDBP (S. Li et al., 2016)) was proposed to learn shared subspaces between multiview time series. A few prominent applications of these methods include learning correlated components from brain signals across individuals (Parra et al., 2018) and phoneme recognition using speech features and articulatory information (Andrew et al., 2013). The central premise of this class of methods is that a shared subspace can be learnt because of the correspondence between the multiple views in the data and such a subspace representation is inherently discriminative (Livescu & Stoehr, 2009). Many of these methods, in their linear version, have a global optimum which can be solved using singular value decomposition (Benton et al., 2017) or eigenvalue decomposition (Andrew et al., 2013; S. Li et al., 2016; Livescu & Stoehr, 2009). Local optima often arise because of the assumption of a graph Laplacian kernel (e.g., multiview clustering, (Kumar et al., 2011)) or through the use of deep learning framework (Benton et al., 2017). In most of these methods, the number of learnable parameters scale at least linearly with the number of views (Kumar et al., 2011). Thus, these methods do not scale well when working with a large number (> 10) of views. Subspace alignment methods were also extended to a supervised setting by learning a discrim- inative multiview subspace, treating the label space as an additional view (Benton et al., 2017). Prominent examples of this idea include generalized multiview analysis (GMA, (Sharma et al., 2012)), partial least squares regression based methods (Cai, Wang, Xiao, Chen, & Zhou, 2013) and multiview discriminant analysis (MvDA, (Kan, Shan, Zhang, Lao, & Chen, 2015)). They were eectively used for applications such as image captioning and pose-invariant face recognition where labeled data is available for training. However, the generalizability of these methods to hundreds of views remains to be explored. 41 4.2.2 View-agnostic multiview learning The subspace methods discussed thus far assume that the view information is readily available during training and testing. For instance, GMA and MvDA estimate a within-class scatter matrix specic to each view. In practice, view information may not be available for the probe data (e.g., pose angle of a face during testing) making it dicult to choose the view-specic transformation to apply during inference. A promising direction to address this problem was proposed in (Ding & Fu, 2014, 2017). Here, the need for view information of the probe sample was eliminated by estimating a low-rank subspace representation that can bridge the view-specic and view-invariant representations. Another recent study examined 3D shape reconstruction (Sridhar, Rempe, Valentin, Soen, & Guibas, 2019) using multiview 2D images invariant to the order in which the 2D images are acquired. A permutation layer was used to promote information exchange between the variable views thereby ensuring an order-invariant reconstruction. While both methods present interesting ideas to build view-agnostic frameworks, they still need a single transformation per view. As such, they scale linearly with the number of views and fail to incorporate information from a large number of views, particularly when the size of training data is limited. 4.2.3 Multiview paradigm for domain adaptation A recent survey (Ding et al., 2018) presented a unied learning framework mapping out the simi- larities between multiview learning and domain adaptation. Typical domain adaptation methods seek domain-invariant representations which are analogous to view-invariant representations if we treat dierent domains as views. The benet of the multiview paradigm in this context is that the variabilities associated with multiple views can be \washed out" to represent the shared information across the views and discriminate the underlying classes. This analogy is particularly useful in the domain of speech/audio processing for applications such as wake-word recognition (K epuska & Klein, 2009). Here, a conversational assistant needs to recognize a keyword (e.g., \Alexa", \OK Google", \Siri") no matter who says it or where it is said (i.e., the background acoustic conditions). Methods based on joint factor analysis (De- hak et al., 2009) and total variability modeling (Dehak, Kenny, et al., 2011) were used to obtain 42 speaker-dependent and speaker-independent factors to build robust models in such domain adap- tion settings. In our recent work, we showed that a more robust speech representation can be obtained by explicitly modeling dierent utterances of a word as multiple views (Somandepalli, Kumar, Jati, et al., 2019). 4.3 Problem Formulation and Optimization We rst review the multiview correlation (MV-CORR) objective we developed in (Somandepalli, Kumar, Jati, et al., 2019). Next, we consider practical aspects for using this objective in a deep learning framework followed by view-bootstrapping. Then, we develop a theoretical analysis for the error of the bootstrapped MV-CORR objective. 4.3.1 Multiview correlation (MV-CORR) Consider N samples of d-dimensional features sampled by observing an object/event from M dif- ferent views. Let X l 2R dN : l = 1;:::;M, be the data matrix for the l-th view with columns as mean zero features. We can use the same feature dimension d across all views because we assume that that the multiple views are sampled from identical distributions (see Sec. 2.1). We describe the MV-CORR objective in the context of CCA. The premise of applying CCA-like approaches to multiview learning is that the inherent variability associated with a semantic class is uncorrelated across multiple views to represent the signal shared across the views. For M = 2, CCA nds projections of same dimensions v 1 and v 2 in the direction that maximizes the correlation between them. Formally, (v 1 ; v 2 ) = arg max v 1 ;v 2 2R d v > 1 12 v 2 q v > 1 11 v 1 v > 2 22 v 2 (4.3.1) where 12 is the cross-covariance and 11 ; 22 are the covariance terms for the two views. To extend the CCA formulation for more than two views, we consider the sum of all pairwise covariance terms. That is, nd an (orthonormal) projection matrix W2R kd withkd that maximizes the 43 ratio of the sum of between-view over within-view covariances: W = arg max W W > X 1 X > 2 +::: + X M1 X > M W W > X 1 X > 1 +::: + X M X > M W (4.3.2) We refer to the numerator and denominator covariance sums in Eq. 6.4.6 as between-view covari- ance R b and within-view covariance R w which are sums of M(M 1) and M covariance terms, respectively. The scaling factor M(N 1) common to the covariance terms is omitted from the ratio. We can estimate the covariance as a cross product without loss of generality because we assume the feature columns in X l to be mean-zero. We now dene a multiview correlation matrix as the normalized ratio of between- and within-view covariance matrices (Somandepalli, Kumar, Jati, et al., 2019): = max W 1 M 1 W > R b W W > R w W s.t. W > W = I (4.3.3) A version of this ratio of covariances has been considered in several related multiview learning methods. One of the earliest works by (Hotelling, 1992) presented a similar formulation for scalars, also referred to as multi-set CCA in a recent work (Parra et al., 2018). For a dierent application, a version of this ratio known as the intraclass correlation coecient (Bartko, 1966) has been exten- sively used to quantify test-retest repeatability (e.g., studying fMRI measures (Somandepalli et al., 2015)). Notice that this ratio is similar to the use of between-class and within-class scatter matrices in linear discriminant analysis (LDA, (Fisher, 1936)) and more recently in multiview methods such as GMA and MvDA. The primary dierence of our formulation from subspace alignment methods such as CCA and LDA is that MV-CORR does not consider the class information explicitly while estimating the covariance matrices. We only assume that samples from dierent views are correlated i.e., they share a correspondence because they observe the same underlying event. Additionally, we consider the sum of covariances for all pairs of views, eliminating the need for view-specic transformation (used in methods such as GCCA) which enables us to learn the shared subspace W in a view- agnostic manner. On the downside, we can only represent the shared information common to multiple views and discard view-specic information which may be of interest for some multiview 44 applications. For example, predicting face pose angles. 4.3.2 Implementation and practical considerations Using ideas similar to the deep variants of CCA (Andrew et al., 2013) and LDA (Dorfer et al., 2015), we can use deep neural networks (DNN) to learn nonlinear transformations of the multiview data to obtain (possibly) low-dimensional representations. In Eq. 4.3.3, the solution W jointly diagonalizes the two covariances R b and R w because W is their common eigenspace. Thus, we use the trace (Tr) form of Eq. 4.3.3 to fashion a loss function, M for batch optimization for data from M views. M = max W 1 d(M 1) Tr W > R b W Tr(W > R w W) s.t. W > W = I (4.3.4) The DNN framework for MV-CORR consists of one network per view l, referred to as a sub- network denoted by f l ;l = 1 : M. The architecture of the sub-network is the same for multiple views as the nature of input data is same across views (e.g., speech from dierent mics in a room). Because we do not assume to have the relation between the views, we model the data from each view separately to learn a shared subspace across them. As such, the weights are not shared across the sub-networks of any layer. The output from the top-most layer of each sub-network is passed to a fully-connected layer ofd neurons which is the embedding layer (see green blocks in Fig. 5.1.1). Let H l =f l (X l )2R dN be the activations from the embedding layer where N is now the batch size. For optimization, in each batch we estimate the between- and within-view covariances R b and R w using H l ;l = 1;:::;M and compute the loss in Eq. B.0.1 by estimating W at each iteration. The transformation W is obtained by solving the generalized eigenvalue (GEV) problem using Cholesky decomposition. Below, we discuss a few other considerations for implementing this objective in a deep learning framework: Total view covariance For a large number of views M, estimating R b in each batch is expensive as it isO(M 2 ). We instead compute a total-view covariance term R t which only involves estimating a single covariance for the sum of all views and isO(M), and then estimate R b = R t R w . See Appendix A for the 45 Table 4.1: Summary of datasets used for benchmarking our proposed method. Dataset Task No. classes No. views View distribution Subsampling? DCASE (2018) Multi-channel audio activity classication (Sec. 4.5.1) 9 activities 7 mics uniform no ModelNet (2015) View-invariant object classication (Sec. 4.5.2) 40 objects 12, 80 angles uniform yes MultiPIE (2010) Pose-invariant face recognition (Sec. 4.5.3) 129 subjects 15 poses uniform yes complete proof. R t = R b + R w = 1 M M X l=1 X l M X l=1 X l > (4.3.5) Choosing batch size: A sample size ofO(d logd) is sucient to approximate the sample co- variance matrix of a general distribution in R d (Vershynin, 2010). Thus we choose a batch size of N = ceil(d logd) for a network with d-dimensional embeddings. In practice, choosing N 2d was sucient in general to achieve a healthy model convergence. Regularize R w : Maximizing M (Eq. B.0.1) corresponds to maximizing the mean of eigenval- ues of R 1 w R b . Estimating R w with rank decient H l may lead to spuriously high . One solution is to truncate the eigenspace W. However, this will reduce the number of directions of separability in the data. To retain the full dimensionality of the covariance matrix, we use \shrinkage" regu- larization (Ledoit & Wolf, 2004) for R w with a parameter = 0:2 and normalized trace parameter = Tr(R w ) as ~ R w = (1)R w + I d =d. The choice of is consistent with the experiments in our previous work (Somandepalli, Kumar, Jati, et al., 2019). Loss function is bounded: The objective M is the average of d eigenvalues obtained by solving GEV. We can analytically show that this objective is bounded above by 1. Thus, during training, we minimize the loss 1 M to avoid trivial solutions. See Appendix B for the complete proof. Inference: Maximizing leads to maximally correlated embeddings. Thus, during inference we only need to extract embeddings from one of the sub-networks. The proposed loss ensures that the dierent embeddings are maximally correlated. We conducted simulation experiments to validate this assumption (see simulation experiments in chapter 3, section 3.4.3). 46 4.4 View Bootstrapping: A Theoretical Analysis In the MV-CORR model described thus far, incorporating a large number of views would need one sub-network per view, which is not practical for scaling up to hundreds of views. To address this issue, we propose view bootstrapping. The schematic of the overall method is shown in Figure 5.1.1. Here, we construct a network withm sub-networks and sample with replacement a small number of viewsmM to model data withM views. View-agnostic training can be ensured by not keeping track of which views were sampled for a specic sub-network. Formally, the bootstrapped objective is: =E mU(1;M) m M (4.4.1) The intuition behind our view bootstrapping lies in the law of large numbers applied to the covari- ance matrices in Eq. B.0.1. Let b and w denote the true between- and within-view covariance matrices which are estimated by R b and R w respectively for M views as in the numerator and denominator of Eq. 6.4.6. Let R (m) fbg and R (m) w denote the between- and within-view covariances estimated from a subset of m views. Asymptotically, with a large M and as m!M, using law of large numbers we have E m R (m) b ! b and E m R (m) w ! w . In practice, the number of available view samples is nite and the total number of views possible is often unknown. Thus, we are inter- ested in understanding how the covariances behave in a non-asymptotic setting. Toward this end, we analyze the estimate m with respect to the optimum =d 1 Tr W > b W = Tr W > w W where we assume the total number of views used to estimate s is unknown. Theorem 4.4.1. Let X = [A (1) ;:::; A (N) ] be the md matrices of m views subsampled from an unknown number of views M. Let the rows A l of the view matrices A be independent subgaussian vectors in R d with kA l k 2 = 1 : l = 1;:::;m. Then for any t 0, with probability at least 1 2 exp ct 2 , we have m max 1;C m 2 ( p d +t) 2 Here, m and are the MV-CORR objectives for subsampled views m and the total number of views M respectively. The constant C depends only on the subgaussian norm K of the view space with K = max i;l A (i) l 2 47 Proof sketch. Here we provide a sketch of the main elements in the proof. See Appendix C for the detailed proof. We begin with the trace ratio in the expression for the subsampled MV-CORR objective m (see Eq. B.0.1). The terms R b and R w now denote covariance matrices for m views. Using the properties of trace and spectral norm, we can rewrite the expression of the corresponding m as: m = Tr W > R b W Tr(W > R w W) = hR b + b b ; WW > i hR w + w w ; WW > i h b ; WW > i +kR t t k +kR w w k h w ; WW > ikR w w k where b and w are the between- and within-view covariances, respectively, across all views. From Eq. 4.3.5, recall the result: R b = R t R w . The rest follows through triangular inequalities. Observe that the ratio h b ; WW > i=h w ; WW > i is the optimal estimated from the unknown total number of views M. Moreover, note that the two trace terms are sums of normalized eigenvalues and hence positive. Thus h b ; WW > i ; h w ; WW > i 2 [1;d]. Next, we need to bound the two norms w =kR w w k and t =kR t t k. In the statement of the theorem, note that the multiview data matrix X was rearranged as [A (1) ;:::; A (N) ] using the features as rows in the view-matrices A; this allows us to express covariance of zero-mean feature matrices as A > A instead of a cross product. Furthermore, using the assumption that the dierent view matrices A are identically distributed, we have: w = N X i=1 1 m A (i)> A (i) EA (i)> A (i) N X i=1 1 m A (i)> A (i) (i) w N 1 m A > A w The term 1 m A > A w has been extensively studied for the case of isotropic distributions i.e., w = I in matrix concentration theory (Vershynin, 2010). Here, we obtain a bound for the general case of w and show that w =kR w w k isO(d=m); See Appendix C for complete proof. Similarly, we can show that t =kR t t km. The intuition here is that R t is sum of m view vectors, hence it concentrates asO(m); See Appendix C for complete proof. Using these results and the fact that we always choose an embedding dimension d greater than m, we can prove that 48 m isO(m 2 =d). This result is signicant because we can now show that, to obtain d-dimensional multiview embeddings, we only need to subsample m p d number of views from the larger set of views. For example, to estimate a 64-dimensional embedding, we would need to sample at most 8 views. In other words, the DNN architecture in this case would have 8 sub-networks or less. Another factor to consider that is not expressed in the theorem is the trade-o between the choice of the embedding dimension d and the discriminative power of the embeddings. Similar to the related concept in LDA (Fisher, 1936), a small d would only discriminate between classes that are already easily separable in the data. In contrast, a larger d would require a greater m which in turn in ates the number of parameters in the DNN. To understand this trade-o with respect to overall performance, we conduct an empirical analysis by varying the embedding dimension d and number of subsampled views m in Sec. 4.5.2. 4.5 Benchmarking Experiments We consider four datasets to evaluate our framework for audio and visual applications; the details of the task, views and classes are summarized in Table 4.1. We chose these datasets to assess our method in two distinct multiclass settings: (1) uniform distribution and (2) variable number of views per class. Our performance evaluation highlights that our model can be trained in a view- agnostic manner. The results show that MV-CORR models are not only robust to views not seen during training but also the subsampling approach generalizes to a large number of views. For all experiments, the embedding layer (see green block in Fig. 5.1.1) was constrained to have a unit l 2 norm consistent with our assumption in theorem 4.4.1. The DNN models were trained at a batch level by minimizing the loss 1 M (see Eq. B.0.1) using SGD (learning rate=0:01, momentum=0:9 and decay=10 6 ). We experimented with both ReLU and sigmoid functions for the convolution and fully-connected layers and found that sigmoid activation yielded a smaller loss at convergence. To determine model convergence, we applied early stopping criteria, i.e., stop training if the loss at the end of a training epoch did not decrease by 10 3 for 5 consecutive epochs. In each experiment, the sub-network architecture was chosen to ensure a fair comparison with the corresponding baselines. The dimension of the embedding layer was tuned on a held-out 49 development set. After convergence, we extract embeddings (for evaluation) from only one of the sub-networks (which is randomly chosen). We did not observe signicant changes in performance when a dierent sub-network was chosen for inference or when the embeddings from all sub-networks were averaged. This behavior is expected as the MV-CORR loss ensures that the embeddings from the last layer are highly correlated with each other (see simulations in chapter 3, section 3.4.3). All models were implemented in TensorFlow (Module: tfj TensorFlow Core v2.0.0 , n.d.) and trained on a GeForce GTX 1080 Ti GPU. The related code and resources have been made publicly available 1 . 4.5.1 Multi-channel audio activity classication As discussed in Sec. 5.1, multi-channel audio recordings collected in a constrained space can be used to learn embeddings of the underlying audio activities in a self-supervised fashion. To this end, we experimented with the SINS audio database (Dekkers et al., 2017) for detection of daily activities in a home. Specically, we used the dataset and baseline models released as part of Task-5 in DCASE 2018 Challenge (Dekkers et al., 2018). It contains audio recordings segmented at 10 second intervals of one person living in a vacation home over a period of one week. Data from 7 microphone arrays in the combined living room and kitchen area were used. We refer the reader to the DCASE website 2 for the oor-plan of the recorded environment along with the position of the sensors. In total, the train and test sets each contained about 73,000 samples of 10 second audio segments from 7 mics. The training set included data from 4 mics. Along with held-out data from the 4 mics, the test set also included data from the remaining 3 mics, not seen during training. This data partition provided a good test-bed to evaluate if our framework generalizes to unseen views (mics). The organizers of DCASE retained audio segments that recorded only one activity. As such, this task is multi-class (see Fig. 4.5.1 for the nine activity labels) and not multilabel classication. 1 MV-CORR code/resources: sail.usc.edu/ ~ somandep/projects/multiview 2 DCASE-2018 Task 5: Monitoring Domestic Activities 50 absence cooking dishwashing social- activity eating working watching TV vacuum other Figure 4.5.1: t-SNE visualization of the embeddings learnt on the multi-channel audio data. The nine classes are watching-TV ( ), absence ( ), working ( ), vacuum({), dishwashing (), eating (j), social-activity ( ), cooking ( ) and others ( ). See demo for an interactive version Model and baselines As shown in Fig. 5.1.1, we use identical sub-networks to model data from each view. As the number of views in the training set were only four, we do not require subsampling of the views. The overall model included four sub-networks of identical 1D convolutional networks, each with same architecture as the baseline model (Dekkers et al., 2018) with one notable change: no softmax layer at the end (as our model is not supervised with the labels available). Thus, the result is a 64-dimensional embedding of the last layer. This choice of architecture ensured a fair comparison with the baseline. 51 Table 4.2: Performance evaluation of multi-channel acoustic activity classication on the DCASE dataset. Microphone type Clustering task Accuracy (%) Seen during training Activity 77.4 Channel 5.12 Unseen during training Activity 76.2 Channel 6.01 The models were trained as follows: First, the train set was split into a 60-40 partition where the 40% split was held out for additional supervised adaptation of the embeddings learnt on the 60% split. At each training iteration, four audio segments with the same time-stamp were sampled at random, i.e., the samples in a batch fed to a certain sub-network may come from dierent mics. Thus, the training was view-agnostic. On the 40% split, a linear multiclass SVM was trained with the resulting embeddings as input. Evaluation was performed on the test set, consistent with the baseline (Dekkers et al., 2018). Performance evaluation We assess the discriminability of the multiview embeddings, both qualitatively using t-SNE (Maaten & Hinton, 2008) and quantitatively with unsupervised clustering. As shown in Fig. 4.5.1, the multiview embeddings without any additional supervision uncover the structure of the underlying acoustic activities. Activities such as watching TV, vacuum, social activities (chatter), cooking, and dishwashing cluster well as they have distinct acoustic features. Other activities such as absence, working and eating have a relatively low acoustic energy and are harder to distinguish from each other without additional supervision. We expect the shared subspace estimated from our multiview embeddings to `wash out' the variability associated with the dierent mics but retain the information about the audio events. To assess this quantitatively, we apply k-means clustering (Pedregosa et al., 2011) on the embeddings extracted on the test-set in two settings: (1) cluster into 9 groups and evaluate with activity labels (2) cluster into 7 groups and evaluate with mic labels. The former would show how eective our embeddings are at classifying the acoustic activities and the latter quanties how much information 52 about the mic type is still retained. As shown in the Table 4.2, the clustering performance 3 for identifying the acoustic activity is signicantly higher 4 than for discriminating between the mics. This result supports the central idea of our framework: MV-CORR learns embeddings discriminative of the underlying semantic class by successfully uncorrelating the view information (mic type) from the class information (acoustic activity). Additionally, there was no signicant dierence in performance for activity classication (See bold numbers in Table 4.2) between seen and unseen mics 5 . This shows that our method is robust to views not seen during training. Finally, supervised adaptation of the multiview embeddings using SVM yielded a macro-average F score of 87.6% which was signicantly greater than that of the baseline 84.5% (permutation test n = 10 5 ;p < 0:01) reported in (Dekkers et al., 2018). This suggests that the multiview embeddings can robustly capture the shared information across dierent mics compared to end-to-end supervised models. Why are the embeddings discriminative? Recall the loss function from Eq. B.0.1. Notice that the orthonormal matrix W jointly diagonalizes the two covariance matrices, R b and R w . Thus, if R w is invertible, the generalized eigenvalues (i) correspond to the eigenvalues of R 1 w R b . As a result, we can rewrite the objective M as the average over the d eigenvalues: M = 1 d d X i=1 (i) s.t. 1 (i) 1 (4.5.1) To interpret these eigenvalues, we look at two related formulations: the original work of Fisher's linear discriminant (Fisher, 1936) used in LDA and inter-set correlation (Parra et al., 2018). Both approaches solve a GEV problem for the ratio of variance matrices and analyze the eigenvalues to understand the separability of the semantic classes in the data. Thus, eigenvalues closer to 1 represent directions-of-separability in the data and contribute to a higher multiview correlation objective. As more eigenvalues approach 1, the lower-dimensional subspace can discriminate better 3 Clustering accuracy estimated with Kuhn's Hungarian method (Kuhn, 1955). 4 Permutation test n = 10 5 , p 0:001 5 Permutation test n = 10 5 , p = 0:09 53 Training iterations Sorted eigenvalues ⍴ (M) ⍴ i → 1 -0.4 1.0 Figure 4.5.2: Distribution of the magnitude of eigenvalues during MV-CORR training. As training progresses more eigenvalues approach the maximum value of 1, thereby discovering more directions of separability in the data. between a larger set of the underlying semantic classes. We visualize the magnitude of the gen- eralized eigenvalues (i) at each training iteration as shown in Fig. 4.5.2. As the model training evolves, the number of positive eigenvalues increases and a larger number of eigenvalues approach 1|resulting in embeddings discriminative of underlying classes in the data. 4.5.2 View-invariant object classication In the previous experiment, we showed that time-synchronized audio recordings can be used to learn a robust representation of the underlying events. In a similar vein, we consider view-invariant object recognition using 2D images, but acquired at more than ten angles requiring subsampling of the views during training. We use Princeton ModelNet dataset (Z. Wu et al., 2015) to classify 54 2 3 4 5 7 9 50 60 70 80 m< p d;d = 40 View bootstrap sample size m Clustering acc. (%) for unseen views d = 16 d = 32 d = 40 d = 64 Figure 4.5.3: Clustering accuracy of unseen views for dierent choices of embedding-dimension d and number of views subsampled m. Notice that the best clustering performance is achieved for d = 40 and a subsample of m = 5 views, consistent with the bounds from our theoretical analysis. the object type from 2D images acquired at multiple view points. We use the train/test splits for the 40-class subset provided in their website 6 . Each class has 100 CAD models (80/20 for train/test) with 2D images (100 100px) rendered in two settings by (Su et al., 2015): V-12: 12 views by placing virtual cameras at 30 degree intervals around the consistent upright position of an object and V-80: 80 views rendered by placing 20 cameras pointed towards the object centroid and rotating at 0; 90; 180; 270 degrees along the axis passing through the camera and the object centroid. Model and baselines As shown in Fig. 5.1.1, we use identical sub-networks to model the data from each view. The number of sub-networks is equal to m which denotes the number of views subsampled. We use a 3-block VGG architecture (Chateld, Simonyan, Vedaldi, & Zisserman, 2014) as illustrated in the inset in Fig. 5.1.1. To reduce the number of trainable parameters, we use global average pooling after the last layer instead of vectorizing its activations before passing them to a embedding layer of d neurons. The result from our theoretical analysis in Theorem 4.4.1 establishes an upper 6 3D object dataset and leader-board: modelnet.cs.princeton.edu 55 Table 4.3: Accuracy of clustering for seen and unseen views. SD computed from ten trials. Bold indicates signicantly higher accuracy. Dataset/Model Ours Supervised V-12 seen 82.9 0.5 88.7 1.2 unseen 82.1 0.7 81.5 0.9 V-80 seen 84.2 0.4 89.2 1.4 unseen 85.7 1.1 80.3 1.5 bound on the relation between d and m in case of view bootstrapping. In order to evaluate this relation empirically we trained the models over a range of hyper-parameters: m = [2; 3; 4; 5; 7; 9] andd = [16; 32; 40; 64]. During training, at each iteration, we randomly samplem images belonging to the same class (object type) without the knowledge of the angle at which the image was acquired or the CAD model from which the image was generated. The dierent samples in a batch may not necessarily belong to the same object type. Thus, the training process is semi-supervised and view-agnostic. To setup a view-agnostic evaluation from the ModelNet data, we create ten trials by repeating the following process: 1. Using a random seed, pick 50% of the views|i.e., 6 views for V-12 and 40 for V-80|from the 80 CAD models in the train set to create a train split. 2. From the 20 CAD models in the test set, create separate test-sets for views seen and unseen during training. 3. Apply k-means algorithm (number of clusters=40) on the embeddings obtained from a dif- ferent model for each trial and evaluate on the seen and unseen test splits. As a baseline, we train a fully supervised CNN in a view-agnostic fashion with same architecture as that of our sub-network. This baseline provides an upper bound for view-agnostic performance as it is fully supervised end-to-end. 56 Performance evaluation First, we examine the clustering accuracy 3 for dierent choices ofm andd on the test-set of unseen views in V-12. As shown in Fig. 4.5.3, we found that d = 40 with the number of sub-networks m = 5 gave the best performance. Consistent with our theory, m > p d did not improve the performance further. The dip in performance form 7 maybe due to the limited data for training a larger network. Then, we compare the clustering performance of the chosen model on the test set for the views seen and unseen during training, as well as with the supervised baseline. We observed no signicant 7 dierence between the accuracy for seen and unseen views across the ten trials (see `Ours' in Table 4.3). While the corresponding supervised models trained end- to-end generally performed better than our method, they performed signicantly poorer for views not seen during training. Particularly, in V-80, our model performed signicantly better than the supervised baseline for views not seen during training, re ecting that our multiview modeling is robust to unseen views, when a densely sampled and larger number of views are available. To evaluate our framework in a supervised setting, we trained a MV-CORR model (m = 5;d = 40) on data from 40 out of the 80 CAD models in the V-12 train set. Embeddings extracted from the remaining 40 CAD models were used to train a 2-layer fully connected DNN (a sigmoid layer of 80 neurons and a softmax layer of 40 neurons) to classify the object category. Classication accuracy and mean average precision (mAP) were used to evaluate the recognition and retrieval tasks. For baseline, we compare our method with the ModelNet leader-board 6 . We highlight our results in reference to the state-of-the-art (SoA) for this application (see bold numbers in Table 4.4) as well as examples from widely used methods such as domain-invariant applications of GAN (Khan, Guo, Hayat, & Barnes, 2019) and multiview CNN for object recognition (Sun, Liu, & Mao, 2019). As shown in Table 4.4, MV-CORR framework performs on par with these methods and is within 4% points of the SoA for recognition and retrieval tasks. In all experiments, we observed that the bound for the maximum number of sub-networks in practice is better than the theoretical bound, i.e. md 2=5 < p d. The choice of m only depends on d and not the larger set of views M: a useful property to have in practical settings where M 7 Signicance testing using Mann-Whitney U test at = 0:05 57 Table 4.4: 3D object recognition and retrieval comparison with other methods. Bold indicates state-of-the-art results. Method Acc. mAP Loop-view CNN (Jiang, Bao, Chen, Zhao, & Gao, 2019) 0.94 0.93 HyperGraph NN (Feng, You, Zhang, Ji, & Gao, 2019) 0.97 - Factor GAN (Khan et al., 2019) 0.86 - MVCNN (Su et al., 2015) 0.90 0.80 Ours + 2-layer DNN 0.94 0.89 maybe unknown. The embedding dimension d however, needs to be tuned as it depends on the intra- and inter-class variabilities associated with the inputs. The choice of d also depends on the complexity of the downstream task and the number of semantic classes in a dataset. 4.5.3 Pose-invariant face recognition In the previous experiment, we evaluated our framework with a focus on the performance of seen vs. unseen views during training, but all the object types were seen during training. In this ex- periment, we test the usefulness of our embeddings to recognize classes not seen in training. To this end, we apply our framework for face recognition where the goal is to learn a person's identity from face images, invariant to dierent presentations such as the pose, illumination, expression, and background. Multiview learning frameworks that are view-agnostic and generalizes to a large num- ber of views are attractive for this application as they can robustly capture the shared information across dierent variations of a person's face without knowing how the images were acquired. We use the Multi-PIE database (Gross et al., 2010) which includes face images of 337 subjects in 15 dierent poses, 20 lighting conditions and 6 expressions in 4 sessions. For comparison with baseline, we use a similar train/test split as in GMA (Sharma et al., 2012) of 129 subjects common to all four sessions as test data and remaining 120 subjects in session 01 for training. For performance evaluation, we use 1-NN matching with normalized euclidean distance similarity score as the metric. The gallery consisted of face images of 129 individuals in frontal pose and frontal lighting and the remaining images from all poses and lighting conditions were used as probes. All images were cropped to contain only the face and resized to 100 100 pixels. No face alignment was performed. 58 Table 4.5: 1-NN matching accuracy comparison for pose-invariant face recognition. Bold indicates the best performing model Method 15 30 45 60 75 Avg. GMLDA 92.6 80.9 64.4 32.3 28.4 59.7 GMMFA 92.7 81.1 64.7 32.6 28.6 59.9 DCCA 82.4 79.5 73.2 62.3 51.7 69.8 Ours 95.7 93.1 94.5 92.3 91.1 93.3 Model and baselines For our model architecture, we rst choosem = 2 sub-networks and examine the MV-CORR value at convergence for dierent embedding dimension d. Based on this, we pick d = 64. Following our observations in the object classication task, we choose m = 4 sub-networks. The sub-network architecture is the same as before (See inset Fig. 5.1.1) During training, we sample with replacement, m face images per individual agnostic to the pose or lighting condition. For matching experiments, we extract embeddings from a single randomly chosen sub-network. For baseline, we train deep CCA (DCCA (Andrew et al., 2013)) using its implementation 8 with the same sub-network architecture as ours. We trained separate DCCA models for ve poses: 15, 30, 45, 60 and 75 degrees. While training the two sub-networks in DCCA, we sample face images of subjects across all lighting conditions with a frontal pose for one sub-network and images of the specic pose for the second. This matches the testing conditions where we only have frontal pose images in the gallery. During testing, we use the pose-specic sub-network to extract embeddings. We also compare our method with two other variants of GMA: GMLDA and GMMFA by using the results reported in (Sharma et al., 2012). Performance evaluation As shown in Table 4.5, our model successfully matches at least 90% of the probe images to the frontal faces in the gallery, across all poses. The performance drop across dierent poses was also minimal compared to a pairwise method such as deep CCA, which assumes that the pose of a 8 Deep-CCA code: github.com/VahidooX/DeepCCA 59 probe image is available in testing conditions. However, the view-agnostic benet of our method and the Multi-PIE dataset need to be viewed in the context of the broader research domain of face recognition. Methods such as MvDA (Kan et al., 2015) which build view-specic transformation have shown nearly 100% face recognition rates on Multi-PIE when the pose information of the probe and gallery images was known. Furthermore, the face images in this dataset were acquired in strictly controlled conditions. While it serves as an eective test-bed for benchmarking, we must consider other sources of noise for robust face recognition besides pose and lighting (F. Wang et al., 2018). We have also explored MV-CORR for robust face recognition in-the-wild towards character diarization in movies i.e., automatically determining who appeared when. Our preliminary results are presented in (Somandepalli, Hebbar, & Narayanan, 2020) and discussed in chapter 6. 4.6 Discussion In this chapter, we explored a neural method based on multiview correlation (MV-CORR) to cap- ture the information shared across large number of views by bootstrapping the data from multiple views during training in a view-agnostic manner. We discussed the theoretical guarantees of view bootstrapping as applied to MV-CORR and derived an upper bound for the number of views to subsample for a given embedding dimension. Our experiments on multi-channel audio event classication, view-invariant object classication and retrieval, and pose-invariant face recognition showed that our approach performs on par with competitive methods in the respective domains. The results underscore the applicability of our framework for large-scale practical applications of multiview data, where we may not know how multiple corresponding views were acquired. 60 Chapter 5 Learning Speaker and Speech Command Representations In chapter 3, we presented deep multiset CCA as a measure of multiview correlation. To extend this framework to a large number of views, we proposed generalized multiview correlation (MvCorr) in chapter 4. The experiments thus far show that our framework is competitive with respect to existing benchmarks on data acquired in controlled conditions. The datasets considered thus far have the same number of views per class, i.e., uniform distribution of the views although the number of samples per class may be variable. In practical settings, we often have to deal with a variable number of views per class. To study our framework in this context, we study the performance of multiview correlation for learning speaker and speech command representation in this chapter. We have released the code and trained models at github.com/usc-sail/gen-dmcca. We also developed some interactive visualizations on how well our method clusters the speech commands data which can be found at the associated hyperlinks for multiview command representations and multiview speaker representations. The work presented in this chapter was presented at Somandepalli et. al., \Multiview Shared Subspace Learning Across Speakers and Speech Commands", Proceedings of Interspeech 2019 61 5.1 Introduction Consider a person interacting with a conversational virtual agent. A key research problem here is to recognize a speech command (e.g., \Okay Google, TV On") irrespective of who the speaker is or what device one is using. This is a challenging task because of several sources of variability in speech audio such as speaker or channel characteristics and background noise. A similar research problem, in computer vision is to identify an object imaged under dierent illumination and camera angles. In such problems, the dierent modes of variability are akin to multiple observations or views of the underlying phenomenon or the signal. Our objective is to learn a subspace from these multiple views to capture the information shared across them. Multiview learning has been studied in several elds (See (Zhao et al., 2017) for a survey). Two important research questions here are: 1) How do we handle multiple views simultaneously? and 2) How can we model a large (possibly thousands) number of views? In applications with only two views, Canonical correlation analysis (CCA (Hotelling, 1936)), and its extension, deep CCA (Andrew et al., 2013) have been successfully used. To model more than two views in parallel, we proposed deep multiset CCA (dMCCA (Somandepalli, Kumar, Travadi, & Narayanan, 2019a)) as a measure of multiview correlation. dMCCA models both the between-view and within-view variance to learn a shared subspace using an independent deep neural network (DNN) per view. However, it is unclear if dMCCA can model a large number of views in parallel, since the number of parameters to be trained increases with the number of views in the dataset. To address this limitation, a generalized multiview correlation (MvCorr) was proposed in our previous chapter 4 to incorporate a large number of views by subsampling a small set of views with replacement during training. In this chapter, we propose a novel direction to model speech in a multiview paradigm, where the variability across speakers or channels can be considered as multiple views; bridging the gap between domain adaptation and multiview learning. Figure 5.1.1 illustrates the analogies of multiview learning for speech tasks. The acoustic features of an utterance, e.g., \On" by dierent speakers is analogous to imaging an object from dierent angles. We can then constrain our views to either the speakers or utterances to capture the shared information, i.e., the signal. Related work in speech research has mostly applied multiview learning in a multimodal setup. 62 Figure 5.1.1: (A) Representing the `signal' from multiple `views' (B) speakers as views (C) words as views Typically, video (Livescu & Stoehr, 2009) or articulatory measurements (Bharadwaj, Arora, Livescu, & Hasegawa-Johnson, 2012; X. Wang & Gupta, 2015) are available along with speech. In contrast, for speech applications, we only need acoustic features, where we consider the dierent modes of variability as multiple views. Additionally, we propose to stochastically sample a small number of views during training to apply dMCCA for a larger set of views. For this purpose, we conduct experiments on the Speech Commands Dataset (Warden, 2018). This publicly available corpus provides an excellent testbed for us because it consists of over 1800 speakers with one or more of thirty speech commands each, thus providing a large number of views to test our approach. Specif- ically, we test our approach for two tasks as illustrated in Fig. 5.1.1B{C: command and speaker identication. Because we introduce our approach within the framework of speaker and command identica- tion, we also explore existing methods for these tasks. Obtaining utterance representations robust to speaker or channel variability has been extensively studied with total variability modeling (TVM) (Dehak, Kenny, et al., 2011). It is shown to perform eectively for speaker and language identi- cation (Dehak et al., 2009; Travadi et al., 2014). TVM estimates a single low-rank representation (called i-vector) for all sources of variability in the data. As such, the dominant modes of vari- 63 ability captured largely depend on the training data, and cannot be explicitly supervised. Because they are trained in an unsupervised manner, linear discriminant analysis (LDA) (Dehak, Torres- Carrasquillo, Reynolds, & Dehak, 2011) or probabilistic LDA (PLDA (Prince & Elder, 2007)) is applied to the i-vectors prior to an identication task. Alternatively, DNNs can be trained end- to-end for tasks such as speaker verication in a supervised fashion (Heigold, Moreno, Bengio, & Shazeer, 2016; Snyder et al., 2016). To make use of back-end technology developed for i-vectors, a two-part system: DNN to learn embeddings (x-vectors), and a separate classier trained for the task was proposed in (Snyder, Garcia-Romero, Povey, & Khudanpur, 2017). More recently, this method has been further improved using data augmentation (Snyder, Garcia-Romero, Sell, Povey, & Khudanpur, 2018) for robust speaker recognition. Our primary contribution in this work is to model domain adaptation problems in speech using a multiview learning framework using MvCorr. While the MvCorr model can be applied for any DNN architecture, in this work, we use convolutional neural networks (CNN) to obtain shared representations of the speech features of the utterances considering dierent speakers or words as views to classify the words or speakers respectively. 5.2 Methods ConsiderN samples ofD-dimensional features observing the same underlying phenomenon (signal) from M dierent views. Let X l 2 R DN l = 1;:::;M, be the data matrix for the l-th view with columns as features. We dene multi-view correlation matrix as the normalized ratio of the sum of between-view covariances R B and sum of within-view covariances R W for M views as follows: = 1 M 1 R B R W = P M l=1 P M k=1;k6=l X l ( X k ) > (M 1) P M l=1 X l ( X l ) > (5.2.1) where X = X E(X ) are mean centered data matrices. The common scaling factor (N1) 1 M 1 in the ratio is omitted. In the related work, the ratio in Eqn. 6.4.5 is maximized for MCCA (Hotelling, 1992; Parra, 2018). A variation of this measure called intraclass correlation coecient (Bartko, 1966) has been extensively used to quantify the repeatability of measurements in test-retest studies. 64 Figure 5.2.1: CNN architecture for the view branches in dMCCA Our objective is to estimate a shared subspace, V2R DD such that a measure of correlation is maximized. Formally: (M) = max V 1 D(M 1) tr(V > R B V) tr(V > R W V) (5.2.2) where tr() denotes the trace of a matrix. Note that is bounded above by 1. The subspace V can be estimated by solving the generalized eigenvalue problem to simultaneously diagonalize R B and R W . Hence, our objective function is the average of the ratio of eigenvalues of between- view and within-view covariances. In other words, if R W is invertible then we want to nd the a subspace direction that maximizes the spectral norm of R 1 W R B , thus accounting for the dierent modalities of variance (within-view) for a given signal-of-interest (between-view). We then compute the multiview representation Y l = V > X l where Y l 2R DN is a signal that is maximally correlated across views. To extend MCCA across large and complex datasets, deep MCCA has been proposed (Soman- depalli, Kumar, Travadi, & Narayanan, 2019a) where each view of the data is input to a neural networkg l ();l = 1;:::;M. The multiview representation, in this case is the output of the last layer H K l =g l (X l ). The covariance matrices, R B and R W in Eq. 6.4.5 are estimated using H l instead 65 of X l to maximize (M) (Eq. 6.4.6). In dMCCA, data from multiple views is modeled with identical and independent neural net- works that are trained jointly to maximize. We refer to these individual networks as sub-networks. In speech processing applications, where we often deal with a large number of speakers, the sub- networks would also be in the order of thousands. In this case, dMCCA cannot be applied directly for several reasons: 1) the number of parameters increase by order of M, and may not be com- putationally feasible to train, 2) limited amount of data compared to the increased number of parameters 3) it is computationally expensive to estimate the covariance matrices R B ; R W , and 4) may overt because of high variance from modeling all the views (speakers) simultaneously. In order to address these limitations, we use a stochastic approach where a small number of viewsm<<M are uniformly sampled from a larger set of views. Formally, the modied objective is: =E mU(1;M) (m) (M) (5.2.3) This method is called generalized multiview correlation (MvCorr) (Somandepalli & Narayanan, 2020), since it can model a much larger number of views. The training procedure is weakly super- vised in the domain adaptation setting. For example, in a classication problem, we only have the knowledge that certain samples measure the same event from dierent views, e.g., same word \On" said by dierent speakers. The number of classes, or the number of samples per class is not known during training. Another benet of this approach is that it is view-agnostic because during training, we do not keep track of the specic views. We randomly pick dierent views for the dierent branches. Thus, during inference we only use one of the m trained sub-networks, to obtain the representations. Because we learn a shared subspace across the views, we expect the representations from dierent branches to be maximally correlated as shown in (Somandepalli, Kumar, Travadi, & Narayanan, 2019a). 5.3 Experiments As described in Sec. 5.1, we wish to study two distinct aspects of variability in Speech Commands Dataset (Warden, 2018) in a multiview setup, by constraining one mode as view to learn dis- 66 Table 5.1: Speaker and utterance (utt.) characteristics Distribution (Min, Max) Mean SD Samples per utt. (1484, 2203) 1941.7 287.6 Utt. per speaker (1, 205) 31.2 25.6 No. samples (No. speakers) Task Train Dev Test Command-ID 35067 (1123) 9137 (300) 14048 (445) Speaker-ID 27314 (1428) 2092 (146) 2098 (146) criminative representations for the other mode. For this purpose, we setup two experiments: 1) command-ID: speakers as views to discriminate between words, and 2) speaker-ID: dierent words as views to represent speakers. To validate the generalizability of speaker-ID for unseen words, we leave out 15 of the 30 speech commands during training and development (dev) and test on the left-out words. All related code is made publicly available at github.com/usc-sail/gen-dmcca. The characteristics of the dataset are shown in Table 5.1. The number of views is 1123 and 15 and the number of classes is 30 and 1148 for command and speaker-ID tasks, respectively. This allows us to test both conditions with a large number of views, as well as distinct classes in the signal. In all experiments, the test set does not include subjects seen in either train or dev sets. Acoustic features: We used 30 dimensional MFCC features (extracted using Kaldi (Povey et al., 2011) with default parameters) from audio sampled at 16kHz. The frame length and frame shift were 25ms and 15ms, respectively. For MvCorr, since we use CNN architecture, we also experimented with log-mel features obtained with the same conguration. 5.3.1 Baseline Experiments Although the objective of our work is to analyze speech in a multiview paradigm, we compare it with other methods to identify its strengths. The i-vector and x-vector methods have been extensively used for language (Dehak, Torres-Carrasquillo, et al., 2011) and speaker-ID (Snyder et al., 2018), and as such are used here as baselines. Both are trained on the Speech Commands Dataset for fair comparison with our approach. One caveat is that these methods are typically trained on vast amounts of data, which is re ected in our performance evaluation on these features. i-vector: We separately obtain i-vectors for the command and speaker-ID tasks in the data partitions shown in Table 5.1. A GMM with 2048 Gaussians and a 400-dim i-vector extractor was 67 Figure 5.3.1: The eect of the number of views m on multiview correlation and classication performance for command-ID trained. In applications with i-vectors, it is typical to apply LDA (Dehak, Kenny, et al., 2011) to remove eects such as channel variability. Following this, for speaker-ID, the i-vectors were transformed using LDA using the ground-truth speaker identities in the training set. Similarly, for command-ID, the words were used as labels. Thus, we obtain 150-dim features after LDA. x-vector: We apply Kaldi's (Povey et al., 2011) standard x-vector recipe separately for speaker and command modeling tasks. LDA is applied to reduce the 512-dim x-vectors to 150-dim, following the standard conventions of (Snyder et al., 2018). Deep CCA: Because MvCorr, and its deep version are extensions of the original CCA (Hotelling, 1936) objective, we use deep CCA (Andrew et al., 2013), but trained with stochastic sampling of the views (See Eq. 5.2.3). The network architecture is the same for both deep CCA and MvCorr, shown in Fig. 5.2.1 but with only two sub-networks. 5.3.2 Generalized Multiview Correlation As described in Sec. 5.2, the dierent views of the data are transformed using DNNs with identical architectures, referred to as sub-networks. The number of sub-networks is the same as the number of subsampled views m. Although, we can use any DNN architecture, we chose CNNs because they have been successfully applied for tasks such as audio event classication (Hershey et al., 2017) and speech activity detection (Hebbar, Somandepalli, & Narayanan, 2019). The CNN of the 68 Table 5.2: Performance evaluation with clustering and classication; No. of classes for the two tasks are 30 and 146 respectively Method / performance command-ID speaker-ID purity macro F1 purity macro F1 i-vector 0.52 0.69 0.33 0.44 x-vector 0.85 0.89 0.48 0.65 dCCA { MFCC 0.73 0.82 0.50 0.71 MvCorr { MFCC 0.87 0.92 0.90 0.86 MvCorr { log-mel 0.89 0.94 0.92 0.90 sub-networks in our model is a smaller version of that in (Hebbar et al., 2019), and is shown in Fig. 5.2.1. The input to the network with m sub-networks is amTD tensor (either MFCC or log-mel) with T frames and D lter banks. Similar to the application of temporal pooling in x-vectors, we perform global average pooling (GAP: average across outputs from convolutional lters and time points) before input to the fully connected (FC) layer. The GAP layer allows us to model utterances of variable duration, because the activations in time are averaged out. The activations from the FC layers, H l are used to estimate the between and within-view covariances. They are additionally constrained to have a unit l 2 norm. We experimented with sigmoid and linear activations for the convolutional and FC layers and found that sigmoid activation achieved higher at convergence. The number of nodes in FC layer in each sub-network was set to 64. Thus, at inference time, we obtain 64-dim features. In all our experiments, we minimize the negative of the objective using mini-batch stochastic gradient descent with a learning rate of 0.01. Additional experiments with Nesterov momentum of 0.9 and a decay of 1e 6 improved the speed of convergence. To determine model convergence, we applied early stopping criteria (stop training if validation loss does not decrease by 1e-3 for 5 consecutive epochs). Choosingm: A central premise of our work is that MvCorr can model to thousands of views by subsampling the views. In order to empirically study the eect of number of views m, we created ten data partitions similar to that shown in Table 5.1. For each one, we obtained command-ID and speaker-ID representations by varyingm =f2; 3; 4; 5; 6; 8g. We also used this procedure to choosem by examining the multiview correlation of the model at convergence on the dev set. The number of trainable parameters corresponding to choices of m weref590K; 885K; 1:1M; 1:4M; 1:7M; 2:3Mg which were also a factor in choosing the number of views to subsample. 69 5.3.3 Performance Evaluation Figure 5.3.2: t-SNE visualization of x-vectors and generalized MvCorr on the test set (A) command- representations: centroids are indicated by solid squares and the points belonging to a class are shaded. Notice the proximity of similar sounding words (circled in dotted lines) (B) speaker representations: markers+colors represent dierent speakers. 54 speakers with the most utterances in test-set are shown We performed classication and clustering to assess the importance of multiview representations for downstream tasks, as well as compare with the baselines. We trained independent linear support vector machines (SVM) for the representations obtained in Sec. 5.3.1 and MvCorr for command and speaker-ID tasks. Since the representations were learned on the training set, the SVMs were trained on the dev set with 10-fold cross validation for parameter selection. Because the classes were imbalanced, we used macro-averaged F1-score to compare the models. For unsupervised clustering, we used the implementation in sklearn (Pedregosa et al., 2011) with kmeans++ (Arthur & Vassilvitskii, 2007) to initialize the cluster centers. The models are trained on the development set and the number of clusters was set to 30 and 146 for command and speaker-ID respectively, during testing. We use cluster purity and V-measure (Rosenberg & Hirschberg, 2007) to assess the performance of clustering. 5.4 Results The eect of the number of subsampled views m on the multiview correlation and classication performance of these models for command-ID task is shown in Fig. 5.3.1. We observed an improve- ment in from 0:06 to 0:25 with varying m from 2 to 3. Further increasing m only improved by at most 0:02. Considering this and the fewer number of trainable parameters, we chose m = 3 for 70 all our subsequent experiments. The results were similar for the speaker-ID task: = 0:56 0:01 for m 3. Recall that is bounded above by 1. We also analyzed the eect of m on the classication performance using pairwise McNemar's (Dupont & Plummer, 1990) chi-squared test 2 with Bonferroni correction (n = 15 comparisons). We observed signicantly better performance for m 3 compared to m = 2. However the changes in performance form 3 were not signicant. These results provide empirical evidence for stochastic sampling of views to generalize MvCorr. We used t-SNE (van der Maaten & Hinton, 2008) to visualize the embeddings learned for com- mands and speakers from x-vectors as well as MvCorr. As shown in Fig. 8.4.1A, the multiview representations in the command manifold appear invariant to the speaker characteristics. Inter- estingly, similar `sounding' words are proximally located on the manifold. A few examples are f\down",\dog"g andf\on", \o",\up"g. The t-SNE manifold for speakers is visualized in Fig. 8.4.1B. Qualitatively, the representations from MvCorr appear to be highly separable for both ID tasks. Next, to quantitatively assess if the command and speaker-ID can be performed without super- vision, we compare the average purity of the clusters from kmeans approach as shown in Table 5.2. Similar trend in the V-measure was observed. Overall, for both tasks, MvCorr performs the best. Using log-mel features instead of MFCC improved the performance. This is consistent with the use of CNNs, reported in other works (C.-W. Huang & Narayanan, 2017). While deep CCA trained on stochastically sampled views performs comparable to other baselines, MvCorr performs better suggesting the benet of the modeling more than two views in parallel and subsampling the views. For command-ID, the x-vector performance is comparable to that of MvCorr, but i-vectors do not cluster as well. This is perhaps because the utterances were of small duration (about 1s). This is consistent with ndings in speaker recognition literature where i-vector system degrades as the utterance length decreases (Jati & Georgiou, 2018; Kanagasundaram, Vogt, Dean, Sridharan, & Mason, 2011). We observed similar results with i-vectors for speaker-ID task as well. However, compared to command-ID, x-vectors perform poorly for the speaker-ID task. This could be because 2 Reject H0: marginal probabilities of each outcome are the same at p< 0:01 71 dog go tree left on o up three cat bed seven eight four ve right two down stop six nine yes house no bird one marvin sheila zero wow happy 0:4 0:5 0:6 0:7 0:8 Average class acc. = 0.66 Clustering accuracy (acc.) per speech command Figure 5.4.1: Performance of MV-CORR for spoken word recognition in SCD the large variance in the number of utterances per speaker (See Table 5.1), introducing class imbal- ance during the supervised training of x-vectors. Similar trends in the performance were observed for classication (See Table 5.2). Finally, as described in Section 5.2, because we obtain view-agnostic features during inference using only one of the sub-networks, we evaluate the performance for the features from the other sub-networks. On average we noticed a variation of 0:24 in cluster purity and 0:11 in macro-F1, which was not substantial. Spoken word classication Next, we analyze spoken word recognition on the Speech Commands Dataset in greater detail. Specically, we are interested in the performance MV-CORR in the case of variable number of views per class. Thus, we analyze the performance of each word with respect to the number of unique speakers (views) available for that word. Of the 1868 speakers, we use 1000 speakers for training and the remaining for testing to ensure that we only test on speakers (views) not seen during training. To assess generalizability to unseen classes, we create three folds by including 20 words for training and the remaining 10 words for testing. We use the k-means algorithm to cluster the embeddings for the test splits with the number of clusters set to 10. 72 Table 5.3: Comparison of MV-CORR framework with domain adversarial methods Method DAN CrossGrad Ours + 2-layer DNN Acc (%) 77.9 89.7 92.4 The per-class accuracy 3 from the clustering task is shown in Fig. 5.4.1. The average number of speakers across the thirty commands was 400:3 52:5 which underscores that number of views available per class is highly variable. We observed a minimal association (Spearman rank correlation = 0:12) between the number of unique speakers per word and the per-class accuracy scores| suggesting that MV-CORR can work eectively with a variable number of views per class. However, it is dicult to disambiguate this result from the complexity of the downstream learning task. That is, we may need more number of speakers (views) for certain words to account for higher inter-class variability (e.g., similar sounding words such as \on" vs. \o" or \tree" vs. \three") and intra-class variability (e.g., dierent pronunciations of a word or the acoustic background). Finally, in the context of domain adaptation for experiments with SCD, we compare MV-CORR with two recent domain adversarial learning methods: domain adversarial networks (DAN, (Ganin et al., 2016)) and cross-gradient training (CrossGrad, (Shankar et al., 2018)). The central idea of these methods is to achieve domain invariance by training models to perform better at classifying a label (word) than at classifying the domain (view). Similar to our previous experiments, we adapt our embeddings for a supervised setting on a subset of 12 commands in SCD to compare with the results in (Sharma et al., 2012). We rst train the MV-CORR model of four sub-networks using 500 speakers from the training set. We then obtain 64-dimensional embeddings on the remaining 500 speakers and train a 2-layer fully connected DNN (A sigmoid layer followed by a softmax layer with 128 and 12 neurons respectively) to classify the 12 commands, and test on the remaining 868 speakers. For baseline, we replicate the experiments for DAN and CrossGrad using their publicly available code 1 . We use the same splits of 500 speakers each for training/development and 868 speakers for testing. The classication accuracy of our method and that of DAN and CrossGrad is shown in Table 5.3. We observed a signicant improvement 2 over CrossGrad, suggesting that 1 CrossGrad and DAN code: github.com/vihari/crossgrad 2 Permutation test n = 10 5 , p = 0:008 73 a multiview formulation can be eectively used for domain adaptation problems such as robust speech recognition. 5.5 Discussion In this chapter, we study speech representations in a multiview paradigm by treating the dierent modes of variability as multiple views using CNN and generalized multiview correlation. In the Speech Commands Dataset, we conduct two distinct experiments to identify speakers and com- mands, where we constrain the views to the dierent commands or speakers, respectively. We show that in multiview correlation, stochastically sampling a small number of views generalizes for thousands of views even in the presence of a variable number of views per class. Our per- formance evaluation and comparison with state-of-the-art methods used to obtain robust speech representations demonstrate the benet of explicitly modeling the dierent modes of variability using multiview learning. 74 Chapter 6 Robust face clustering in movie videos At the outset of this dissertation, in chapter 1, section 1.1.2 one of the application areas highlighted for multiview learning was automatic character labeling in a movie: i.e., automatically understand- ing who appears when in a movie video. In this chapter, we will explore the generalized multiview correlation to develop large-scale, movie domain matched, state-of-the-art face clustering method- ologies. Additionally, the diverse set of audio and visual experiments conducted in the last three chapters|chapter 3, 4 and 5|the settings were constrained, i.e., the dierent types of views were known and in some cases such as 3D object recognition and speech command representations, he multiview setting was simulated in a semi-supervised manner using the labels provided. In contrast, this chapter considers the case of robustly clustering character faces in movies, where the multiview setting is unconstrained. 6.1 Introduction Media is created by humans, for humans: to tell stories that educate, entertain, inform, market products or call us to action. When we watch an ad, TV series or a movie, the onscreen characters shape our point of view by providing a window into the narrative. These characters advance the plot and play a vital role in eective storytelling (Cohen, 2001). Advances in machine learning can help The work presented in this chapter has been submitted to IEEE Transactions on Multimedia, 2021 as Robust Character Labeling in Movie Videos: Data Resources and Self-supervised Feature Adaptation. A preprint version of this submission is available on ArXiv at https://arxiv.org/pdf/2008.11289.pdf 75 automatically identify who, where, and when to build a computational understanding of character representations, portrayal and behavior in media content (Somandepalli et al., 2021). This research can also directly in uence other technical elds such as understanding social interactions in video (Vicol, Tapaswi, Castrejon, & Fidler, 2018), automatic video captioning (Rohrbach, Rohrbach, Tang, Joon Oh, & Schiele, 2017) and developing computational narratology (Kim & Doh, 2017). Media character-level analysis oers numerous applications to an array of stakeholders: from content creators and curators to engineers, media scholars and consumers. Consider video streaming platforms which are able to tailor recommendations based on the cast of characters and the settings in which they appear (El Bolock, El Kady, Herbert, & Abdennadher, 2020). Another notable example, particularly for content curators is the X-ray feature (4 ways to use X-Ray in Prime Video, n.d.) on Amazon's Prime Video platform. Aimed at viewers who want to learn more about who they are watching, X-ray, among other things, identies the cast and characters in some of the movies and series on Prime Video, enriching the user experience. Social media scholars and content creators can easily conduct large-scale studies of TV and movie trends using character on-screen presence to shed light on a variety of relevant topics such as diversity and inclusion in casting (Guha, Huang, Kumar, Zhu, & Narayanan, 2015; Somandepalli et al., 2021). Such studies have a real impact on society and our everyday lives: for example, increased employment opportunities for women and underrepresented groups per a 2020 report on diversity in Hollywood (2020 Hollywood Diversity Report: A dierent story behind the scenes, n.d.). A rst step toward developing such technology is the ability to automatically identify the characters in the visual modality, i.e., the who in a video. This is a dicult task because of the rich variety in portrayal and design of dierent media forms (e.g., automatic character labeling in animated content (Somandepalli, Kumar, Guha, & Narayanan, 2017)). In this study, we focus on live-action content where a person's face is used as the primary signal of identication. This is typically achieved through a two-step process: i) face detection to localize the face of a person in a frame, ii) clustering the detected faces to recognize a person irrespective of where and how they appear in the content. Recent advances in face detection can localize faces with near-perfect precision, even in extreme conditions of illumination and pose (Deng, Guo, Zhou, et al., 2019). Similarly, advances in learning face representations (embeddings) and rich open-source face datasets (e.g., (Cao, Shen, Xie, Parkhi, & Zisserman, 2018; Schro, Kalenichenko, & Philbin, 2015)) have 76 (F, G) (F, G) (P, L) (F, L, O) (F) (B, F) (F) (B, P) (F, L) (F, L, O) (P) (B, F) (F) (P) (F, L) (B, F, O) (B, F) (B, P) Blurry (B) Frontal (F) Glasses (G) Occluded (O) Poor Lighting (L) Profile (P) Face quality labels Hidden Figures (2016) (B, F) (B, F) (P) Figure 6.1.1: Challenging instances of faces detected in a movie for character labeling task. The ex- ample shows the prominent characters from a 2016 Academy award winning movie Hidden Figures. Face quality labels associated with each track are also shown to tag some of the visual distrac- tors. The images in the rst column are character label exemplars taken from the IMDb page: www.imdb.com/title/tt4846340 provided powerful frameworks to identify a person by their face. However, face recognition in videos in the absence of domain-matched training data remains a challenging problem. We need to robustly identify the characters irrespective of changes in ap- pearance, background imagery, facial expression, size (resolution), view points (pose), illumination, partial detection (occlusion), and in some cases, age (Ghaleb, Tapaswi, Al-Halah, Ekenel, & Stiefel- hagen, 2015). Figure 6.1.1 highlights the variability in the appearance of characters that makes character labeling task challenging in the presence of some of these visual distractors. This task is further complicated in long-form content such as movies, where characters occur at varying frequen- cies and suitable exemplars of actors playing them are not always available. In this setup, eective face recognition not only requires face embeddings which remain robust to visual distractors, but possibly unsupervised methods such as clustering to accurately identify every character in a movie. This is the main focus of this work with Hollywood movie videos as our application domain of interest. Unlike photo albums or image datasets, the temporal nature of videos can be used to group time-consecutive faces detected from the same person into face tracks, using simple heuristics based on the detection overlap (Xiang, Alahi, & Savarese, 2015) agnostic to the face identity. Thus, face recognition in videos is generally performed at the track-level. Face tracking in video content such as movies|where there are typically multiple characters appearing in a scene|also oers self- 77 supervised means of mining pairs of faces that belong to the same person (all faces within a track) and faces that belong to dierent people (all faces co-occurring in a frame). This process exploits the co-occurring nature of faces, both spatially and temporally, inherent to video content such as movies and does not require additional supervision; hence the term self-supervised. In the context of video face clustering, several past studies have shown self-supervision frameworks to be eective in learning robust domain-specic face embeddings (Sharma, Tapaswi, Sarfraz, & Stiefelhagen, 2019; S. Zhang, Gong, & Wang, 2016) as well as improve face clustering (Cinbis, Verbeek, & Schmid, 2011; Somandepalli & Narayanan, 2019). In this work, we use the idea of self-supervision to contribute toward two critical aspects of face clustering in movies: addressing the lack of domain-matched training data and adapting deep face embeddings learned from static images to movie face tracks. First, we present a large-scale weakly labeled dataset that we created by mining instances of spatially and temporally co-occurring face tracks in movies, called SAIL-MultiFace. From a sample of 240 Hollywood movies released between 2014{2018, we curated over 169,000 face tracks including over 10 million face images with weak labels (i.e., whether two faces belong to the same person or not) obtained automatically. Second, we propose an oine method based on nearest-neighbor search to identify challenging cases in the weakly labeled data: that is, faces belonging to the same person which are \far apart" (hard- positives) and faces of dierent people that are \close together" (hard-negatives) in the embedding space. Then, in order to improve the overall face recognition in the movie domain using weakly labeled data, we explore triplet loss (Schro et al., 2015; S. Zhang et al., 2016) and multiview correlation (Somandepalli, Kumar, Travadi, & Narayanan, 2019b) based methods to adapt general- purpose face embeddings to the movies domain. Finally, we developed the SAIL-Movie Character Benchmark (SAIL-MCB), a resource to eval- uate and compare the performance of face clustering methodologies for movies. We considered movie/TV videos evaluated in other studies (Everingham, Sivic, & Zisserman, 2006; S. Zhang et al., 2016), and two more recently produced movies we studied in our recent work (Somandepalli & Narayanan, 2019). In order to ensure that our benchmark dataset is representative and inclu- sive of the actors in movies, we included two other movies with a more racially diverse cast of characters. For all six videos, we annotated nearly 10,000 tracks with over 0.5 million faces for the character labels of both primary and secondary actors. We also annotated face quality labels 78 (See Figure 6.1.1) to understand the performance of face recognition frameworks in the presence of visual distractors. Our experimental results with face verication and clustering, and subsequent error analysis, demonstrate the benet of using self-supervised adaptation techniques to improve character labeling in videos. The rest of the chapter is organized as follows: In section 6.2, we discuss the relevant literature for automatic visual character labeling in movies. We then present in section 6.3 the benchmark and the large-scale weakly labeled face datasets created and shared as part of this work followed by a discussion of the feature adaptation methods in section 6.4. In section 6.5, we consider four widely successful general-purpose face embedding frameworks to conduct character labeling experiments after feature adaptation using weakly labeled data harvested from movies. 6.2 Related Work We review here relevant existing work to contextualize the two challenges of face clustering in movies central to our work: 1. The lack of movie domain-specic data resources for training/evaluation and 2. Addressing the wide inherent variability in movie content through the ideas of self-supervision for domain adaptation. We discuss the impact of publicly available large-scale face (image) datasets on developing robust face embeddings followed by automatic character labeling in videos. We then discuss the promise of self-supervision to create large-scale weakly labeled data and for feature adaptation to bridge image and video domains. Finally, we highlight the need for diverse domain- specic benchmarks for a robust evaluation of movie character labeling. 6.2.1 Deep face embeddings and data resources One of the earliest large-scale open source datasets called Labeled Faces in the Wild (LFW) was developed in 2008 (G. B. Huang, Mattar, Berg, & Learned-Miller, 2008). LFW consists of over 13,000 faces from 5,749 people detected using the Viola Jones face detector (Viola & Jones, 2001). With the advent of deep learning methods for more robust face detection at scale, more datasets have been created by collecting face images from the Internet. For example, the CASIA WebFace dataset (Yi, Lei, Liao, & Li, 2014) contains nearly half a million images from over 10,000 people. CelebFaces Attributes dataset (CelebA) is another large-scale resource with images from more than 79 20,000 celebrities and over 200,000 images. More recently, VGGFace2 dataset (Cao et al., 2018) was released with nearly 3 million faces from over 9,000 celebrities. Another notable eort in this space was led by Microsoft in creating MSCeleb-1M (Guo, Zhang, Hu, He, & Gao, 2016) with 10 million face images from nearly 100,000 individuals. While this dataset is no longer publicly available, such eorts highlight the feasibility 1 of curating large-scale face recognition resources. Unlike the image domain, there have been very few fully labeled face video datasets. One notable resource in the video domain is the YouTube Faces (YTFaces, (Wolf, Hassner, & Maoz, 2011)) dataset with about 3,500 face tracks from over 1,500 dierent people sampled from interview recordings on YouTube. The availability of such resources has led to the development of several supervised deep representations (embeddings) of faces (e.g., (Deng, Guo, Xue, & Zafeiriou, 2019; Liu et al., 2017; Parkhi, Vedaldi, & Zisserman, 2015; Schro et al., 2015; Y. Shi & Jain, 2019)) that have proven to be powerful and discriminative of a person's identity. Deep face embeddings are typically trained with static images mined from web search or photo albums. Two major challenges remain in applying these embeddings directly for character labeling for movie/TV content in the video domain. First, the unit of analysis in videos is a face track. The face embeddings are typically aggregated across all faces in the track using the mean oper- ation (Sharma, Sarfraz, & Stiefelhagen, 2017). This may not be robust to dynamic changes of a person's face within a track as shown in (Sharma et al., 2017), resulting in unreliable track-level em- beddings. Second, unlike web images, movies/TV show a person's face in a wide variety of situations (backgrounds and visual distractors), particularly in long-form content such as movies (Sharma et al., 2019) resulting in domain mismatch. Thus image-domain embeddings do not often yield robust video face representations (S. Zhang et al., 2016). One approach to address this domain mismatch is to train or adapt face image embeddings using labeled data from the domain-of-interest: in our case, perhaps using YTFaces dataset. However in the context of training deep learning models, such datasets are relatively smaller in size (3,500 face tracks) and the source videos have fewer visual distractors (e.g., video interviews of celebrities with mostly frontal facing appearance). Although additional domain-specic resources may be collected by manually assigning character labels to 1 Ethical and privacy concerns notwithstanding; see the discussion in (Van Noorden, 2020) 80 video face tracks, this process is generally time-consuming and expensive (Cao et al., 2018). 6.2.2 Character labeling in videos Fully automatic identication of characters in video using face information has been studied for over a decade. Some earlier works (Everingham et al., 2006; Ramanathan, Joulin, Liang, & Fei-Fei, 2014; Sivic, Everingham, & Zisserman, 2009) have explored aligning speaker names available in movie screenplays with subtitles to obtain character labels at a given timestamp. Character identication was then framed as a matching problem of the faces detected in a movie frame with the names extracted from screenplays. While screenplay-based methods are eective in identifying at least the speaking/named characters in a video, they fail to scale up or generalize due to the limited access to nal production screenplays as well as inaccurate time-alignment with the subtitles (Cour, Sapp, Nagle, & Taskar, 2010; Haurilet, Tapaswi, Al-Halah, & Stiefelhagen, 2016). Audio-visual character labeling methods, particularly for movies, have been somewhat less successful (e.g., (El Khoury, S enac, & Joly, 2014; Kapsouras et al., 2017)); primarily because such eorts have mostly focused on multimodal active speaker labeling (Vallet, Essid, & Carrive, 2012) and fail to detect all characters appearing on screen. Using external metadata for supervised matching of characters was also explored. A prominent example is using IMDb images for TV series labeling (Aljundi, Chakravarty, & Tuytelaars, 2016). While eective for some TV series, these methods fail to generalize for movies. The appearance of an actor's face on sources such as IMDb or Wikipedia can be dierent from that of the characters they play on screen (for example, the eect of makeup (Kose, Apvrille, & Dugelay, 2015) or age (Ghaleb et al., 2015)). Figure 6.1.1 illustrates an example of the mismatch in one of the movies in our dataset. The dierences between an actor's appearance on the IMDb-curated image vs. the actual appearance in the movie, are very noticeable. In this work, instead of trying to resolve the mapping between characters and actors, we focus on accurately identifying all faces belonging to a character using unsupervised methods such as clustering in the embedding space. If necessary, the exemplars of the resulting clusters can be mapped to actors in casting lists from IMDb with minimal manual eort. Thus, in order to robustly represent faces in videos, we explore the paradigm of self-supervision to adapt face image embeddings to the movie domain. Current research based on self-supervision 81 in this domain can be broadly categorized along two directions: (1) mining weakly labeled data from video content and (2) adapting face image embeddings for the domain-of-interest. Self-supervision to collect weakly labeled data. Local tracking of the face detection in a video acts as high precision clustering to identify faces within a shot that must belong to the same person and faces that cannot belong to the same person (when multiple faces co-occur in a frame); generating must-link and cannot-link face constraints respectively (B. Wu, Zhang, Hu, & Ji, 2013). Thus, without additional signals such as subtitles, character metadata, or speaker labels, we can easily mine weakly labeled faces agnostic to the character ID (Somandepalli & Narayanan, 2019). In the same spirit, here we curate over 169,000 face-tracks with weak labels from 240 movies to generate domain-matched data for downstream feature adaptation. Recently, two methods, track-supervised Siamese network (TSiam) and self-supervised Siamese network (SSiam) were proposed in (Sharma et al., 2019) which|besides face tracking|relied on the Euclidean distances in the embedding space to generate similar and dissimilar face tracks. While we were motivated by the success of face-tracking to mine a large number of weakly labeled tracks from movies easily, we used distance based methods to further identify dicult samples for downstream feature adaptation tasks. To this end, we propose a nearest neighbor-based approach to further segment the curated tracks into smaller \tracklets". Hard-positives were formed between tracklets with maximal distance and hard-negatives between cannot-link tracklets with minimal distance in the embedding space. This method is entirely oine and avoids the need for complex online hard-example mining, common in triplet-loss based systems (Schro et al., 2015). Several novel clustering methods have been inspired by the must-link and cannot-link constraints obtained via self-supervision. The similarity matrix of face image embeddings was modied to sat- isfy the pairwise constraints in videos to improve clustering in a specic domain (Vretos, Solachidis, & Pitas, 2011). Hidden Markov random eld models (HMRF (B. Wu, Lyu, Hu, & Ji, 2013)) were used to cluster faces by iteratively associating face tracks with the available constraints as the initial starting point. A key insight with respect to pairwise constraints is that, if all must-link and cannot-link face pairs in a video are known, the constraint matrix is low-rank. This property has been used to improve face clustering by learning low-rank subspace representations (Y.-X. Wang, Xu, & Leng, 2013; Xiao, Tan, & Xu, 2014). The low-rank structure was also eectively used to jointly detect and cluster faces in a video (Jin, Su, Stauer, & Learned-Miller, 2017). In our previ- 82 ous work, we showed that jointly imputing unknown pairwise constraints using matrix completion and learning subspace representations (Somandepalli & Narayanan, 2019) improves overall face clustering in movies. 6.2.3 Self-supervised feature adaptation The weak labels generated from must-link and cannot-link constraints have been successfully used for feature adaptation using metric learning. Unsupervised logistic discriminative learning (ULDML (Cinbis et al., 2011)) was proposed to learn a metric such that must-link faces are closer to each other and cannot-link faces are further apart in the feature space. A Siamese network was trained with contrastive-loss using such weakly labeled data for face verication task in (Chopra, Hadsell, & LeCun, 2005) and using triplet-loss in the development of FaceNet (Schro et al., 2015). In the context of track-level face clustering in videos, similar and dissimilar faces were used to train face embeddings using a Siamese network with contrastive loss (Datta, Sharma, & Jawahar, 2018). Triplet-loss was shown to perform generally better than contrastive loss for face recognition in videos (Huo & van Zyl, 2020). An improved triplet (ImpTriplet) loss was proposed in (S. Zhang et al., 2016) that performed better than the traditional triplet loss function for track-level feature adaptation. ImpTriplet not only pushes the negative faces away from the positive pairs (like triplet loss) but also constrains the distance of the positive pair to be less than a margin. Thus, in our work, we evaluated ImpTriplet to adapt face embeddings using the hard-positive and hard-negative tracklets automatically gathered from movies. In contrast to contrastive/triplet loss formulation which needs a negative sample, multiview methods such as canonical correlation analysis (CCA, (Hotelling, 1992)) and their deep vari- ants (Andrew et al., 2013) can be used to learn the shared information between a pair of posi- tive samples; in our case learning the shared character identity from dierent appearances (views) of a person. While deep CCA was applied eectively for face recognition (X. Chang, Xiang, & Hospedales, 2018) and reconstruction (Zhang, Yuan, Shen, & Li, 2018), multiview learning meth- ods in general have not been explored for face clustering in videos. Perhaps the closest in this context is using linear discriminant analysis (LDA) for face recognition in videos (Pnevmatikakis & Polymenakos, 2009). But, unlike CCA-based methods LDA needs labels for all class and not just weakly labeled data. In a recent work, we developed a neural-based approach called multiview 83 correlation (MvCorr (Somandepalli, Kumar, Jati, et al., 2019)) to generalize CCA for more than two views where all we know is that a set of observations come from the same source. In the do- main of speaker recognition, we showed that MvCorr oers state-of-the-art performance for speaker clustering (Somandepalli, Kumar, Jati, et al., 2019) capturing information regarding the person's identity irrespective of the spoken utterance. Analogous to speaker recognition, the hard-positive tracklets we extract from face tracks readily include faces of the same person in dierent views with respect to pose, illumination and occlusion. Thus, for our feature adaptation experiments with weakly labeled data, we explore both ImpTriplet and MvCorr frameworks. 6.2.4 Benchmark datasets for movie face clustering The overarching goal of face clustering in movies is to identify the characters wherever and whenever they appear throughout the video. In the domain of movie video analysis in particular, there are very few open-source datasets available till date to benchmark related tasks. This is in part because movie videos are longer in duration compared to trailers and other video clips (e.g., YTFaces (Wolf et al., 2011)). Labeling characters throughout the content requires expensive manual eort and can be time intensive. Benchmark datasets have been mostly created from episodes of TV series: Buy the vampire slayer (Everingham et al., 2006; Z. Zhang, Luo, Loy, & Tang, 2016), Big Bang Theory (Tapaswi, B auml, & Stiefelhagen, 2012) and Sherlock Holmes (Nagrani & Zisserman, 2018). However, TV episodes are generally shorter compared to movies and do not have as much variability with respect to the appearance of characters and the backgrounds or situations in which they appear. In the movie domain, a few widely used examples include the movies Casablanca and American Beauty compiled in (Bojanowski et al., 2013), Notting Hill (Xiao et al., 2014) and more recently ACCIO (Ghaleb et al., 2015) which is a dataset of the Harry Potter movies collected with a focus on learning age-invariant face representations. In our recent work, we released labels for two other movies adding to these resources (Somandepalli & Narayanan, 2019). These movies and TV episodes mostly include white actors in prominent roles and are not entirely representative of the growing and desired trend of diverse casting in Hollywood (see reports (2020 Film: Historic Gender Parity in Family Films, n.d.; 2020 Hollywood Diversity Report: A dierent story behind the scenes, n.d.)). In this work, we address this limitation by developing a benchmark dataset that includes two movies with a more racially diverse casting. We hope that these resources can enable 84 Table 6.1: Details of Movie Character Benchmark (SAIL-MCB) dataset. The number of characters were chosen to label at least 99% of the detected faces. The range of number of tracks-per-character shows that we label both prominent and minor characters. Movie (year) No. faces No. tracks No. faces-per-track mean std No. characters No. tracks-per-character (min, max) % frontal face tracks ALN About Last Night (2014) 70,210 1,880 38.2 34.1 10 (16, 491) 41.1 BFF Buy, the Vampire Slayer (2000) 47,870 634 66.3 69.4 12 (17, 112) 16.3 DD2 Dumb and Dumber To (2014) 152,908 2,361 70.8 66.3 10 (12, 557) 24.9 HF Hidden Figures (2016) 101,438 1,902 59.8 61.1 24 (8, 283) 47.3 MT Malecent (2014) 53,450 875 64.1 57.9 10 (14, 254) 28.2 NH Notting Hill (1999) 154,625 2,121 79.2 84.5 12 (18, 585) 21.9 a robust evaluation of automatic character labeling methods in the movie domain. 6.3 Data Resources In this section we describe the two data resources we have developed: (1) the SAIL Movie Char- acter Benchmark (SAIL-MCB) and (2) SAIL-MultiFace: weakly labeled face tracks for training or adapting general purpose face embeddings. All movies were purchased in house. 6.3.1 SAIL-MCB: SAIL Movie Character Benchmark Dataset We started with two widely used benchmark videos: a movie, Notting Hill (NH), and an episode from season 5 of the TV series, Buy the Vampire Slayer (BFF. NH and BFF have been used to evaluate video face clustering, both online (e.g., (Kulshreshtha & Guha, 2020)) and oine (e.g., (Bian, Mei, & Zhang, 2018; Haq, Muhammad, Ullah, & Baik, 2019; C. Zhou, Zhang, Li, Shi, & Cao, 2014)) methods. Although labels are publicly available for these videos, we repeated the annotation process to improve the overall coverage of the character labels: our system detected signicantly more number of faces than those in recent reports. For example, Sharma et. al (Sharma et al., 2019) reported 39,263 faces detected for BFF; compare this with 8,000 more faces (See Table 6.1, row 2) in SAIL-MCB. We also included data from two movies which we made publicly available in our recent work (Somandepalli & Narayanan, 2019): Dumb and Dumber To (DD2) and Malecent (MT). These two movies were chosen to include content produced more recently compared to NH. In addition, we included About Last Night (ALN) and Hidden Figures (HF) to include more racially diverse cast of actors. For all six videos, we rst performed face detection and local tracking using Google's API 85 obtained with an academic license. The face tracking used here was developed using simple heuris- tics based on analyzing the intersection-of-union of the successive bounding boxes of the detected faces (Bugeau & P erez, 2008). A summary of the number of faces detected and face tracks as well as the density of faces per track is shown in Table 6.1. We used a face track as the unit of annotation for the movie videos. The process included two tasks: character labeling and face quality labeling. Character labeling: We used the names of the actors listed on the IMDb casting page cor- responding to each movie/TV series as character labels. These labels were manually assigned to each face track by two human annotators independently and ties were resolved by a third person, thereby ensuring high annotation agreement. The number of characters labeled was not xed across the movies and chosen to cover at least 99% of all faces (not face tracks) detected in each video. This was another reason why we re-annotated NH and BFF in house. Recent studies only labeled for 5 characters in NH (Sharma et al., 2017) and 6 characters in BFF (Sharma et al., 2017, 2019) (compare this with 12 characters for both videos in SAIL-MCB). As summarized in Table 6.1, most movies only needed 10 or 12 character labels to label 99% of the faces detected. Hidden Figures (HF) had the largest number of characters at 24. The number of face tracks per character varied widely within a movie. For example, in HF, the least frequent character has only eight face tracks while the most frequent character has 283 tracks (See Table 6.1). Thus, we also pick up on some characters in a minor role besides the commonly appearing characters (lead/co-leads). Face quality labeling: In order to facilitate a detailed performance and error analysis of face clustering methods on SAIL-MCB, we also obtained face quality labels for dierent visual distractors while labeling face tracks with their character IDs. These qualitative labels were annotated along six dimensions at the track level: (1) whether all faces in the track are facing the viewer (frontal:F); (2) at least one face in the track is shown facing sideways (prole; P); (3) at least one face in the track is blurry (B); (4) at least one face in the track is shown wearing glasses (G); (5) at least one face in the track is only partially visible or occluded (O); and (6) at least one face in the track is poorly lit (L). These questions were presented sequentially to the annotators and they were instructed to tag with more than one label where applicable. A few exemplars of these labels are shown in Figure 6.1.1. The distribution of the quality labels across all videos in SAIL-MCB is shown in Fig. 6.3.1. In this gure, notice that the total does not sum to 100 as a single face 86 Prole Frontal Blurry Lighting Occluded Glasses 10 20 40 60 Percentage of total face tracks (%) Figure 6.3.1: Distribution of face quality labels in SAIL-MCB at the track-level. Over 50% of the face tracks were labeled as \Prole" which means that at least one face in the track was shown posing sideways. Table 6.2: Count statistics of the tracks mined at each stage of the harvesting process. Sample size of movies = 240 Statistic Total No. faces No. tracks No. tracks/movie mean std. Track length 1s 23.2M 335845 1389.9 760.3 Co-occurring tracks 10.2M 169201 726.2 704.4 track can have multiple distractor labels. Over 50% of the face tracks had atleast on prole face. Additionally, we found that on average, only 30% of all face tracks were completely frontal facing in our dataset (See Table 6.1). These annotations along with the track-level character labels have been made publicly available. 6.3.2 SAIL-MultiFace: Harvesting Weakly Labeled Face Tracks Our methodology for gathering weakly labeled data from movies includes two steps: (1) Face tracking and preprocessing to generate must-link and cannot-link face pairs and (2) hard-example mining in the embedding space to identify \dicult" tracklets. We call this dataset MultiFace as it includes faces corresponding to multiple views of the same/dierent person in multiple settings. The spatial and temporal co-occurrence patterns of people in a movie scene can be used to generate associations of must-link and cannot-link constraints: i.e., faces in a track must belong to the same person and multiple faces in a frame cannot belong to the same person, respectively. 87 25 50 100 150 Western Reality-TV Short Musical Sport Documentary Animation Music War History Horror Mystery Family Crime Fantasy Biography Sci-Fi Romance Thriller Adventure Comedy Action Drama Number of movies Figure 6.3.2: Distribution of the genres for the 240 movies used for harvesting weakly labeled face tracks in SAIL-MultiFace. Movies may have multiple genres associated with them as listed on IMDb. To gather such data in movie videos, we considered a set of 240 movies (24 fps video frame rate, 1280x720 resolution) released between 2014{2018 that were purchased in house. These movies span a wide range of genres as shown in Figure 6.3.2, providing data from dierent movie styles. The movie titles and related details are provided in the Appendix D. We performed the following preprocessing steps to mine must-link faces and cannot link tracks from each movie: 1. Face detection and local tracking (as explained in the previous section). 2. Filter out face tracks which have a duration of less than 1s, i.e., minimum of 24 faces in a track. This limits the search space of tracks for the next step as well as provides opportunities to extract instances of the same person appearing in dierent conditions such as pose, expression and lighting. 3. Only retain tracks which co-occur with other tracks. We considered two face tracks to be co-occurring if they shared at least six frames (0.25s) in common. This threshold was chosen heuristically to minimize propagating errors from tracking. The count statistics of the total number of tracks retained at each step of the process is shown in 88 Tracklet 1 Tracklet 2 Tracklet 1 Tracklet 3 Tracklet 1 Tracklet 2 Tracklet 1 Tracklet 3 k=5 k=3 Figure 6.3.3: Example of hard-positive tracklets resulting from our proposed method with the nearest neighbor parameter k = 3; 5. Each color indicates one tracklet. Notice that they dier from each other with respect to face orientation (e.g., Row 1: Tracklet 1 vs. Tracklet 2) or with eyes open/closed (e.g., Row 2: Tracklet 2 vs. Tracklet 3). A single hard-positive tracklet can be formed from faces that may not appear in a sequential order allowing us to mine harder positives (see Tracklet 1). The pair of cannot-link face tracks shown here are from the movie Hidden Figures (2016) at time 00 : 11 : 05. Table 6.2 for the 240 movies used for this task. Of the tracks identied in step (2), on average, 45:1 21:2% of the face tracks had at least one co-occurring track. This statistic is consistent with the range of 35 70% reported by a previous work (Sharma et al., 2019) that also mined co-occurring face tracks in movies. It is important to note that the nal number of 169,000 tracks after preprocessing only represents the must-link instances that provide positive samples (faces of the same person). In order to extract negative samples (faces from dierent persons), we need to look at all cannot-link pairs associated with them. While every face track overlapped with at least one other track, the number of overlapping tracks varied between 1{94 with an average of 5:2 9:1 overlaps per track; thus resulting in a large set of negatives to choose from. In such cases, past face clustering studies have emphasized the need for hard-example mining to not only reduce the search space but also improve the robustness of face representations to visual distractors (e.g., (Y. Shi & Jain, 2019)). The goal here is to identify samples belonging to the same person that appear very dierent (hard-positives) and samples belonging to dierent persons that look similar to each other (hard-negatives). In the next section, we discuss the development of a hard-example mining approach for our use case. 89 Figure 6.3.4: Mining Hard-positive and Hard-negative tracklets using Nearest Neighbor Search 6.3.3 Mining Hard-positive and Hard-negative Tracklets Hard-example mining has been studied extensively in the computer vision literature for tasks such as object recognition (e.g., (Shrivastava, Gupta, & Girshick, 2016)). In the context of face clus- tering, mining hard-positive and hard-negative faces in an embedding space can be easily achieved by identifying the pairs of face samples that yield maximum and minimum distance respectively (Sharma et al., 2019). However, face clustering is performed at the track-level where we typi- cally average the embeddings of all the faces in a track to provide a robust representation. Thus, we propose a nearest-neighbor based approach to identify the hard-positive and hard-negative tracklets. The pseudocode for the proposed method is detailed in Algorithm 6.3.4. Let us as- sume that we have a semantic embedding space for the face tracks (e.g., VGGFace2) denoted by 90 H = [h 1 ;:::; h N ]2R dN with d-dimensional embeddings for N faces in the track. As described in the function HardPositiveMining(), we segment each track into multiple tracklets to form a hard-positive set. First, we nd the pair of embeddings in H that are maximally distant from each other and assign one of them to be the anchor face h a (line 3). Thenk-nearest neighbors of h a are accrued and averaged to form the corresponding anchor tracklet v a (NNTracklet, line 4) After xing the anchor h a , we mine the hard-positive setV p :fh (i) p ji2 [1;M 1]g iteratively (lines 5{9) as follows: (1) Find h p farthest from h a from the columns remaining in H after removing the nearest neighbors of h a (line 7) (2) Average the k-nearest neighbors of h p to get its tracklet v p (line 8) (3) Remove these nearest neighbors from the matrix H. Repeat these steps M 1 times to obtain the set of hard-positive trackletsfv (i) p ji2 [1;M 1]g for anchor v a . The variable M is controlled by choosing an appropriate value for the parameter k in the nearest neighbor search (see Parameters in Algorithm 6.3.4). For example, if M = 5 from a track with N faces, we set k =b N 5 c. This choice of k ensures that all tracklets are of same length and that most faces in a track are used up. We used the KDTree algorithm implementation in scikit-learn (Pedregosa et al., 2011) for nearest neighbor search. An example of the resulting hard-positive tracklets is shown Figure 6.3.3. Notice how they dier from each other with respect to face orientation and eyes closed/open. An interesting consequence of using k-nearest neighbors in our hard-positive mining approach|particularly without time-contiguous constraints|is that we can form tracklets with faces that are similar in the embedding space but may not be sequential in a face track, as highlighted in Figure 6.3.3. Our proposed hard-negative mining is detailed in the function HardNegativeMining. It takes two inputs: the parent track H of anchor v a and the set of all C 1 cannot-link tracks for H (see Step 3 in 6.3.2); denoted byC :fH (i) ji2 [1;C]g. Each track inC is segmented into M tracklets (line 15{16) to obtain a total of MC cannot-link trackletsV q (line 17). A hard-negative pair is then identied as the tracklet inV q that has the shortest distance to the anchor v a (line 18). In Fig. 6.3.3, Row1 : Tracklet3 was identied as the anchor andRow2 : Tracklet2 as its hard-negative. Notice that both tracklets show the actors with heads tilted similarly and eyes closed in a similar (dark) background. Throughout this algorithm, we use the normalized Euclidean distance metric 91 to estimate distances in the embedding space. Notice that both hard-positives and hard-negatives are mined with respect to an anchor which can be directly used for feature adaptation with triplet- loss based frameworks. For all subsequent feature adaptation methods, we use the hard-positives fv (1) p ;:::; v (M1) p g and the hard-negative v q corresponding to the anchor tracklet v a . 6.4 Self-supervised Feature Adaptation As discussed in Section 6.2, we explore improved triplet loss (ImpTriplet, (S. Zhang et al., 2016)) and multiview correlation (MvCorr, (Somandepalli, Kumar, Jati, et al., 2019)). In this section, we review these two methods in our context of adapting general-purpose face embeddings for video face tracks using the hard-positive and hard-negative tracklets. 6.4.1 ImpTriplet: Improved Triplet Loss ImpTriplet (S. Zhang et al., 2016) is an advanced version of the popular triplet loss formulation. For one triplet, the original triplet loss function is dened as: L o = 1 2 max 0;D(v a ; v (1) p )D(v a ; v q ) + (6.4.1) wherefv a ; v (1) p g andfv a ; v q g are the hard-positive and hard-negative tracklet pairs respectively and is the distance margin (typically, = 1). Minimizing this loss would cause the triplet embedding to push the negative tracklet v q from the anchor v a . However there are two key limitations in this formulation per (S. Zhang et al., 2016): (1) v q is pushed away only from v a and not both v a and v (1) p , and (2) the distance margin of the positive pair v a and v (1) p is not specied. ImpTriplet addresses these two issues by introducing interclass constraints and intra-class constraints respectively, as follows: (v a ; v (1) p ; v q ) = +D(v a ; v (1) p ) (6.4.2) 1 2 D(v a ; v q ) +D(v (1) p ; v q ) (v a ; v (1) p ) =D(v a ; v (1) p ) ^ (6.4.3) 92 Here, ^ ensures that the anchor and positive pair lie within a margin (^ = 0:1) Finally the ImpTriplet loss is given as: L s = max 0; fv a ; v (1) p ; v q g + max 0; fv a ; v (1) p g (6.4.4) where the parameter balances the contribution of intra-class constraints in the modied triplet formulation ( = 0:02). The parameters ^ and were identied in (S. Zhang et al., 2016). 6.4.2 MvCorr: Multiview Correlation MvCorr can successfully incorporate information from more than two views without the need for negative samples or additional labels to learn discriminative embeddings (Somandepalli, Kumar, Travadi, & Narayanan, 2019b). For face clustering, the hard-positive tracklets containing dierent visual distractors can be treated as multiple views of a person's facial identity. We rst discuss the loss formulation and then its application within a neural network framework. Let T be the total number of available face tracks. Each track is segmented into M tracklets (as described in Section 6.3.3). We can collect the d-dimensional embeddings of each tracklet as columns to form a set of M hard-positive matricesfV a ; V (1) p ;:::; V (M1) p g with V 2R dT . The multiview correlation matrix is the normalized ratio of the sum of between-view covariances R b and sum of within-view covariances R w for M views as follows: = 1 M 1 R b R w = P M l=1 P M k=1;k6=l V l ( V k ) > (M 1) P M l=1 V l ( V l ) > (6.4.5) where V = V E(V ) are mean-centered data matrices. The common scaling factor (T1) 1 M 1 in the ratio is omitted. Our objective is to estimate a shared subspace, W2R dd such that the multiview correlation is maximized. Thus, the loss function can be written as: (M) = max W 1 d(M 1) Tr(W > R b W) Tr(W > R w W) (6.4.6) where Tr() denotes the trace of a matrix. The subspace W can be estimated by solving the generalized eigenvalue problem to simultaneously diagonalize R b and R w . Hence, the MvCorr objective is the average of the ratio of eigenvalues of between-view and within-view covariances. In 93 other words, if R w is invertible, then we wish to nd the a transformation matrix W that maximizes the spectral norm of R 1 w R b . Thus, maximizing the between-view variability while minimizing the within-view variability, capturing the shared information across the views. This ratio of variances formulation is similar to that in linear discriminant analysis (LDA) but without the need for class labels. We use neural networks to optimize MvCorr for large and complex datasets similar to our past work (Somandepalli, Kumar, Jati, et al., 2019). Here, data from M hard-positive matrices is input toM corresponding \sub-networks" with identical architecture but no shared weights across them. The embedding capturing the shared information across the views is the output of the last layer of the individual networks. The MvCorr model is trained using mini-batch SGD to minimize the loss 1 (M) . During inference, we only need to extract embeddings from one of the sub-networks as the last-layer activations are maximally correlated across all the sub-networks by virtue of the loss, as demonstrated in (Somandepalli, Kumar, Travadi, & Narayanan, 2019b). 6.5 Experiments and Results One of our goals in this study is to empirically study whether feature adaption using weakly labeled data can lead to robust face clustering in the movie domain. For feature adaptation, we rst need to choose an embedding space. To this end, we compare and evaluate the four of the best opensource frameworks proposed over the recent years (Cao et al., 2018; Liu et al., 2017; Schro et al., 2015; Y. Shi & Jain, 2019). Specically, we setup face verication experiments to evaluate which model performs the best for the video domain, followed by feature adaptation experiments on the harvested track data SAIL-MultiFace (see Sec. 6.3.2). Finally, using the SAIL-MCB dataset, we benchmark face clustering performance with and without adaptation along with a detailed error analysis using the associated face quality labels (see Fig. 6.3.1). 6.5.1 Face Verication for video data We refer to the face embedding frameworks we tested as baseline models since we do not addi- tionally adapt them for the movie domain. We compared FaceNet (2015, (Schro et al., 2015)), SphereFace (2017, (Liu et al., 2017)), VGGFace2 (2018, (Cao et al., 2018)) and Probabilistic Face 94 Table 6.3: Comparison of face verication performance for standard video datasets using TPR @ 0.1 FPR % metric Model / Dataset IJB-B Aligned IJB-B Unaligned YTFaces VGGFace2 (Cao et al., 2018) 97.0 97.7 96.6 FaceNet (Schro et al., 2015) 98.5 98.7 95.7 SphereFace (Liu et al., 2017) 94.4 52.2 69.2 PFE (Y. Shi & Jain, 2019) 98.3 73.3 94.7 0:2 0:4 0:6 0:8 1 0:1 1 2 10 4 Number of samples (a) VggFace2 (no adaptation) 0:2 0:4 0:6 0:8 1 Normalized Euclidean distance (b) Improved triplet-loss based adaptation (+ImpTriplet) Hard-positives Hard-negatives 0:2 0:4 0:6 0:8 1 (c) Multi-view correlation adaptation (+MvCorr) Figure 6.5.1: Eect of feature adaption with weakly labeled data. Feature adaptation is expected to bring positive samples closer to each other and pull apart negative samples further in the trans- formed embedding space. Qualitative comparison of the distributions of hard-positive and hard- negative distances in SAIL-MultiFace development set shows the benet of adaptation with (b) ImpTriplet and (c) MvCorr over the (a) original embeddings without adaptation. Embeddings (PFE, 2019, (Y. Shi & Jain, 2019)). While these frameworks have been extensively benchmarked against image datasets such as LFW (G. B. Huang et al., 2008), performance compar- ison in the video domain is lacking. Thus, we set up face verication experiments using two video face benchmark datasets: IJB-B (Whitelam et al., 2017) and YTFaces (Wolf et al., 2011). IJB-B is commonly used for benchmarking face-verication methods in videos. It consists of around 77,000 faces detected from 21,000 still images and 55,000 video-frames. YTFaces is widely used for face recognition and verication in video, consisting of 3425 videos of approximately 1600 unique iden- tities. While IJB-B, acquired in relatively controlled conditions helps validate if face embeddings trained on images perform well for video, YTFaces (mined from Youtube) helps evaluate their use for videos-in-the-wild. Baseline models. FaceNet (Schro et al., 2015) was trained using triplet loss and Inception- 95 v1 architecture, for downstream tasks such as face verication and clustering. SphereFace (Liu et al., 2017) is a metric learning method that combines the ideas of cross-entropy and angular margin loss to improve classication performance. Notably, one of the benets of the angular margin loss used here is its ease of training compared to triplet loss-based methods. Probabilistic face embeddings (PFE (Y. Shi & Jain, 2019)) models dierent faces of the same person as multivariate Gaussian distributions where the mean captures information about the identity. VGGFace2 is a ResNet-50 network trained on the large-scale dataset, also called as VGGFace2 (Cao et al., 2018) with over 9,000 face identities mined from YouTube for the task of face classication. In related work, VGGFace2 showed state-of-the-art face verication and clustering performance for standard datasets such as IJB-B (F.-J. Chang et al., 2017). For all baseline models, we used pub- licly available code and models pretrained on VGGFace2 dataset. See Appendix E preprocessing, implementation details specic to each model. Verication setup. Most face recognition methods typically perform face alignment based on facial landmarks as a pre-processing step to align faces in dierent poses. In movies however, these landmarks can be dicult to detect due to occlusion, pose, etc (See Fig. 6.1.1). Hence, we evaluate the sensitivity of the dierent methods to alignment by performing two sets of verication experiments: on raw face images and aligned face images. For details on alignment, please see Appendix E. Consistent with past video face experiments, we use the 1:1 verication setup described in (Cao et al., 2018) for IJB-B. For YTFaces, we create mean track-level embeddings (without alignment) and report results averaged over the 10-fold splits 2 (Wolf et al., 2011). For all methods, we use the cosine distance between l 2 -normalized embeddings as the similarity metric. Results. The ROC curves for all verication results in IJB-B with and without face align- ment and YTFaces are shown in Appendix E. The results are summarized in Table 6.3. PFE and SphereFace are heavily reliant on face alignment and perform poorly on unaligned face images (columns 1{2, Table 6.3). Without face alignment, SphereFace performed the poorest for YTFaces among the methods considered. On the other hand, both VGGFace2 and FaceNet performed con- sistently well on IJB-B and YTFaces. While FaceNet performed slightly better than VGGFace2 in 2 YTFaces evaluation splits: www.cs.tau.ac.il/wolf/ytfaces 96 IJB-B, overall, VGGFace2 performed the best considering that face verication is more challenging for videos in-the-wild as in YTFaces. Our ndings are also comparable to the results reported else- where (e.g., (F.-J. Chang et al., 2017)). Based on these observations, we chose VGGFace2 as the embedding space to perform movie-domain adaptation using the weakly labeled data we harvested from 240 movies (see Sec. 6.3.2). 6.5.2 Self-supervised Feature Adaptation Network architecture for ImpTriplet consists of three sub-networks; one each for anchor v a , hard- positive v (1) p and hard-negative v q . Each sub-network is a fully-connected network (FCN) of iden- tical architecture with shared weights across the sub-networks. We used publicly available code 3 to implement the loss as described in Sec.6.4.1. Similarly, for MvCorr adaptation, we also use three sub-networks but without the need of hard-negatives. We set the number of views to three and used the anchor v a and two hard-positives, v (1) p ; v (2) p as the three multiview inputs to the network as described in 6.4.2. In MvCorr, the weights are not shared across the sub-networks, unlike the ImpTriplet model. We used our publicly released code to train MvCorr models 4 . To choose the sub-network architecture, we explored three FCN congurations: C1: INP[(512)]!FC[512];DO(0:2)!FC[256] C2: INP[(512)]!FC[1024];DO(0:2)!FC[512] C3: INP[(512)]!FC[1024];DO(0:2)!FC[512];DO(0:2)!FC[256] where INP = Input, FC = Fully connected layer with ReLu/sigmoid activation followed by batch normalization. The number of nodes in each layer is indicated inside []. A dropout (DO) of 0.2 was added for all intermediate FC layers. Dropout was tuned over the rangef0:1; 0:2; 0:4g. Training and model choice. All adaptation experiments were conducted on 169; 201 tracks in SAIL-MultiFace. For the training set, we used the data from 180 movies resulting in 126; 435 samples each for hard-positives and hard-negatives. The remaining 42; 766 samples were used as the development set. Both adaptation networks were trained with a batch size of 1024 using SGD (momentum=0:9, decay=1e 6) with a learning rate of 0:001 for ImpTriplet and 0:01 for MvCorr. 3 ImpTriplet code: github.com/manutdzou/Strong Person ReID Baseline 4 MvCorr code: github.com/usc-sail/gen-dmcca 97 To determine model convergence, we applied early stopping criteria (stop training if the loss on the development set at the end of a training epoch did not decrease by 10 3 for 5 consecutive epochs). Of the C1{C3 congurations tested, we chose C2 for ImpTriplet and C1 for MvCorr as they showed the smallest loss on the development set at convergence. Increasing the model size beyond C3 with additional layers or changing the embedding size from 256 to 128 or 1024 did not appear to further improve the loss at convergence. All congurations showed slightly better performance with ReLu over sigmoid activation. All models were implemented in TensorFlow 5 and trained on GeForce GTX 1080 Ti GPU. Adaptation results. First, we examine the distribution of the normalized Euclidean distance for all hard-positive pairs v a v (1) p 2 and hard-negative pairskv a v q k 2 , in our development set of 60 movies. For hard-positives, we expect a smaller pairwise distance and the distribution to skew left, and for hard-negatives, we expect larger distances and the distribution to skew right. The distributions for VGGFace2 embeddings without any adaptation is shown in Figure 6.5.1a. The distances generally skew right regardless of whether the samples were positives (similar) or negatives (dissimilar). The distribution of hard-positive distances is fairly uniform in the range 0:6{ 1 showing that the face tracklets belonging to the same person can be far apart in the embedding space. It suggests that we indeed incur domain mismatch on direct application of VGGFace2 embeddings for face tracks in movies. This result also highlights the eectiveness of our nearest neighbor-based hard-example mining in identifying \dicult" samples in the embedding space. Next, we examine the distribution of hard-positive and hard-negative distances for ImpTriplet and MvCorr adaptation. As shown in Figure 6.5.1b{c, both models skew the hard-positive dis- tances further to the left compared to VGGFace2. This suggests that both methods help reduce the distance between the hard-positives as desired. Compared to the original embeddings, ImpTriplet adaptation appears to reduce the distance between dissimilar faces (Figure 6.5.1b) which could prove to be detrimental to downstream verication/clustering tasks. However, MvCorr skews the distri- bution of hard-negative distances further to the right than ImpTriplet. Although hard-negatives are not used in MvCorr adaptation, it appears to pull the cannot-link tracks (dissimilar faces) far 5 TensorFlow 2.1: tensor ow.org 98 Table 6.4: Face verication performance with adaptation averaged across all videos in the SAIL- MCB benchmark dataset. Metric/ Model FaceNet VGGFace2 +ImpTriplet +MvCorr TPR @ 0.1FPR % 90.2 3.5 92.5 1.7 91.4 2.0 93.7 1.2 Table 6.5: V-measure for hierarchical agglomerative clustering (HAC) and anity propagation (AP) with Over-clustering Index (OCI) Method ALN BFF DD2 HF MT NH Mean (OCI) HAC VGGFace2 83.2 95.3 82.6 76.7 88.6 79.5 84.3 (1.0) +ImpTriplet 81.2 96.6 83.9 78.0 89.7 81.1 85.1 (1.0) +MvCorr 86.8 97.6 85.7 82.2 92.1 84.0 88.1 (1.0) AP VGGFace2 56.3 76.9 57.1 65.9 67.5 59.5 63.9 (5.2) +ImpTriplet 57.4 77.9 58.8 67.7 68.9 60.6 65.2 (6.0) +MvCorr 60.4 84.3 60.1 70.1 69.6 62.0 67.7 (6.7) apart from each other in the transformed embedding space; suggesting improved discriminability. This is consistent with our past multiview representation learning work where multiview embed- dings were able to robustly classify whether two speech segments belonged to the same person or not (Somandepalli, Kumar, Travadi, & Narayanan, 2019b). Face verication results. A possible drawback of metric-learning based adaptation is that it may transform the embedding space to optimize only for the distances between hard examples while losing the discriminability of the input embeddings. In other words, it could overt to the adaptation dataset. To assess overtting, we repeat the face verication experiments for SAIL-MCB benchmark dataset using the adapted VGGFace2 embeddings. We generate verication pairs by exhaustively mining all combinations of face tracks using the ground-truth character labels. This resulted in an average of about 993; 000 372; 000 pairs (26 9% matching pairs) across the six videos in SAIL-MCB. We report TPR at 0.1 FPR averaged across all videos as the performance metric. As shown in Table 6.4, the performance of ImpTriplet is comparable to that of VGGFace2. With MvCorr adaptation, we observed a small but signicant (1:2%) improvement in thr true positive rate at 0.1 FPR. 99 6.5.3 Face clustering experiments To test the applicability of our system for unsupervised automatic character labeling in videos, we compare the unsupervised clustering performance for VGGFace2 embeddings with and without adaptation. For a fair comparison with past works (S. Zhang et al., 2016; Z. Zhang et al., 2016), we use hierarchical agglomerative clustering (HAC (Pedregosa et al., 2011)) and assume the number of desired clusters (unique characters) to be known. However, in practice, the number of unique characters in a movie is often not available. Hence, we also experiment with anity propagation clustering (AP (Frey & Dueck, 2007)) which does not require the number of clusters before running the algorithm. Similar to the k-medoids algorithm, AP rst nds representative exemplars to cluster all the points in the dataset. The exemplar count determines the number of clusters. We evaluate the performance using several clustering metrics: homogeneity, completeness, V- measure, purity, and accuracy. In Table 6.5, we report the V-measure scores on the benchmark dataset. Performance evaluation with respect to other metrics is presented in the Appendix E. For both HAC and AP, we achieve nearly 3% improvements using ImpTriplet and 4% improvement using MvCorr (See Table 6.5). The V-measure scores for MvCorr were signicantly better than VGGFace2 across all videos in our dataset 6 . In contrast to HAC, the V-measure scores for AP were low, but it yields clusters which are nearly 100% pure where multiple clusters may belong to the same character. We quantify this by reporting the over-clustering index (Somandepalli et al., 2017) which is the average number of clusters assigned to a character (the mean OCI across all movies is shown in Table 6.5). Mean OCI for the HAC method is 1 because the number of clusters is provided to the clustering algorithm. Comparison with State-of-the-art. Finally, we compare the HAC performance for clus- tering characters in BFF and NH to the results reported in existing works. It is important to note that although the number of characters in our dataset is greater, the comparison metric is mean clustering accuracy, which is generally robust to these dierences. As shown in Table 6.6, our proposed approach is comparable to the state-of-the-art methods for the two videos. Face clustering error analysis. As part of the SAIL-MCB benchmarking dataset, we also 6 Signicance testing: permutation test n = 10 5 , p = 0:007 100 Table 6.6: Comparison of average clustering accuracy with state-of-the-art methods based on self- supervision. Method/ Dataset BFF NH ULDML (2011) (Cinbis et al., 2011) 41.6 73.2 HMRF (2013) (B. Wu, Zhang, et al., 2013) 50.3 84.4 WBSLRR (2014) (Xiao et al., 2014) 62.7 96.3 Zhang et. al. (2016) (Z. Zhang et al., 2016) 92.1 99.0 CP-SSC (2019) (Somandepalli & Narayanan, 2019) 65.2 54.3 TSiam (2019) (Sharma et al., 2019) 92.5 - SSiam (2019) (Sharma et al., 2019) 90.9 - +MvCorr (ours) 97.7 96.3 Prole Frontal Blurry Lighting Occluded Glasses 75 80 85 90 95 Correctly classied face-tracks (%) VGGFace2 (no adaptation) +MvCorr Figure 6.5.2: Face quality error analysis in SAIL-MCB. For tracks with faces either wearing glasses (Glasses) or always facing the camera (frontal), MvCorr adaptation (+MvCorr) performed on-par with VGGFace2 pre-adaptation. In all other cases, +MvCorr signicantly improved clustering accuracy. obtained face quality labels along 6 dimensions as described in Sec. 6.3.1. We perform error analysis of the clusters obtained using HAC along these dimensions. We use the percentage of total face tracks tagged with a particular quality label that were correctly classied as the metric of analysis. To determine if a face track is correctly classied, we applied the Hungarian algorithm on the HAC output using the ground-truth character labels. Results comparing this metric for VGGFace2 embeddings without adaptation and with MvCorr adaptation are presented in Fig. 6.5.2. For frontal (all faces in a track facing the camera) and glasses (atleast one face wearing glasses in a track), there was no signicant change in performance with MvCorr adaptation. This is consistent with previous evaluation of VGGFace2 on datasets such as CelebA (Cao et al., 2018) which showed 101 the model to be robust to the attribute of glasses and frontal facing images. However with all other cases (prole, blurry, poor lighting and occluded/obstructed), MvCorr adaptation signicantly improved the clustering performance 7 . These results highlight the importance of the need for domain-matched data and subsequent adaptation to improve face clustering in movies to account for visual distractors. 6.6 Discussion A key component of a robust, fully unsupervised, and automatic face clustering framework for movies is the ability to discover the number of characters without using additional metadata. In our clustering experiments, we either assumed this number to be known apriori (in HAC) or allowed for over-clustering (with AP) which requires additional heuristics or manual intervention to merge these clusters. In this context, we want to highlight two recent and promising directions of research. An iterative merging algorithm called ball clustering (BCL, (Tapaswi, Law, & Fidler, 2019)) was developed to jointly estimate the number of clusters as well as resolve the assignment issue of the face tracks in a video. However, BCL was only evaluated on TV series. While the results are encouraging with respect to clustering background/secondary characters, movies tend to have more intermittent characters than in TV series. As such, the generalizability of BCL for long-form content remains to be explored. Recent studies of online diarization for videos in (Kulshreshtha & Guha, 2018; Kulshreshtha & Guha, 2020) proposed a shot-specic character interaction graph to incorporate constraints mined from movies. One of the benets of online methods is that as diariza- tion progresses new characters may be \discovered" as part of the process, creating new clusters. Our future work will focus on studying these methods for face clustering in movies, particularly in a zero-shot learning framework, which has been eectively used for person re-identication in the image domain (Z. Wang et al., 2015). Finally, although SAIL-MCB included more racially diverse movies, face quality labels used in the error analysis only included dimensions related to visual distractors. These labels were used to evaluate the robustness of face clustering methods. However, these dimensions did not include 7 Signicance testing: Permutation test n = 10 4 ;p 0:01 102 demographic variables such as gender, age and race. We are currently working along this direction to contribute to a growing list of resources such as FairFace (Karkkainen & Joo, 2021) which is a face image dataset with demographic variables. These resources can help assess the fairness of algorithms along with their robustness. In this chapter, we study robust face clustering in the movie domain using ideas of self- supervision. First, we developed SAIL-Movie Character Benchmark (SAIL-MCB) with character labels for six movie/TV videos, and SAIL-MultiFace with weakly labeled data from 240 movies, to oer domain-specic resources for feature adaptation and benchmarking. Next, we proposed a nearest-neighbor approach to identify hard-positive and hard-negative tracklets from the must- link and cannot-link faces mined in SAIL-MultiFace. Finally, using these tracklets, we explored triplet-loss and multiview correlation based frameworks to adapt face embeddings learned from web images to long-form content such as movies. Our face verication/clustering experiments and error analysis highlight the benets of self-supervised feature adaption for robust automatic char- acter labeling in movies. The SAIL-MCB and SAIL-MultiFace datasets have been made publicly available. We hope that these resources will help advance the research in understanding character portrayals in media content. 103 Part II Multimodal shared subspace learning 104 Chapter 7 Tied Crossmodal Autoencoders This is the second part of the dissertation where I will focus on modeling multiple modalities. In this chaptr, we demonstrate that the audio-visual representations from multimodal autoencoders can improve the performance of classication tasks of ads compared to the unimodal features. Our experiments suggest that the representations from autoencoders trained on larger, unlabeled data not only improve classication accuracy on testing with unseen data, but also capture complemen- tary information across the modalities regardless of the task that they are trained for. All models were trained on TensorFlow 1.4.0 and Keras 2.1.5 (Chollet et al., 2015). We have made the trained models, features and related code publicly available github.com/usc-sail/mica-multimodal-ads. 7.1 Introduction Video advertisements (ads) or TV commercials have become an indispensable tool for marketing. Media advertising spending in the United States for the year 2017 was about 206 Billion USD 1 . Companies not only invest heavily in advertising, but several companies generate revenue from ads. For example, the ad revenue for Alphabet, Inc., rose from about 43 to 95 Billon USD from 2012{2017. Considering the sheer number of ads being produced, it has become crucial to develop The work presented in this chapter was published in the following article: Somandepalli, et al. \Multimodal Representation of Advertisements Using Segment-level Autoencoders" Proceedings of the International Conference on Multimodal Interaction. ICMI, 2018. 1 statista.com 105 tools for a scalable and automatic analysis of ads. In the seminal work of decoding advertisements, Williamson (Williamson, 1978) states, \we can only understand advertisements ... by analysing the way in which they work". Previous research in advertising has shown a link between humorous or exciting content and ad eectiveness (Kapferer, Laurent, et al., 1985). These studies have been limited in their ability to generalize their results, due to sampling methods or the sample sizes. An automatic and scalable analysis of ads might provide an understanding of the most valuable design choices that are needed to produce an eective ad. In this context, the multimodal nature of ads can be leveraged to learn robust representations that can relate them to impact. Ads often contain video, audio and text (spoken language) modal- ities. With the advent of deep learning and big data, it is possible to obtain semantic attributes for these modalities beyond tasks such as labeling objects in images, or detecting music in audio. For example, features from a network pre-trained to detect high-level attributes like smiling (Jaques, Chen, & Picard, 2015), actions (Tran, Bourdev, Fergus, Torresani, & Paluri, 2015) or quality of an athletic action (Pirsiavash, Vondrick, & Torralba, 2014) may form better input features for learning representations. There have been only a few studies that have leveraged audiovisual representations for under- standing ads, and fewer studies that looked at their multimodal aspects. Hussain et. al. (Hussain et al., 2017) analyzed the \visual rhetoric of ads" using image representations. This study has compiled and examined a dataset of over 64,000 image ads, and 3,477 video ads. This data also includes human annotations for whether an ad is funny or exciting, among other labels such as topics, sentiment etc. In this work, we focus on the two binary classication tasks of whether an ad is `funny' or `exciting`. These tasks were chosen because they have the most number of human-labeled samples. Research studies on ads prior to (Hussain et al., 2017) have mostly focused on understanding the impact on the consumers. For example, (Azimi et al., 2012; Cheng et al., 2012) used low-level image features such as intensity and color to predict click-through rates in ads. Similar work has been done in the audio domain. For example, the audio characteristics of an ad can be related to high-level concepts such clarity of the advertised message and its persuasiveness (Ebrahimi, Vahabi, Prockup, & Nieto, 2018). Audio-visual analysis of ads has historically focused on applications such as context-based 106 Table 7.1: Description of the unlabeled and labeled advertisements in our dataset Dataset No. videos Duration (s) CVPR-2017 ads (Hussain et al., 2017) 2720 50.8 27.9 Cannes Film Festival Ads 9740 90.3 80.1 video indexing (Tsekeridou & Pitas, 1999), high-speed detection of commercials in MPEG streams (Sadlier, 2002), and retrieval of video segments for editing (T. Zhang & Kuo, 2001). Although there has been some research in the eld of Media Studies analyzing the audio-visual components of commercials (Z. Li, 2017), automatic understanding of high-level attributes (e.g., humor) has not been explored. Multimodal deep learning (Ngiam et al., 2011) has shown promising results in obtaining robust representations across dierent modalities. A few prominent applications are audio-visual speech recognition in the wild (Chung & Zisserman, 2016), video hyperlinking (Vukoti c, Raymond, & Gravier, 2016), content indexing (Snoek & Worring, 2005) and multimodal interaction analysis in the elds of emotion recognition and aect tracking (Schuller et al., 2011). The objective of multimodal representations is to obtain a feature vector that projects dierent unimodal representations onto a common space. See (Baltru saitis, Ahuja, & Morency, 2018b) for a survey on multimodal machine learning and its taxonomy. While there are several neural network architectures to learn these representations (e.g., sequence-to-sequence learning (Bahdanau, Cho, & Bengio, 2014)), a popular approach is an autoencoder setup. An autoencoder with a linear decoder is akin to performing principal component analysis. Here, the model is trained to minimize the reconstruction error, and the features from the intermediate (middle) layer are used as common representations of the input modalities. Autoencoders can be pre-trained on large domain-matched data, to obtain robust representa- tions in an unsupervised fashion (Ngiam et al., 2011). Motivated by this, we trained unimodal, and multimodal autoencoders on a large, unlabeled dataset to obtain joint representations for ads. Our experiments demonstrate the benet of using such representations over audio-visual features. 107 7.2 Advertisement dataset We use two datasets in this work. The primary dataset we consider is the video ads dataset released in (Hussain et al., 2017). This dataset originally had 3,477 ads from YouTube. However at the time of this work, only 2,720 ads were available. We consider two binary classication tasks: whether an ad is `funny' or `exciting'. Of the annotations provided in (Hussain et al., 2017), 1,923 ads had the labels for `funny', and 1,326 had labels for `exciting'. Videos of duration less than 10s were excluded for this work. We refer to this dataset as Ads-cvpr17. The average duration of the ads in this dataset is 50:8 27:9 seconds. One of our goals in this work is to show that the unsupervised autoencoder representations learned from a similar, larger, and unlabeled database of ads can provide better representations for the classication tasks. For this purpose we obtained ads shortlisted for the Cannes Lions Film Festival (Cannes Lions, n.d.) through the years 2007{2017. This resulted in a dataset of 9,740 ads. The details of this dataset have been made publicly available 2 . Henceforth, we refer to this dataset as Ads-Cannes. The average duration of ads in this dataset is 90:3 80:1 seconds. 7.3 Methods We rst describe the choice of audio and video descriptors that we used in this work. We then present the procedure for training segment-level autoencoders using input frames to obtain uni- modal and joint representations, followed by classication methods for the binary tasks. 7.3.1 Unimodal representations Video representations Prior work (Hussain et al., 2017) has shown that features extracted from action recognition neural networks are eective toward automatic understanding on ads. Following this direction, we used the features from C3D network (Tran et al., 2015) which capture spatio-temporal features for recognizing actions in videos. It was pre-trained on the Sports-1M data (Karpathy et al., 2014) and 2 To be provided after blind review 108 time ( t) Time aligned frames Concatenate frames Final output representation Video features Audio features Video segment d v d a video frame-of-interest audio frame-of-interest v a joint representation a-to-a (A) (B) (C) Encoder f V Decoder g V g A f A f ̃ v f ̃ a g ̃ a g ̃ v Audio segment v-to-v inputs context inputs outputs Figure 7.3.1: Schematic diagram of segment-level autoencoders for (A) joint representation (B) audio: a-to-a, and (C) video: v-to-v. Input video and audio features of dimensions d v ;d a for an ad of duration t. Video and audio segments of length v ; a ne-tuned on UCF-101 (Soomro, Zamir, & Shah, 2012). Global average pooling (GAP) in CNNs has been shown to improve localization and discriminability for action recognition tasks (B. Zhou, Khosla, Lapedriza, Oliva, & Torralba, 2016). Hence, we perform GAP after the nal convolution block in C3D network to obtain video features. We also replicated the results in (Hussain et al., 2017) using the fc7 features. These results were comparable to the results from GAP features. Audio representations Audio is an integral part of TV commercials because it can convey abstract concepts in a short duration (Ebrahimi et al., 2018). We use features from an audio event detection network, AudioSet (Gemmeke et al., 2017). The actions being considered in the C3D framework are dierent from the events classied in AudioSet. Hence, these features likely provide complementary information to the video features. AudioSet uses a modied VGG (Simonyan & Zisserman, 2014) architecture to classify audio events with log-mel spectrograms of audio as input. 109 7.3.2 Segment level autoencoders Segments from frames The ads considered in this work are of variable duration. Hence, the features obtained from the dierent modalities do not have a xed length. Additionally, the pretrained networks used for video and audio representations each use dierent number of input frames. In this context, sequence representation is a popular approach (See (Baltru saitis et al., 2018b) for a survey). Sequence modeling approaches commonly use a RNN or LSTM architecture to learn the map- ping between sequences in the two modalities, often explicitly modeling alignment with the use of attention (Sutskever, Vinyals, & Le, 2014). In this work, we approximate a sequence representa- tion by training an autoencoder at the segment level instead of sequence-to-sequence learning. We refer to the window consisting of a frame-of-interest and a xed context as a segment. This allows us to jointly model the multimodal representation at a shorter time-scale (about 7 seconds) thus generalizing easily to other ads. This is a common data augmentation approach used in sequence learning for datasets with longer sequences (Gemmeke et al., 2017). Segments are generated as follows: 1. Identify a frame-of-interest in video x v , and include seconds worth of frames for forward and backward context. 2. Identify the audio frame-of-interest x a corresponding to x v 3. Concatenate the frames from the resulting video segment of length v to form X v . Similarly, concatenate the frames of audio segment of length a to form X a . (See Fig. 7.3.1) Note that although v 6= a because of dierent sampling rates for each modality, these windows correspond to the same time duration. This helps ensure that the video and audio segments used in the autoencoders are time-aligned. \Tied" Crossmodal autoencoders We train three dierent autoencoders using the segment-level samples. A schematic diagram of the three networks is shown in Fig. 7.3.1. The encoders and decoders in these networks are a sequence 110 of fully connected (FC) layers. For brevity, we represent the parameters (weights and biases) of the encoder and decoder parts of the network as f() andg() respectively. The representation vectors are denoted as Z. Using this notation, the three architectures can be described as follows: (A) Joint autoencoder representation: (Z joint ) min fa;fv;ga;gv w v n X i=1 X (i) v Y (i) v 2 +w a n X i=1 X (i) a Y (i) a 2 (7.3.1) where h v =f v (X v ); h a =f a (X a ) Z joint = [h v ; h a ]; Y a =g a (Z joint ); Y v =g v (Z joint ) where, n is the batch size, and Y v and Y a are the decoded video and audio respectively. We minimize the weighted mean squared error (MSE). The weights w v and w a are set such that they sum to 1. In this architecture, the middle layer (joint representation: Z joint ) is formed by concatenating the encoded video and audio layers. The corresponding video and audio segments are decoded using this the two parts of this representation independently. It is important to note that this middle layer representation does not share the weights between the two encoder-decoders. We only use the loss of one direction i.e., audio-to-video to regularize the loss of the opposite direction, and vice-versa for each mini-batch during training. Hence the term, "tied" crossmodal autoencoders. A good analogy for this form of regularization is to think of the loss term from the other network as a form of early-stopping. This was also in part inspired from the co-training (Blum & Mitchell, 1998) methodology in machine learning which has been recently popularized within the deep learning frameworks (Qiao, Shen, Zhang, Wang, & Yuille, 2018). This is dierent from the \classical" autoencoders used for joint representation (Vukoti c et al., 2016), where a single common layer is used to encode the outputs of both modalities (shared representation). One benet of our network architecture over classical design is that we can handle missing data from either of the modalities. Our proposed model is comparable in design to the bidirectional symmetrical deep neural networks, BiDNN (Vukoti c et al., 2016). This network was trained for cross-modal translation where the weights are shared for the layers adjacent to the joint representation. But, unlike (Vukoti c et al., 2016), our objective is to obtain joint representation and not perform translation between the modalities. Hence we do not tie the weights for the layers adjacent to the middle layer. For example, a video feature that encodes information about the visual action (lunging) can be very dierent from the event captured by audio feature (burping: human sounds). But, the combined information from these two modalities may indicate that the ad is funny. We perform additional experiments to compare our proposed system with classical autoencoders, as well as the BiDNN. 111 Table 7.2: Unimodal performance: autoencoder representation vs. features (Acc., Accuracy (%); F 1 , F1 score (%) Task: Funny Exciting No. Samples 1923 1326 Majority class baseline (Acc.) 58.00 60.80 Performance measures Acc. F 1 Acc. F 1 C3D features 71.95 64.92 70.43 65.51 v-to-v representation (Z v ) 74.32 68.21 75.42 81.02 Audioset features 76.10 74.40 74.98 69.01 a-to-a representation (Z a ) 78.32 75.82 79.79 84.17 (B) Segment-level audio representation (Z a ), and (C) , segment-level video representation (Z v ): We construct symmetric autoencoders to obtain segment-level audio and video representations. While (A) captures the shared joint representation of the two modalities, these networks capture the information from the individual modalities. The three networks were trained independent of each other. The network layers were designed such that the dimension of middle layers was consistent; i.e.,jZ joint j = 2jZ v j = 2jZ a j. Finally, we concatenate the three vectors to obtain a multimodal representation of dimension 2jZ joint j. These are further averaged across time to obtain a single feature per ad. 7.3.3 Classication of ads as funny or exciting We use a support vector machine (SVM) classier with representations from the autoencoders as features. The parameters of the SVM are tuned on a development set, and the results are reported using a 10- fold cross validation. We used permutation testing to test for dierences in both the performance of the predicted labels with the ground-truth, as well as between dierent models. We conducted pairwise tests (with Bonferroni correction for multiple comparisons) to test the performance of the models trained on dierent representations. A signicant dierence in this pairwise test of these outcomes would indicate that the dierent representations provides complementary information for the classication task. 7.4 Experiments and Results In all our experiments, the multimodal autoencoders were trained on the Ads-Cannes dataset in an unsuper- vised fashion (with respect to the classication tasks). The middle layer representations for the Ads-cvpr17 dataset were then used for the classication tasks. First, we trained the autoencoders using the segments from 9,740 ads. We then performed experiments to analyze two claims: 1) representations from segment-level autoencoders trained on a larger dataset are better than the frame-level features, and 2) multimodal feature 112 representations provide more information than the individual modalities { for classication tasks on the Ads-cvpr17 dataset. In all cases, segment-level or frame-level features were averaged across time to obtain a single feature vector per sample. The binary classication tasks were peformed as described in Sec. 7.3.3. 7.4.1 Multimodal autoencoders As described in Sec. 7.3.1, we obtained video features for every 0.64s using the pre-trained C3D network. We obtained the output from the nal convolution layer, and performed global-average-pooling, resulting in a 512-dimension vector. We obtained audio features (128-dimensional) for every 0.96s from the publicly available pre-trained AudioSet model 3 . Audio and video segments were constructed with these `frame- level' features with a forward context and backward context of duration about 3.2 seconds. Each segment consisted of feature vectors corresponding to a duration of about 7s resulting in video and audio segments of size 11x512 (=5632) and 7x128(=896) respectively. We use a shift of 3 video-frames for obtaining consecutive video-segments. This resulted in a training set of 495,331 segments from the 9,740 ads. The encoder and decoder parts of the autoencoders (see Sec. 7.3.2) were designed in a symmetric fashion. We use a shorthand notation to describe the network architecture. (N.B: due to the symmetric nature of the design, it is sucient to describe the encoders) Video encoder f(X v ) :: INP[(5632)]! FC[2816]! Dropout(0.2) ! FC[1402]! Dropout(0.2)! FC[704]! FC[n v ], where INP = Input, FC = Fully connected layer with ReLu activation. [] indicates the number of nodes in the layer. Audio encoder f(X a ) :: INP[(896)]! FC[448]! FC[n a ]. The number of nodes in the middle layern a and n v was tuned over the set of parametersf16; 32; 64; 128; 256; 512; 1024g on the reconstruction loss of the training set. The networks were optimized with the Adam algorithm (Kinga & Adam, 2015). Training loss was monitored for parameter search because we train the models agnostic of the test dataset (i.e, Ads-cvpr17 ) Early stopping criterion (delta change of 10 6 ) on the training loss was used to terminate the network updates. The weights for the loss function, w v and w a were set to 0.75 and 0.25 respectively. Using these experiments we selected number of nodes for the middle layer, n a =n v = 512. The average reconstruction error (MSE) of the multimodal autoencoder was of the order, 10 3 for training set, and 10 2 for the test set. The MSE for the unimodal autoencoders (a-to-a and v-to-v) was of the order, 10 5 for training set, and 10 4 for the test set. 3 github.com/tensor ow/models/tree/master/research/audioset 113 Table 7.3: Performance evaluation of unimodal vs. joint representations from the autoencoders Method/Task Funny Exciting Performance Measure Acc. F 1 Acc. F 1 Baseline (on complete data (Hussain et al., 2017)) 78.6 { 78.2 { BiDNN (Vukoti c et al., 2016) 78.01 75.71 79.02 82.17 Classical autoencoder (Vukoti c et al., 2016) 71.54 61.92 77.10 81.02 Best uni-modal (Z a ) 78.32 75.82 79.79 84.17 Joint representation (Z joint ) 79.83 76.41 80.39 85.01 Multimodal representation 83.24 79.16 84.05 86.22 7.4.2 Results and Discussion Unimodal autoencoder representations Classication performance as measured by accuracy and F1-score shown in Table 7.2 supports our claim that the unimodal autoencoder representations perform better than the raw features { in the context of classifying whether an ad is funny or exciting. McNemar's chi-squared test 2 showed that the all the models outperform the majority class baseline. The audio-autoencoder representations (a-to-a) and the video-autoencoder (v-to-v) representations signicantly outperformed the input audio and video features respectively (p 0:01). This suggests that (segment-level) autoencoders trained on larger, unlabeled data can provide `better' representations for classication tasks. Multimodal autoencoder representations Performance evaluation results shown in Table 7.3 support our claim that multimodal information improves classication of whether an ad is funny or exciting. Note that the baseline results used in Table 7.3 are taken from (Hussain et al., 2017). These results are dierent when replicated on the data available (see acc. for C3D features in Table 7.2) . As such (Hussain et al., 2017) is a competitive baseline since it had at least 700 more samples than the Ads-cvpr17 dataset used in our work. Our best uni-modal performance was comparable (if not better) than this baseline. We did not perform a signicance test here due to lack of predicted labels from (Hussain et al., 2017). The concatenated multimodal representation (nal output representation in Fig. 7.3.1) improved the classication accuracy by about 5% over the baseline. Pairwise permutation tests were conducted between the predicted outcomes for the two classication tasks from the four models (i.e., v-to-v, a-to-a, joint and multimodal representations). All pairwise tests 2 RejectH0: marginal probabilities of each outcome are the same atp< 0:01 114 audio video 0 corr elation 1 a-to-a v-to-v Joint representation audio video time(t ) Figure 7.4.1: Similarity of representations with each other and across time in a sample except one (exciting: a-to-a vs. joint representation) survived multiple comparsion correction ( p<< 0:01). This suggests that the unimodal, joint and multimodal representations provide complementary information for the classication tasks considered. The complementary nature of these representations, regardless of their contribution to a particular task was examined by computing similarity of these vectors with each other and across time of an ad sample. Fig. 7.4.1 shows that the unimodal and joint representations are uncorrelated with each other (o-diagonal blocks). Additionally, the correlation of the individual representations across the time within an ad is similar (by visual inspection of the diagonal blocks). Finally, as shown in Table 7.3, the proposed segment-level autoencoders outperform the classication models trained on representations from BiDNN (Vukoti c et al., 2016) and classical autoencoders. 115 7.5 Discussion We proposed tied crossmodal autoencoders to obtain multimodal audiovisual joint representations for analyz- ing advertisements. We show that representations obtained from training autoencoders on larger, unlabeled datasets are benecial for classication tasks. Our experiments show that the unimodal and joint represen- tations from the autoencoders provide complementary information, and improve classication performance over a competitive baseline, as well as compared to representations from classical and cross-modal autoen- coders. In the future, we would like to allow our proposed approach to be more exible in aligning segments across dierent modalities, which could be useful for representing asynchronous audio and visual events. A prominent drawback of this method was that it was pairwise in construction. For example, for three modalities, we would have to consider six dierent autoencoders. Hence training such systems would become computationally infeasible with increase in the number of modalities. 116 Chapter 8 Modeling Multimodal Event Streams as Temporal Point Processes In the previous chapter 7, we presented a crossmodal autoencoder based method to learn multimodal rep- resentations in a self-supervised fashion. In this chapter, we propose using temporal point processes for modeling multimodal event streams. This formulation addresses one of the fundamental limitations of many existing multimodal modeling frameworks that use self-supervision. 8.1 Introduction When we watch TV, a movie or an advertisement (ad), we experience dierent aspects of the presented content|visuals, sound design, and dialogue|at the same time. We are able to process the dierent aspects of what we are watching, in terms of, who the characters are, where and when they appear, how they are portrayed and what they are doing to comprehend the overall message. With the democratization of media content creation and the rise in streaming services, there is an increased impetus for developing human-centered media analytics to automatically understand multimedia content such as ads to assess their in uence on our everyday lives. This has led to the emergence of computational media intelligence (CMI, (Somandepalli et al., 2021)) which seeks to understand the stories told in media using what-why-how-where-when attributes to characterize the portrayal of people, places and topics in media content. Such computational human-centered analytics serve dierent stakeholders of the media world: from content creators and curators to policy makers and consumers to predict the impact of The work is being prepared for submission to ICMI 2021. 117 media on individuals and the society at large. A fundamental building block of CMI is multimodal machine learning (Baltru saitis, Ahuja, & Morency, 2018a). Similar to how humans perceive media content through dierent aspects such as visuals and sound design, multimodal machine learning can be used to model the information from the constituent modalities such as images, audio, and spoken language (text) available in multimedia. Historically, multimodal research has focused on developing fusion methods (see survey in (Gao, Li, Chen, & Zhang, 2020)) to learn relevant joint representations for downstream task classication. In the context of multimedia understanding, it is expensive and time-consuming to collect large-scale labeled data for end-to-end learning: a necessary component for the successful use of multimodal fusion methods. To address this limitation, there is a growing interest in exploring self-supervision (Jing & Tian, 2020) in the intersection of multimodal learning and media understanding research. In multimodal data, self-supervision deals with leveraging the correspondence between the naturally co-occurring modalities of images, audio and text to generate supervisory signals for representation learning. Unlike fusion methods, self-supervised representations can be trained on large amounts of unlabeled data and then adapted for downstream learning tasks using smaller labeled datasets. A few examples of this application include cross-modal auto-encoders (e.g., (Somandepalli et al., 2018; Vukoti c et al., 2016)), visual scene clustering using ambient sounds (Owens et al., 2016) and audio-visual active speaker localization (Sharma, Somandepalli, & Narayanan, 2019). However, these methods are only able to model a pair of modalities (for example, audio and video in (Sharma et al., 2019) or video and text (Vukoti c et al., 2016)) and may not generalize to incorporate additional modalities. Additionally, as observed by a recent fusion study (W. Wang, Tran, & Feiszli, 2020), multimodal networks \are often prone to overtting due to the increased capacity"; primarily caused by the variable rates of change of information in individual modalities. This problem is further compounded in a self-supervised setting due to the lack of labeled data to `lter out' relevant unimodal information or when some modalities are absent (e.g., no audio or closed captions in ad videos). Thus, there is an imminent need for developing self-supervised multimodal frameworks that can (1) incorporate a large number of modalities, (2) handle absent modalities and (2) account for the variable rate of information change in individual modalities. In this work, we propose a novel direction of jointly modeling the timing and type of events occurring across modalities of a video to characterize the multimodal content in a self-supervised fashion. The overview of the proposed method is illustrated in Fig 8.1.1 for an ad video. We begin by conceptualizing the time series of semantic classes detected in individual modalities as a multimodal event stream. As shown in the Figure 8.1.1, pre-trained models for audio, images, and text can be applied to a video to extract a sequence of timing-event pairs. The timing-event pairs successfully capture the variable rate of change of information 118 t t t thump narration hubbub applause boxing running baseball pitch person location (Nigeria) event {(t i , k i ), … } k i Audio events (N k =527) Video actions (N k = 101) Text entities (N k =18) t i Figure 8.1.1: Illustration of multimodal event streamsSf(t i ;k i );:::g in media content. The set of predictionsk i and associated timingt i from pre-trained models can be used to create a sequence of timing-event pairs (t i ;k i ). Marked temporal point process can then be used to learn self-supervised representations of the multimodal event streams that capture the underlying content. The example shown here is from the ad campaign by Nike, 2017 available at youtu.be/WYP9AGtLvRg. in individual modalities as they encode both the arrival time of a new event as well as the type of the event. In multimodal event streams, the timing information implicitly captures the duration of an event, i.e., no two consecutive events are the same. Here, the occurrence of a new event may be in uenced by what happened in the past, either within the same modality or from events of dierent modalities. Thus, self-supervision in the form of predicting the next arrival time and the type of event in a sequence can help capture the latent multimodal representation governing the observed event patterns over time. To model the overall timing pattern in a multimodal event stream, we use temporal point processes (TPP) by assuming that the arrival times can be modeled as observations drawn from a point process. Because we also need to model the event type information associated with the arrival times, we use \marked temporal point processes" 1 (Jacobsen, 2006). TPP modeling has been extensively used for many real-world applications such as neural spike modeling, earthquake prediction, and understanding human behavioral interactions (see survey (J. Yan, Xu, & Li, 2019)). However, they have not been explored for multimodal modeling, particularly in the context of media understanding. With the advances in deep learning and open-source access to pre-trained models, we can easily construct multimodal event streams by automatically detecting a wide range of semantic classes of events from the constituent modalities with high precision. In our proposed framework, marked TPP 1 In the related point process literature, the \events" in our formulation are referred to as \marks" 119 modeling of the multimodal event streams oers a scalable, self-supervised solution to model a large number of modalities by encoding them as event types alongside the arrival times. This formulation can naturally handle instances where certain modalities are absent in a portion of the data. The rest of the paper is organized as follows: In section 8.2, we discuss the relevant literature at the inter- section of self-supervised multimodal learning for media content analysis followed by temporal point process modeling. We then present the TPP methodology for modeling multimodal event streams in section 8.3 followed by the evaluation dataset and experimental setup in 8.4 and results in 8.5. 8.2 Related Work We review here relevant existing work to underscore the promise of self-supervision for multimodal learning, particularly for media understanding applications. We then discuss TPP modeling, applications and its promise for modeling multimodal data. 8.2.1 Self-supervised multimodal learning The idea of self-supervision has a rich history in learning representations for unlabeled data since its intro- duction in the 1990 seminal work (Schmidhuber, 1990). Self-supervision deals with using naturally existing correspondences in data as signal for training. These correspondences are a result of way the data is col- lected (Somandepalli et al., 2020) or how the content is created (Jansen et al., 2020). The age of represen- tation learning has given rise to need for minimal or no manual labeling where the notion of self-supervision is enticing. Perhaps, one of the earliest large-scale application of this idea was to learn word representations called word2vec (Mikolov et al., 2013) using the context inherent to language use as training signal. Self- supervision has also been successfully applied in computer vision tasks by setting up pretext or proxy task to learn rich representations from unlabeled images. Some examples of proxy tasks include: solving a jigsaw puzzle (Doersch et al., 2015), image colorization (R. Zhang et al., 2016), time-aligning video frames (X. Wang & Gupta, 2015) and learning self-expressive representations for face clustering (Somandepalli & Narayanan, 2019). There are two components to self-supervised representation learning: (1) How to mine correspondences or what correspondences to consider? and (2) How to learn embeddings or a metric to use this correspondence. For multimodal applications, self-supervision leverages the co-occurrence between modalities natural in many domains, illustrated best by this example in (de Sa, 1994): \Hearing `mooing' and seeing cows tend to occur together". In domains such as videos or created content such as ads, it is easy to collect and time-align the dierent modalities to generate the correspondences necessary for self-supervised learning. This ideas was successfully applied for many multimodal learning tasks by trying to predict one modality from another. A 120 few exemplar applications include clustering video scenes using ambient sounds (Owens et al., 2016), ecient road segmentation using RGB and depth images (W. Wang, Wang, Wu, You, & Neumann, 2017), improving action recognition using RGB information and optical ow (Sayed, Brattoli, & Ommer, 2018), cross-modal autoencoders to extract unsupervised joint representations (Somandepalli et al., 2018), cross-modal matching for visual navigation using natural language (X. Wang et al., 2019) and cross-modal deep clustering (Alwassel et al., 2020). In terms of metric learning for self-supervision, current methods have mostly explored contrastive- and triplet-loss formulations (T. Chen, Kornblith, Norouzi, & Hinton, 2020; Tian, Krishnan, & Isola, 2019) where the goal to maximize the similarity between a signal and it's distorted version to capture the semantic information of the signal and not the transformation used create a distorted version. Triplet loss has also been explored in this space. Distortion via data augmentation has also been used for label augmentation in some self-supervised applications (Lee, Hwang, & Shin, 2020). For multimodal learning, contrastive loss has been used to maximize the similarity between two modalities of the same video while minimizing the similarity between modalities mined from dierent videos (Jansen et al., 2020). Other examples in this space include cross-modal reconstruction using autoencoders (Somandepalli et al., 2018; Vukoti c et al., 2016), cross-modal label prediction using classication loss (Sharma et al., 2019) and KL-divergence based loss to handle missing modalities (Sutter, Daunhawer, & Vogt, 2020). A major drawback of the existing self-supervision frameworks for multimodal learning is that they are predominantly pairwise in construction. For example, most approaches focus on predicting one modality from another (Somandepalli et al., 2018). A limiting factor to generalize these frameworks for more than two modalities, particularly in videos is that individual modalities present dierent amount of information rele- vant to the underlying concept (X. Li et al., 2020). Additionally, the rate of change of information in these modalities is highly variable across samples in a multimodal dataset (W. Wang et al., 2020). When working with labeled datasets, these problems have been addressed in fusion methods to some extent. For example, with the use of multimodal attention networks (X. Li et al., 2020; Ngiam et al., 2011) to identify important modalities, or online ltering methods based on external cues indicating reliability of a modality (Soman- depalli et al., 2016) or gradient regularization methods to prevent overtting between modalities (W. Wang et al., 2020). However, in the context of self-supervised learning with unlabeled data, these limitations have not been addressed. A promising direction of research in this context is using methods such as neural predictive coding (Jati & Georgiou, 2019) and contrastive predictive coding (Oord, Li, & Vinyals, 2018) in a multimodal setup. At a high level, the predictive coding methods convert the generative time series process to a classication task of predicting the \future" samples. Inspired by these frameworks, we propose using pre-trained models to 121 extract a sequence of events across dierent modalities to form a multimodal event stream. The proxy task for self-supervision is then to predict the next event in the sequence regardless of the modality in which the event is detected. This allows us to learn both the inter-modal and intra-modal relationships in multimodal data. Due to the variable nature of information presented in individual modalities, the sequence of events are not distributed uniformly in time. Thus, we propose the self-supervision proxy task of jointly modeling the arrival times and events as marked temporal point processes. 8.2.2 Temporal point process modeling Temporal point process (TPP) is a random process whose realizations are times corresponding to some arbitrary events distributed over a time period (Daley & Vere-Jones, 2007). The events may correspond to dierent types and when the type of the event is known, the processes are called as marked TPP (Ja- cobsen, 2006). TPP can be used to mathematically abstract dierent phenomena across several domains. Prominent examples, include modeling earthquakes and predicting aftershocks (Hawkes, 1971a, 1971b), for computational nance (Bacry, Iuga, Lasnier, & Lehalle, 2015) and modeling human interactions in sociol- ogy (Malmgren, Stouer, Motter, & Amaral, 2008). Recently, TPP have been successfully used to model a wide range of social media applications such as measuring online user engagement (Farajtabar et al., 2014), modeling the dynamics of fake news spread (Farajtabar et al., 2017). In all these applications of TPP modeling, assumptions are made about the arrival time dynamics to parameterize the underlying point process with an intensity function. The simplest form of intensity function is a constant that is used to parameterize a homogenous point process which assumes the duration between arrival times to be stationary and independent of each other. This often leads to what is referred to as the \the curse of model misspecication" (Du et al., 2016). To address this limitation, recently, recurrent neural networks (RNN) have been used to develop intensity-free learning methods to model TPP by directly estimating the condition density function of the arrival times. Additionally, classic TPP parameterizations do not generalize well for predicting the type of events as in the case of marked TPP (Rasmussen, 2011). Here RNN frameworks allow for predicting the type of the event as a straightforward classication task. More recently, other intensity-free learning methods without the need for RNN have also been proposed. For example, using fully connected neural network to model the cumulative conditional intensity function (Omi, Ueda, & Aihara, 2019), and using mixture distributions to directly model the conditional density function from which the arrival times are sampled (Shchur, Bilo s, & G unnemann, 2019). In this work, we use recurrent marked temporal point processes (RMTPP, (Du et al., 2016)) for self- supervised representation learning from multimodal event streams (see Fig. 8.1.1). RMTPP was shown to learn an ecient representation of the underlying dynamics from the event history without assuming a 122 Hidden state h j Embedding y j Event k j Timing t j , Figure 8.2.1: Architecture of recurrent marked temporal point processes (RMTPP) proposed in (Du et al., 2016). BCE denotes binary cross entropy loss for predicting the event type and f (t) denotes the conditional density function. The parameters ; v;w are system parameters specic to estimating f (t). parametric form. It was able to robustly model several parametric forms of point processes along with jointly predicting the type of event (Du et al., 2016). The exibility of RMTPP to represent dierent forms of point processes using RNN makes it an ideal candidate for modeling multimodal event streams as it is dicult to develop a generalized parametric form of how events occur in multiple modalities. 8.3 Recurrent Marked Temporal Point Process In this section we will review the RMTPP method proposed in (Du et al., 2016). Let us denote a sequence of timing-event pairs from a marked TPP asS =f(t j ;k j ) T j=1 g wheret j denotes the arrival time of eventk i and T is the duration of the sequence. By construction, the set of arrival times is ordered and strictly increasing and no two consecutive events are the same i.e.,k j 6=k j+1 . Let the total number of events be denoted byN k . The key idea of RMTPP is to let a RNN or its variants of LSTM or GRU model the nonlinear dependency of both the event types and the arrival times from past timing-event pairs. Figure 8.2.1 shows the overall architecture of the RMTPP model proposed in (Du et al., 2016). The model takes the sequence of timing-event pairsS as input. The timing-event pair (t j ;k j ) is fed as input to the RNN. The eventk j is passed through an embedding layer (for example, one-hot encoding) and concatenated witht j as input to the RNN cell. The embedding h j records the in uence of all the past timing-event pairs 123 and upto (t j ;k j ). The conditional density function for the next time point can be represented as: f (t j+1 ) =f (t j+1 jh j ) =f (d j+1 jh j ) (8.3.1) where d j+1 is the inter-arrival time. As a result, we can now use the hidden state representation h j to predict the next timing-event pair (t j+1 ;k j+1 ). The conditional density function f (t) is computed through rst estimating the conditional intensity function (t) as follows: f (t) = (t) exp Z t tn ()d (8.3.2) Based on the hidden state representation h j of the RNN, the conditional intensity function (t) was formulated in (Du et al., 2016) as follows: (t) = exp +w(tt j ) + v > h j (8.3.3) where the learnable parameters, is a scalar capturing the base intensity, w captures the local time depen- dency and the vector v is used to estimate the in uence of the past timing-event pairs encoded by h j . The conditional density function is then estimated using the relation in equation 8.3.2. The overall network is trained using back propagation through time using loss functions of negative loglikelihood for estimating the next timing and binary cross entropy loss to predict the next event in the sequence. In our work, the event k j is modied to be multilabel as events from dierent modalities may occur at the same time depending on the pre-trained models used. 8.4 Experiments In this section we will rst review the datasets used in the work for training and evaluation, followed by the dierent pre-trained models used to create multimodal event streams. Then we will brie y discuss how the RMTPP models in a self-supervised fashion on the unlabeled dataset followed by feature adaptation for downstream learning tasks on the labeled dataset. 8.4.1 Evaluation dataset and baselines We use two datasets in this work as summarized in table 8.1. The primary dataset we consider is the video ads dataset released in (Hussain et al., 2017). We follow the same training and evaluation setup described in (Somandepalli et al., 2018). This dataset originally had 3,477 ads from YouTube. However at the time of this work, only 2,720 ads were available. We consider ve classication tasks: two binary classication tasks 124 Table 8.1: Description of the unlabeled and labeled advertisements in our dataset Dataset No. videos Duration (s) CVPR-2017 ads (Hussain et al., 2017) 2720 50.8 27.9 Cannes Film Festival Ads 9740 90.3 80.1 Table 8.2: Performance evaluation of self-supervised representations from modeling multimodal event streams as temporal point processes Method/task Eective (5) Exciting (2) Funny (2) Sentiment (30) Topics (38) Majority Baseline 38.53 60.2 60.05 23.09 10.33 Audio-only 41.76 74.98 76.1 28.2 20.9 Video-only 41.29 70.43 71.95 25.9 28.45 Text-only (N=1436) 28.65 59.64 68.52 19.42 40.2 Best unimodal AE (Somandepalli et al., 2018) 42.42 79.91 78.32 33.12 35.21 AV-Multimodal AE (Somandepalli et al., 2018) 45.26 84.05 83.24 37.54 48.72 AV-RMTPP 44.32 78.11 75.44 37.01 50.22 AVT-RMTPP 47.54 80.21 82.52 44.94 51.34 of whether an ad is `funny' or `exciting',a 5-class scale for whether the ad is eective, 30-class sentiment classication and 38-class topic classication. Videos of duration less than 10s were excluded for this work. We refer to this dataset as Ads-cvpr17. The average duration of the ads in this dataset is 50:827:9 seconds. One of our goals in this work is to show that the self-supervised representations learnt on a larger unlabeled dataset using TPP can provide better representations for the classication tasks. For this purpose, we obtained ads shortlisted for the Cannes Lions Film Festival (Cannes Lions, n.d.) through the years 2007{ 2017. This resulted in a dataset of 9,740 ads. Henceforth, we refer to this dataset as Ads-Cannes. The average duration of ads in this dataset is 90:3 80:1 seconds. 8.4.2 Creating multimodal event streams Video events Prior work (Hussain et al., 2017; Somandepalli et al., 2018) has shown that features extracted from action recognition neural networks are eective toward automatic understanding on ads. Following this direction, we used the actions predicted from C3D network (Tran et al., 2015) which capture the sequence of actions happening in a video. It was pre-trained on the Sports-1M data (Karpathy et al., 2014) and ne-tuned on UCF-101 (Soomro et al., 2012). The number of video events is N k = 101 (See Fig 8.1.1). Audio events Audio is an integral part of TV commercials because it can convey abstract concepts in a short duration. Audio event detection features have been shown to successfully improve classication model performance for 125 ads classication dataset in (Somandepalli et al., 2018). The actions being considered in the C3D framework are dierent from the events classied by audio. Hence, they are likely to provide complementary information to the visual actions. We use the audio event detections from the models made publicly available in (Kong et al., 2020). The number of audio events considered in our work is N k = 527. Text events For text modality, we used the closed captions that were automatically available with the ads data. We used a simple named entity recognition (NER) model available with Spacy library 2 . The number of entity events in our work is N k = 18. Named entities from text provide additional information complementary to the audio events and visual actions identied in the ads data. 8.4.3 Multimodal events as point processes Two distinct RMTPP models were trained using the audio, visual and text events as described in the previous sections: (1) AV-TPP model: Here, the multimodal event stream consisted of only audio and video events with a total number of events asN k = 628 and (2) AVT-TPP: with the audio, visual and text events in the multimodal event stream with total number of events asN k = 646. The embedding dimension (See Fig 8.2.1) was xed according to the number of classes in each model. The hidden state dimension was tuned over the setjh j j =f16; 32; 64; 128; 256g. The RMTPP network was optimized with the Adam algorithm (Kinga & Adam, 2015). The models were trained on the Ads-cannes dataset while the loss was monitored on the Ads-cvpr17 as the development set. Early stopping criterion (delta change of 10 6 ) on the monitored loss was used to terminate the network updates. The output self-supervised representations are the average across all hidden state representations from the RNN model. 8.4.4 Downstream task classication We use a support vector machine (SVM) classier with averaged RNN hidden-state representations as fea- tures. The parameters of the SVM are tuned on a development set, and the results are reported using a 10-fold cross validation. We used permutation testing for dierences in both the performance of the pre- dicted labels with the ground-truth, as well as between dierent models. We conducted pairwise tests (with Bonferroni correction for multiple comparisons) to test the performance of the models trained on dier- ent representations. For results, we report the accuracy scores. Other performance metrics, mean average precision (mAP) and macro-averaged F1-score are also monitored. 2 Named entity recognition from Spacyspacy.io/models/en 126 t-SNE component 1 t-SNE component 2 Amused C onfident Active Aler t Amaz ed Ch eerful Inspir ed Alar m ed E a ger An gr y P r oud Figure 8.4.1: Qualitative visualization of the self-supervised representations from audio, visual and text multimodal event streams. A few sentiment classes are highlighted. 8.5 Results and Discussion The overall classication results are shown in table 8.2. The AV-RMTPP representations, although only sparse event detections, surprisingly perform on par with dense autoencoder representations suggesting that the multimodal event streams capture sucient information to predict the underlying tasks. Including event information from text modality, AVT-RMPTPP additionally improves the performance of all tasks except funny and exciting. Nearly 7% improvement in accuracy is observed for both the sentiment and topic classication tasks by including text into the multimodal event streams. In order to qualitatively visualize what the hidden state representations of the AVT-RMTPP are learning, we look at low-dimensional representations using t-SNE. As shown in the gure 8.4.1, the RMTPP representations capture the underlying story themes associated with sentiments in ad videos. 127 Chapter 9 Conclusion and Future Work At the outset of this dissertation, I presented my thesis statement: Semi-supervised neural-based methods that leverage the inherent correspondence between mul- tiple views and modalities can learn comprehensive representations from unlabeled data to develop robust machine perception. In much of the existing multiview and multimodal learning literature, the terms views and modalities are used interchangeably. In the context of application areas considered in this thesis, I started by presenting a unied framework in which views and modalities can be treated as allied but distinct concepts. Following this, I present a multiview correlation framework that can incorporate information from a large number of views, in a view-agnostic manner and handle absent or missing views. Extensive experiments showed that this method is able to develop a comprehensive representation of the underlying event. Next, I presented two dierent self-supervision methodologies for multimodal learning in unlabeled data. Specically, cross-modal autoencoders to learn joint audio-visual representations and temporal point processes to model multimodal event streams. I was able to show that the multimodal self-supervised representations are able to capture the underlying concepts complementary to individual modalities. While this dissertation oers great evidence to support the claim of \[...] can learn comprehensive representations", there are few missing components to this overarching goal. Firstly, to develop generative models to be able to modify both the view and signal of multiview data. I mostly considered \shared representation" across multiple views and did not have a mechanism to incorporate view-specic information. Secondly, in multimodal modeling, a complete representation of multiple modalities in such data also requires an understanding of how the dierent modalities in uence each other. Disentangling the modality-specic and modality-shared information can help learn a holistic representation of multimodal data. 128 References 2020 Film: Historic Gender Parity in Family Films. (n.d.). Retrieved 2021- 02-01, from https://seejane.org/research-informs-empowers/2020-film-historic -gender-parity-in-family-films/ 2020 Hollywood Diversity Report: A dierent story behind the scenes. (n.d.). Retrieved 2021-01-29, from https://newsroom.ucla.edu/releases/2020-hollywood-diversity-report 4 ways to use X-Ray in Prime Video. (n.d.). Retrieved 2021-01-29, from https://www.amazon .com/primeinsider/video/pv-xray-tips.html Akaho, S. (2006). A kernel method for canonical correlation analysis. arXiv preprint cs/0609071 . Aljundi, R., Chakravarty, P., & Tuytelaars, T. (2016). Who's that actor? automatic labelling of actors in tv series starting from imdb images. In Accv (pp. 467{483). Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self- supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33. Anderson, T. W. (1958). An introduction to multivariate statistical analysis (Vol. 2). Wiley New York. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning (pp. 1247{1255). Arthur, D., & Vassilvitskii, S. (2007). K-means++: the advantages of careful seeding. In In proceedings of the 18th annual acm-siam symposium on discrete algorithms. Azimi, J., Zhang, R., Zhou, Y., Navalpakkam, V., Mao, J., & Fern, X. (2012). Visual appearance of display ads and its eect on click through rate. In Proceedings of the 21st acm international conference on information and knowledge management (pp. 495{504). 129 Bacry, E., Iuga, A., Lasnier, M., & Lehalle, C.-A. (2015). Market impacts and the life cycle of investors orders. Market Microstructure and Liquidity, 1(02), 1550009. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 . Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., & Bengio, Y. (2016). End-to-end attention- based large vocabulary speech recognition. In 2016 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 4945{4949). Baltru saitis, T., Ahuja, C., & Morency, L.-P. (2018a). Challenges and applications in multimodal machine learning. In The handbook of multimodal-multisensor interfaces (pp. 17{48). Baltru saitis, T., Ahuja, C., & Morency, L.-P. (2018b). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence. Banville, H., Albuquerque, I., Hyv arinen, A., Moat, G., Engemann, D., & Gramfort, A. (2019). Self-supervised representation learning from electroencephalography signals. In 2019 ieee 29th international workshop on machine learning for signal processing (mlsp) (p. 1-6). doi: 10.1109/MLSP.2019.8918693 Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. d., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of machine learning research, 3(Feb), 1107{1135. Bartko, J. J. (1966). The intraclass correlation coecient as a measure of reliability. Psychological reports, 19(1), 3{11. Basu, S., Karki, M., Ganguly, S., DiBiano, R., Mukhopadhyay, S., Gayaka, S., . . . Nemani, R. (2017). Learning sparse feature representations using probabilistic quadtrees and deep belief nets. Neural Processing Letters, 45(3), 855{867. Beard, R., Das, R., Ng, R. W., Gopalakrishnan, P. K., Eerens, L., Swietojanski, P., & Miksik, O. (2018). Multi-modal sequence fusion via recursive attention for emotion recognition. In Proceedings of the 22nd conf. on computational natural language learning (pp. 251{259). Benton, A., Khayrallah, H., Gujral, B., Reisinger, D. A., Zhang, S., & Arora, R. (2017). Deep generalized canonical correlation analysis. arXiv preprint arXiv:1702.02519 . Bharadwaj, S., Arora, R., Livescu, K., & Hasegawa-Johnson, M. (2012). Multiview acoustic feature learning using articulatory measurements. In Intl. workshop on stat. machine learning for speech recognition. 130 Bian, J., Mei, X., & Zhang, J. (2018). A video face clustering approach based on sparse subspace representation. In Tenth international conference on digital image processing (icdip 2018) (Vol. 10806, p. 1080645). Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. In Proceedings of the 26th annual international acm sigir conference on research and development in informaion retrieval (pp. 127{134). Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on computational learning theory (pp. 92{100). Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2013). Finding actors and actions in movies. In Proceedings of the ieee international conference on computer vision (pp. 2280{2287). Bugeau, A., & P erez, P. (2008). Track and cut: simultaneous tracking and segmentation of multiple objects with graph cuts. EURASIP Journal on Image and Video Processing, 317278. Cai, X., Wang, C., Xiao, B., Chen, X., & Zhou, J. (2013). Regularized latent least square regression for cross pose face recognition. In Twenty-third international joint conference on articial intelligence. Cannes lions. (n.d.). www.canneslions.com. ([Online; accessed May-2017]) Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). Vggface2: A dataset for recognising faces across pose and age. In 2018 13th ieee international conference on automatic face & gesture recognition (fg 2018) (pp. 67{74). Chang, F.-J., Tuan Tran, A., Hassner, T., Masi, I., Nevatia, R., & Medioni, G. (2017). Faceposenet: Making a case for landmark-free face alignment. In Proceedings of the ieee international conference on computer vision workshops (pp. 1599{1608). Chang, X., Xiang, T., & Hospedales, T. M. (2018). Scalable and eective deep cca via soft decorrelation. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1488{1497). Chateld, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 . Chaudhuri, K., Kakade, S. M., Livescu, K., & Sridharan, K. (2009). Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th annual international conference on 131 machine learning (pp. 129{136). Chen, M., Denoyer, L., & Arti eres, T. (2017). Multi-view data generation without view supervision. arXiv preprint arXiv:1711.00305 . Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597{ 1607). Cheng, H., Zwol, R. v., Azimi, J., Manavoglu, E., Zhang, R., Zhou, Y., & Navalpakkam, V. (2012). Multimedia features for click prediction of new ads in display advertising. In Proceedings of the 18th acm sigkdd international conference on knowledge discovery and data mining (pp. 777{785). Chollet, F., et al. (2015). Keras. https://keras.io. Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verication. In 2005 ieee computer society conference on computer vision and pattern recognition (cvpr'05) (Vol. 1, pp. 539{546). Chung, J. S., & Zisserman, A. (2016). Lip reading in the wild. In Asian conference on computer vision (pp. 87{103). Cinbis, R. G., Verbeek, J., & Schmid, C. (2011). Unsupervised metric learning for face identication in tv video. In 2011 international conference on computer vision (pp. 1559{1566). Cohen, J. (2001). Dening identication: A theoretical look at the identication of audiences with media characters. Mass communication & society, 4(3), 245{264. Cour, T., Sapp, B., Nagle, A., & Taskar, B. (2010). Talking pictures: Temporal grouping and dialog-supervised person recognition. In 2010 ieee computer society conference on computer vision and pattern recognition (pp. 1014{1021). Daley, D. J., & Vere-Jones, D. (2007). An introduction to the theory of point processes: volume ii: general theory and structure. Springer Science & Business Media. Datta, S., Sharma, G., & Jawahar, C. (2018). Unsupervised learning of face representations. In 2018 13th ieee international conference on automatic face & gesture recognition (pp. 135{142). Dehak, N., Kenny, P., Dehak, R., Glembek, O., Dumouchel, P., Burget, L., . . . Castaldo, F. (2009). Support vector machines and joint factor analysis for speaker verication. In 2009 ieee international conference on acoustics, speech and signal processing (pp. 4237{4240). 132 Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verication. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788{798. Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Twelfth annual conference of the international speech communication association. Dekkers, G., Lauwereins, S., Thoen, B., Adhana, M. W., Brouckxon, H., van Waterschoot, T., . . . Karsmakers, P. (2017, November). The SINS database for detection of daily activities in a home environment using an acoustic sensor network. In Proceedings of the detection and classication of acoustic scenes and events 2017 workshop (dcase2017) (pp. 32{36). Dekkers, G., Vuegen, L., van Waterschoot, T., Vanrumste, B., & Karsmakers, P. (2018). DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel acoustics (Tech. Rep.). KU Leuven. Retrieved from https://arxiv.org/abs/1807.11246 de Leeuw, J. (2011). Derivatives of generalized eigen systems with applications. Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the ieee cvpr (pp. 4690{4699). Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., & Zafeiriou, S. (2019). Retinaface: Single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641 . de Sa, V. R. (1994). Learning classication with unlabeled data. In Advances in neural information processing systems (pp. 112{119). Ding, Z., & Fu, Y. (2014). Low-rank common subspace for multi-view learning. In 2014 ieee international conference on data mining (pp. 110{119). Ding, Z., & Fu, Y. (2017). Robust multiview data analysis through collective low-rank subspace. IEEE transactions on neural networks and learning systems, 29(5), 1986{1997. Ding, Z., Shao, M., & Fu, Y. (2018). Robust multi-view representation: A unied perspective from multi-view learning to domain adaption. In Ijcai (pp. 5434{5440). Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In Proceedings of the ieee international conference on computer vision (pp. 1422{1430). 133 Dorfer, M., Kelz, R., & Widmer, G. (2015). Deep linear discriminant analysis. arXiv preprint arXiv:1511.04707 . Dorfer, M., Schl uter, J., Vall, A., Korzeniowski, F., & Widmer, G. (2018). End-to-end cross- modality retrieval with cca projections and pairwise ranking loss. International Journal of Multimedia Information Retrieval, 7(2), 117{128. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., & Song, L. (2016). Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1555{1564). Dumpala, S. H., Sheikh, I., Chakraborty, R., & Kopparapu, S. K. (2018). Sentiment classication on erroneous asr transcripts: A multi view learning approach. In 2018 ieee spoken language technology workshop (slt) (pp. 807{814). Dupont, W. D., & Plummer, W. D. (1990). Power and sample size calculations: a review and computer program. Controlled clinical trials, 11(2), 116{128. Ebrahimi, S., Vahabi, H., Prockup, M., & Nieto, O. (2018). Predicting audio advertisement quality. El Bolock, A., El Kady, A., Herbert, C., & Abdennadher, S. (2020). Towards a character-based meta recommender for movies. In Computational science and technology (pp. 627{638). Springer. El Khoury, E., S enac, C., & Joly, P. (2014). Audiovisual diarization of people in video content. Multimedia tools and applications, 68(3), 747{775. Everingham, M., Sivic, J., & Zisserman, A. (2006). Hello! my name is... buy"{automatic naming of characters in tv video. In Bmvc (Vol. 2, p. 6). Farajtabar, M., Du, N., Rodriguez, M. G., Valera, I., Zha, H., & Song, L. (2014). Shaping social activity by incentivizing users. Advances in neural information processing systems, 27. Farajtabar, M., Yang, J., Ye, X., Xu, H., Trivedi, R., Khalil, E., . . . Zha, H. (2017). Fake news mitigation via point process based intervention. In International conference on machine learning (pp. 1097{1106). Feng, Y., You, H., Zhang, Z., Ji, R., & Gao, Y. (2019). Hypergraph neural networks. In Proceedings of the aaai conference on articial intelligence (Vol. 33, pp. 3558{3565). Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2), 179{188. 134 Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. science, 972{976. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., . . . Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2096{2030. Gao, J., Li, P., Chen, Z., & Zhang, J. (2020). A survey on deep learning for multimodal data fusion. Neural Computation, 32(5), 829{864. Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., . . . Ritter, M. (2017). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, speech and signal processing (icassp), 2017 ieee international conference on (pp. 776{780). Ghaleb, E., Tapaswi, M., Al-Halah, Z., Ekenel, H. K., & Stiefelhagen, R. (2015). Accio: A data set for face track retrieval in movies across age. In Proceedings of the 5th acm on international conference on multimedia retrieval (pp. 455{458). Gross, R., Matthews, I., Cohn, J., Kanade, T., & Baker, S. (2010). Multi-pie. Image and Vision Computing, 28(5), 807{813. Guha, T., Huang, C.-W., Kumar, N., Zhu, Y., & Narayanan, S. S. (2015). Gender representation in cinematic content: A multimodal approach. In Proceedings of the 2015 acm on international conference on multimodal interaction (pp. 31{34). Guo, Y., Zhang, L., Hu, Y., He, X., & Gao, J. (2016). Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In European conference on computer vision (pp. 87{102). Haq, I. U., Muhammad, K., Ullah, A., & Baik, S. W. (2019). Deepstar: Detecting starring characters in movies. IEEE Access, 7, 9265{9272. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12), 2639{2664. Harman, H. H. (1960). Modern factor analysis. Haurilet, M.-L., Tapaswi, M., Al-Halah, Z., & Stiefelhagen, R. (2016). Naming tv characters by watching and analyzing dialogs. In 2016 ieee wacv (pp. 1{9). Hawkes, A. G. (1971a). Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society: Series B (Methodological), 33(3), 438{443. 135 Hawkes, A. G. (1971b). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1), 83{90. Hebbar, R., Somandepalli, K., & Narayanan, S. (2019, May). Robust speech activity detection in movie audio: Data resources and experimental evaluation. In Proceedings of icassp. Heigold, G., Moreno, I., Bengio, S., & Shazeer, N. (2016). End-to-end text-dependent speaker verication. In 2016 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 5115{5119). Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., . . . others (2017). Cnn architectures for large-scale audio classication. In Acoustics, speech and signal processing (icassp), 2017 ieee international conference on (pp. 131{135). Horst, P. (1961). Generalized canonical correlations and their applications to experimental data. Journal of Clinical Psychology, 17(4), 331{347. Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3/4), 321{377. Hotelling, H. (1992). Relations between two sets of variates. In Breakthroughs in statistics (pp. 162{190). Springer. Hsu, W.-N., & Glass, J. (2018). Disentangling by partitioning: A representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264 . Huang, C.-W., & Narayanan, S. (2017). Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition. CoRR, abs/1706.02901. Huang, G. B., Mattar, M., Berg, T., & Learned-Miller, E. (2008, October). Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments. In Workshop on Faces in 'Real-Life' Images: Detection, Alignment, and Recognition. Huang, S.-Y., Lee, M.-H., & Hsiao, C. K. (2009). Nonlinear measures of association with kernel canonical correlation analysis and applications. Journal of Statistical Planning and Inference, 139(7), 2162{2174. Huo, J., & van Zyl, T. L. (2020). Unique faces recognition in videos. In 2020 ieee 23rd international conference on information fusion (fusion) (pp. 1{7). Hussain, Z., Zhang, M., Zhang, X., Ye, K., Thomas, C., Agha, Z., . . . Kovashka, A. (2017). Automatic understanding of image and video advertisements. In 2017 ieee conference on computer vision and pattern recognition (cvpr) (pp. 1100{1110). 136 Jacobsen, M. (2006). Point process theory and applications: marked point and piecewise determin- istic processes. Springer Science & Business Media. Jansen, A., Ellis, D. P., Hershey, S., Moore, R. C., Plakal, M., Popat, A. C., & Saurous, R. A. (2019). Coincidence, categorization, and consolidation: Learning to recognize sounds with minimal supervision. arXiv preprint arXiv:1911.05894 . Jansen, A., Ellis, D. P. W., Hershey, S., Moore, R. C., Plakal, M., Popat, A. C., & Saurous, R. A. (2020). Coincidence, categorization, and consolidation: Learning to recognize sounds with minimal supervision. In Icassp 2020 - 2020 ieee international conference on acoustics, speech and signal processing (icassp) (p. 121-125). doi: 10.1109/ICASSP40776.2020.9054137 Jaques, N., Chen, W., & Picard, R. W. (2015). Smiletracker: automatically and unobtrusively recording smiles and their context. In Proceedings of the 33rd annual acm conference extended abstracts on human factors in computing systems (pp. 1953{1958). Jati, A., & Georgiou, P. (2018). Neural predictive coding using convolutional neural networks towards unsupervised learning of speaker characteristics. Accepted in IEEE Transactions on Audio, Speech and Language Processing, arXiv preprint arXiv:1802.07860 . Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10), 1577{1589. Ji, X., Ju, Z., Wang, C., & Wang, C. (2016). Multi-view transition hmms based view-invariant human action recognition method. Multimedia Tools and Applications, 75(19), 11847{11864. Jiang, J., Bao, D., Chen, Z., Zhao, X., & Gao, Y. (2019). Mlvcnn: Multi-loop-view convolutional neural network for 3d shape retrieval. In Proceedings of the aaai conference on articial intelligence (Vol. 33, pp. 8513{8520). Jin, S., Su, H., Stauer, C., & Learned-Miller, E. (2017). End-to-end face detection and cast grouping in movies using erdos-renyi clustering. In Proceedings of the ieee iccv (pp. 5276{ 5285). Jing, L., & Tian, Y. (2020). Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. Kakade, S. M., & Foster, D. P. (2007). Multi-view regression via canonical correlation analysis. In International conference on computational learning theory (pp. 82{96). 137 Kan, M., Shan, S., Zhang, H., Lao, S., & Chen, X. (2015). Multi-view discriminant analysis. IEEE transactions on pattern analysis and machine intelligence, 38(1), 188{194. Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S., & Mason, M. W. (2011). I-vector based speaker recognition on short utterances. In Proceedings of the 12th annual conference of the international speech communication association (pp. 2341{2344). Kapferer, J.-N., Laurent, G., et al. (1985). Consumer involvement proles: a new and practical approach to consumer involvement (Tech. Rep.). Kapsouras, I., Tefas, A., Nikolaidis, N., Peeters, G., Benaroya, L., & Pitas, I. (2017). Multimodal speaker clustering in full length movies. Multimedia Tools and Applications, 76(2), 2223{ 2242. Karkkainen, K., & Joo, J. (2021). Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the ieee/cvf winter conference on applications of computer vision (pp. 1548{1558). Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classication with convolutional neural networks. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1725{1732). K epuska, V., & Klein, T. (2009). A novel wake-up-word speech recognition system, wake-up- word recognition task, technology and evaluation. Nonlinear Analysis: Theory, Methods & Applications, 71(12), e2772{e2789. Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika, 58(3), 433{451. Khan, S. H., Guo, Y., Hayat, M., & Barnes, N. (2019). Unsupervised primitive discovery for improved 3d generative modeling. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 9739{9748). Khattar, D., Goud, J. S., Gupta, M., & Varma, V. (2019). Mvae: Multimodal variational autoen- coder for fake news detection. In The world wide web conference (pp. 2915{2921). Kim, M., & Doh, Y. Y. (2017). Computational modeling of players' emotional response patterns to the story events of video games. IEEE Transactions on Aective Computing, 8(2), 216-227. doi: 10.1109/TAFFC.2016.2519888 138 Kinga, D., & Adam, J. B. (2015). A method for stochastic optimization. In International conference on learning representations (iclr). Klemen, J., & Chambers, C. D. (2012). Current perspectives and methods in studying neural mechanisms of multisensory interactions. Neuroscience & Biobehavioral Reviews, 36(1), 111{ 133. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M. D. (2020). Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2880{2894. Kose, N., Apvrille, L., & Dugelay, J. (2015). Facial makeup detection technique based on texture and shape analysis. In 2015 11th ieee international conference and workshops on automatic face and gesture recognition (fg) (Vol. 1, p. 1-7). doi: 10.1109/FG.2015.7163104 Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2), 83{97. Kulshreshtha, P., & Guha, T. (2018). An online algorithm for constrained face clustering in videos. In 2018 25th ieee international conference on image processing (icip) (p. 2670-2674). doi: 10.1109/ICIP.2018.8451343 Kulshreshtha, P., & Guha, T. (2020). Dynamic character graph via online face clustering for movie analysis. Multimedia Tools and Applications, 79(43), 33103{33118. Kumar, A., Rai, P., & Daume, H. (2011). Co-regularized multi-view spectral clustering. In Advances in neural information processing systems (pp. 1413{1421). LeCun, Y. (1998). The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/ . Ledoit, O., & Wolf, M. (2004, Feb). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365{411. Lee, H., Hwang, S. J., & Shin, J. (2020). Self-supervised label augmentation via input transforma- tions. In International conference on machine learning (pp. 5714{5724). Li, S., Li, Y., & Fu, Y. (2016). Multi-view time series classication: A discriminative bilinear pro- jection approach. In Proceedings of the 25th acm international on conference on information and knowledge management (pp. 989{998). 139 Li, X., Wang, C., Tan, J., Zeng, X., Ou, D., Ou, D., & Zheng, B. (2020). Adversarial multimodal representation learning for click-through rate prediction. In Proceedings of the web conference 2020 (pp. 827{836). Li, Y., et al. (2018). A survey of multi-view representation learning. IEEE Transactions on Knowledge and Data Engineering. Li, Z. (2017). The \celeb" series: A close analysis of audio-visual elements in 2008 us presidential campaign ads. Lian, W., Rai, P., Salazar, E., & Carin, L. (2015). Integrating features and similarities: Flexi- ble models for heterogeneous multiview data. In Twenty-ninth aaai conference on articial intelligence. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L. (2017). Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the ieee cvpr (pp. 212{220). Livescu, K., & Stoehr, M. (2009). Multi-view learning of acoustic features for speaker recognition. In 2009 ieee workshop on automatic speech recognition & understanding (pp. 82{86). Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 250{256). Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov), 2579{2605. Malmgren, R. D., Stouer, D. B., Motter, A. E., & Amaral, L. A. (2008). A poissonian explanation for heavy tails in e-mail communication. Proceedings of the National Academy of Sciences, 105(47), 18153{18158. McKeon, J. J. (1967). Canonical analysis: Some relations between canonical correlation, factor analysis, discriminant function analysis, and scaling theory (No. 13). Psychometric Society. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Ecient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Module: tfj TensorFlow Core v2.0.0. (n.d.). Retrieved 2020-12-28, from https://www.tensorflow .org/versions/r2.0/api docs/python/tf 140 Molina, J. F. G., Zheng, L., Sertdemir, M., Dinter, D. J., Sch onberg, S., & R adle, M. (2014). Incremental learning with svm for multimodal classication of prostatic adenocarcinoma. PloS one, 9(4), e93600. Mroueh, Y., Marcheret, E., & Goel, V. (2015). Deep multimodal learning for audio-visual speech recognition. In Acoustics, speech and signal processing (icassp), 2015 ieee international con- ference on (pp. 2130{2134). Nagrani, A., & Zisserman, A. (2018). From benedict cumberbatch to sherlock holmes: Character identication in TV series without a script. arXiv preprint arXiv:1801.10442 . Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In Icml. Nielsen, A. A. (2002). Multiset canonical correlations analysis and multispectral, truly multitem- poral remote sensing data. IEEE transactions on image processing, 11(3), 293{305. Omi, T., Ueda, N., & Aihara, K. (2019). Fully neural network based model for general temporal point processes. arXiv preprint arXiv:1905.09690 . Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 . Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In European conference on computer vision (pp. 801{816). Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. British Machine Vision Association. Parra, L. C. (2018). Multi-set canonical correlation analysis simply explained. arXiv preprint arXiv:1802.03759 . Parra, L. C., Haufe, S., & Dmochowski, J. P. (2018). Correlated components analysis: Extracting reliable dimensions in multivariate data. stat, 1050, 26. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825{2830. Pham, H., Liang, P. P., Manzini, T., Morency, L.-P., & P oczos, B. (n.d.). Learning robust joint representations for multimodal sentiment analysis. 141 Pirsiavash, H., Vondrick, C., & Torralba, A. (2014). Assessing the quality of actions. In European conference on computer vision (pp. 556{571). Pnevmatikakis, A., & Polymenakos, L. (2009). Subclass linear discriminant analysis for video-based face recognition. Journal of Visual Communication and Image Representation, 543{551. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., . . . others (2011). The kaldi speech recognition toolkit. In Ieee 2011 workshop on automatic speech recognition and understanding. Prince, S. J., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In 2007 ieee 11th international conference on computer vision (pp. 1{8). Qiao, S., Shen, W., Zhang, Z., Wang, B., & Yuille, A. (2018). Deep co-training for semi-supervised image recognition. In Proceedings of the european conference on computer vision (eccv) (pp. 135{152). Ramanathan, V., Joulin, A., Liang, P., & Fei-Fei, L. (2014). Linking people in videos with \their" names using coreference resolution. In European conference on computer vision (pp. 95{110). Rasmussen, J. G. (2011). Temporal point processes: the conditional intensity function. Lecture Notes, Jan. Rastogi, P., Van Durme, B., & Arora, R. (2015). Multiview lsa: Representation learning via generalized cca. In Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 556{566). Rohrbach, A., Rohrbach, M., Tang, S., Joon Oh, S., & Schiele, B. (2017). Generating descriptions with grounded and co-referenced people. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 4979{4989). Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning(emnlp-conll) (pp. 410{420). Sadlier, D. A. (2002). Audio/visual analysis for high-speed tv advertisement detection from mpeg bitstream (Unpublished doctoral dissertation). Dublin City University. Sayed, N., Brattoli, B., & Ommer, B. (2018). Cross and learn: Cross-modal self-supervision. In German conference on pattern recognition (pp. 228{243). 142 Schmidhuber, J. (1990). Making the world dierentiable: On using self-supervised fully recurrent n eu al networks for dynamic reinforcement learning and planning in non-stationary environm nts. Schro, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unied embedding for face recog- nition and clustering. In Proceedings of the ieee cvpr (pp. 815{823). Schuller, B., Valstar, M., Eyben, F., McKeown, G., Cowie, R., & Pantic, M. (2011). Avec 2011{ the rst international audio/visual emotion challenge. In Aective computing and intelligent interaction (pp. 415{424). Springer. Sellami, A., Dup e, F.-X., Cagna, B., Kadri, H., Ayache, S., Arti eres, T., & Takerkart, S. (2020). Mapping individual dierences in cortical architecture using multi-view representation learn- ing. arXiv preprint arXiv:2004.02804 . Shankar, S., Piratla, V., Chakrabarti, S., Chaudhuri, S., Jyothi, P., & Sarawagi, S. (2018). Gener- alizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745 . Sharma, A., Kumar, A., Daume, H., & Jacobs, D. W. (2012). Generalized multiview analysis: A discriminative latent space. In Computer vision and pattern recognition (cvpr), 2012 ieee conference on (pp. 2160{2167). Sharma, R., Somandepalli, K., & Narayanan, S. (2019). Toward visual voice activity detection for unconstrained videos. In 2019 ieee international conference on image processing (icip) (p. 2991-2995). doi: 10.1109/ICIP.2019.8803248 Sharma, V., Sarfraz, M. S., & Stiefelhagen, R. (2017). A simple and eective technique for face clustering in TV series. Sharma, V., Tapaswi, M., Sarfraz, M. S., & Stiefelhagen, R. (2019). Self-supervised learning of face representations for video face clustering. In 2019 14th ieee international conference on automatic face & gesture recognition (fg 2019) (pp. 1{8). Shchur, O., Bilo s, M., & G unnemann, S. (2019). Intensity-free learning of temporal point processes. arXiv preprint arXiv:1909.12127 . Shi, Y., & Jain, A. K. (2019). Probabilistic face embeddings. In Proceedings of the ieee iccv (pp. 6902{6911). Shi, Z., & Mueller, H. J. (2013). Multisensory perception and action: development, decision-making, and neural mechanisms. Frontiers in integrative neuroscience, 7, 81. 143 Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In Proceedings of the ieee cvpr (pp. 761{769). Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 . Sivic, J., Everingham, M., & Zisserman, A. (2009). \who are you?"-learning person specic classiers from video. In 2009 ieee conference on computer vision and pattern recognition (pp. 1145{1152). Snoek, C. G., & Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art. Multimedia tools and applications, 25(1), 5{35. Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embed- dings for text-independent speaker verication. In Interspeech (pp. 999{1003). Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 5329{5333). Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016). Deep neural network-based speaker embeddings for end-to-end speaker verication. In 2016 ieee spoken language technology workshop (slt) (pp. 165{170). Soltanolkotabi, M., Elhamifar, E., Candes, E. J., et al. (2014). Robust subspace clustering. The Annals of Statistics, 42(2), 669{699. Somandepalli, K., Guha, T., Martinez, V. R., Kumar, N., Adam, H., & Narayanan, S. (2021). Computational media intelligence: Human-centered machine analysis of media. Proceedings of the IEEE. Somandepalli, K., Gupta, R., Nasir, M., Booth, B. M., Lee, S., & Narayanan, S. S. (2016). Online aect tracking with multimodal kalman lters. In Proceedings of the 6th international workshop on audio/visual emotion challenge (pp. 59{66). Somandepalli, K., Hebbar, R., & Narayanan, S. (2020). Multi-face: Self-supervised multiview adaptation for robust face clustering in videos. arXiv preprint arXiv:2008.11289 . Somandepalli, K., Kelly, C., Reiss, . E., Castellanos, F. X., Milham, M. P., & Di Martino, A. (2015). Short-term test{retest reliability of resting state fmri metrics in children with and without attention-decit/hyperactivity disorder. Developmental Cognitive Neuroscience, 15, 144 83{93. Somandepalli, K., Kumar, N., Guha, T., & Narayanan, S. S. (2017). Unsupervised discovery of character dictionaries in animation movies. IEEE Transactions on Multimedia, 539{551. Somandepalli, K., Kumar, N., Jati, A., Georgiou, P., & Narayanan, S. (2019). Multiview shared subspace learning across speakers and speech commands. Proc. Interspeech 2019 , 2320{2324. Somandepalli, K., Kumar, N., Travadi, R., & Narayanan, S. (2019a). Multimodal representation learning using deep multiset canonical correlation. Somandepalli, K., Kumar, N., Travadi, R., & Narayanan, S. (2019b). Multimodal representation learning using deep multiset canonical correlation. arXiv preprint arXiv:1904.01775 . Somandepalli, K., Martinez, V., Kumar, N., & Narayanan, S. (2018). Multimodal representation of advertisements using segment-level autoencoders. In Proceedings of the 20th acm international conference on multimodal interaction (pp. 418{422). Somandepalli, K., & Narayanan, S. (2019). Reinforcing self-expressive representation with con- straint propagation for face clustering in movies. In Icassp 2019-2019 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 4065{4069). Somandepalli, K., & Narayanan, S. (2020). Generalized multi-view shared subspace learning using view bootstrapping. arXiv preprint arXiv:2005.06038 . Soomro, K., Zamir, A. R., & Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 . Sridhar, S., Rempe, D., Valentin, J., Soen, B., & Guibas, L. J. (2019). Multiview aggregation for learning category-specic shape reconstruction. In Advances in neural information processing systems (pp. 2351{2362). Sridharan, K., & Kakade, S. M. (2008). An information theoretic framework for multi-view learn- ing. Srivastava, N., & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems (pp. 2222{2230). Su, H., Maji, S., Kalogerakis, E., & Learned-Miller, E. (2015). Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the ieee international conference on computer vision (pp. 945{953). 145 Sun, S., Liu, Y., & Mao, L. (2019). Multi-view learning for visual violence recognition with maximum entropy discrimination and deep features. Information Fusion, 50, 43{53. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104{3112). Sutter, T. M., Daunhawer, I., & Vogt, J. E. (2020). Multimodal generative learning utilizing jensen-shannon-divergence. arXiv preprint arXiv:2006.08242 . Tapaswi, M., B auml, M., & Stiefelhagen, R. (2012). \knock! knock! who is it?" probabilistic person identication in tv-series. In 2012 ieee cvpr (pp. 2658{2665). Tapaswi, M., Law, M. T., & Fidler, S. (2019). Video face clustering with unknown number of clusters. In Proceedings of the ieee iccv (pp. 5027{5036). Taskar, B., Segal, E., & Koller, D. (2001). Probabilistic classication and clustering in relational data. In International joint conference on articial intelligence (Vol. 17, pp. 870{878). Tian, Y., Krishnan, D., & Isola, P. (2019). Contrastive multiview coding. arXiv preprint arXiv:1906.05849 . Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Computer vision (iccv), 2015 ieee international conference on (pp. 4489{4497). Travadi, R., Segbroeck, M. V., & Narayanan, S. S. (2014). Modied-prior i-vector estimation for language identication of short duration utterances. In Fifteenth annual conference of the international speech communication association. Tsai, Y.-H. H., Liang, P. P., Zadeh, A., Morency, L.-P., & Salakhutdinov, R. (2018). Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 . Tsekeridou, S., & Pitas, I. (1999, Jul). Audio-visual content analysis for content-based video indexing. In Proceedings ieee international conference on multimedia computing and systems (Vol. 1, p. 667-672 vol.1). doi: 10.1109/MMCS.1999.779279 Vallet, F., Essid, S., & Carrive, J. (2012). A multimodal approach to speaker diarization on tv talk-shows. IEEE transactions on multimedia, 15(3), 509{520. van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Van Noorden, R. (2020). The ethical questions that haunt facial-recognition research. Nature, 587(7834), 354{358. 146 Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 . Vicol, P., Tapaswi, M., Castrejon, L., & Fidler, S. (2018). Moviegraphs: Towards understanding human-centric situations from videos. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 8581{8590). Viola, P., & Jones, M. (2001). Robust real-time object detection. In International journal of computer vision. Vretos, N., Solachidis, V., & Pitas, I. (2011). A mutual information based face clustering algorithm for movie content analysis. Image and Vision Computing, 29(10), 693{705. Vukoti c, V., Raymond, C., & Gravier, G. (2016). Multimodal and crossmodal representation learning from textual and visual features with bidirectional deep neural networks for video hyperlinking. In Proceedings of the 2016 acm workshop on vision and language integration meets multimedia fusion (pp. 37{44). Vukoti c, V., Raymond, C., & Gravier, G. (2017). Generative adversarial networks for multimodal representation learning in video hyperlinking. In Proceedings of the 2017 acm on international conference on multimedia retrieval (pp. 416{419). Wang, F., Chen, L., Li, C., Huang, S., Chen, Y., Qian, C., & Change Loy, C. (2018). The devil of face recognition is in the noise. In Proceedings of the european conference on computer vision (eccv) (pp. 765{780). Wang, W., Arora, R., Livescu, K., & Bilmes, J. (2015). On deep multi-view representation learning. In International conference on machine learning (pp. 1083{1092). Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classication networks hard? In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 12695{12705). Wang, W., Wang, N., Wu, X., You, S., & Neumann, U. (2017). Self-paced cross-modality transfer learning for ecient road segmentation. In 2017 ieee international conference on robotics and automation (icra) (pp. 1394{1401). Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceedings of the ieee international conference on computer vision (pp. 2794{2802). 147 Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.-F., . . . Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the ieee/cvf conference on computer vision and pattern recogni- tion (pp. 6629{6638). Wang, Y.-X., Xu, H., & Leng, C. (2013). Provable subspace clustering: When lrr meets ssc. In Advances in neural information processing systems (pp. 64{72). Wang, Z., Hu, R., Liang, C., Yu, Y., Jiang, J., Ye, M., . . . Leng, Q. (2015). Zero-shot person re- identication via cross-view consistency. IEEE Transactions on Multimedia, 18(2), 260{272. Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. CoRR, abs/1804.03209. Whitelam, C., Taborsky, E., Blanton, A., Maze, B., Adams, J., Miller, T., . . . others (2017). Iarpa janus benchmark-b face dataset. In Proceedings of the ieee conference on computer vision and pattern recognition workshops (pp. 90{98). Williamson, J. (1978). Decoding advertisements (Vol. 4). Marion Boyars London. Wolf, L., Hassner, T., & Maoz, I. (2011). Face recognition in unconstrained videos with matched background similarity. In Cvpr 2011 (pp. 529{534). Wu, B., Lyu, S., Hu, B.-G., & Ji, Q. (2013). Simultaneous clustering and tracklet linking for multi-face tracking in videos. In Proceedings of the ieee iccv (pp. 2856{2863). Wu, B., Zhang, Y., Hu, B.-G., & Ji, Q. (2013). Constrained clustering and its application to face clustering in videos. In Proceedings of the ieee cvpr (pp. 3507{3514). Wu, M., & Goodman, N. (2018). Multimodal generative models for scalable weakly-supervised learning. In Advances in neural information processing systems (pp. 5575{5585). Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 1912{1920). Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In Proceedings of the ieee international conference on computer vision (pp. 4705{4713). Xiao, S., Tan, M., & Xu, D. (2014). Weighted block-sparse low rank representation for face clustering in videos. In Computer vision { eccv 2014: 13th european conference, zurich, 148 switzerland, september 6-12, 2014, proceedings, part vi (pp. 123{138). Springer International Publishing. doi: 10.1007/978-3-319-10599-4 9 Xu, C., Tao, D., & Xu, C. (2013). A survey on multi-view learning. arXiv preprint arXiv:1304.5634 . Yan, F., & Mikolajczyk, K. (2015). Deep correlation for matching images and text. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 3441{3450). Yan, J., Xu, H., & Li, L. (2019). Modeling and applications for temporal point processes. In Pro- ceedings of the 25th acm sigkdd international conference on knowledge discovery; data mining (p. 3227{3228). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/3292500.3332298 doi: 10.1145/3292500.3332298 Yang, X., Ramesh, P., Chitta, R., Madhvanath, S., Bernal, E. A., & Luo, J. (2017). Deep multimodal representation learning from temporal data. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 5447{5455). Yi, D., Lei, Z., Liao, S., & Li, S. Z. (2014). Learning face representation from scratch. arXiv preprint arXiv:1411.7923 . Zadeh, A., Liang, P. P., Mazumder, N., Poria, S., Cambria, E., & Morency, L.-P. (2018). Mem- ory fusion network for multi-view sequential learning. In Thirty-second aaai conference on articial intelligence. Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016, Oct). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10), 1499- 1503. doi: 10.1109/LSP.2016.2603342 Zhang, R., Isola, P., & Efros, A. A. (2016). Colorful image colorization. In European conference on computer vision (pp. 649{666). Zhang, S., Gong, Y., & Wang, J. (2016). Deep metric learning with improved triplet loss for face clustering in videos. In Pacic rim conference on multimedia (pp. 497{508). Zhang, T., & Kuo, C. C. J. (2001, May). Audio content analysis for online audiovisual data segmentation and classication. IEEE Transactions on Speech and Audio Processing, 9(4), 441-457. doi: 10.1109/89.917689 Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2016). Joint face representation adaptation and clustering in videos. In European conference on computer vision (pp. 236{251). 149 Zhang, Z., Yuan, Y., Shen, X., & Li, Y. (2018). Low resolution face recognition and reconstruction via deep canonical correlation analysis. In 2018 ieee international conference on acoustics, speech and signal processing (icassp) (p. 2951-2955). doi: 10.1109/ICASSP.2018.8461985 Zhao, J., Xie, X., Xu, X., & Sun, S. (2017, November). Multi-view learning overview. Inf. Fusion, 38(C), 43{54. doi: 10.1016/j.inus.2017.02.007 Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Computer vision and pattern recognition (cvpr), 2016 ieee conference on (pp. 2921{2929). Zhou, C., Zhang, C., Li, X., Shi, G., & Cao, X. (2014, jul). Video face clustering via constrained sparse representation. In 2014 ieee international conference on multimedia and expo (icme) (p. 1-6). Los Alamitos, CA, USA: IEEE Computer Society. doi: 10.1109/ICME.2014.6890188 150 Appendices 151 Appendix A Proof for total covariance This chapter presents the complete proof for the total covariance formulation used for generalized multiview correlation in Chapter 4. Total-view Covariance = Within-view covariance + Between-view covariance Consider the sum of R b and R w which includes M 2 terms. Note that we assume X l : l = 1;:::;M to have mean-zero columns. Therefore covariance estimation is just the cross-product: R w + R b = 1 M M X l=1 X l (X l ) > + 1 M M X k=1 M X l=1;l6=k X l (X k ) > = 1 M M X l=1 M X k=1 X l (X k ) > = 1 M M X l=1 X l M X l=1 X l > = R t The rst expression comes from the denition of R b and R w and R t denotes the total-view covariance matrix where the total-view matrix P M j=1 X j is a sum of the feature matrices across all the views considered. Thus, R t can be easily estimated as the covariance of a single total-view matrix, without having to consider the sum of M 2 M covariance matrices. Note that we excluded the normalization factor N 1 in the esimtation of the covariance terms above. This gives us the following useful relation which simplies many computations in practice. R t = R b + R w (A.0.1) 152 Appendix B Upper bound for multiview correlation This chapter presents a loose upper bound for the objective function used for generalized multiview correla- tion in Chapter 4. Proposition: Multi-view correlation objective is bounded above by 1 Recall the multi-view correlation objective for M views: M = max W 1 d(M 1) Tr W > R b W Tr(W > R w W) (B.0.1) It is desirable to have an upper bound for the objective similar to the correlation coecient metric which is normalized to have a maximum value of 1. Let us begin with the denition of the multi-view correlation matrix: = max W 1 M 1 W > R b W W > R w W (B.0.2) Here, W2R dk ;kd Dene a matrix Y l = W > X l 2 R kN ;k d where the column vectors y2 R k are a low-dimensional projection of the input features X. The column vector elements are y l i 2R :i = 1;:::;N;l = 1;:::;M with that the ratio in Eq. B.0.2, ignoring the max operation can be written as: 153 = 1 M 1 W > (X 1 X > 2 +::: + X M X > M1 )W W > (X 1 X > 1 +::: + X M X > M )W = 1 M 1 (Y 1 Y > 2 +::: + Y M Y > M1 ) (Y 1 Y > 1 +::: + Y M Y > M ) = 1 M 1 P i P l P k6=l y l i y k i P i P l (y l i ) 2 = 1 M 1 r b r w Using the relation r t = r b +r w from Sec. ??, to show that 1, we equivalently prove the following expression to be non-negative: 0 (M 1)r w r b = (M 1)r w (r t r w ) =Mr w r t =M X i X l (y l i ) 2 X i X l y l i 2 :=F Now, we need to nd the y l i that minimizes F . Therefore, take the gradient of F with respect to y l i and check if the curvature is non-negative where the gradient is zero. @F @y l i = 2My l i 2 X j X l y k j X l kl ji = 2My l i 2 X k y k j (B.0.3) @ 2 F @y l i @y k j = 2M lk ij 2 X t jt ji = 2 ji (M lk 1) :=J (B.0.4) Solving for @F @y = 0 has a unique solution: y l i = 1 M P k y k i = y i . Putting this result back gives F = 0 at this solution. To show this solution minimizes F and therefore < 1, we need to show that the Jacobian J in Eq. 5 has only non-negative eigenvalues. Note that there are only variables in Eq. 5. Thus, in a matrix form across all views we have J =MI M I M yielding non-negative eigenvalues. Hence 1 154 Appendix C Upper bound for bootstrapped multiview correlation This chapter presents a theoretical upper bound for the bootstrapped multiview correlation objective pro- posed in Chapter 4. C.1 Upper Bound for Bootstrapped Within-View Covariance Lemma C.1.1. (Subsampled view matrices, approximate isotropy) Let A be a md matrix created by subsamplingm views from a larger, unknown number of views. The rows A i of the matrix A are independent subgaussian random vectors in R d and a second moment matrix =EA i A i . Then for every t 0, with probability at least 1 2 exp ct 2 we have 1 m A > A max(; 2 ) where =C r d m + t p m (C.1.1) Here C;c> 0 depend only on the subgaussian norm K = max i kA i k 2 of the view space Proof. This is a straight-forward generalization of Theorem 5.39 in the matrix concentration theory book by Vershynin (Vershynin, 2010) for non-isotropic spaces. The proof involves covering argument which uses a netN to discretize the compact view space, which is all the vectors z in a unit sphereS d1 . Similar to thr analysis for isotropic case discussed in (Vershynin, 2010), we prove this lemma in three steps: 1. N Approximation: Bound the normkAzk 2 for all z2R d s.t.kzk 2 = 1 by discretizing the sphere with a -netN . 2. Concentration: Fix a vector z;kzk 2 = 1, and derive a tight bound ofkAzk 2 . 155 3. Union bound: Take a union bound for all the z in the net. Step 1:N Approximation. We use the following statement from matrix concetration theory of isotropic spaces (Vershynin, 2010): 9> 0; B > B I max(; 2 ) =) kBk 2 1 + (C.1.2) We evaluate the operator norm in eq. C.1.1 as follows: 1 m A > A = 1 m A > A 1 m EA > A = 1 m m i=1 A i A > i 1 m m i=1 EA i A > i Note that the feature vectors A i are arranged in rows of the view matrix A. Let D := 1 m P m i=1 A i A > i 1 m m i=1 EA i A > i . Choose a -netN such thatjNj 9 d which provides sucient coverage for the unit sphereS d1 . Thus, = 1=4. Then, for every z2N we have (using Lemma 5.4 in (Vershynin, 2010)), kDk max z2N kzk=1 jhDz; zij 1 1 2 0 max x2N kzk=1 z > Dz 2 max z2N z > Dz For some > 0, we want to show that the operator norm of D is concentrated as max z2N z > Dz 2 where := max(; 2 ) (C.1.3) Step 2: Concentration. Fix any vector z2S d1 and deneY i = A > i zEA > i z where A i are subgaussian random vectors by assumption withkA i k 2 = K. Thus, Y i i = 1;:::;m are independent subgaussian random variables. The subgaussian norm of Y i is calculated as, kY i k 2 = A > i zEA > i z 2 2 A > i z 2 (C.1.4) 2kA i k 2 kzk = 2K The above relation is an application of triangular and Jensen's inequalities:kXEXk 2kXk withjEXj EjXjjXj . Similarly,Y 2 i are independent subexponential random variables with the subexponential norm 156 K e =kY i k 1 kY i k 2 2 4K 2 . Finally, by denition of Y i , we have z > Dz = 1 m j m i=1 Y 2 i j (C.1.5) We use the exponential deviation inequality in Corollary 5.17 from (Vershynin, 2010) to control the summa- tion term in eq. C.1.5 to give: P z > Dz 2 =P 1 m j m i=1 Y 2 i j 2 (C.1.6) 2 exp " c min 2 4K 2 e ; 2K e m # Note that := max(; 2 ). If 1 then = 2 . Thus, min(; 2 ) = 2 . Using this and the fact that K 2kY i k 2 1, we get P z > Dz 2 ) 2 exp " c 1 K 4 2 m # (C.1.7) 2 exp " c 1 K 4 (C 2 d +t 2 ) # by substituting =C q d m + t p m and using (a +b) 2 a 2 +b 2 . Step 3: Union Bound. Using Boole's inequality to compute the union bound over all the vectors z in the netN with cardinalityjNj = 9 d , we get P ( max z2N 1 m A > A 2 ) 9 d 2 exp " c 1 K 4 (C 2 d +t 2 ) # (C.1.8) Pick a suciently large C =C K K 2 p log 9=c 1 , then the probability P ( max z2N 1 m A > A 2 ) 2 exp d + c1t 2 K 4 (C.1.9) 2 exp c 1 t 2 K 4 Thus with a high probability of at least 1 2 exp (ct 2 ) eq. C.1.1 holds. In other words, the deviation of the subsampled view matrix from the entire view space, in spectral sense isO(d=m) 157 Lemma C.1.2. Subsampled within-view covariance bound. Let X be the Nmd tensor whose elements A2R md are identically distributed matrices with rows A i representing m-views sampled from a larger set of views in R d . If A i are independent sub-gaussian vectors with second moment w , then for every t 0, with probability at least 1 2 exp (ct 2 ) , we have kR w w kN C 2 d +t 2 m (C.1.10) Here R w is the sum of within-view covariance matrices for m views and C > 0 depends only on the sub- gaussian norm K = max i kA i k 2 of the subsampled view space. Proof. Let us now consider the rearrangedm-view subsampled data tensor X2R Nmd = [A (1) ;:::; A (N) ]. Let A be the md view-specic data sampled identically for N times. Without loss of generality, assume the rows to be zero mean which makes covariance computation simpler. The rows A i are independent sub- gaussian vectors with second moment matrix =EA > A. The between-view covariance matrix R w for m views can be written as: R w = 1 m N X i=1 m X j=1 A j A j = N X i=1 1 m A (i)> A (i) (C.1.11) The matrix A is a sampling of m views from an unknown and larger number of views M for which the R w is constructed. We want to bound the dierence between this term and the within-view covariance of the 158 whole space using lemma C.1.1: kR w w k = N X i=1 1 m A (i)> A (i) N X i=1 (i) w = N X i=1 1 m A (i)> A (i) (i) w [By triangular inequality] N X i=1 1 m A (i)> A (i) (i) w [Identical sampling] =N 1 m A > AEA > A [From lemma C.1.1] N max (; 2 ) with =C r d m + t p m =N r Cd +t m 2 Because [d;m> 1 and (a +b) 2 a 2 +b 2 ] N C 2 d +t 2 m 159 C.2 Upper Bound for Bootstrapped Total-View Covariance Lemma C.2.1. (Subsampled total-view covariance bound) Let X be the Nmd tensor whose elements A2R md are identically distributed matrices with rows A i representing m-views sampled from a larger set of views inR d . Construct a total-view matrix X2R md by summing entries across all views. Let t be the second moment of the total-view space. Then, we have kR t t kNm (C.2.1) Here R t is the total-view covariance matrix and c 2 > 0 depends on the range of the total view space k such thatjXjk. Proof. Consider the m-view subsampled data tensor rearranged with feature vectors as rows to get X2 R Nmd = [A (1) ;:::; A (N) ] with rows of A as A i . Without loss of generality, assume the d-dimensional rows of A to be zero mean which makes estimating covariances simpler. The covariance R t of the total view matrix can be written as follows R t = 1 m N X i=1 m X i=1 A (i) m X i=1 A (i) > (C.2.2) = 1 m N X i=1 m X j=1 A (i) j m X j=1 A (i) j = 1 m N X i=1 m X j=1 A (i) j m X j=1 A (i) j = 1 m N X i=1 W i W > i We want to bound the dierence between this subsampled total-view covariance matrix and the second moment of the total-view space. Let a (i) = P m j=1 A (i) j for i = 1;:::;N. The vector a (i) is the sum-of-views. 160 We use a useful application of Jensen's inequality here:kXEXk 2kXk withjEXjEjXjjXj kR t t k = 1 m N X i=1 a (i) a (i)> N X i=1 (i) t [By triangular inequality] 1 m N X i=1 a (i) a (i)> (i) t [Identical sampling] = N m aa > Eaa > [Triangular and Jensen's inequality] N m aa > [From assumption:kak 2 = 1] = N m m 2 =Nm 161 C.3 Theorem: Error of the Bootstrapped Multi-view Correlation Theorem C.3.1. Let X = [A (1) ;:::; A (N) ] be the md matrices of m views sampled from an unknown number of views M. Let the rows A l of the view matrices A be independent subgaussian vectors in R d with kA l k 2 = 1 :l = 1;:::;m. Then for any t 0, with probability at least 1 2 exp ct 2 , we have m max 1;C m 2 ( p d +t) 2 Here, m and are the MV-CORR objectives for subsampled views m and the total number of views M respectively. The constant C depends only on the subgaussian norm K of the view space, with K = max i;l A (i) l 2 Proof. Starting from the objective dened in the main paper and ignoring the normalization factors, the objective m for m views can be rewritten as: m = Tr(W > R b W) Tr(W > R w W) = hR b + b b ; WW > i hR w + w w ; WW > i h b ; WW > i +kR t t k +kR w w k h w ; WW > ikR w w k where b and w are the second moment matrices for the between-view and within-view covariances respec- tively. This can be written using cyclical properties of trace function and relation between spectral norm and trace. Additionally note from the previous result that we can use total covariance to simplify the estimation of R b . That is, R b = R t R w . The rest follows through triangular inequalities. Observe that the ratioh b ; WW > i=h w ; WW > i is the optimal we are interested to bound the approximation m from. We can show thatjj 1. Additionally the two trace terms are sum of normalized eigenvalues (each bounded above by 1). Thush b ; WW > i2 [1;d] andh w ; WW > i2 [1;d]. Furthermore, from lemma C.1, we know that the norm term with R w is greater than 1 i.e.,kR t w k C d m > 1, because we always choose the embedding size to be greater than the number of views subsampled. With these inequalities. We can loosely bound the above inequality for m as: 162 m h b ; WW > i h w ; WW > i kR t t k +kR w w k kR w w k [From Lemmas C.1 and C.2] kR t t k kR w w k 2Nm NC ( p d+t) 2 m C 0 m 2 d O( m 2 d ) where C 0 is a constant term that depends only the subgaussian norm of the d-dimensional feature vectors. 163 Appendix D SAIL MultiFace dataset details This chapter lists the titles of all movie videos used in creating the SAIL MultiFace dataset described in Chapter 6. Movie titles of the videos used for harvesting face tracks The movie titles for the 240 movies spanning the years 2014|2018 in our dataset that we used for harvesting the 169K face tracks with weak labels at a movie level are listed in tables D.1 and D.2. 164 Table D.1: Movie titles - Part 1 102 Not Out Chappaquiddick Home Again 15 17 Paris Christopher Robin Hotel Transylvania 3 47 Meters Down Churchill House With A Clock In Its Walls 7 Days In Entebbe Close Encounters Of The Third Kind How To Be A Latin Lover A Bag Of Marbles Collide How To Be Single A Dogs Purpose Columbus Hurricane Heist A Great Wall Commuter I Am Not Your Negro A Haunted House 2 Condorito The Movie I Can Only Imagine A Most Wanted Man Crazy Rich Asians I Feel Pretty A Question Of Faith Creed Ii Indivisible A Quiet Passion Daddys Home 2 Insidious The Last Key A United Kingdom Darkest Hour Instant Family A-x-l Darkest Minds Isle Of Dogs About Last Night Devils Due Jane Addicted Diary Of A Wimpy Kid The Long Haul John Wick Adrift Dog Days Johnny English Strikes Again All Saints Dunkirk Jumanji Welcome To The Jungle Aloha Every Day Jurassic World Alpha Everybody Loves Somebody Just Getting Started An Inconvenient Sequel Truth To Power Everything Everything Justice League Ant-man Wasp Fantastic Beasts The Crimes Of Grimwald Keanu Arrival Finding Your Feet Kin As Above So Below First Man King Arthur Legend Of The Sword At Eternitys Gate Flatliners King Kong Skull Island Avengers Innity War Forever My Girl Leave No Trace Barbershop The Next Cut Geostorm Let There Be Light Battle Of The Sexes Ghost In The Shell Life Of The Party Beautifully Broken Gifted Lights Out Beauty And The Beast God Bless Broken Road Little Women Before I Fall God Bless The Broken Road Logan Begin Again Gods Not Dead A Light In Darkness Logan Lucky Bilal A New Breed Of Hero Going In Style Love Simon Birth Of The Dragon Goodbye Christopher Robin Love The Coopers Black Mass Goosebumps 2 Mamma Mia Here We Go Again Black Panther Gosnell Americas Biggest Serial Killer Marjorie Prime Bohemian Rhapsody Green Book Marshall Book Club] Guardians Of The Galaxy 2 Maudie Boyhood Happy Death Day Maze Runner Death Cure Breaking In Hate U Give Me Before You Central Intelligence Hearts Beat Loud Megan Leavey 165 Table D.2: Movie titles - Part 2 Midnight Sun Same Kind Of Dierent As Me The Lost City Of Z Miracle Season Samson The Man Who Invented Christmas Mission Impossible Fallout Seagull The Meg Moonlight Searching The Mountain Between Us Mortal Engines Secret In Their Eyes The Mummy Mr Holmes Seventh Son The Mystery Of Happiness Murder On The Orient Express Sgt Stubby American Hero The Nice Guys My Cousin Rachel Sherlock Gnomes The November Man Night School Show Dogs The Old Man The Gun Nightcrawler Skyscraper The Post Nutcracker And The Four Realms Slender Man The Promise Oceans 8 Solo Star Wars The Resurrection Of Gavin Stone Only The Brave Split The Sense Of An Ending Operation Finale Star Wars The Last Jedi The Shack Overboard Step The Single Moms Club Pacic Rim Uprising Steve Jobs The Space Between Us Pad Man Still Alice The Stray Paddington 2 Suicide Squad The Trip To Spain Paranormal Activity The Ghost Dimension Sully The Zookeepers Wife Paranormal Activity The Marked Ones Table 19 Thor Ragnarok Paris Can Wait Teen Titans Go To The Movies Tomb Raider Paul Apostle Of Christ The Accountant Top Five Peter Rabbit The Book Of Henry Transformers The Last Knight Phoenix Forgotten The Bookshop Tyler Perrys Boo 2 Pirates Of The Caribbean Dead Men Tell No Tales The Case For Christ Unbroken Path To Redemption Pitch Perfect 3 The Circle Uncle Drew Queen Of Katwe The Conjuring 2 Unnished Business Quiet Place The Dark Tower Valerian And The City Of A Thousand Planets Raazi The Equalizer Venom Railroad Tigers The Fate Of The Furious Victoria And Abdul Rampage The Founder War Dogs Ready Player One The Glass Castle Wildlife Rings The Greatest Showman Winchester Risk The Gunman Wish Upon Robin Hood The Hunger Games Mockingjay Pt2 Wonder Rock Dog The Judge Wonder Woman Roman J Israel Esq The Lady In The Van Wonderstruck Room The Last Match Wrinkle In Time Rumble The Indians Who Rocked The World The Legend Of Tarzan Xxx Return Of Xander Cage Sabans Power Rangers The Lobster Ya Veremos 166 Appendix E SAIL Movie Character Benchmark: Additional results This chapter presents additional implementation details and clustering performance metrics for the experi- ments presented in Chapter 6. Implementation details for baseline face verication experiments Four baseline methods were adopted for verication experiments, to determine the best performing feature- embeddings. We adapted the open-source implementations for each of the models: FaceNet 1 (Schro et al., 2015), SphereFace 2 (Liu et al., 2017), VggFace2 3 (Cao et al., 2018), and Probabilistic Face Embeddings 4 (PFE (Y. Shi & Jain, 2019)) For each of the methods, we perform two sets of experiments on IJB-B; a) directly on raw face images, b) on face images aligned using similarity transformation for 5 facial landmarks. For landmark detection, we use the outputs from Multi-task Cascaded Convolutional Networks (MTCNN (K. Zhang, Zhang, Li, & Qiao, 2016)). We use the mtcnn module available in PyPI to detect landmarks for eye centers, nose tip, and mouth corners in face images scaled to a xed size of (96, 112). We then use the codebase provided in the SphereFace git repository to align with a set of reference points using a similarity transformation. We use the following coordinates as reference points for each of the landmarks: 1 Facenet: github.com/davidsandberg/facenet 2 SphereFace: github.com/clcarwin/sphereface pytorch 3 VggFace2: github.com/WeidiXie/Keras-VGGFace2-ResNet50 4 PFE: github.com/seasonSH/Probabilistic-Face-Embeddings 167 i. Left eye center: [30.2946, 51.6963] ii. Right eye center: [65.5318, 51.5014] iii. Nose tip : [48.0252, 71.7366] iv. Left corner of mouth : [33.5493, 92.3655] v. Right corner of mouth : [62.7299, 92.2041] We then proceed to replicate the preprocessing steps for each of the method, which involves image scaling and standardization. The ROC curves for all verication results in IJB-B with and without face alignment and YTFaces are shown in Fig. E.0.1. 168 Table E.1: No adaptation: VggFace2 clustering performance for the benchmarking dataset; Measure: Homogeneity (homog), Completeness (comp); V-measure (v-meas); over-clustering index (OCI); Number of clusters (K) AP HAC movie homog comp v-meas purity OCI homog comp v-meas purity OCI K ALN 95.5 39.9 56.3 97.7 8.2 90.1 77.4 83.2 94.7 1 10 BFF 95.0 64.5 76.9 96.7 2.8 95.40 95.25 95.3 97.3 1 12 DD2 93.9 41.0 57.1 97.1 5.8 90.2 76.1 82.6 95.0 1 10 HF 86.3 53.3 65.9 86.8 4.3 79.9 73.7 76.7 82.4 1 24 MT 97.1 51.7 67.5 98.6 4.4 92.8 84.8 88.6 95.9 1 10 NH 95.6 43.2 59.5 97.9 5.7 89.1 71.8 79.5 93.6 1 10 Mean 93.9 48.9 63.9 95.8 5.2 89.6 79.8 84.3 93.2 Table E.2: ImpTriplet adaptation: VggFace2+ImpTriplet clustering performance for the bench- marking dataset AP HAC movie homog comp v-meas purity OCI homog comp v-meas purity OCI K ALN 96.6 51.5 57.4 98.9 9.1 91.7 78.4 81.2 96.7 1 10 BFF 96.2 65.6 77.9 98.0 3.2 96.6 96.3 96.6 99.1 1 12 DD2 95.8 52.6 58.8 98.6 6.8 91.8 78.0 83.9 96.7 1 10 HF 88.0 55.2 67.7 88.5 6.0 80.9 75.4 78.0 83.7 1 24 MT 98.4 53.4 68.9 99.9 5.0 94.3 86.5 89.7 97.5 1 10 NH 97.4 49.7 60.6 99.4 6.1 90.9 73.6 81.1 94.9 1 10 Mean 95.4 54.7 65.2 97.2 6.0 91.1 81.4 85.1 94.8 Face clustering performance for the benchmarking dataset Tables E.1, E.2, E.3 present clustering performance measures for the videos in our benchmarking dataset. We present the results without adaptation on the VggFace2 embeddings No adaptation; with Triplet loss ImpTriplet adaptation; with multiview correlation adaptation MvCorr adaptation. The two clustering methods we explored are anity propagation (AP) and hierarchical agglomerative clustering (HAC). To summarize, overall, MvCorr adaptation yielded signicantly better clustering performance than the VggFace2 embeddings. This shows the benet of adapting pretrained embedddings with domain-specic data. While AP gave clusters which were nearly 100% pure, we incurred over-clustering issues. 169 Table E.3: MvCorr adaptation: VggFace2+MvCorr clustering performance for the benchmark- ing dataset AP HAC movie homog comp v-meas purity OCI homog comp v-meas purity OCI K ALN 97.8 52.5 60.4 99.9 10.1 92.5 79.2 86.8 97.3 1 10 BFF 97.1 68.8 84.3 99.0 3.1 97.8 96.7 97.6 99.9 1 12 DD2 95.8 53.5 60.1 99.8 7.8 93.1 79.0 85.7 96.9 1 10 HF 88.1 58.4 70.1 94.0 5.1 81.9 76.0 82.2 87.1 1 24 MT 99.2 63.9 69.6 98.2 6.1 94.4 87.6 92.1 98.8 1 10 NH 97.7 55.1 62.0 98.0 8.2 91.4 74.3 84.0 96.3 1 10 Mean 95.9 58.7 67.7 98.2 6.7 91.9 82.2 88.1 96.0 Figure E.0.1: ROC curves comparing the performance of face embeddings trained on images in videos for (a) IJB-B dataset without face alignment (b) IJB-B dataset with model-specic face alignment and (c) YTFaces dataset. VGGFace2 and FaceNet embeddings not only performed well in IJB-B regardless of face alignment but performed consistently well in YTFaces which is typically more challenging as it consists of videos in-the-wild, mined from YouTube. 170
Abstract (if available)
Abstract
Human perception involves reconciling different sources of information that may appear conflicting because of the multisensory nature of our experiences and interactions. Similarly, machine perception can benefit by learning from multiple sources of measurements to develop a comprehensive model of an observed entity or event. This thesis focuses on addressing three open research problems in the area of multiview and multimodal machine learning: (1) how to learn robust representations from unlabeled or weakly labeled data (2) how to model many (more than two) views/modalities and (3) how to handle absent views/modalities. ? I begin by presenting a unified framework to delineate views and modalities as allied but distinct concepts in learning paradigms to facilitate subsequent modeling and analysis. For multiview learning, I propose deep multiview correlation to learn embeddings that capture the information shared across corresponding views such that they are discriminative of the underlying semantic class of events. Experiments on a diverse set of audio and visual tasks?multi-channel acoustic activity classification, spoken word recognition, 3D object classification and pose-invariant face recognition?demonstrate the ability of deep multiview correlation to model a large number of views. This method also shows excellent capacity to generalize for view-agnostic settings and when data from certain views is not available. ? For multimodal learning, I explore self-supervision between images, audio and text that naturally co-occur in data. I propose cross-modal autoencoders to learn joint audio-visual embeddings and model the arrival times of multimodal events with marked temporal temporal point processes. Experiments on a variety of tasks including sentiment and topic classification in videos show the benefit of cross-modal autoencoder embeddings to capture information complementary to individual modalities. Results underscore that point process modeling not only offers a scalable solution to model a large number of modalities but also capture the variable rate of change of information in individual modalities. The methods developed in this thesis have been successfully applied for large-scale media analytic tasks such as robust face clustering in movies and automatic understanding of video advertisements.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Establishing cross-modal correspondences for media understanding.
PDF
Visual representation learning with structural prior
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Multimodal reasoning of visual information and natural language
PDF
Emotional speech production: from data to computational models and applications
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Semantically-grounded audio representation learning
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Modeling, learning, and leveraging similarity
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Multimodal representation learning of affective behavior
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Learning controllable data generation for scalable model training
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
Asset Metadata
Creator
Somandepalli, Krishna
(author)
Core Title
Learning shared subspaces across multiple views and modalities
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-08
Publication Date
07/25/2021
Defense Date
08/09/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computational media intelligence,computer vision,deep learning,machine learning,media understanding,multimodal learning,multi-view learning,OAI-PMH Harvest,speech processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth (
committee chair
), Adam, Hartwig (
committee member
), Drakopoulos, Kimon (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
krishna.somandepalli@gmail.com,somandep@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15621179
Unique identifier
UC15621179
Legacy Identifier
etd-Somandepal-9867
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Somandepalli, Krishna
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computational media intelligence
computer vision
deep learning
machine learning
media understanding
multimodal learning
multi-view learning
speech processing