Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards generalizable expression and emotion recognition
(USC Thesis Other)
Towards generalizable expression and emotion recognition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS GENERALIZABLE EXPRESSION AND EMOTION RECOGNITION by Yufeng Yin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Yufeng Yin Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Multimodal Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Generative Model Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Unsupervised Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Outline and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Facial Action Unit Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Contrastive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Multimodal Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.4 Normalization Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.5 Face Understanding with Generative Models . . . . . . . . . . . . . . . . . . . . . . 19 2.3.6 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.7 Set Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 3: Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition . . . . . . 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Datasets and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.3 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.4 Domain-Adversarial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.5 Speaker-Invariant Domain-Adversarial Neural Network . . . . . . . . . . . . . . . 33 ii 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Training and Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.3 Modality Contribution Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4: Contrastive Learning for Domain Transfer in Cross-Corpus Emotion Recognition . . . 41 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2 Data Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.2 Base Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.3 Face wArping emoTion rEcognition (FATE) . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Evaluation Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5: X-Norm: Exchanging Normalization Parameters for Bimodal Fusion . . . . . . . . . . . 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4 Normalization Exchange (NormExchange) . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.5 Positions for NormExchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.6 X-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.4 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.7 Qualitative Analysis of Normalization Parameters . . . . . . . . . . . . . . . . . . . 73 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 6: FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features . . . . . . 75 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 iii 6.2.4 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.2 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Chapter 7: Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.3 Pre-trained Speech Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2.4 Personalization Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2.5 Performance variance across speakers . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2.6 Personalized Adaptive Pre-training (PAPT) . . . . . . . . . . . . . . . . . . . . . . 99 7.2.7 Personalized Label Distribution Calibration (PLDC) . . . . . . . . . . . . . . . . . . 99 7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3.1 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.3 Experimental Results on test-b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3.4 Evaluations on Unseen Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.3.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter 8: SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision . . . . . 105 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2 EmoCeleb Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.2.1 Labeling Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.2.2 Unimodal Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2.3 Cross-modal Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.2.5 Label Quality Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.3.1 Backbone Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.3.2 Personalized Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.3.3 Training Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.4.1 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 9: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 iv List of Tables 3.1 Within-domain performance of the baseline model. A1, A2, V, and L represent VGGish acoustic, MFB acoustic, ResNet visual, and BERT lexical features. M and I stand for MSP-Improv and IEMOCAP database. ACC and UAR stand for Accuracy and Unweighted Average Recall. They are 0.33 when the detected labels are uniformly distributed. . . . . . 36 3.2 Cross-domain performance of the unsupervised domain adaptation. . . . . . . . . . . . . . 37 3.3 Cross-domain performance with the MFB acoustic features. . . . . . . . . . . . . . . . . . 38 3.4 Modality contribution analysis for unsupervised domain adaptation. . . . . . . . . . . . . . 39 4.1 CCC values for different directions of domain adaptation. Source means directly transferring the source model to the target domain. Target means both training and evaluating the model with the target data. A and V stand for arousal and valence respectively. * denotes vaules reported in the original work. Kossaifi et al. [86] and Deng et al. [37] use visual modality for detection while Yang et al. [196] use acoustic modality. . 51 4.2 Ablation study for lrmult in FATE. CCC values of domain adaptation from Aff-Wild2 to SEWA are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Ablation study for preservation for subtle facial features. CCC values of domain adaptation from Aff-Wild2 to SEWA are reported. A and V stand for arousal and valence respectively. 55 5.1 Number of utterances of each class for IEMOCAP [20] and MSP-Improv [22]. . . . . . . . . 66 5.2 Hyperparameters of X-Norm we use for the various benchmarks. . . . . . . . . . . . . . . 69 5.3 Comparison of X-Norm with different unimodal and multimodal baselines on IEMOCAP and MSP-IMPROV for emotion recognition and EPIC-KITCHENS for action recognition. A, V, T, R, and O stand for audio, vision, text, RGB, and optical flow respectively. ACC% and F1% stand for accuracy and weighted F1 score respectively. The numbers in the brackets are the standard deviations. X-Norm achieves comparable or superior performance to the existing methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4 Ablation study for X-Norm. We report the accuracy (ACC) on EPIC-KITCHENS (random seed is 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 v 6.1 Region of Interest (ROI) for each action unit (AU). Scale is measured by inner-ocular distance (IOD). Landmark (Lmk) positions are illustrated in Figure 6.4. . . . . . . . . . . . 80 6.2 Within-domain evaluation in terms of F1 score (↑). Except for GH-Feat and ME-GraphAU + FFHQ pre-train, all the baseline numbers are from the original papers. Our method has competitive performance compared to the state-of-the-art. . . . . . . . . . . . . . . . . . . 84 6.3 Cross-domain evaluation between DISFA and BP4D in terms of F1 scores (↑). Our model achieves superior performance compared to the baselines. ∗ The numbers are reported from the original paper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4 AU intensity estimation on DISFA [111] in terms of MSE (↓) and MAE (↓). Our method has competitive performacne compared to the state-of-the-art. . . . . . . . . . . . . . . . . 87 6.5 Ablation study for FG-Net. F1 score (↑) is the metric. D and B stand for DISFA and BP4D. D → B means the model is trained on DISFA and tested on BP4D and similar to B → D. (i) Our method gets better performance than GH-Feat [195]. (ii) With every component, our method achieves the highest within-domain performance while removing late features gets the best cross-domain performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.1 Details and statistics of our splits for MSP-Podcast. . . . . . . . . . . . . . . . . . . . . . . 96 7.2 Evaluations on MSP-Podcast (test-b) in terms of CCC (↑). O-CCC refers to the overall CCC between the prediction and ground truth. A-CCC denotes the average CCC for each test speaker. Numbers in the brackets are the standard deviations calculated across speakers. Our proposed PAPT-FT achieves superior performance compared to the baselines. . . . . . 101 7.3 Evaluations on unseen speakers (test-c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.4 Effect of different speaker embedding fusion positions. . . . . . . . . . . . . . . . . . . . . 103 8.1 Comparison of EmoCeleb with previous emotion recognition datasets. Mod indicates the available modalities, (a)udio, (v)ision, and (t)ext. TL denotes the total number of hours. # U and # S denote the number of utterances and speakers respectively. Our datasets are larger and have more speakers, with at least 50 utterances per speaker. . . . . . . . . . . . 108 8.2 Number of utterances in each class for EmoCeleb. . . . . . . . . . . . . . . . . . . . . . . . 110 8.3 Comparison with a single human annotator. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Both directions of weak label generation achieve superior performance compared to random guessing. . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.4 Comparison with existing emotion datasets. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Model trained with EmoCeleb outperforms RAVDESS and CMU-MOSEI which are manually labeled. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 vi 8.5 Speech emotion recognition on EmoCeleb-A and downstream datasets. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Average accuracy (A-ACC %, ↑) and average F1-score (A-F1 %, ↑) across speakers are also reported. . . . . . . . . . . . . . . . . 120 8.6 Visual emotion recognition on EmoCeleb-V and MSP-Improv. SetPeER surpasses the baseline methods across all evaluated metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 121 8.7 Ablations for SetPeER. Fusion refers to fusing speaker embedding with audiovisual features for personalized emotion recognition. A and V stand for audio and vision modalities respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 vii List of Figures 3.1 Screenshots from MSP-Improv and IEMOCAP. . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Label distributions for different domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 The network architecture for the cross-domain emotion recognition models. The inputs from different modalities are passed through the models, in this case, MFB/VGGish for speech, ResNet for vision, BERT for language. The baseline model only has the encoder and emotion classifier. The DANN model has the encoder, emotion classifier, and domain discriminator. The SIDANN has all the four parts (encoder, emotion classifier, domain discriminator, and speaker discriminator). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 The discrepancies across datasets (domains) result in lower emotion recognition performance. Top two rows are from Aff-Wild2 [84] and two bottom rows are from SEWA [87]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Overview of the first order motion model [152]. The model takes a single image and a driving video as the input and generates an output video by retargeting animation. . . . . 45 4.3 Real and synthetic (Syn) video pairs for self-supervised contrastive pre-training. Driving videos are from the source domain (Aff-Wild2) and anchor faces are from the target domain (SEWA). The corresponding real and synthetic videos have the same facial movements but different subject appearances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Heat maps for the different datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 The overview of the proposed Face wArping emoTion rEcognition (FATE). Unlike the traditional DA models, FATE reverses the training order. FATE is first pre-trained with real-synthetic video pairs and then fine-tuned with labeled source data, in a supervised manner. The encoder is a ResNet (2+1)D [168] up to the penultimate layer in addition to two fully-connected layers and the classifier is one fully-connected layer. . . . . . . . . . . 48 5.1 Comparison between different modality fusion strategies (best viewed in color). There is no information exchanged in the late fusion (left) before the unimodal model outputs. Early fusion (middle) simply concatenates the unimodal representations. Our proposed method, X-Norm, (right) condenses and encodes the hidden states into the normalization parameters to be exchanged between modalities. . . . . . . . . . . . . . . . . . . . . . . . . 58 viii 5.2 Overview of the proposed X-Norm for bimodal fusion. Inputs of the two modalities are fed into two unimodal branches. In the proposed NormExchange layer, normalization parameters are exchanged. At last, the logits from each branch are weighted averaged. . . 60 5.3 Illustration for the proposed Normalization Exchange (NormExchange) layer (best viewed in color). NormEncoders (Normalization Encoders) condense and encode the hidden states into the normalization parameters. Then, we perform affine transformations with the normalized hidden states∗ and the opposite normalization parameters. At last, we utilize the skip connections to keep the original modality information. *Norm(·) can be either batch, layer, or instance normalization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Screenshots from IEMOCAP [20], MSP-Improv [22], and EPIC-KITCHENS [35]. . . . . . . 65 5.5 Qualitative analysis of the generated normalization parameters (best viewed in color). We show three examples of sentences from MSP-Improv. The sentences are on the left and the labels are on the right. <bos> and <eos> are tokens standing for begin of sentence and end of sentence. We compute the L1 norm of the normalization parameters for each token. Darker green indicates a higher norm. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.1 Performance (F1 score ↑) gap between the within- and cross-domain AU detection for DRML [212], JAA-Net [148], ME-GraphAU [109], and the proposed FG-Net. The within- ˆ domain performance is averaged between DISFA and BP4D, while the cross-domain performance is averaged between BP4D to DISFA and DISFA to BP4D. The proposed FG-Net has the highest cross-domain performance, thus, superior generalization ability. . . 76 6.2 Overview of our proposed pipeline. FG-Net first encodes the input image into a latent code using a StyleGAN2 encoder (e.g. pSp [133] here). In the decoding stage [77], we extract the intermediate multi-resolution feature maps and pass them through our Pyramid CNN Interpreter to detect AUs coded in the form of heatmaps. Mean Squared Error (MSE) loss is used for optimization between the ground truth and predicted heatmaps. . . . . . . . . . 78 6.3 Visualizations of the ROI centers for DISFA (left) and BP4D (right). AU indices are labeled above or below. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4 The positions for the 68 facial landmarks. Image is adapted from the iBUG 300-W dataset [141]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.5 Visualization of the generated ground-truth heatmaps on DISFA (first row) and BP4D (second row). We generate one heatmap for every AU which has two Gaussian windows with the maximum values at the two ROI centers (see Figure 6.3). The peak value is either 1 (red, AU is active) or −1 (blue, AU is inactive). . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6 Case analysis on ME-GraphAU [109] and FG-Net. Models are trained on BP4D and tested on DISFA. Orange means active AU while blue means inactive AU. FG-Net is more accurate than ME-GraphAU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Data efficiency evaluation with different numbers of samples. Our method is data-efficient and its performance trained on 1k samples is close to the whole set. . . . . . . . . . . . . . 88 ix 6.8 Visualization of the detected heatmaps for ablation study. With all the components, FG-Net detects the most similar heatmaps to the ground-truth (GT) for within-domain evaluation. Removing late features results in the best cross-domain evaluation. . . . . . . . 90 6.9 Visualization of the failure cases. FG-Net achieves inferior performance on AU9, AU15, and AU26. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.1 Performance gap between speaker-dependent and speaker-independent models for valence estimation with varying the number of training speakers. . . . . . . . . . . . . . . 97 7.2 Overview of our proposed method. (a) Personalized Adaptive Pre-Training (PAPT) pre-trains the HuBERT encoder with learnable speaker embeddings in a self-supervised manner. (b) Personalized Label Distribution Calibration (PLDC) finds similar training speakers and calibrates the predicted label distribution with the training label statistics. . . 98 8.1 Overview of cross-modal labeling: (i) Emotion recognition with two modalities (vision+text or audio+text) to provide weak supervision. (ii) Weak labels are retained when two modalities are in sufficient agreement (measured by KL divergence). (iii) Inference results are averaged to generate a weak label for the target modality (audio or vision). . . . . . . . 109 8.2 Examples of emotional expressions in EmoCeleb-A and EmoCeleb-V. Green solid lines denote the modalities used for cross-modal labeling, while red dashed lines refer to the target modalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.3 SetPeER overview. The personalized feature extractor P generates layer-specific personalized embeddings from the input and feeds the embeddings to the backbone encoder layer. These personalized embeddings serve as contextual cues for the current layer, aiding in generating more targeted features. The weights of P are shared across layers. Additionally, we apply contrastive learning for embeddings generated from P to enhance the consistency in producing personalized speaker embeddings. . . . . . . . . . . 114 8.4 t-SNE visualizations of speaker embeddings from MSP-Podcast. Blue points represent male speakers and orange points indicate female speakers. Representations by SetPeER (first row) show clear separation w.r.t. gender. . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.5 Impact of set size K on performance. Larger set sizes lead to higher performance. . . . . . 124 x Abstract Emotions play a significant role in human creativity and intelligence including rational decision-making. Emotion recognition is the process of identifying human emotions, e.g. , happiness, sadness, anger, and neutral. External manifestations of emotions can be recognized by tracking emotional expressions. Recent deep learning approaches have shown promising performance for emotion recognition. However, the performance of automatic recognition methods degrades when evaluated across datasets or subjects, due to variations in humans, e.g. , culture, gender, and environmental factors, e.g. , camera and background. Expression and emotion annotations require laborious manual coding by trained annotators. The annotation is both time-consuming and expensive. Thus, the manual annotations required by the supervised learning methods present significant practical limitations when working with new datasets or individuals. Throughout this thesis, we delve into various methodologies aimed at enhancing the generalization capabilities of perception models with minimal human efforts. Our investigation encompasses unsupervised domain adaptation, multimodal learning, generative model feature extraction, and unsupervised personalization, all of which enhance the adaptability of recognition models to unseen datasets or subjects lacking labels. This thesis includes four major contributions. First, we advance the understanding of unsupervised domain adaptation by proposing innovative approaches to obtain domain-invariant and discriminative features without relying on target labels. This lays a foundation for improving model generalization across diverse datasets. Second, we introduce a novel architecture for bimodal fusion, enabling the extraction of meaningful representations for multimodal emotion and action recognition. Notably, our xi approach remains agnostic to specific tasks or architectures, underscoring its versatility and lightweight nature. Furthermore, our exploration into generative model feature extraction yields significant advancements in data efficiency for facial action unit detection. By extracting generalizable and semantic-rich features, our method achieves competitive performance even with a few training samples, thereby demonstrating its strong potential to solve expression recognition in a real-life scenario. Finally, we tackle the issue of unsupervised personalization for emotion recognition on unseen speakers. By leveraging domainadapted pre-training and learnable speaker embeddings, coupled with cross-modal labeling to construct a large-scale weakly-labeled emotion dataset, our work facilitates the development of personalized emotion recognition systems at scale. Overall, these contributions not only advance the scientific understanding of perception models and representation learning but also offer practical implications for real-world applications such as affective computing and personalized systems. By addressing key challenges in model generalization, multimodal learning, data efficiency, and personalized adaptation, our work significantly pushes the boundaries of the field, paving the way for more robust and adaptable perception models in diverse contexts. xii Chapter 1 Introduction Emotions are integral to human cognition, influencing not only creativity but also rational decision-making processes [126]. Emotion recognition, the ability to discern human emotional states such as happiness, sadness, anger, and neutrality, is pivotal for facilitating natural and intelligent interactions between humans and computers [126]. Previous research has predominantly focused on two primary emotion representations: Paul Ekman’s six basic emotions model [45], which includes anger, disgust, fear, happiness, sadness, surprise, and neutrality, and the dimensional emotion model [166], which employs continuous values to capture emotions’ arousal and valence. Arousal signifies emotional intensity, while valence indicates whether the emotion is positive or negative. It is essential to clarify that within this thesis, emotion recognition pertains to perceived emotions rather than felt emotions. Felt emotions refer to an individual’s conscious emotional experiences, whereas perceived emotions denote how others interpret or perceive emotional signals, such as speech, language, or facial expressions. Facial expressions serve as potent communicative cues for conveying emotions and social attitudes [140]. The automatic detection of facial action units is a foundational component for objective facial expression analysis [46]. The Facial Action Coding System (FACS) provides a taxonomy of facial activities, enabling the description of expressions through anatomical action units, such as the lip corner puller (AU 12) [46]. Unlike subjective assessments of facial expressions, where consensus among raters may vary [10], 1 FACS offers an objective measure linked to precise movements of facial muscles, facilitating the description of specific facial behaviors [110]. Over the past few years, deep learning approaches have shown promising performance for emotion recognition and AU detection [170, 176, 204, 212, 148, 109], requiring a large number of samples. Existing expression or emotion recognition methods are often evaluated with within-domain cross-validation, with training and testing data from the same dataset, and the generalization to other datasets (model trained and tested on different datasets) has not been widely investigated. As within-domain performance can be due to overfitting, cross-corpus performance can suffer a considerable loss. The performance loss is due to the variations in environments, e.g. , camera and background, and individual differences, e.g. , culture and gender. To improve the generalization ability of the human perception model, one way is to train the model with a larger and more diverse labeled dataset. However, expression and emotion annotations require laborious manual coding by trained annotators. The annotation is both time-consuming and expensive. Thus, it is not practical to acquire a large amount of labeled data from any new dataset or person. Therefore, in this thesis, to address the aforementioned problem, we explore different methods to improve the generalization of expression and emotion recognition with minimal human effort. This thesis focuses on unsupervised and self-supervised representation learning approaches to enhance the generalization of expression and emotion recognition. We explore the unsupervised domain adaptation to obtain the domain-invariant and discriminative features without any target labels. Then, we propose a novel architecture for bimodal fusion to extract meaningful representations for multimodal emotion recognition and action recognition. Subsequently, we explore extracting the generalizable and semanticrich features from the generative model for generalizable facial action unit detection. Lastly, we discuss unsupervised personalization on unseen speakers for emotion recognition through cross-modal labeling and personalized feature representation learning. 2 1.1 Unsupervised Domain Adaptation Machine learning performance can suffer from variations in data distributions between training and test data. Domain adaptation methods have been proposed to alleviate such problems. In particular, deep domain adaptation methods try to learn a representation that is both effective for the main task and invariant to domains [191]. Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition. In the adversarialbased domain adaptation, e.g. , DANN [53], a domain discriminator is trained to classify whether a data point is drawn from the source or target domain. It is used to encourage domain confusion through an adversarial objective to minimize the distance between the source and target domains [191]. The DANN model succeeds in reducing the domain bias between the source and the target domains, but it fails to address the bias between the speakers. We propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN). Specifically, based on the DANN model, we add a speaker discriminator to detect the speaker’s identity. We add a GRL at the beginning of the discriminator so that the encoder can unlearn the speakerspecific information. To confirm the effectiveness of our proposed model, we conduct the within-domain and cross-domain experiments with multimodal data (speech, vision, and text). We evaluate our method on two publicly available fully-annotated audiovisual emotion databases (MSP-Improv [22] and IEMOCAP [20]).Our results indicate that the proposed SIDANN model outperforms the DANN model, confirming that the SIDANN has better domain adaptation ability than the DANN. Contrastive Learning for Domain Transfer in Cross-Corpus Emotion Recognition. Emotions result in subtle and localized changes in the face. However, unsupervised domain adaptation methods cannot guarantee to preserve local features necessary for emotion recognition while reducing domain discrepancies in global features. This limits their ability in improving emotion recognition performance across corpora. To address the problem, we propose Face wArping emoTion rEcognition (FATE). Specifically, we employ first-order facial animation warping [152] to generate a synthetic dataset. We choose an anchor 3 image (a face) from the target domain and drive its synthetic facial behavior from source video sequences, transferring the facial movements from source videos to the target subjects. Then, we apply a contrastive learning with the instance-based Information Noise Contrastive Estimation (InfoNCE) loss [122, 129] to pre-train the encoder with the real and synthetic video pairs. Using this self-supervised pre-training, the encoder can learn the domain-invariant information focusing on the facial features only. We conduct the cross-domain experiments with three publicly available fully-annotated continuous emotion recognition databases (Aff-Wild2 [84], SEWA [87], and SEMAINE [112]). Our experimental results indicate that the proposed FATE model achieves enhanced emotion recognition performance across corpora. 1.2 Multimodal Learning Multimodal learning aims to build models that can process and relate information from multiple modalities [7, 156, 6, 16]. In multimodal learning, leveraging multiple modalities to capture different views of the data is expected to enhance the model capacity and robustness [7]. X-Norm: Exchanging Normalization Parameters for Bimodal Fusion. One challenge for multimodal learning is how to fuse different modalities and perform inference. Wang et al. [192] find that different modalities overfit and generalize at different rates while Nagrani et al. [118] point out that machine perception models are typically modality-specific and optimized for unimodal benchmarks. Thus using concatenation for modality fusion makes the multimodal networks sub-optimal [192]. We propose X-Norm, a novel, simple, and efficient approach for bimodal fusion. X-Norm generates and exchanges the normalization parameters between the modalities for fusion. Specifically, we propose a Normalization Exchange (NormExchange) layer which generates and exchanges the normalization parameters for the two modalities. We conduct extensive experiments on two different multimodal tasks, i.e. , emotion recognition and action recognition with different combinations of modalities and different architectures. 4 Our experimental results show that X-Norm achieves comparable or superior performance compared to the existing methods. 1.3 Generative Model Features Generative models provide an estimate of the distribution of training samples [17]. In the field of semantic segmentation, recent studies [208, 9] leverage a well-trained generative model to synthesize imageannotation pairs from only a few labeled examples (around 30 training samples). They show that the intermediate features of generative models exhibit semantic-rich representations that are well-suited for pixel-wise segmentation tasks in a few-shot manner. FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features. Inspired by the success of GAN features in semantic segmentation, we propose FG-Net, a facial action unit detection method that can better generalize across domains. Specifically, FG-Net first encodes and decodes the input image with a StyleGAN2 encoder (pSp) [133], and a StyleGAN2 generator [77], trained on the FFHQ dataset [76]. Then, FG-Net extracts feature maps from the generator during decoding. To take advantage of the informative pixel-wise representations from the generator, FG-Net detects the AUs through heatmap regression. We propose a Pyramid CNN Interpreter which incorporates the multi-resolution feature maps in a hierarchical manner. The proposed module makes the training efficient and captures essential information from nearby regions. We conduct extensive experiments with the widely-used DISFA [111] and BP4D [205] for AU detection. The results show that the proposed FG-Net method has a strong generalization ability and achieves state-of-the-art cross-domain performance. We showcase that FG-Net is a data-efficient approach. With only 100 training samples, it can achieve decent performance. 5 1.4 Unsupervised Personalization There are individual differences in expressive behaviors driven by cultural norms and personality. This between-person variation can result in reduced emotion recognition performance. Therefore, personalization, i.e. , adapting the machine learning model to a specific subject, is an important step in improving the generalization and robustness of emotion recognition. Most prior studies [79, 8, 29, 75, 160] are either limited by the number of subjects available in the existing emotion datasets or rely on a single data point for personalization, which compromises the reliability of the learned personalized features and hinder their applicability to unseen speakers. One notable exception is the study by Sridhar et al. [160] which proposes to find speakers in the training set whose acoustic patterns closely match those of the testing speakers to create an adaptation set. The approach needs additional training (model adaptation) at inference time, limiting its applicability to unseen speakers. Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition. To achieve unsupervised personalized emotion recognition, we first pre-train an encoder with learnable speaker embeddings in a self-supervised manner to learn robust speech representations conditioned on speakers. Second, we propose an unsupervised method to compensate for the label distribution shifts by finding similar speakers and leveraging their label distributions from the training set. Extensive experimental results on the MSP-Podcast corpus indicate that our method consistently outperforms strong personalization baselines and achieves state-of-the-art performance for valence estimation. SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision. One main challenge in emotion recognition is the inherent variability and subjectivity of emotional expressions, making a general-purpose emotion recognition model fail to perform consistently across a wide range of speakers [187]. Another significant obstacle in emotion recognition, particularly for personalized approaches, stems from the scarcity of appropriate data. Commonly used emotion recognition databases suffer from limitations such as a small number of speakers or insufficient samples per speaker. These constraints not only 6 hinder progress in developing personalized emotion recognition systems but also pose significant challenges in personalized systems evaluation. From the modeling perspective, we introduce a novel approach called the Set-based Personalized Representation Learning for Emotion Recognition (SetPeER). This model is designed to extract personalized information from as few as eight samples per speaker. Regarding data, we develop a framework to weakly label in-the-wild audiovisual videos. We use pre-trained models for text, vision, or audio-based emotion recognition to assign weak labels to a target modality from the remaining two modalities to a large dataset of unlabeled data with a large number of speakers. Through extensive experiments, we validate the usefulness of our dataset. The comprehensive evaluation validates the effectiveness of our proposed model in comparison to existing personalized emotion recognition approaches. 1.5 Outline and Contributions For the remainder of this thesis, we first discuss the related literature for emotion recognition, expression recognition, and different representation learning approaches in Chapter 2. We then dive into unsupervised domain adaptation for emotion recognition in Chapter 3 and Chapter 4. In Chapter 3, we propose a new model based on the DANN model to reduce both the domain bias and the speaker bias at the same time while in Chapter 4, we propose to generate synthetic data with unlabeled target data and utilize contrastive learning to let the model learn domain-invariant information focusing on the facial features only. Then, we describe a novel bimodal fusion strategy in Chapter 5. We propose X-Norm which generates and exchanges the normalization parameters between the modalities for fusion. In Chapter 6, we explore to use generative model features for generalization AU detection. We propose FG-Net which extracts the generalizable and semantic-rich deep representations from a well-trained generative model and detects the AUs through heatmap regression. Lastly, we introduce two unsupervised personalization approaches for emotion recognition in Chapter 7 and Chapter 8. In Chapter 7, we propose 7 to pre-train an encoder with speaker embeddings to learn robust representations conditioned on speakers and then we design an unsupervised method to compensate for the label distribution shifts with similar training speakers. In Chapter 8, we develop a framework to weakly label in-the-wild audiovisual videos for personalized emotion recognition system design and evaluation and also design a novel approach to extract personalized information from a few utterances. With these approaches to enhance models’ generalization ability, we conclude in Chapter 9 and discuss potential future directions for generalizable expression and emotion recognition. The major contributions of this thesis can be summarized as follows. • Unsupervised Domain Adaptation for Emotion Recognition. (i) Introduction of the SpeakerInvariant Domain-Adversarial Neural Network (SIDANN) to address domain and speaker biases simultaneously. (ii) Proposal of the Face wArping emoTion rEcognition (FATE) model for enhancing emotion recognition performance across different datasets through facial animation warping and contrastive learning. • Multimodal Learning. Development of X-Norm, a novel bimodal fusion approach, which exchanges normalization parameters between modalities to improve fusion efficiency and performance. • Generative Model Features for Facial Action Unit Detection. Introduction of FG-Net, leveraging features from a well-trained generative model for facial action unit detection, achieving stateof-the-art performance across different datasets. • Unsupervised Personalization for Emotion Recognition. (i) Design, development and evaluation of a personalized adaptation approach with pre-trained speech encoders, achieving state-ofthe-art results for valence estimation on the MSP-Podcast corpus. (ii) Development of Set-based Personalized Representation Learning for Emotion Recognition (SetPeER), enabling personalized emotion recognition with minimal data by utilizing weakly labeled in-the-wild audiovisual videos. 8 These contributions collectively advance the field of expression and emotion recognition by addressing challenges related to domain adaptation, multimodal fusion, feature extraction, and personalization, ultimately enhancing model generalization. Our work significantly advances the field, paving the way for more robust and adaptable perception models in diverse contexts. Beyond theoretical insights, our findings hold implications for real-world applications, particularly in domains like affective computing and personalized systems. 9 Chapter 2 Related Work 2.1 Emotion Recognition Emotion recognition is the process of identifying human emotions, e.g. , happy, sad, angry, and neutral. Over the past few years, deep learning methods have shown promising performance for emotion recognition. Tzirakis et al. [176] propose a multimodal system to perform an end-to-end spontaneous emotion detection task with Long-Short Term Memory (LSTM). Zhang et al. [204] propose a hybrid deep model which first produces audio-visual segment features with CNNs, then fuses the segment features in a Deep Belief Networks (DBNs). Zhao et al. [210] present 1D and 2D CNN LSTM networks to recognize emotions from speech. They investigate how to learn local correlations and global contextual information from raw audio clips and log-mel spectrograms. Wang et al. [190] propose Self Cure Network (SCN) which suppresses the uncertainties efficiently and prevents deep networks from over-fitting uncertain facial images. Kossaifi et al. [86] propose coined CP-Higher-Order Convolution (HO-CPConv) for spatio-temporal facial emotion analysis and report state-of-the-art results in both static and temporal emotion recognition. Emotion Recognition Databases. Access to expansive, natural databases that capture the nuanced facets of emotional expression is essential for improving emotion recognition. Table 8.1 presents some of the widely used emotion recognition databases. Generally, emotion recognition datasets can be categorized into three main types. Acted databases constitute the first type, where speakers are directed to express 10 specific emotions while reciting predetermined sentences. This method is employed in various databases such as RAVDESS [105] and CREMA-D [23]. The second type, and the most prevalent, consists of datasets captured within controlled laboratory environments. Participants are typically instructed to engage in interactions surrounding a given topic or to respond to emotion-inducing videos. Notable examples of this type include HUMAINE [43], SEWA [87], IEMOCAP [20], and MPS-Improv [22]. Lastly, the third type comprises fully natural utterances sourced from real-world settings, such as YouTube, and subsequently annotated through crowdsourcing. Datasets falling into this category include CMU-MOSEI [201], MSP-FACE [185], and MSP-Podcast [108]. Arguably, datasets of the third type are optimal for developing generalized emotion recognition systems applicable across diverse environments. Their potential is particularly promising as in-the-wild utterances are readily accessible from the internet. However, the expense associated with human annotations often impedes large-scale development efforts, especially in personalized emotion recognition, where both the dataset size and the number of samples per speaker are crucial. As illustrated in Table 8.1, existing large-scale emotion recognition datasets typically suffer from a scarcity of utterances per speaker. This paper aims to bridge the gap by leveraging the wealth of in-the-wild data to construct a large-scale weakly-labeled dataset customized for training and evaluating personalized emotion recognition systems and explore the trade-off between annotation accuracy and automated labeling. 2.2 Facial Action Unit Detection A facial action unit is an indicator of activation of an individual or a group of muscles, e.g. , cheek raiser (AU6). AUs are formalized by Paul Ekman in Facial Action Coding System (FACS) [46]. Previous studies explore attention mechanism [148, 164, 72] or self-supervised learning [26] to get discriminative representations for AU detection. 11 Zhao et al. [212] propose Deep Region and Multi-label Learning (DRML). DRML is trained with region learning (RL) and multi-label learning (ML) and is able to identify more specific regions for different AUs than conventional patch-based methods. Shao et al. [148] propose JAA-Net for joint AU detection and ˆ face alignment. JAA-Net uses adaptive attention learning to refine the attention map for each AU. Tang ˆ et al. [164] propose a joint strategy called PIAP-DF for AU representation learning. PIAP-DF involves pixel-level attention for each AU and individual feature elimination and utilizes the unlabeled data to mitigate the negative impacts of wrong labels. Jacob et al. [72] combine transformer-based architectures with region of interest (ROI) attention module, per-AU embeddings, and correlation module to capture relationships between different AUs. Chang et al. [26] propose a knowledge-driven self-supervised representation learning framework. AU labeling rules are leveraged to design facial partition manners and determine correlations between facial regions. Recent work on AU detection use graph neural networks [209, 159, 109]. Zhang et al. [209] utilize a heatmap regression-based approach for AU detection. The ground-truth heatmaps are defined based on the ROI for each AU. Besides, the authors utilize graph convolution for feature refinement. Song et al. [159] propose a hybrid message-passing neural network with performance-driven structures (HMPPS). A performance-driven Monte Carlo Markov Chain sampling method is proposed for generating the graph structures. Besides, hybrid message passing is proposed to combine different types of messages. Luo et al. [109] propose an AU relationship modeling approach that learns a unique graph to explicitly describe the relationship between each pair of AUs of the target facial display. Previous studies on AU detection achieve promising within-domain performance. However, the generalization ability, i.e. , cross-domain performance, for AU detection has not been widely investigated. Ertugrul et al. [47, 48] demonstrate that the deep-learning-based AU detectors achieve poor crossdomain performance due to the variations in the cameras, environments, and subjects. Tu et al. [173] propose Identity-Aware Facial Action Unit Detection (IdenNet). IdenNet is jointly trained by AU detection 12 and face clustering datasets that contain numerous subjects to improve the model’s generalization ability. Yin et al. [198] propose to use domain adaptation and self-supervised patch localization to improve the cross-corpora performance for AU detection. However, this method requires data from the target domain for domain adaptation. Hernandez et al. [66] conduct an in-depth analysis of performance differences across subjects, genders, skin types, and databases. To address this gap, they propose deep face normalization (DeepFN) that transfers the facial expressions of different people onto a common facial template. 2.3 Representation Learning 2.3.1 Domain Adaptation Supervised deep learning methods suffer from performance loss on unseen data due to the covariate shift. Domain adaptation techniques are proposed to reduce discrepancies between different domains. Unsupervised Domain Adaptation (UDA) can be used to train a model with labeled data from the source domain (training dataset) and unlabeled data from the target domain (unseen dataset). The goal is to learn a representation that is both discriminative for the main learning task (e.g. emotion recognition) on the source domain and insensitive to the covariate shift between the domains. Wang et al. [191] defines this kind of problem as the homogeneous domain adaptation and divides the homogeneous domain adaptation into three categories: discrepancy-based approach, adversarial-based approach, and reconstruction-based approach. The discrepancy-based approach aims to diminish the shift between the two domains by fine-tuning the deep network model [191]. Tzeng et al. [175] proposes a new CNN architecture with an adaptation layer and an additional domain confusion loss, to learn a representation that is both semantically meaningful and domain invariant. Long et al. [106] proposes a Deep Adaptation Networks (DAN) architecture, which generalizes deep CNNs to the domain adaptation scenario. In this architecture, hidden representations of 13 all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. Rozantsev et al. [139] introduces a two-stream architecture, where one operates in the source domain and the other in the target domain. The weights in corresponding layers are related but not shared. Saito et al. [142] introduces a new approach that attempts to align distributions of source and target by utilizing the task-specific decision boundaries. Regarding to the adversarial-based approach, a domain discriminator that classifies whether a data point is drawn from the source or target domain. It is used to encourage the domain confusion through an adversarial objective to minimize the distance between the source and target domains [191]. The DomainAdversarial Neural Network (DANN) [53] integrates a gradient reversal layer (GRL) into the standard architecture to ensure that the feature distributions over the two domains are made similar. In contrast to the DANN, the Adversarial Discriminative Domain Adaptation (ADDA) [174] model considers the independent source and target mappings by untying the weights, and the parameters of the target model are initialized by the pre-trained source one. The Wasserstein Distance Guided Representation Learning (WDGRL) [149] uses a domain critic to minimize the Wasserstein Distance (with Gradient Penalty) between domains. The Multi-Adversarial Domain Adaptation (MADA) [124] captures multimode structures to enable fine-grained alignment of different data distributions based on multiple domain discriminators. The Selective Adversarial Network (SAN) [24] addresses partial transfer learning from big domains to small domains where the target label space is a subspace of the source label space. The third category is the reconstruction-based approach which assumes that the data reconstruction of the source or target samples can help improve the performance of domain adaptation [191]. Bousmalis et al. [19] decouples domain adaptation from a specific task and trains a model that changes images from the source domain to appear as they were from the target domain while maintaining their original content. 14 Hoffman et al. [68] proposes a novel discriminatively trained Cycle-Consistent Adversarial Domain Adaptation (CyCADA) model. The model adapts representations at both pixel- and feature-level and enforces cycle-consistency while leveraging a task loss, and does not require aligned pairs. Domain Adaptation for Emotion Recognition. Because of the multi-faceted information included in the speech signal [52], domain adaptation has been widely applied to speech-based emotion recognition. Li et al. [92] proposes a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Gideon et al. [55] investigates how knowledge can be transferred between three paralinguistic tasks: speaker, emotion, and gender recognition. Emotions result in behavioral changes including facial expressions [52]. A variety of domain adaptation techniques have been explored for vision-based emotion recognition. Zhao et al. [215] develops a novel adversarial model for emotion distribution learning, termed EmotionGAN, which optimizes the Generative Adversarial Network (GAN) loss, semantic consistency loss, and regression loss. The EmotionGAN model can adapt source domain images such that they appear as if they were drawn from the target domain while preserving the annotation information. For cross-domain sentiment analysis, Glorot et al. [56] studies the problem of domain adaptation for sentiment classifiers. They demonstrated that a deep learning system based on Stacked Denoising AutoEncoders with sparse rectifier units can perform an unsupervised feature extraction which is highly beneficial for the domain adaptation of sentiment classifiers. Moreover, these modalities are often combined for multimodal learning. For example, Jaiswal et al. [73] studies how stress alters acoustic and lexical emotional detection. They use the GRL to decorrelate stress modulations from emotion representations. Zhao et al. [211] uses an adversarial training procedure to investigate how emotion knowledge of Western European cultures can be transferred to Chinese culture with all the three modalities (speech, vision, and language). 15 2.3.2 Contrastive Learning Contrastive learning is an emerging paradigm for representation learning research [31, 57, 165, 129, 122, 96, 94]. The key idea behind the contrastive learning is to utilize the Noise Contrastive Estimation (NCE) loss to maximize the similarity between the anchor and the positive samples and minimize the anchor and the negative samples. Chen et al. [31] propose the idea of contrastive learning (simCLR) and use data argumentation such as crop, rotation, and color jittering to create positive pairs for anchor samples. In addition, contrastive learning combined with the video representation learning is also popular for the modern action recognition tasks [57, 165, 129]. 3D-CNN encoder is first used to extract representations from video clips, then the contrastive loss is applied to learn the representations in a self-supervised fashion. Gordan et al. [57] use the temporal clue to mine positive/negative samples, where clips from the same video are classified as positive and the rest are negative. Furthermore, contrastive learning has also been applied in emotion recognition tasks [96, 94]. Lian et al. [96] use CNNs to extract representations from audio sequences, in which audio excerpts that belong to the same category form positive pairs. 2.3.3 Multimodal Learning Multimodal learning aims to build models that can process and relate information from different modalities [7]. Multimodal learning assumes that combining multiple views or sensory perceptions of the same data enhances the model capacity [7]. One challenge for multimodal learning is how to fuse different modalities and perform inference. Late fusion uses unimodal networks for inference and then fuses the decisions using a fusion mechanism, e.g., averaging [150] and voting schemes [116]. In early fusion [118, 192, 100, 177], the unimodal representations are first concatenated and then fed into a machine learning model such as a Multi-Layer Perception (MLP) to perform the inference. 16 These two approaches are widely used due to their simplicity and efficiency. However, in late fusion, there is no information shared between the unimodal networks. In early fusion, Wang et al. [192] point out that using concatenation for modality fusion makes the multimodal networks sub-optimal. Therefore, they result in poor performance and sometimes even worse than the unimodal networks, if the other modalities include confounding signals. Multimodal Emotion Recognition. Recent studies [200, 63, 197, 184] mainly work on three modalities, i.e. , audio, vision, and language for multimodal emotion recognition. [200] uses outer product of unimodal representations concatenated with unimodal representations in a tensor in Tensor Fusion Network (TFN), which learns both the intra-modality and inter-modality dynamics. Hazarika et al. [63] propose ModalityInvariant and -Specific Representations (MISA). Features for each modality are projected to two distinct sub-spaces, where the first one is modality-invariant while the second one is modality-specific. These disentangled representations provide a holistic view of the multimodal data for modality fusion that leads to task predictions [63]. With the emergence of the Transformer architecture which utilizes the self-attention mechanism [184], several works [131, 172, 216] design the modality fusion strategies based on the cross-modal self-attention mechanism. Rahman et al. [131] propose Multimodal Adaptation Gate (MAG) which fine-tunes the lexical features with the vision and acoustic modalities. In MAG, the attention over the lexical and nonverbal dimensions is used to fuse the multimodal data into another vector, which is subsequently added to the input lexical vector. Tsai et al. [172] propose Multimodal Transformer (MulT). MulT utilizes a cross-modal self-attention mechanism, where the query is from one modality and the key and value are from the other ones. The cross-modal self-attention enables MulT to capture long-range contingencies regardless of the alignment assumption. Zheng et al. [216] propose Cascade Multi-head Attention (CMHA) which connects multiple multi-head attention modules in series. CMHA can mine the modality interactions with heterogeneous and non-aligned properties. 17 Empirical experiments show that these methods work better to narrow down the heterogeneity gap between modalities [131, 172, 216]. However, these methods are designed to be trained from scratch and cannot perform emotion recognition in an end-to-end manner, which is both expensive to train and inefficient for inference. Multimodal Action Recognition. Action recognition is the process of recognizing human actions in image sequences [127]. Previous work [154, 25, 147, 93, 117, 118] on multimodal action recognition mainly utilizes modality information from RGB, optical flow, depth (D), and skeleton. Simon et al. [154] propose a two-stream late fusion from the RGB and the optical flow modalities to address the lack of motion features outperforming the unimodal approaches. Carreira et al. [25] propose a two-stream Inflated 3D ConvNet (I3D) which expands the very deep image classification ConvNets into 3D so that the model can learn the spatiotemporal features from video while leveraging successful ImageNet architecture designs and even their parameters. Shahroudy et al. [147] propose a shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components for RGB+D videos. Li et al. [93] propose a Skeleton-Guided Multimodal Network (SGM-Net) which makes full use of the complementarity of the RGB and skeleton modalities at the semantic feature level. Munro et al. [117] present a multimodal domain adaptation approach for fine-grained action recognition. The authors use a self-supervision task of predicting the correspondence of multiple modalities which can help align the modalities and improve the generalization ability. Wang et al. [192] identify two main reasons why multimodal networks perform worse than the unimodal ones. First, multimodal networks are prone to overfitting due to the increased capacity, Second, different modalities overfit and generalize at different rates. The authors present Gradient-Blending, which computes an optimal blending of modalities based on their overfitting behaviors. Nagrani et al. [118] use fusion bottlenecks for modality fusion at multiple layers. The proposed model is trained to collate and condense relevant information in each modality and 18 share what is necessary which improves fusion performance, at the same time reducing the computational cost. 2.3.4 Normalization Layer In a normalization layer, the input vector is first normalized such that the mean is zero and the standard deviation is one. Then the normalized vector is scaled and translated with the affine transformation parameters which are learned during training [15]. Regarding the normalization operation, there are three common types of normalization layers, namely, batch normalization [71], layer normalization [4], and instance normalization [178]. Previous studies find that the normalization layers [71, 4, 178, 15] are crucial to stabilize the model training and make the model converge faster and generalize better. Recent works [125, 70] show that meaningful information can be encoded into the normalization layers. Perez et al. [125] present Feature-wise Linear Modulation (FiLM) for visual reasoning. Given the conditioning information, FiLM generates the normalization parameters in the batch normalization layers. In style translation, Huang et al. [70] propose Adaptive Instance Normalization (AdaIN), a simple yet effective approach that aligns the mean and variance of the content features with those of the style features. The authors find that AdaIN achieves a comparable speed to the fastest existing approach. In addition, through training with AdaIN, the style information can be encoded into the normalization parameters and the style of the generated image becomes similar to the given reference. 2.3.5 Face Understanding with Generative Models Generative models provide an estimate of the distribution of training samples [17]. Prior work utilizing generative models for face understanding has mainly focused on semantic segmentation [208, 9, 90] and landmark detection [208, 195]. 19 Zhang et al. [208] introduce DatasetGAN, an automatic procedure to generate massive datasets of highquality semantically segmented images requiring minimal human effort. The authors show how the GAN latent code can be decoded to produce a semantic segmentation of the image and allow the decoder to be trained with only a few labeled examples. Baranchuk et al. [9] demonstrate that feature maps from diffusion models [42] can capture the semantic information and appear to be excellent pixel-wise representations. Li et al. [90] propose semanticGAN, a generative adversarial network that captures the joint image-label distribution. The proposed semanticGAN showcases an extreme out-of-domain generalization ability, such as transferring from real faces to paintings, sculptures, and even cartoons and animal faces. Xu et al. [195] consider the pre-trained StyleGAN generator as a learned loss function and train a hierarchical encoder to get visual representations, namely GH-Feat, for input images. GH-Feat has strong transferability to both generative and discriminative tasks. Previous studies show that the hidden states from the generative models are powerful representations for face understanding. However, to the best of our knowledge, no existing work adapts such architectures to AU detection. Zhang et al. [208] and Baranchuk et al. [9] extract pixel-wise features and treat each pixel as a training sample, leading to extreme inefficiency due to the per-sample computational overhead. More importantly, inference with singe-pixel features lacks the inductive bias (local features), crucial to AU detection shown in the previous studies [212, 148]. In addition, semanticGAN [90] has to encode the input image to the latent space in an optimization-based manner for inference, which is extremely timeconsuming. Thus semanticGAN can be only tested with a few samples. The limitation of this approach does not allow for training or testing with larger datasets. GH-Feat [195] is the closest method to ours that extracts latent code representations from generative models. The major differences between GH-Feat and our method are: (i) GH-Feat extracts the 1D latent code features while FG-Net further decodes the latent codes to images and gets the 2D feature maps. (ii) GH-Feat is trained in a multi-stage manner while our method is end-to-end. GH-Feat utilizes the StyleGAN generator as a learned loss function and trains 20 a hierarchical encoder and then uses this encoder to extract visual representations for downstream tasks. The whole pipeline requires more than 700 GPU hours due to the complicated training process while FG-Net only needs 10 GPU hours for training. 2.3.6 Personalization Various modalities have been investigated for personalized emotion recognition, e.g. , physiological signals [213, 214], speech [8, 29, 160, 169], and facial expressions [132, 146]. Bang et al. [8] introduce a framework for robust personalized speech emotion recognition, which incrementally provides a customized training model for a target user via virtual data augmentation. Their method is evaluated on IEMOCAP [20] with ten speakers. Zhao et al. [213, 214] explore the impact of personality on emotional behavior through physiological signals using graph learning. Their studies are conducted on the ASCERTAIN dataset [162], which comprises data from 58 subjects. Zen et al. [203] propose an SVM-based vision regression model to learn the relationship between a user’s sample distribution and the parameters of that individual’s classifier and use the learned model to transfer to new users with unseen distributions. Chen et al. [29] develop a twolayer fuzzy random forest using features extracted from openSMILE[49] and train on different categories of people generated via a fuzzy C-means clustering algorithm. They demonstrate a potential performance gain in four subjects. Shahabinejad et al. [146] introduce an innovative attention mechanism tailored for facial expression recognition (FER). This mechanism generates an attention map using a face recognition (FR) network, thereby personalizing the FER process with FR features. However, their method relies on a single image for personalization, which raises concerns about the reliability of the personalization. Barros et al. [12] propose a Grow-When-Required network that learns person-specific features on seen speakers via a conditional adversarial autoencoder. Barros et al. [13] presents Contrastive Inhibitory Adaption (CIAO) to adapt the last layer of facial encoders to model nuances in facial expressions across different 21 datasets. Barros et al. [11] introduce a set of layers designed to learn both clusters of general facial expressions and individual behaviors through online learning and affective memories. However, the method is not applicable to unseen speakers. Most prior studies are either limited by the number of subjects available in the existing emotion datasets or rely on a single data point for personalization, which compromises the reliability of the learned personalized features and hinder their applicability to unseen speakers. Two notable exceptions are the studies by Sridhar et al. [160] and Tran et al. [169] that utilize MSP-Podcast [108], benefiting from its extensive range of subjects. However, the dataset is limited to the audio modality. Sridhar et al. [160] propose to find speakers in the training set whose acoustic patterns closely match those of the testing speakers to create an adaptation set. The approach needs additional training (model adaptation) at inference time, limiting its applicability to unseen speakers. Tran et al. [169] present PAPT, a personalized adaptive pre-training method, where the model is pre-trained with learnable speaker embeddings in a self-supervised manner, and personalized label distribution calibration, which adjusts the predicted label distribution using label statistics from similar training speakers. PAPT has demonstrated superior effectiveness in personalization compared to Sridhar et al. ’s method [169] while eliminating the necessity for retraining on new speakers. 2.3.7 Set Learning Set representation learning extracts meaningful embeddings invariant to permutations for set inputs. DeepSets [202] operates by independently processing elements within a set and subsequently aggregating them using operations such as minimum, maximum, mean, or sum. Set Transformers [89] explore selfattention to model interactions between elements of a set. In addition to designing permutation-invariant modules for set encoding, alternative set-learning methodologies have emerged. These include methods that learn set representations by minimizing the disparity between an input set and a trainable reference set through bipartite matching [155] or optimal transport [61]. 22 Chapter 3 Speaker-Invariant Adversarial Domain Adaptation for Emotion Recognition 3.1 Introduction Emotions play a significant role not only in human creativity and intelligence but also in rational human thinking and decision-making. To enable natural and intelligent interaction with humans, computers need the ability to recognize and express emotions [126]. Over the past few years, deep learning approaches have shown promising performance for emotion recognition [170, 176, 204]. However, constructing a large-scale emotion benchmark is both time-consuming and expensive. As a result, it is unrealistic to construct a large fully-annotated database every time we perform an emotion recognition task on a new domain. Deep domain adaptation has emerged as a new learning technique to address the lack of massive amounts of labeled data [191]. Using the publicly available fully-annotated audiovisual emotion databases (e.g. MSP-Improv [22], IEMOCAP [20]), we can apply deep domain adaptation techniques e.g. DANN [53] to recognize the emotions on an unlabeled dataset. In the adversarial-based domain adaptation e.g. DANN [53], a domain discriminator is trained to classify whether a data point is drawn from the source or target domain. It is used to encourage the domain confusion through an adversarial objective to minimize the distance between the source and target domains 23 [191]. The Domain-Adversarial Neural Network (DANN) [53] is trained to minimize the classification loss (for source samples) while maximizing domain confusion loss via the use of the GRL. The DANN model succeeds in reducing the domain bias between the source and the target domains, but it fails to address the bias between the speakers. There are multiple speakers in the MSP-Improv and the IEMOCAP databases, each with their own individual appearance and voice characteristics. Though the DANN model can detect and remove the bias between domains, the bias between speakers still remains which results in reduced performance. To address this problem, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN). Figure 3.3 shows the network architectures for the cross-domain emotion recognition models. Specifically, based on the DANN model, we add a speaker discriminator to detect the speaker’s identity. We add a GRL at the beginning of the discriminator so that the encoder can unlearn the speaker-specific information. To confirm the effectiveness of our proposed model, we conduct the within-domain and cross-domain experiments with multimodal data (speech, vision, and text). We evaluate our method on two publicly available fully-annotated audiovisual emotion databases (MSP-Improv [22] and IEMOCAP [20]). Specifically, the MSP-Improv and IEMOCAP database have 8,348 and 10,039 utterances respectively produced by 22 speakers in total, each labeled with both arousal and valence values. We extract two kinds of acoustic features: (i) the Mel Filter Bank (MFB) acoustic features. (ii) the VGGish [67, 54] acoustic representations. We obtain the visual features from the penultimate layer of the ResNet-152 [65] and we use the pre-trained BERT [40] to transform the text from each utterance into a vector. For the within-domain experiments, we train and test the baseline model with five-fold speakerindependent cross-validation. For the cross-domain experiments, we first train the baseline with the labeled source data and then train the domain adaptation models (DANN [53], and SIDANN) by fine-tuning the baseline with the labeled source data and unlabeled target data. We then test all three models on the 24 whole target domain. The results of the within-domain experiments show that the multimodel with the MFB acoustic and the BERT lexical features has the best performance for arousal detection. Meanwhile, the multimodel with the MFB acoustic, the ResNet visual, and the BERT lexical features achieve the best performance for valence detection. For the cross-domain experiments, our results indicate that the proposed SIDANN model outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, confirming that the SIDANN has better domain adaptation ability than the DANN. The major contributions of this chapter are as follows. • We study the unsupervised domain adaptation problem on emotion recognition with multimodal data including speech, vision, and language. We conduct detailed experiments to explore the domain adaptation performance of different modalities and their combinations. • We study the problem of how to reduce both the domain bias and speaker bias. Based on the DANN model, we propose Speaker-Invariant Domain-Adversarial Neural Network to separate the speaker bias from the domain bias. Specifically, we add a speaker discriminator to detect the speaker’s identity. There is a GRL at the beginning of the discriminator so that the encoder can unlearn the speakerspecific information. • The experimental results confirm that the SIDANN has a better domain adaptation ability than the DANN. 3.2 Datasets and Features In this section, we introduce in detail the datasets we use to evaluate the methods. 25 (a) A screenshot from MSP-Improv. (b) A screenshot from IEMOCAP. Figure 3.1: Screenshots from MSP-Improv and IEMOCAP. 3.2.1 Datasets Two public datasets are used to study the UDA problem for emotion recognition: (1) MSP-Improv dataset [22]; and (2) Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [20]. Both are audiovisual databases and have the arousal and the valence labels. Videos from both the databases are shot in a laboratory thus they have similar environments. MSP-Improv. The MSP-Improv database is an acted audiovisual emotional database that explores emotional behaviors during acted and improvised dyadic interaction. Overall, the corpus consists of 8,438 turns (over 9 hours) of emotional sentences and 12 speakers (6 males and 6 females). IEMOCAP. The IEMOCAP database is an acted, multimodal, and multispeaker database. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. Overall, the dataset has 10,039 utterances and 10 speakers (5 males and 5 females). Screenshots from the two databases are shown in Figure 3.1a and Figure 3.1b. Videos are recorded from different angles and the video resolutions are also different. 26 0 0.1 0.2 0.3 0.4 0.5 0.6 low mid high MSP-Improv IEMOCAP (a) Distributions of arousal values. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 low mid high MSP-Improv IEMOCAP (b) Distributions of valence values. Figure 3.2: Label distributions for different domains. 3.2.2 Labels Each utterance in MSP-Improv and IEMOCAP has labels for both arousal and valence on a five-point Likert scale. According to the label processing method mentioned in [73], we bin the labels into one of the three classes, defined as, {"low":[1, 2.75], "mid":(2.75, 3.25], "high":(3.25, 5]}. The overall distribution for arousal is: {"low": 45.69%, "mid": 20.72%, "high": 33.59%} and for valence is: {"low": 25.49%, "mid": 23.14%, "high": 51.37%}. Therefore, the label distributions are imbalanced. Moreover, label distributions vary between datasets (see Figure 3.2). 3.2.3 Features Behavior from three modalities is analyzed. Speakers’ spoken content is manually transcribed in the IEMOCAP and automatically recognized in the MSP-Improv. Videos are used to track facial expressions and speech prosody is analyzed from audio. Speech. The Mel Filter Bank (MFB) consists of overlapping triangular filters with the cutoff frequencies determined by the center frequencies of the two adjacent filters [28]. The MFB acoustic features have shown great domain transferability in previous work of emotion recognition [73]. We use the same extraction method as [73]. Specifically, we extract the 40-dimensional MFB features using a 25-millisecond 27 Hamming window with a step-size of 10-milliseconds. As a result, we have a T × 40 vector for each utterance, where T represents the number of time steps. Deep neural networks trained on large quantities of data are able to learn powerful representations [40, 65, 67, 54]. Therefore, we also utilize the VGGish [67, 54] to extract a deep generalized acoustic representation for different domains. VGGish is a deep convolutional neural network trained on audio spectrograms extracted from a large database of videos to recognize an ontology of 632 audio events, for example, vehicle noise, music genre, human locomotion [67, 54]. According to the acoustic feature extraction method mentioned in [157], we use the 128-dimensional embedding that can be generated by the VGGish after dimensionality reduction with Principal Component Analysis (PCA). We use the hop size of 33ms, which means a 128-dimensional vector is extracted for every 33ms of the audio signals. As a result, we have a T × 128 vector for each utterance, where T represents the number of time steps. Vision. We first sample the videos at a 30 fps rate and crop the speaker’s face for each frame with OpenCV ∗ . To extract a generalized visual representation for different domains, we extract the activations from the penultimate layer of ResNet-152 [65] trained on the ImageNet [39]. We feed the network with cropped faces from each frame. As a result, we have a T × 2048 matrix for each utterance, where T denotes the number of frames. Language. To represent the spoken words, we use the pre-trained BERT [40] for mapping the spoken utterances to a representation. Bidirectional Encoder Representations from Transformers (BERT) [40] is a method for learning a language model that can be trained on a large amount of data in an unsupervised manner. This pre-trained model is very effective in representing a sequence of terms as a fixed-length representation (vector). BERT representation achieves state-of-the-art results in multiple natural language understanding tasks [157]. In this chapter, we use pre-trained BERT to transform the text for each utterance into a 768-dimensional vector. IEMOCAP includes manual transcriptions that we use for language ∗ https://opencv.org/ 28 Vis u al F e a t u r e s L e xic al F e a t u r e s 1 D C o n v M a x P o ol Re L U 1 D C o n v M a x P o ol Re L U 1 D C o n v M a x P o ol Re L U G R U M e a n P o ol D r o p o u t Lin e a r Re L U D r o p o u t B a t c h N o r m B a t c h N o r m C o n c a t Lin e a r Re L U Lin e a r S o f m a x E m o t o n L a b els Lin e a r Re L U Lin e a r S o f m a x D o m ain L a b els G R L Lin e a r Re L U Lin e a r S o f m a x S p e a k e rID s G R L A c o u s t c F e a t u r e s 1 D C o n v M a x P o ol Re L U 1 D C o n v M a x P o ol Re L U 1 D C o n v M a x P o ol Re L U G R U M e a n P o ol D r o p o u t B a t c h N o r m Encoder Emoton Classifer Domain Discriminator Speaker Discriminator Figure 3.3: The network architecture for the cross-domain emotion recognition models. The inputs from different modalities are passed through the models, in this case, MFB/VGGish for speech, ResNet for vision, BERT for language. The baseline model only has the encoder and emotion classifier. The DANN model has the encoder, emotion classifier, and domain discriminator. The SIDANN has all the four parts (encoder, emotion classifier, domain discriminator, and speaker discriminator). analysis. We transcribe MSP-Improv using Google Cloud enhanced Automatic Speech Recognition (ASR) † to generate the text data. We discard 271 out of 8,438 utterances for which ASR fails to detect any speech. As a result, we use 8,167 utterances in total for the MSP-Improv database. Finally, we z-normalize all the features from three modalities (acoustic, visual, and lexical) for each speaker, by subtracting their mean value and dividing them by their standard deviation. 3.3 Method In this section, we first introduce the problem formulation and notations. Then we explain the network architectures and the detailed training strategies for both the baseline and UDA models. Their network architectures are shown in Figure 3.3. Also, the pseudo-code for training the SIDANN model for one epoch is shown in the Algorithm 1. † https://cloud.google.com/speech-to-text/docs/enhanced-models 29 3.3.1 Problem Formulation Multimodal Emotion Recognition: Given a set of utterances S, for each utterance xi ∈ S, xi = {x a i , xv i , xl i }, where x a i , x v i , and x l i represent the (a)coustic, (v)isual, and (l)exical features respectively. We aim to detect the arousal and valence values ai , vi for each utterance using multimodal inputs and functions Fa(.) and Fv(.). ai = Fa(xi), (3.1) vi = Fv(xi). (3.2) 3.3.2 Notations Let the source dataset be Ds = {(x1, e1, s1, d1), ...,(xM, eM, sM, dM)} and the target dataset be Dt = {(xM+1, sM+1, dM+1), ...,(xM+N , sM+N , dM+N )}, where M and N are the numbers of the source and target utterances respectively. xi = {x a i , xv i , xl i } is the extracted feature. ei is the emotion label (arousal or valence value). We do not have the emotion labels for the target dataset. si denotes the speaker identity. di is the domain label, where di = 0 means xi belongs to the source domain and di = 1 means it belongs to the target domain. Therefore, di = 0, for i = 1, 2, ..., M and di = 1, for i = M + 1, M + 2, ..., M + N. 3.3.3 Baseline Model We use the multimodal approach mentioned in [73]. It is worth noting that Jaiswal et al. [73] only utilizes the acoustic and lexical features to recognize the emotions. In addition, the lexical features they extracted are sequential but ours are not. Therefore, for the visual part, we use the same architecture as the acoustic one while for the lexical part, we simply use a linear layer as encoder. The network architecture is shown in Figure 3.3. The baseline model only contains two components: the encoder and the emotion classifier. We assume each part as a mapping. The encoder Ge outputs a 30 fixed-size representation f given x (acoustic, visual, and lexical features). The emotion classifier Gc maps f to a probability distribution e over the emotion label space of three classes (low, mid, or high). We denote the vector of parameters from all layers in the encoder and the emotion classifier as θe and θc. f = Ge(x; θe), (3.3) e = Gc(f; θc). (3.4) The uni-modal baseline only takes a single stream (acoustic, visual, or lexical) input while the bi-modal baseline takes a two-stream input and the tri-modal baseline takes a three-steam input (acoustic, visual, and lexical). The goal of the model is to minimize the cross-entropy loss which is defined as follows. LBaseline = Lemotion = X (xi,ei)∈Ds LCE(Gc(Ge(xi ; θe); θc), ei), (3.5) where LCE is the cross-entropy loss. 3.3.4 Domain-Adversarial Neural Network The Domain-Adversarial Neural Network (DANN) [53] minimizes the classification loss (for source samples) while maximizing domain confusion loss. The DANN integrates a gradient reversal layer (GRL) into the standard architecture to ensure that the feature distributions over the two domains are similar. Based on the baseline architecture, we add a domain discriminator to discriminate whether the output of the encoder is from the source or the target domain. Specifically, there is a gradient reversal layer (GRL) at the beginning of the domain discriminator. The DANN has three components: encoder, emotion classifier, and domain discriminator. The domain discriminator Gd maps f to a probability distribution d over the domain label space of two classes (source 31 Algorithm 1 Train the SIDANN for one epoch. For Adam optimizer, we use the default values of α = 0.0001, β1 = 0.9, and β2 = 0.999. The batch size m is 256. Require: The batch size m, Adam hyperparameters α, β1, β2. Require: Parameters for encoder θe, emotion classifier θc, domain discriminator θd, and speaker discriminator θs and their corresponding mappings: Gc, Ge, Gd, and Gs. Require: Weights for the domain loss λ1 and the speaker loss λ2. m′ ← m/2 n1 ← (Number of source samples)/m′ n2 ← (Number of target samples)/m′ n ← min(n1, n2) for batch = 1, ..., n do Sample {xi , ei , si , di} m′ i=1 a half batch from source data Sample {xi , si , di} m i=m′+1 a half batch from target data Xs ← {xi} m′ i=1 X ← {xi} m i=1 fs ← Ge(Xs) f ← Ge(X) eˆ ← Ge(fs) ˆd ← Gd(f) sˆ ← Gs(f) LE ← 1 m′ Pm′ i=1 ei · log(ˆei) LD ← 1 m Pm i=1 di · log( ˆdi) LS ← 1 m Pm i=1 si · log(ˆsi) θe ← Adam(∆θe [LE − λ1 · LD − λ2 · LS], θe, α, β1, β2) θc ← Adam(∆θc [LE], θc, α, β1, β2) θd ← Adam(∆θd [LD], θd, α, β1, β2) θs ← Adam(∆θs [LS], θs, α, β1, β2) end for or target). We denote the vector of parameters from all layers in the domain discriminator as θd. Therefore, we have d = Gd(f; θd). (3.6) The objective function of the model has two parts: the task-specific loss and domain loss. The taskspecific loss is the same as the baseline objective function which is shown in Equation 3.5. The domain loss is defined as follows. Ldomain = X (xi,di)∈Ds∪Dt LCE(Gd(fi ; θd), di) = X (xi,di)∈Ds∪Dt LCE(Gd(Ge(xi ; θe); θd), di). (3.7) 32 where LCE is the cross-entropy loss. The objective of the DANN is to maximize the performance of the emotion classifier while minimizing the performance of the domain discriminator. Overall, the goal of the DANN model is defined as follows. LDANN = Lemotion − λ · Ldomain, (3.8) where λ is the hyper-parameter that controls the trade-off between the two objectives that shape the features during learning [53]. 3.3.5 Speaker-Invariant Domain-Adversarial Neural Network Although the DANN model can remove the domain bias between the source and the target domain, it ignores the bias between speakers. There are 12 speakers in the MSP-Improv database and 10 in the IEMOCAP database. These 22 speakers have individual styles for expressing emotions. Therefore, during the DANN training, the model mixes these two sources of bias together resulting in poor performance. To address this problem, we propose Speaker-Invariant Domain-Adversarial Neural Network (SIDANN). Specifically, we add a speaker discriminator to detect the speaker’s identity. Similar to the DANN model, we add a GRL at the beginning of the discriminator so that the encoder can unlearn the speaker-specific information. With the speaker discriminator, the model can separate the speaker bias from the domain bias. Overall, the SIDANN has four parts: encoder, emotion classifier, domain discriminator, and speaker discriminator. The speaker discriminator Gs maps f to a probability distribution s over the speaker label space of 22 classes. We denote the vector of parameters from all layers in the speaker discriminator as θs. Therefore, we have s = Gs(f; θs). (3.9) 33 Besides the task-specific loss (Equation 3.5) and domain loss (Equation 3.7), the objective function of the SIDANN has the speaker loss, which is defined as Lspeaker = X (xi,si)∈Ds∪Dt LCE(Gs(fi ; θs), si) = X (xi,si)∈Ds∪Dt LCE(Gs(Ge(xi ; θe); θs), si), (3.10) where LCE is the cross-entropy loss. The objective of the SIDANN is to maximize the performance of the emotion classifier while minimizing the performance of the domain discriminator and the speaker discriminator. Integrating all the things (Equation 3.5, 3.7, and 3.10), the goal of the DANN model is defined as follows. LSIDANN = Lemotion − λ1 ∗ Ldomain − λ2 ∗ Lspeaker, (3.11) where λ1 and λ2 are the hyperparameters that control the trade-off between the three objectives that shape the features during learning. The pseudo-code for training the SIDANN model for one epoch is shown in the Algorithm 1. 3.4 Experiments In this section, we will describe the experimental design and the training details. We will also report and discuss the experimental results. 3.4.1 Training and Evaluation Details For the baseline model, it is trained for a maximum of 50 epochs and we stop the training if the validation loss does not improve after five consecutive epochs. Given the imbalanced nature of our data, we utilize an imbalanced dataset sampler ‡ to re-balance the training class distributions. The model is trained with ‡ https://github.com/ufoym/imbalanced-dataset-sampler 34 the Adam [81] optimizer (initial learning rate = 10−4 ) with a dynamic learning rate decay § based on the validation loss. We use the default parameters for the Adam optimizer. The batch size is 256. All models are implemented in PyTorch [123]. We use validation samples (20% source data) for hyper-parameter selection and early stopping. The hyperparameters that we use for the baseline include: the width of the convolution layers {64, 128}, the kernel size of the convolution layers {2, 3}, the kernel size of the max pool layers {2}, the number of the GRU layers {2, 3}, the width of the linear layer in encoder {32}, the width of the linear layer in emotion classifier {32, 64}, and the dropout rate {0.3}. For the UDA models, they are simply trained for 25 epochs, since we do not have the labels for the target domain. They are trained with the Adam optimizer with a fixed learning rate, which is also a hyperparameter. The optimizer is set with the default parameters. The batch size is also 256. The network structures of the domain discriminator and the speaker discriminator are exactly the same as that of the emotion classifier. The hyperparameters we use for the DANN include: the learning rate {1e-5, 3e-5, 1e-4, 3e-4, 1e-3}, and λ {0.1, 0.3, 1, 3, 10} while for the SIDANN include: the learning rate {1e-5, 3e-5, 1e-4, 3e-4, 1e-3}, λ1 {0.1, 0.3, 1, 3, 10}, and λ2 {0.1, 0.3, 1, 3, 10}, where the meanings of λ, λ1, λ2 have been explained in Section 3.3.4 and Section 3.3.5. We utilize Accuracy (ACC) and Unweighted Average Recall (UAR) to evaluate the performance. Specifically, ACC and UAR are 0.33 when the detected labels are uniformly distributed. 3.4.2 Experimental Results Within-domain Evaluation. To evaluate the baseline model, we train and test it with five-fold speakerindependent cross-validation. Specifically, we evaluate the performance of the unimodal, bimodal, and trimodal model. § https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.ReduceLROnPlateau 35 Table 3.1: Within-domain performance of the baseline model. A1, A2, V, and L represent VGGish acoustic, MFB acoustic, ResNet visual, and BERT lexical features. M and I stand for MSP-Improv and IEMOCAP database. ACC and UAR stand for Accuracy and Unweighted Average Recall. They are 0.33 when the detected labels are uniformly distributed. (a) Results for arousal detection. ACC UAR Modality M I M I Avg A1 0.619 0.577 0.509 0.525 0.558 A2 0.672 0.593 0.492 0.602 0.590 V 0.568 0.474 0.415 0.492 0.487 L 0.513 0.510 0.409 0.473 0.476 A1+V 0.601 0.569 0.455 0.532 0.539 A2+V 0.646 0.556 0.503 0.489 0.549 A1+L 0.587 0.578 0.467 0.527 0.540 A2+L 0.684 0.587 0.587 0.550 0.602 V+L 0.547 0.505 0.428 0.466 0.487 A1+V+L 0.623 0.573 0.513 0.517 0.557 A2+V+L 0.644 0.570 0.500 0.530 0.561 (b) Results for valence detection. ACC UAR Modality M I M I Avg A1 0.428 0.466 0.417 0.431 0.436 A2 0.422 0.455 0.489 0.489 0.464 V 0.503 0.499 0.472 0.462 0.484 L 0.513 0.618 0.499 0.576 0.552 A1+V 0.503 0.518 0.493 0.462 0.494 A2+V 0.489 0.510 0.471 0.478 0.487 A1+L 0.538 0.611 0.515 0.571 0.559 A2+L 0.538 0.629 0.527 0.583 0.569 V+L 0.539 0.614 0.520 0.541 0.554 A1+V+L 0.554 0.643 0.534 0.555 0.572 A2+V+L 0.537 0.638 0.553 0.571 0.575 Table 3.1 displays the within-domain performances of the baseline model. We have totally evaluated 11 models of different feature combinations. For arousal detection (shown in Table 3.1a), the MFB combined with the BERT features has the best ACC scores while the MFB features achieve the highest UAR scores on both the MSP-Improv and IEMOCAP databases. Further, the MFB combined with the BERT features works better than the MFB features on average. For unimodal methods, both acoustic features perform better than the other modalities (vision and language) and the lexical features perform the worst on average. For valence detection (shown in Table 3.1b), the VGGish combined with the ResNet and BERT features achieve the best ACC scores on both the MSP-Improv and IEMOCAP databases. However, the MFB combined with the ResNet and BERT features has the best performance on average. For unimodal methods, lexical features perform the best while acoustic features are the worst. This is the exact opposite of arousal detection. The acoustic features are informative for arousal detection and the lexical features are powerful for valence detection. Past work [120, 60] showed that speech works better for arousal detection and language is better able to capture 36 Table 3.2: Cross-domain performance of the unsupervised domain adaptation. (a) Results for arousal detection. Inputs are the MFB acoustic features and the BERT lexical features. ACC UAR Model M → I I → M M → I I → M Avg Baseline 0.241(.03) 0.291(.05) 0.186(.02) 0.245(.03) 0.241 DANN [53] 0.321(.01) 0.266(.06) 0.271(.01) 0.279(.02) 0.309 SIDANN 0.392(.03) 0.390(.08) 0.371(.02) 0.308(.01) 0.365 (b) Results for arousal detection. Inputs are the MFB acoustic features and the ResNet visual features ACC UAR Model M → I I → M M → I I → M Avg Baseline 0.263(.02) 0.284(.06) 0.188(.01) 0.277(.03) 0.253 DANN [53] 0.388(.03) 0.407(.09) 0.344(.02) 0.336(.03) 0.369 SIDANN 0.415(.01) 0.506(.07) 0.422(.03) 0.379(.03) 0.430 (c) Results for valence detection. Inputs are the MFB acoustic features, the ResNet visual features, and the BERT lexical features ACC UAR Model M → I I → M M → I I → M Avg Baseline 0.381(.02) 0.407(.01) 0.442(.02) 0.406(.01) 0.409 DANN [53] 0.460(.02) 0.456(.02) 0.409(.01) 0.456(.03) 0.445 SIDANN 0.480(.01) 0.500(.03) 0.431(.02) 0.482(.03) 0.473 valence. Facial expression is also better at detecting valence than arousal, see AVEC challenges results [145, 181, 180, 135, 179, 137, 134, 136]. Cross-domain Evaluation. We design to set one database as the source domain and the other as the target domain. Thus, we have two directions of domain adaptation (M → I and I → M, where M is MSPImprov and I is IEMOCAP). For the baseline model, we use 80% of the source data for training and 20% for validation where training and validation data are speaker-independent. For the DANN and the SIDANN, we train them by fine-tuning the baseline model with the labeled source data and unlabeled target data. We then test all the three models on the whole target domain. 37 Table 3.3: Cross-domain performance with the MFB acoustic features. (a) Results for arousal detection. ACC UAR Model M → I I → M M → I I → M Avg Baseline 0.258 0.368 0.201 0.224 0.263 DANN [53] 0.367 0.414 0.445 0.306 0.383 SIDANN 0.407 0.496 0.452 0.428 0.446 Jaiswal et al. [73] - - - 0.402 - (b) Results for valence detection. ACC UAR Model M → I I → M M → I I → M Avg Baseline 0.464 0.367 0.402 0.364 0.399 DANN [53] 0.470 0.376 0.442 0.406 0.424 SIDANN 0.550 0.407 0.482 0.452 0.473 Jaiswal et al. [73] - - - 0.439 - We show the results of the cross-domain performance in Table 3.2. We input the MFB and the BERT features for detecting arousal and the MFB, the ResNet, and the BERT features for detecting valence since these two combinations have the highest performance on average for each task. We report the results in Table 3.2a and Table 3.2c. The numbers in the brackets are the standard deviations. The numbers indicate that our proposed model performs significantly better than the DANN and the baseline with t-test (at p < 0.1). Specifically, the SIDANN outperforms the DANN by 5.6% and 2.8% on average for detecting arousal and valence, confirming that the SIDANN has a better domain adaptation ability than the DANN. Though the SIDANN is the best performing model, it performs poorly detecting arousal. Based on the modality contribution analysis in Section 3.4.3, we speculate that the lexical features are not helpful for detecting arousal. Therefore, we replace the BERT lexical features with the ResNet visual features and display the results in Table 3.2b. The results show that the MFB combined the ResNet features work better than the MFB combined with the BERT features for all the evaluation metrics. Specifically, the former one outperforms the later one by 6.1% on average. The result is significant at p < 0.1 with t-test. 38 Table 3.4: Modality contribution analysis for unsupervised domain adaptation. (a) Results for arousal detection. VGGish acoustic features ResNet visual features BERT lexical features ACC UAR ACC UAR ACC UAR Model M → I I → M M → I I → M M → I I → M M → I I → M M → I I → M M → I I → M Avg Baseline 0.311 0.360 0.244 0.300 0.405 0.280 0.340 0.297 0.276 0.317 0.259 0.290 0.307 DANN [53] 0.503 0.559 0.467 0.353 0.453 0.430 0.413 0.367 0.313 0.323 0.304 0.310 0.400 SIDANN 0.491 0.556 0.485 0.376 0.415 0.437 0.485 0.378 0.315 0.327 0.309 0.313 0.407 (b) Results for valence detection. VGGish acoustic features ResNet visual features BERT lexical features ACC UAR ACC UAR ACC UAR Model M → I I → M M → I I → M M → I I → M M → I I → M M → I I → M M → I I → M Avg Baseline 0.424 0.360 0.376 0.386 0.323 0.350 0.343 0.316 0.494 0.453 0.475 0.449 0.396 DANN [53] 0.450 0.401 0.395 0.402 0.499 0.400 0.459 0.381 0.506 0.458 0.484 0.458 0.441 SIDANN 0.481 0.400 0.414 0.405 0.501 0.408 0.510 0.400 0.515 0.467 0.477 0.465 0.454 3.4.3 Modality Contribution Analysis To figure out the contribution of each modality, we re-conduct the cross-domain experiment with a single modality (acoustic or visual or lexical). Specifically, we first train unimodal models on the source domain and then fine-tune them. The results of the modality contribution analysis are reported in Table 3.3 and Table 3.4. Table 3.3 shows the cross-domain performance with the MFB acoustic features. The proposed SIDANN model performs better than the DANN and baseline model for both arousal and valence. Specifically, the SIDANN outperforms the DANN by 6.3% and 4.9% on average when detecting arousal and valence values respectively. Also, the proposed model achieves higher UAR than the numbers reported in [73]. The results of the other three kinds of features (VGGish, ResNet, and BERT) are reported in Table 3.4. The SIDANN has a slight advantage over the DANN (+0.7% and +1.3% for arousal and valence). Additionally, we find that the BERT lexical features perform worst for arousal detection while they perform best for valence detection. This is consistent with the previous results we obtain in the within-domain experiments. 39 3.5 Conclusions In this chapter, we study the Unsupervised Domain Adaptation (UDA) problem on emotion recognition with multimodal data including speech, vision, and language. We propose Speaker-Invariant DomainAdversarial Neural Network (SIDANN) to separate the speaker bias from the domain bias. Specifically, we add a speaker discriminator to detect the speaker’s identity. There is a gradient reversal layer at the beginning of the discriminator so that the encoder can unlearn the speaker-specific information. The crossdomain experimental results indicate that the proposed SIDANN model outperforms (+5.6% and +2.8% on average for detecting arousal and valence) the DANN model, confirming that the SIDANN has a better domain adaptation ability than the DANN. Though the multimodal methods perform better than the unimodal methods for the within-domain experiments, the results of later ones are better for the cross-domain experiments. Therefore, for our future work, we need to explore additional multimodal fusion techniques to solve the problem. We also plan to evaluate our proposed model on other tasks to evaluate its general ability to reduce between-subject variance. 40 Chapter 4 Contrastive Learning for Domain Transfer in Cross-Corpus Emotion Recognition 4.1 Introduction Emotions wield significant influence in human creativity and are crucial to human intelligence and decisionmaking. For computers to engage in natural and intelligent interactions with humans, they require the capability to perceive and convey emotions [126]. In recent years, deep learning methodologies have demonstrated promising efficacy in emotion recognition [176, 204, 210]. However, emotion recognition performance drops across corpora as a result of variations in sensors, i.e. , cameras, environments, i.e. , lighting condition, background, and subjects, i.e. , ethnicity, age, gender, etc. Figure 4.1 shows an example of variations across two commonly used datasets for emotion recognition. Supervised emotion recognition requires laborious manual coding by trained annotators. As a result, it is not practical to label any new dataset or person, and unsupervised domain adaptation methods are more suitable to address the domain shift for such problems. Emotions result in subtle and localized changes in the face. However, unsupervised domain adaptation methods like Domain-Adversarial Neural Network (DANN) [53] or Deep Adaptation Networks (DAN) 41 Aff -Wild2 SEWA Figure 4.1: The discrepancies across datasets (domains) result in lower emotion recognition performance. Top two rows are from Aff-Wild2 [84] and two bottom rows are from SEWA [87]. [106] cannot guarantee to preserve local features necessary for emotion recognition while reducing domain discrepancies in global features. This limits their ability in improving emotion recognition performance across corpora. Another approach for leveraging unlabeled data is self-supervised learning. Selfsupervised representation learning leverages proxy supervision, which has great potential in improving performance in computer vision applications [104]. To address this problem, we propose Face wArping emoTion rEcognition (FATE) (see Figure 4.5). Unlike the traditional domain adaptation models in which the base model is first trained with the source data and then fine-tuned with the source and target data, we reverse the training order. Specifically, we employ first-order facial animation warping [152] to generate a synthetic dataset (see Figure 4.3). We choose an anchor image (a face) from the target domain and drive its synthetic facial behavior from source video sequences, transferring the facial movements from source videos to the target subjects. Then, we apply a contrastive learning with the instance-based Information Noise Contrastive Estimation (InfoNCE) loss [122, 129] to pre-train the encoder with the real and synthetic video pairs. The corresponding pairs of real and synthetic videos have the same facial movements but various subject appearances from different domains. Using this self-supervised pre-training, the encoder can learn the domain-invariant information 42 focusing on the facial features only. At last, we utilize the labeled source data to fine-tune the encoder and the classifier. FATE can keep the subtle facial features after fine-tuning. Since the domain spaces are aligned through contrastive learning, supervise training of the domain-invariant encoder with the labeled source data will improve the performance across corpora. To confirm the effectiveness of our proposed model, we conduct the cross-domain experiments with three publicly available fully-annotated continuous emotion recognition databases (Aff-Wild2 [84], SEWA [87], and SEMAINE [112]). We only use the facial behaviors for emotion recognition. Specifically, AffWild2, SEWA, SEMAINE databases have in total 2,786,201, 946,932, and 1,440,000 frames with 458, 262, and 24 subjects respectively. Each frame is labeled with both arousal and valence values. Our experimental results indicate that the proposed FATE model achieves enhanced emotion recognition performance across corpora. Specifically, our proposed model outperforms all the DA baselines in terms of Concordance Correlation Coefficient value on average for both arousal and valence (see Table 4.1). The results show that FATE has a better domain generalization ability for emotion recognition. The major contributions of this chapter are as follows • We utilize the first-order motion model to warp target faces with the source video sequences. With the generated data and contrastive learning, the encoder learns the domain-invariant information and focuses on the facial features only. To the best of our knowledge, we are the first to study visual emotion recognition with contrastive learning. • We propose a novel domain adaptation model FATE for emotion recognition which reverses the training order for the traditional domain adaptation models. The proposed model can preserve the subtle facial features after fine-tuning. • Our experimental results indicate that the proposed FATE model substantially increases emotion recognition performance across corpora. 43 4.2 Data In this section, we introduce the datasets we use in our experiments. We also introduce the synthetic data we use to pre-train FATE. Models are evaluated on three widely used benchmarks for emotion recognition, i.e. , Aff-Wild2 [84], SEWA [87], and SEMAINE [112]. All the three datasets are recorded in real-world settings. 4.2.1 Dataset Aff-Wild2 is an extension of the previous database Aff-Wild [83] database. Aff-Wild2 consists of 558 videos from 458 subjects with 2,786,201 frames, showing both subtle and extreme human behaviors in real-world settings. As for the annotations, 558 videos are annotated with valence and arousal values, 84 videos have annotations for seven basic facial expressions, and 63 videos have facial action unit (AU) labels. We use the cropped and aligned images provided by the organizer [82]. As the Aff-Wild2 test set is not released, we use the released validation set to test on and randomly divide the training set into a training and a validation subset (with an 85/15 split). SEWA contains 1,525 minutes of audio-visual data of people’s reaction to adverts from 398 individuals and includes annotations for facial landmarks, facial action unit intensities, various vocalizations, verbal cues, mirroring, and rapport, continuously valued valence, arousal, liking, and prototypic examples (templates) of (dis)liking and sentiment. As we focus on visual emotion recognition, we exclude those videos which are annotated by the video and the audio at the same time and the audio-only. After filtration, we use the subset with 262 subjects and 946,932 frames in total. All frames are cropped and aligned using facial landmarks. We randomly divide the dataset into a training, a validation, and a subset (with a 70/20/10 split). SEMAINE is a video database of human-agent interactions. It is recorded in a Wizard of Oz experiment where participants held conversations with an operator who adopted various roles designed to evoke 44 Generation Module First Order Motion Module Source Video Optical Flow Occlusion Map Fake Video Target Image Figure 4.2: Overview of the first order motion model [152]. The model takes a single image and a driving video as the input and generates an output video by retargeting animation. emotional reactions [27]. It contains 1,440,000 frames with 24 subjects. Each frame is annotated with arousal, expectancy, power, and valence values. There is no official split on this dataset. Thus, similar to SEWA, we randomly divide the dataset into a training, a validation, and a subset (with a 70/20/10 split). Valence and arousal for all the three datasets take values in the range: [-1, +1]. All the data partitions are subject-independent which means subjects in one fold will not appear in the other two folds. Due to the calculating resource limitation, we only use part of the data from the Aff-Wild2, SEWA, and SEMAINE datasets which are randomly selected. To balance the number of samples from each dataset, we use 10%, 30%, and 15% data from Aff-Wild2, SEWA, and SEMAINE. For all the three datasets, we only use the facial behaviors for emotion recognition. Synthetic Dataset. Warping is widely used for image animation [151, 58]. Image warping is a geometric transformation which maps pixels or feature points from one location to another. We utilize the pre-trained first order motion model from [152] to drive and warp target faces with the source video sequences (see Figure 4.2). The model takes a single image and a driving video as the input and generates an output video by retargeting animation. The model decouples subject appearances and motion information by extracting the motion representation using a self-supervised formulation. Then it utilizes such a representation to generate a backward optical flow and an occlusion map. The backward optical flow is used to perform backward warping on the feature map for the motion translation. The occlusion map is used to 45 Real Syn Real Syn Figure 4.3: Real and synthetic (Syn) video pairs for self-supervised contrastive pre-training. Driving videos are from the source domain (Aff-Wild2) and anchor faces are from the target domain (SEWA). The corresponding real and synthetic videos have the same facial movements but different subject appearances. automatically estimate the subject parts which are not visible in the source image but should be inferred in content, for example, showing teeth while smiling. Finally, the subject image, the optical flow, and the occlusion map are used together to render the target video sequences. Three directions of face warping are performed: Aff-Wild2 to SEWA, SEWA to Aff-Wild2, and Aff-Wild2 to SEMAINE. For Aff-Wild2 to SEWA and SEWA to Aff-Wild2, we randomly select 150 anchor images and driving videos to perform the face warping. While for the direction of Aff-Wild2 to SEMAINE, we randomly select 90 image and video pairs to generate the synthetic data. After the warping, we get the synthetic dataset, namely DSyn. We name the corresponding source video sequences as DReal. Specifically, we randomly select 80% pairs of video sequences for training and the remaining 20% pairs for validation. Examples of the real-synthetic video pairs are shown in Figure 4.3. 4.2.2 Data Observation We show the label distributions and the correlation between the arousal and valence values with heat maps (see Figure 4.4). Specifically, x-axis is the arousal and y-axis is the valence. In Aff-Wild2 (Figure 4.4a), most of the arousal values are positive. There are three main trends with the increase of arousal. The first one is that the valence is in direct proportion to the arousal. The second one is that the valence is inversely proportional to the arousal. The last trend lies in the left side of the x-axis. The valence values are around 46 1.0 0.6 0.2 0.2 0.6 1.0 Arousal 1.0 0.6 0.2 0.2 0.6 1.0 Valence 0.000 0.005 0.010 0.015 0.020 (a) Aff-Wild2. 1.0 0.6 0.2 0.2 0.6 1.0 Arousal 1.0 0.6 0.2 0.2 0.6 1.0 Valence 0.000 0.005 0.010 0.015 0.020 0.025 (b) SEWA. 1.0 0.6 0.2 0.2 0.6 1.0 Arousal 1.0 0.6 0.2 0.2 0.6 1.0 Valence 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 (c) SEMAINE. Figure 4.4: Heat maps for the different datasets. zero. In SEWA (Figure 4.4b), the distribution is more disordered compared to Aff-Wild2. There is only one trend that the valence is in direct proportion to the arousal. Besides, the data near zero are more scattered compared to Aff-Wild2. In SEMAINE (Figure 4.4c), the data are more scattered than the other two datasets and there is no obvious trend in the heat map. 4.3 Method In this section, we introduce the notations and show the detailed training strategies for the proposed Face wArping emoTion rEcognition (FATE). The overview of the proposed model is shown in Figure 4.5. Also, the pseudo-code for training the FATE model for one epoch is shown in the Algorithm 2. 4.3.1 Notations Let the source dataset be Ds = {(x s 1 , y s 1 ), ..., (x s N , y s N )} and the target dataset be Dt = {(x t 1 , ..., x t M)}, where x and y denote the images and the emotion labels (arousal or valence). N is the number of the source samples and M is the number of the target samples. 47 Encoder … i-th (i+15)-th … i-th (i+15)-th Representation 1 Representation 2 As similar as possible Real video sequence Syn video sequence (a) Step 1: Pre-train with real-synthetic (Real-Syn) video pairs with contrastive learning and InfoNCE loss. Representations of the positive pair should be as similar as possible. Encoder Classifier Arousal/ Valence (b) Step 2: Fine-tune with source data. The learning rate for the encoder is the one for the classifier multiplied with a small constant. Figure 4.5: The overview of the proposed Face wArping emoTion rEcognition (FATE). Unlike the traditional DA models, FATE reverses the training order. FATE is first pre-trained with real-synthetic video pairs and then fine-tuned with labeled source data, in a supervised manner. The encoder is a ResNet (2+1)D [168] up to the penultimate layer in addition to two fully-connected layers and the classifier is one fully-connected layer. 4.3.2 Base Model The base model has two parts: the encoder F and the classifier G. The encoder F contains the ResNet (2+1)D [168] (pre-trained on Kinetics-400 [78]) to the penultimate layer and two fully connected layers while the classifier G contains one fully connected layer. Both single or multiple frames can be fed into the base model. For pre-training with contrastive learning, the input is multiple frames and for emotion recognition, the input is a single frame. 4.3.3 Face wArping emoTion rEcognition (FATE) Visual emotion recognition is a challenging task due to the subtlety of the facial movements, requiring local features for their detection. Domain adaptation models like DANN [53] or DAN [106] cannot guarantee to preserve local features necessary for visual emotion recognition while reducing domain discrepancies in global features. This limits their ability in improving emotion recognition performance across corpora. Therefore, to keep the subtle facial features, unlike the traditional domain adaptation models in which the base model is first trained with the source data and then fine-tuned with the source and target data, we reverse the training order. Specifically, we first utilize the contrastive learning with InfoNCE loss [122, 48 129] to pre-train the encoder F with DReal and DSyn. Then we employ the labeled source data to fine-tune the encoder F and the classifier G. Step 1: Pre-training. We utilize the contrastive learning with the instance-based Information Noise Contrastive Estimation (InfoNCE) loss [122, 129] to perform the self-supervised pre-training. During this stage, we do not need any emotion labels. For each mini-batch during the training, given b pairs of real and synthetic video sequences, we first randomly sample f continuous frames for each pair of videos resulting in {r k , sk} b k=1. Then we feed them into the encoder F to extract the temporal representations, where {z k r , z k s} b k=1. (z i r , z j s) is a positive pair, if i = j, otherwise it is a negative pair (r and s stand for real and synthetic). For each real sample r i (i = 1, 2, ..., b), the InfoNCE loss Li is defined as follows. Li = − log exp S z i r , z i s /τ Di , (4.1) Di = X b k=1,k̸=i exp S z i r , z k r /τ + exp S z i r , z k s /τ , (4.2) where S(u, v) stands for the normalized cosine similarity between the two vectors u and v, which is defined as S(u, v) = u ⊤v/∥u∥2∥v∥2. τ (> 0) stands for the temperature parameter. Lastly, we average the InfoNCE loss together for the mini-batch as LNCE = 1 b X b i=1 Li . (4.3) The InfoNCE loss forces the positive real-synthetic video clip pairs closer, and pushes away with other clips within the mini-batch. Each pair of video clips has exactly the same facial movements but distinct subject appearances from different domains. Once the encoder F is converged, F is invariant to the domain information and focuses on facial features only. Algorithm 2 Pre-train the encoder F in one epoch. Require: The batch size b, the number of batches for one epoch training t. Require: Hyper-parameters for the SGD optimizer: learning rate lr and momentum m. Require: b pairs of real and synthetic videos {v k r , vk s } b k=1. Require: The number of frames for one video clip f. Require: The temperature parameter for the InfoNCE loss τ . Require: Parameters for encoder θ and the corresponding mapping F. for batch = 1, 2, ..., t do for k = 1, 2, ..., b do Sample f continuous frames r k = {r k i } f i=1 for the real video v k r from DReal. Select corresponding frames s k = {s k i } f i=1 for the synthetic video v k s from DSyn. z k r ← F(r k ) z k s ← F(s k ) end for S ← {z k r , z k s} b k=1. loss ← LNCE(S, τ ) θ ← SGD[∆θ(loss), θ, lr, m] end for The pseudo-code for pre-training the proposed model in one epoch is shown in the Algorithm 2. Step 2: Fine-tuning. After pre-training, we employ labeled source data Ds to fine-tune both the encoder F and the classifier G. To ensure smaller updates in encoder weights during this step, the learning rate for the encoder is set to be smaller than the classifier at a constant rate lrmult (≤ 1). The loss function is calculated by Mean Squared Error (MSE). Lf = 1 N X N i=1 (y s i − ybs i ) 2 , (4.4) where y s is the ground truth label and ybs is the prediction. During the pre-training, high-level representations for video clips with similar facial movements have closer distances. Thus, the encoder F is trained to be focused on facial features and invariant to the subject appearances from different domains as well. During the fine-tuning, the model is trained to improve the emotion recognition performance. The learning rate for the encoder is set to be much smaller than the classifier, so the encoded high-level representations can keep these subtle facial features for improved 50 Table 4.1: CCC values for different directions of domain adaptation. Source means directly transferring the source model to the target domain. Target means both training and evaluating the model with the target data. A and V stand for arousal and valence respectively. * denotes vaules reported in the original work. Kossaifi et al. [86] and Deng et al. [37] use visual modality for detection while Yang et al. [196] use acoustic modality. Direction Aff-Wild2 to SEWA SEWA to Aff-Wild2 Aff-Wild2 to SEMAINE A V A V A V A Avg V Avg Source 0.105 0.253 0.052 0.103 0.060 -0.003 0.072 0.118 DANN [53] 0.206 0.207 0.080 0.114 0.242 0.163 0.176 0.161 DAN [106] 0.172 0.423 0.088 0.056 0.203 0.143 0.154 0.207 ADDA [174] 0.125 0.078 0.044 0.081 0.094 0.113 0.088 0.091 FATE (ours) 0.264 0.500 0.120 0.187 0.221 0.347 0.202 0.344 Target 0.338 0.557 0.346 0.181 0.463 0.297 0.383 0.345 Kossaifi et al. [86]* 0.440 0.600 - - - - - - Deng et al. [37]* - - 0.454 0.440 - - - - Yang et al. [196]* - - - - 0.680 0.506 - - emotion recognition performance. Since the domain spaces have already been aligned before fine-tuning, training the model with only the labeled source data improves the performance across corpora. 4.4 Experiments In this section, we describe the experimental design and the implementation details. We also report and discuss the experimental results. 4.4.1 Experimental Design Three directions of cross-domain experiments are conducted: Aff-Wild2 to SEWA, SEWA to Aff-Wild2, and Aff-Wild2 to SEMAINE. First, the target data is randomly partitioned into the training, validation, and test sets. We employ the labeled source data and the unlabeled target data for unsupervised domain adaptation. We use all the source data and unlabeled training partition from the target data for training. For our own proposed 51 model, we only use the source data for fine-tuning. For all the models, we apply the second and third folds of the target data for validation and test. We implement three domain adaptation methods, i.e. , DANN [53], DAN [106], and ADDA [174] with the base model. 4.4.2 Evaluation Criterion Concordance Correlation Coefficient (CCC) is the evaluation metric. CCC produces values between [- 1,+1], with +1 indicating perfect concordance and -1 perfect discordance [84]. The higher the CCC value the better the fit between the annotations and the predictions [84]. CCC is defined as follows ρc = 2sxy s 2 x + s 2 y + (x − y) 2 (4.5) where sx and sy are the variances of the ground truth and prediction respectively, x and y are the corresponding mean values and sxy is the co-variance. 4.4.3 Implementation Details All methods are implemented in PyTorch [123]. Codes and model weights are made available, for the sake of reproducability∗ . For the base model, the hidden layers’ sizes of the two fully-connected layers in the encoder are 256 and 128 respectively. The hidden layer’s size of the fully-connected layer in the classifier is 128. For the DA baselines, we first train the base model with the Adam [81] optimizer (learning rate is 2e−4, weight decay is 1e−4) for 25 epochs with the batch size of 64. We then train the DA models by fine-tuning the base model. We train all the baselines with the Adam [81] optimizer (weight decay is 1e−4)) for 10 epochs, with the batch size of 64. Hyper-parameters include the learning rate ∗ https://github.com/intelligent-human-perception-laboratory/Face-Warping-Emotion-Recognition 52 lr ∈ {1e−5, 3e−5, 1e−4, 3e−4, 1e−3}, and the trade-off weight λ ∈ {0.3, 1, 3} in the domain adaptation model. For our proposed FATE model, during pre-training, the encoder is trained with the Stochastic Gradient Descent (SGD) [18] optimizer (learning rate is 1e−3, momentum is 0.9) for 25 epochs with the batch size of 120. The temperature parameter for the InfoNCE loss τ is 0.1. Next, we fine-tune both the encoder F and the classifier G with the labeled source data only. During fine-tuning, we train the two parts with the Adam optimizer (weight decay is 1e−4) for 25 epochs with the batch size of 64. The learning rate for the classifier is 2e−4. To make sure the weights of the encoder will not change too much, the learning rate for the encoder is set to a fraction of the learning rate for the classifier, at the rate of lrmult = 0.3 (a hyper-parameter). All the hyper-parameters are selected based on the validation results. 4.4.4 Experimental Results In Tables 4.1, "Source" means directly evaluating the source model on the target domain, without domain adaptation whereas "Target" means training and evaluating the model with the target data. Therefore, "Source" and "Target" results are the lower and upper bounds respectively. We include the results from [86], [37], and [196] which are the state-of-the-art supervised emotion recognition models for comparison. When evaluated across corpora, we observe a large drop in performance across the base model for all the three directions, especially for valence detection from Aff-Wild2 to SEMAINE. This demonstrates the challenging nature of cross-domain emotion recognition. The domain adaptation models can mitigate this problem to some extent. All the domain adaptation methods improve the emotion recognition performance on average, with an exception of ADDA for 53 Table 4.2: Ablation study for lrmult in FATE. CCC values of domain adaptation from Aff-Wild2 to SEWA are reported. lr_mult Arousal Valence 0.0 0.054 0.053 0.03 0.179 0.317 0.1 0.214 0.406 0.3 0.264 0.500 0.6 0.180 0.423 1.0 0.042 0.199 valence detection resulting in a decrease in the averaged performance. DANN and DAN are the best performing domain adaptation baseline for arousal and valence detection respectively in terms of the averaged performance. The proposed FATE model achieves superior results compared to the domain adaptation baseline models. For both arousal and valence detection, FATE achieves the best averaged performance, in terms of CCC value. For valence detection, FATE narrows the gap between supervised and unsupervised learning (decrease from 0.227 to 0.001 in terms of CCC value). The results demonstrate the effectiveness of pre-training with generated face warping data and contrastive learning. 4.4.5 Ablation Study lrmult in FATE. In order to measure the effect of lrmult in FATE’s performance, we evaluate and report the performances of FATE from Aff-Wild2 to SEWA with different values of lrmult ranging from 0 to 1 (see Table 4.2). When lrmult = 0, FATE is reduced to a self-supervised model whereas lrmult = 1 turns the model into a direct transfer learning. In both cases, the performance is limited. When lrmult is between 0 and 1, both the arousal and valence performances are better than when lrmult is 0 or 1. Specifically, for both arousal and valence detection, FATE achieves the best performance (0.264 and 0.500 CCC value respectively) when lrmult = 0.3. The results suggest that both pre-training and fine-tuning of FATE are essential to the cross-domain emotion recognition. 54 Table 4.3: Ablation study for preservation for subtle facial features. CCC values of domain adaptation from Aff-Wild2 to SEWA are reported. A and V stand for arousal and valence respectively. A low A high V low V high Source 0.082 0.127 0.091 0.134 DANN [53] 0.054 0.219 -0.046 0.135 DAN [106] 0.104 0.170 0.129 0.315 ADDA [174] 0.040 0.051 0.089 -0.003 FATE (ours) 0.148 0.259 0.250 0.358 Preservation for subtle facial features. In order to show that FATE preserves subtle facial features, we evaluate all the DA models from Aff-Wild2 to SEWA with both low and high intensity data from the target test fold (see Table 4.3). Samples from the low intensity data have less than 50% absolute arousal/valence values other than 0 while labels of the high intensity data have absolute values more than 50%. We observe that all the models perform better for high intensity data than the low intensity one except ADDA for valence detection. FATE has the best performance in terms of all the four groups of data, proving that FATE preserves subtle facial features better than the other baselines. 4.5 Conclusions In this chapter, we study the problem of the cross-domain visual emotion recognition. Traditional DA models might not preserve local features which are necessary for visual emotion recognition while reducing domain discrepancies in global features. To address this problem, we reverse the training order. Specifically, we employ first-order facial animation warping to generate a synthetic dataset with the target data identities for contrastive pre-training. Then, we fine-tune the base model with the labeled source data. Our experiments with cross-domain evaluation indicate that the proposed FATE model substantially outperforms the existing domain adaptation models, suggesting that the proposed model has a better domain generalizability for emotion recognition. The proposed method only requires single still images with 55 no labels from new subjects in order to generalize better to the target samples. This will allow creating a generalizable encoder through the collection of a large amount of still images. 56 Chapter 5 X-Norm: Exchanging Normalization Parameters for Bimodal Fusion 5.1 Introduction Multimodal learning aims to build models that can process and relate information from multiple modalities [7, 156, 6, 16]. In multimodal learning, leveraging multiple modalities to capture different views of the data is expected to enhance the model capacity and robustness [7]. Multimodal learning has a broad range of applications, e.g., emotion recognition [172], sentiment analysis [156], speech recognition [2], and action recognition [117, 118]. One challenge for multimodal learning is how to fuse different modalities and perform inference. There are two common approaches, (1) late fusion (Figure 5.1 left) which uses unimodal networks for inference and then fuses the decisions; (2) early fusion (Figure 5.1 middle) which concatenates the unimodal representations and then performs inference. These two methods are efficient due to their simplicity. However, in late fusion, there is no information shared between the unimodal networks. In early fusion, Wang et al. [192] find that different modalities overfit and generalize at different rates while Nagrani et al. [118] point out that machine perception models are typically modality-specific and optimized for unimodal benchmarks. Thus using concatenation for modality fusion makes the multimodal networks sub-optimal [192]. Therefore, such methods result in poor performance, at times worse than the unimodal models. 57 Layer Late Fusion Early Fusion X-Norm (Ours) Modality 1 Modality 2 Normalization Parameters Figure 5.1: Comparison between different modality fusion strategies (best viewed in color). There is no information exchanged in the late fusion (left) before the unimodal model outputs. Early fusion (middle) simply concatenates the unimodal representations. Our proposed method, X-Norm, (right) condenses and encodes the hidden states into the normalization parameters to be exchanged between modalities. With the emergence of the Transformer architecture [184], recent works [172, 131, 216, 199] use the cross-modal self-attention mechanism where the query is from one modality while the key and value are from the other ones. The cross-modal self-attention mechanism is capable of capturing long-term dependencies regardless of the alignment between modalities. However, these methods are designed to be trained from scratch and their high complexity hinders their utility for training and inference in an end-to-end manner. Nagrani et al. [118] identified that condensing the information shared between the modalities can improve fusion performance. In style translation, Huang et al. [70] propose Adaptive Instance Normalization (AdaIN) which aligns the mean and variance of the content features with those of the style features. The authors find that the style information can be encoded into the normalization parameters and thus make the style of the generated image similar to the given reference. Inspired by these findings, we explore the modality fusion strategy that only shares limited but meaningful information, i.e. , normalization parameters. In this chapter, we propose X-Norm, a novel, simple and efficient approach for bimodal fusion. X-Norm generates and exchanges the normalization parameters between the modalities for fusion (see Figures 5.1 58 and 5.2). Specifically, we propose a Normalization Exchange (NormExchange) layer (see Figure 5.3) which generates and exchanges the normalization parameters for the two modalities. We insert the NormExchange layers in the encoders to share the modality-specific information and therefore implicitly align the feature spaces. X-Norm performs inference in a late-fusion manner computing a weighted average of the two modalities’ logits. We note that X-Norm is task-agnostic, architecture-agnostic, and lightweight. To show the effectiveness of our proposed method, we conduct extensive experiments on two different multimodal tasks, i.e. , emotion recognition and action recognition with different combinations of modalities and different architectures. Specifically, we evaluate (1) speech-based (audio and language) emotion recognition with a Transformer-based architecture on IEMOCAP [20]; (2) vision-language emotion recognition with a Transformer-based architecture on MSP-Improv [21]; and (3) CNN-based action recognition on EPIC-KITCHENS-100 [35] with the RGB and optical flow modalities. Our experimental results show that X-Norm achieves comparable or superior performance compared to the existing methods. The major contributions of this chapter are as follows. • We present a novel approach for bimodal fusion, X-Norm, which generates and exchanges the normalization parameters between the modalities and implicitly aligns the feature spaces. X-Norm is task-agnostic, architecture-agnostic, and lightweight. • Extensive experiments with different tasks and architectures show that X-Norm achieves comparable or superior performance compared with the existing methods, with lower training costs. 5.2 Method In this section, we present X-Norm, a novel and lightweight method that utilizes the exchange of normalization parameters to implicitly align the modalities. 59 Unimodal network 1 Modality 1 Classifier 1 Modality 2 N orm E x chang e Classifier 2 Logits 1 Logits 2 Final Logits Weighted Average Encoder 1 Part 1 Encoder 2 Part 1 Encoder 2 Part 2 Encoder 1 Part 2 N orm E x chang e Unimodal network 2 Figure 5.2: Overview of the proposed X-Norm for bimodal fusion. Inputs of the two modalities are fed into two unimodal branches. In the proposed NormExchange layer, normalization parameters are exchanged. At last, the logits from each branch are weighted averaged. 5.2.1 Problem Formulation Multimodal emotion recognition. Given an utterance set S, for every x ∈ S, we have its three types of modalities xt , xv, xa, where t, v, a denote text, vision, and audio respectively. The goal is to detect the categorical emotion y for x using function f(·). y = f(xt , xv, xa), (5.1) where xt and xa are 2D vectors. The first dimension is the sequence length and the second one is the feature dimension. xv is a 4D vector. The first dimension is the number of channels, the second one is the number of frames, and the last two dimensions are the height and width respectively. Note that here for the illustrative purpose, we show three modalities. However, in this chapter, we only work on models with two modalities, e.g. , language and audio or language and vision. Multimodal action recognition. Given a video set S, for every x ∈ S, we have two modalities of xr and xo, where r and o denote RGB and optical flow respectively. The goal is to detect the action y for x using function f(·). y = f(xr, xo), (5.2) 60 Normalization Exchange Skip Connection Input Hidden States 1 Input Hidden States 2 Norm NormEncoder 1 NormEncoder 2 Norm Normalization Parameters 1 Normalization Parameters 2 Output Hidden States 1 Output Hidden States 2 Skip Connection Affine Transformation Affine Transformation Hidden States 1 Hidden States 2 Figure 5.3: Illustration for the proposed Normalization Exchange (NormExchange) layer (best viewed in color). NormEncoders (Normalization Encoders) condense and encode the hidden states into the normalization parameters. Then, we perform affine transformations with the normalized hidden states∗ and the opposite normalization parameters. At last, we utilize the skip connections to keep the original modality information. *Norm(·) can be either batch, layer, or instance normalization. where xr and xo are 4D tensors, similar to xv mentioned above. 5.2.2 Motivation To address the problem that concatenation makes the multimodal networks sub-optimal, Nagrani et al. [118] propose to condense the relevant information in each modality and share what is necessary. In style translation, Huang et al. [70] found that the style information can be encoded into the normalization parameters. Using those normalization parameters to align the mean and variance of the content features makes the style of the generated image similar to the given reference. Thus, inspired by these findings, we explore the modality fusion strategy that only shares limited but meaningful information, i.e. , mean and variance. 5.2.3 Overview We show the overall pipeline of X-Norm for bimodal fusion in Figure 5.2. The general idea of X-Norm is to only share the limited but meaningful normalization parameters between the modalities. To achieve this, we propose a Normalization Exchange (NormExchange) layer (see Figure 5.3) which generates and exchanges the normalization parameters for the two modalities. We insert the NormExchange layers in the encoders to share the modality-specific information and therefore implicitly align the modalities. At 61 last, X-Norm performs inference in a late-fusion manner which computes a weighted average of the two modalities’ logits. The pseudo-code for training the proposed X-Norm for one epoch is shown in the Algorithm 3. 5.2.4 Normalization Exchange (NormExchange) We show the overview of the proposed Normalization Exchange layer in Figure 5.3. To condense the information shared between the modalities and implicitly align the modalities at the same time, we generate and exchange the normalization parameters for the two modalities. Suppose we exchange the normalization parameters after the l-th layers in the encoders. The outputs before the l-th layers are h1 and h2 respectively. Given the features from the two modalities, we first generate the normalization parameters ⟨α1, β1⟩ = G1(h1), ⟨α2, β2⟩ = G2(h2), (5.3) where Gi is the normalization encoder, αi and βi are the normalization parameters (i = 1, 2). These normalization parameters are then exchanged. In order to keep the original modality information while exchanging the normalization parameters, we utilize the skip connections, which are also widely used in layer normalization [99]. Overall, a single NormExchange layer is defined as follows X-Norm(h1) = h1 + α2Norm(h1) + β2, X-Norm(h2) = h2 + α1Norm(h2) + β1, (5.4) where Norm(·) is the standard z-score normalization which makes the mean zero and the standard deviation one. Norm(·) can be either batch, layer, or instance normalization and in practice, we choose to use the layer normalization following the Transformer architecture [184]. 62 Algorithm 3 Train the X-Norm for one epoch. The input has two modalities of data. The optimizer is AdamW. Require: The number of training batches m. Require: Encoders Ei classifiers Ci , and normalization encoders Gi (i = 1, 2). Require: Parameters θ for the encoders, classifiers, and normalization encoders. Require: The number of layers in the encoders L. Require: The insert positions for the NormExchange layers a which is a list. Require: The weight for averaging for the logits of the two modalities λ. for batch = 1, ..., m do Sample a batch {x1, x2, y} from the training data. h1, h2 ← x1, x2 for l = 1, ..., L do l1, l2 ← the l-th layers in E1 and E2 h1, h2 ← l1(h1), l2(h2) # NormExchange if l ∈ a then α1, β1 ← G1(h1) α2, β2 ← G2(h2) h1 ← h1 + α2Norm(h1) + β2 h2 ← h2 + α1Norm(h2) + β1 end if end for p1, p2 ← C1(h1), C2(h2) p ← λ ∗ p1 + (1 − λ) ∗ p2 loss ← Cross-entropy(y, p) θ ← AdamW(∆θ(loss), θ) end for In the NormExchange layer, only limited information (normalization parameters) is shared between the modalities. Therefore, even if the feature spaces of different modalities are not aligned well, each branch will not be affected by the confounding signals from the other one. 5.2.5 Positions for NormExchange To avoid too many model parameters and keep the training and inference efficient, we restrict the number of NormExchange layers to no greater than three. Specifically, we insert the NormExchange layers after the first, the middle, and the last layers in the encoders so that both shallow and deep features are shared between the modalities. 63 5.2.6 X-Norm X-Norm has two branches, one for each modality (see Figure 5.2). Each branch consist of one encoder Ei and one classifier Ci (i = 1, 2). We insert one or several NormExchange layer(s) in encoders. In X-Norm, only the normalization parameters in the encoders are shared between the modalities while there is no information shared in the classifiers. At last, X-Norm fuses modalities by weighting the logits. p1 = C1(E1(x1)), p2 = C2(E2(x2)), p = λ ∗ p1 + (1 − λ) ∗ p2, (5.5) where p is the final logits and λ is a hyper-parameter. The objective function for the model is the cross-entropy loss L = − 1 n Xn i=1 yi log(pi), (5.6) where n is the number of training samples, yi is the ground-truth and pi is the softmax output for the i-th class. Training X-Norm is efficient. First, the normalization encoders are the only additional learnable parameters. Second, With the advancement of the large-scale pre-trained unimodal models [101, 69], X-Norm converges fast and generalizes well. 5.3 Experiments In this section, we first describe the datasets and evaluation metrics (Section 5.3.1). The architecture for X-Norm is introduced in Section 5.3.2. The baseline methods are introduced in Section 5.3.3. We show the 64 (a) IEMOCAP. (b) MSP-Improv. (c) EPIC-KITCHENS. Figure 5.4: Screenshots from IEMOCAP [20], MSP-Improv [22], and EPIC-KITCHENS [35]. implementation and training details in Section 5.3.4. We present the experimental results in Section 5.3.5, followed by an ablation study in Section 5.3.6 and qualitative results in Section 5.3.7. 5.3.1 Datasets To show the effectiveness of our proposed model, we evaluate X-Norm with two different tasks, namely, emotion recognition and action recognition. Two public datasets are used to study the multimodal emotion recognition, Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [20] and MSP-Improv dataset [22]. For multimodal action recognition from an egocentric view, we use EPIC-KITCHENS-100 [35]. Screenshots for the three datasets are shown in Figure 5.4. Datasets for Emotion Recognition • IEMOCAP [20] is an acted, multimodal, and multispeaker dataset. It contains approximately 12 hours of audiovisual data, including video, speech, motion capture of face, and text transcriptions. The dataset has five sessions and each session has one male and one female speaker. In this chapter, all IEMOCAP subsets are used. Overall, the dataset has 10,039 utterances. IEMOCAP includes manual transcriptions that we use for language analysis. Due to the less than ideal video quality, i.e. , non-frontal view and low resolution (see Figure 5.4a), we choose to use the language and audio modalities for emotion recognition. 65 Table 5.1: Number of utterances of each class for IEMOCAP [20] and MSP-Improv [22]. Dataset Happy Angry Sad Neutral Total IECMOAP 1,636 1,103 1,084 1,708 5,531 MSP-IMPROV 2,454 863 786 3,369 7,472 • MSP-Improv [22]is an acted audiovisual emotion dataset that explores emotional behaviors during acted and improvised dyadic interaction. Videos in MSP-Improv have 29.97 frames per second. MSPImprov has six sessions where each session has one male and one female speaker. Overall, the corpus consists of 8,438 turns (over 9 hours) of emotional sentences. To show the effectiveness of our proposed method on other bimodal fusion setups, we use a different combination of modalities from IEMOCAP, i.e. , language and vision, for emotion recognition. We transcribe MSP-Improv using Google Cloud enhanced Automatic Speech Recognition (ASR)∗ to generate the language data. All frames are cropped and aligned using the detected facial landmarks by OpenCV† . We discard the utterances that are less than 16 frames long or for which ASR fails to detect any speech. For both emotion recognition datasets, we adopt the common four-class classification evaluation protocol [172], i.e. , happy, sad, angry, and neutral classes. In IEMOCAP, we merge excited with happy to achieve a more balanced label distribution [119, 92]. We discard the samples of the other classes and give the statistics of the two datasets in Table 5.1. IEMOCAP and MSP-Improv have five and six sessions respectively. Following previous work [73, 74, 197], for IEMOCAP/MSP-Improv, models are evaluated by five/six-fold cross-validation, with four/five sessions for training and one for validation. The evaluation metrics are accuracy (ACC) and weighted F1- score (F1). We obtain the average and standard deviation of classification metrics within the five/six folds as a measure of overall performance. ∗ https://cloud.google.com/speech-to-text/docs/enhanced-models † https://opencv.org/ 66 Datasets for Action Recognition. EPIC-KITCHENS-100 [35] is the largest dataset for fine-grained action recognition in first-person (egocentric) vision. It is an extension of the EPIC-KITCHENS-55 [34]. The dataset has 55 hours of recording for daily kitchen activities, densely annotated with start/end times for each action or interaction [35]. Videos in EPIC-KITCHENS have 59.94 frames per second. The data is collected from 32 kitchens. We use the publicly available RGB and computed optical flow available in the dataset [35]. Due to the computing resource limitations, we select the three largest subsets (kitchens P01, P08, and P22) in the number of the training action instances for training and testing [117]. EPIC-KITCHENS has a large class imbalance offering additional challenges [117]. To ensure sufficient examples per class without balancing the training-set, we only analyze the performance for the eight largest action classes (’take’, ’put’, ’wash’, ’open’, ’close’, ’insert’, ’turn-on’, ’cut’), which form around 75% of the training action segments. As EPIC-KITCHENS does not release the annotations for the test-set, we use the released validation set to test on and randomly divide the original training set into a training and a validation subset with an 85%/15% split. The number of action segments in the train/validation/test sets is 11,127/1,893/2,115 respectively, where a segment is a time-stamped excerpt, with an action label [117]. The evaluation metric is accuracy (ACC). We compute and report the average and standard deviation with three different random seeds (0, 1, and 2) as the overall performance. 5.3.2 Architecture To show the effectiveness of the proposed X-Norm for modality fusion, we evaluate it with two different architectures. Specifically, we use Transformer-based and CNN-based models for emotion recognition and action recognition respectively. 67 Architecture for Emotion Recognition. We use a Transformer-based architecture for emotion recognition. We use RoBERTa-large [101] as the text encoder, TimeSformer [14] as the vision encoder, and HuBERT-large [69] as the audio encoder. RoBERTa and HuBERT are implemented by HuggingFace’s Transformers [193]. We use the released pre-trained weights for initialization. TimeSformer‡ is implemented with the officially released source codes. Since TimeSformer is originally designed for action recognition, we pre-train it with Aff-Wild2 [85] which is the largest dataset for emotion recognition in-the-wild. The normalization encoder in the NormExchange layer consists of one fully-connected layer. Architecture for Action Recognition. For action recognition, we use I3D [25], which is a CNN-based architecture, for both RGB and optical flow modalities. I3D is adopted from the publicly available code repository§ . The public repository provides the pre-trained weights for I3D with RGB and optical flow frames on ImageNet [38]. The normalization encoder in the NormExchange layer contains one 3D convolutional layer. 5.3.3 Baselines The baseline methods that we compare our method against on the three datasets are as follows. • Early fusion first concatenates representations of different modalities and then feeds the concatenated features into an MLP to get the final logits. • Late fusion averages the final logits for all modalities. • Tensor Fusion Network (TFN) [200] in which the representations are combined by the tensor outer product. ‡ https://github.com/facebookresearch/TimeSformer § https://github.com/piergiaj/pytorch-i3d 68 Table 5.2: Hyperparameters of X-Norm we use for the various benchmarks. IEMOCAP MSP-IMPROV EPIC-KITCHENS Modality 1 Text Text RGB Modality 2 Audio Vision Optical flow Batch size 128 32 16 Learning rate 1e-3 1e-3 1e-4 Average weight λ 0.5 0.5 0.3 • Multimodal Transformer (MulT) [172] utilizes the cross-modal self-attention mechanism for modality fusion, where the query is one modality while the key and value are the other ones. In the original MulT paper [172], the authors formulate emotion recognition on IEMOCAP as a multi-label task where each class is a binary classification task. However, we formulate emotion recognition as a four-class classification task. • Modality-Invariant and -Specific Representations (MISA) [63] disentangles the features of each modality into a modality-invariant and a modality-specific vector which provides a holistic view of the modality fusion. • Gradient-Blending (G-Blend) [192] computes an optimal blending of modalities based on their overfitting behaviors. 5.3.4 Implementation and Training Details All methods are implemented in PyTorch [123]. Code and model weights are available, for the sake of reproducibility¶ . We use a machine with two Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPUs with eight NVIDIA Quadro RTX8000 GPUs for all the experiments. We train all the models with the AdamW optimizer [107] for 50 epochs. The weight decay is 1e-4. The gradient clipping is 1.0. The dropout rate is 0.1. We stop the training procedure if the validation loss is ¶ https://github.com/intelligent-human-perception-laboratory/XNorm 69 Table 5.3: Comparison of X-Norm with different unimodal and multimodal baselines on IEMOCAP and MSP-IMPROV for emotion recognition and EPIC-KITCHENS for action recognition. A, V, T, R, and O stand for audio, vision, text, RGB, and optical flow respectively. ACC% and F1% stand for accuracy and weighted F1 score respectively. The numbers in the brackets are the standard deviations. X-Norm achieves comparable or superior performance to the existing methods. Dataset IEMOCAP MSP-IMPROV EPIC-KITCHENS Model Modality ACC ↑ F1 ↑ Modality ACC ↑ F1 ↑ Modality ACC ↑ RoBERTa [101] T 70.5 (2.4) 70.6 (2.4) T 57.5 (3.0) 56.2 (3.4) - - HuBERT [69] A 71.0 (2.9) 70.8 (2.8) - - - - - TimeSformer [14] - - - V 67.0 (4.8) 65.4 (5.1) - - I3D [25] - - - - - - R 61.4 I3D [25] - - - - - - O 60.0 Early fusion A+T 77.1 (1.3) 77.0 (1.0) V+T 57.0 (4.0) 55.5 (4.1) R+O 65.7 Late fusion A+T 77.5 (2.0) 77.7 (2.4) V+T 57.2 (3.1) 56.3 (3.2) R+O 65.3 TFN [200] A+T 77.3 (1.7) 77.6 (1.7) V+T 47.6 (6.8) 33.3 (10.4) R+O 66.8 MulT [172] A+T 64.5 (1.3) 64.5 (1.4) V+T 66.7 (3.2) 63.3 (4.4) R+O 38.5 MISA [63] A+T 29.9 (4.0) 14.8 (3.3) V+T 45.5 (6.5) 28.7 (6.9) R+O 26.6 G-Blend [192] A+T 73.2 (1.8) 73.2 (1.8) V+T 57.5 (4.5) 55.5 (4.4) R+O 71.4 (0.9) X-Norm (Ours) A+T 79.1 (2.4) 79.2 (2.3) V+T 73.6 (4.2) 72.8 (4.2) R+O 71.6 (1.3) not decreased in the last five epochs. Table 5.2 shows the settings of the various X-Norms that we train for different benchmarks. The hyperparameters that we use include: learning rate (1e-3, 1e-4, 1e-5), batch size (16, 32, 64, 128) and lambda (0.1, 0.3, 0.5, 0.7, 0.9). We select these hyperparameters based on the performance on the validation set. 5.3.5 Results and Discussion We show the quantitative results in Table 5.3. The numbers in the brackets are the standard deviations. For IEMOCAP and MSP-Improv, the standard deviations are computed over five and six folds respectively. For EPIC-KITCHENS, we train the models with three different random seeds (0, 1, and 2) and report the average and standard deviations as the overall performance. For unimodal models, we find that vision performs better than language for emotion recognition while RGB is better than optical flow for action recognition. 70 For multimodal baseline approaches, we observe that the performance of some multimodal baseline methods are worse than the unimodal results. Especially on MSP-Improv, the vision-only model outperforms all the multimodal approaches except X-Norm, showing the extreme challenge of multimodal fusion. On IEMOCAP and EPIC-KITCHENS, early fusion, late fusion, and TFN succeed in multimodal fusion while MulT and MISA perform poorly, especially MISA on IEMOCAP, with a performance (29.9% ACC) close to random chance (25.0%). G-Blend works slightly better than unimodal models on IEMOCAP and outperforms other baselines on EPIC-KITCHENS. Overall, the proposed X-Norm achieves superior or comparable performance to the existing methods. Especially on MSP-Improv, X-Norm performs significantly better than the other models (p-value < 0.05 for two-tailed t-test). Training Cost. We evaluate the training time (minutes) for all the end-to-end methods on EPIC-KITCHENS: early fusion: 119.5, late fusion: 119.1, TFN: 117.2, G-Blend: 245.9, and X-Norm:131.5. G-Blend and X-Norm have the best performance while X-Norm only needs 53% training time compared to G-Blend. 5.3.6 Ablation Study Number and Positions of the NormExchange Layers. We investigate the effect of the number and positions of the NormExchange layers in X-Norm. Specifically, we select one or several layers from the first, the middle, and the last layers in the encoders and insert the NormExchange layers after them. We evaluate and report the accuracy of X-Norm on EPIC-KITCHENS (see Table 5.4a). When the number of NormExchange layers is zero, X-Norm is reduced to a late fusion achieving the lowest performance (64.8% ACC). For adding a single NormExchange layer, the middle position is the best. X-Norm achieves the best performance (73.3%) with all the three NormExchange layers. This suggests that there is a benefit to having both shallow and deep features being shared between the modalities with NormExchange. 71 Table 5.4: Ablation study for X-Norm. We report the accuracy (ACC) on EPIC-KITCHENS (random seed is 0). (a) Ablation study for the number and positions of the NormExchange layers in X-Norm. # layers Position ACC ↑ 0 none 64.8 1 first 68.8 1 middle 71.6 1 last 70.7 2 first+last 69.8 2 first+middle 70.0 2 middle+last 70.3 3 first+middle+last 73.3 (b) Component contribution analysis for the NormExchange layer. Components ACC ↑ none 65.3 -skip connection 66.1 -mean 69.1 -variance 69.4 all 73.3 (c) Performance for X-Norm with various average weight λ. Value ACC ↑ 0.1 71.4 0.3 73.3 0.5 71.2 0.7 69.4 0.9 66.0 Component contribution analysis for the NormExchange layer. We investigate the component contribution for the NormExchange layer. Specifically, we remove the skip connection or the mean or the variance from the NormExchange layer. We evaluate and report the accuracy of X-Norm on EPIC-KITCHENS in Table 5.4b. The results show that all three components are essential to X-Norm while removing the skip connection has the largest performance drop. Value of the average weight. We investigate the effect of average weight λ in X-Norm. Specifically, we evaluate and report the accuracy of X-Norm on EPIC-KITCHENS (see Table 5.4c). Our results show that when λ = 0.3 (0.3 for RGB and 0.7 for optical flow), X-Norm achieves the best performance (73.3% ACC). Overall, X-Norm performs better with a smaller weight on RGB. However, RGB performs better than optical flow for unimodal action recognition (61.4% vs 60.0% ACC). We speculate this to be due to the model putting more effort into the inferior modality to avoid being over-reliant on the superior modality. We observe similar but more extreme results (0.1/0.9 for RGB/optical flow) for G-Blend [192], which computes the weights for every single modality by gradient estimation. 72 Figure 5.5: Qualitative analysis of the generated normalization parameters (best viewed in color). We show three examples of sentences from MSP-Improv. The sentences are on the left and the labels are on the right. <bos> and <eos> are tokens standing for begin of sentence and end of sentence. We compute the L1 norm of the normalization parameters for each token. Darker green indicates a higher norm. 5.3.7 Qualitative Analysis of Normalization Parameters In this section, we show that meaningful information is encoded into the normalization parameters. Specifically, given the 2D generated normalization parameters, where the first dimension is the sequence length and the second one is the feature dimension, we compute the L1 norm over the sequence dimension. We provide three examples of sentences with different emotion labels on MSP-Improv in Figure 5.5. Darker green indicates a higher norm over the token in this position. We observe that the peaks of those norms align with our human intuition. “great” in the first sentence indicates the happy emotion. “can’t” in the second sentence emphasizes the sadness. The contraction (“’t”) together with the word “believe” in the last sentence jointly contribute to the anger. 5.4 Conclusions In this chatper, we introduce X-Norm, a novel approach for modality fusion. X-Norm can effectively align the modalities by exchanging the parameters in normalization layers. Since only normalization parameters are engaged, the proposed fusion mechanism is lightweight and easy to implement. We evaluate X-Norm through extensive experiments on two multimodal tasks, i.e. , emotion and action recognition, and three databases. The quantitative and qualitative analyses show that normalization parameters can encode the modality-specific information and exchanging these between the two modalities 73 can implicitly align the modalities, thus enhancing the model capacity. Ablation study results show the importance of exchanging normalization parameters across multiple layers. Robust and lightweight multimodal fusion techniques are essential for building socially intelligent and autonomous systems. Methods such as X-Norm have the potential to take us closer to that goal. 74 Chapter 6 FG-Net: Facial Action Unit Detection with Generalizable Pyramidal Features 6.1 Introduction Automatic detection of facial action units is a fundamental block for objective facial expression analysis [46]. Manual annotations for facial action units are cumbersome and costly, as they require trained coders to label each frame individually. Common AU datasets, i.e. , DISFA [111] and BP4D [205], only contain a limited number of subjects (27 and 41 subjects respectively). Recent methods for AU detection [212, 148, 109] focus on deep representation learning, requiring a large number of samples. Existing AU detection methods are often evaluated with within-domain cross-validation, with training and testing data from the same dataset, and the generalization to other datasets (model trained and tested on different datasets) has not been widely investigated. As within-domain performance can be due to overfitting, cross-corpus performance can suffer a considerable loss (see comparisons in Figure 6.1). In the field of semantic segmentation, recent studies [208, 9] leverage a well-trained generative model to synthesize image-annotation pairs from only a few labeled examples (around 30 training samples). They show that the intermediate features of generative models exhibit semantic-rich representations that are 75 20 30 40 50 60 70 DRML JA ̂A-Net ME-GraphAU GeFAUS (Ours) Performance gap between within- and cross-domain within-domain cross-domain FG-Net 12.9% 33.0% 26.4% 11.0% Figure 6.1: Performance (F1 score ↑) gap between the within- and cross-domain AU detection for DRML [212], JAA-Net [ ˆ 148], ME-GraphAU [109], and the proposed FG-Net. The within-domain performance is averaged between DISFA and BP4D, while the cross-domain performance is averaged between BP4D to DISFA and DISFA to BP4D. The proposed FG-Net has the highest cross-domain performance, thus, superior generalization ability. well-suited for pixel-wise segmentation tasks in a few-shot manner. Li et al. [90] showcase extreme out-ofdomain generalization ability from such approaches. However, to the best of our knowledge, no existing work adapts such architectures to AU detection, potentially due to the following limitations: (i) the high dimensionality of the pixel-wise features results in inefficient training, and (ii) inference with pixel-level features lacks the information from the nearby regions, which is crucial to AU detection [212, 148]. In this chapter, inspired by the success of GAN features in semantic segmentation, we propose FG-Net, a facial action unit detection method that can better generalize across domains. The general idea of FGNet is to extract the generalizable and semantic-rich deep representations from a well-trained generative model (see Figure 6.1). Specifically, FG-Net first encodes and decodes the input image with a StyleGAN2 encoder (pSp) [133], and a StyleGAN2 generator [77], trained on the FFHQ dataset [76]. Then, FG-Net extracts feature maps from the generator during decoding. To take advantage of the informative pixel-wise representations from the generator, FG-Net detects the AUs through heatmap regression. We propose a Pyramid CNN Interpreter which incorporates the multi-resolution feature maps in a hierarchical manner. The proposed module makes the training efficient and captures essential information from nearby regions. 76 Thanks to the powerful features from the generative model pre-trained on a large and diverse facial image dataset, the proposed FG-Net obtains a strong generalization ability and data efficiency for AU detection. To demonstrate the effectiveness of our proposed method, we conduct extensive experiments with the widely-used DISFA [111] and BP4D [205] for AU detection. The results show that the proposed FGNet method has a strong generalization ability and achieves state-of-the-art cross-domain performance (see Figure 6.1). In addition, FG-Net achieves comparable or superior within-domain performance to the existing methods. Finally, we showcase that FG-Net is a data-efficient approach. With only 100 training samples, it can achieve decent performance. Our major contributions are as follows. • We propose FG-Net, a data-efficient method for generalizable facial action unit detection. To the best of our knowledge, we are the first to utilize StyleGAN model features for AU detection. • Extensive experiments on the widely-used DISFA and BP4D datasets show that FG-Net has a strong generalization ability for heatmap-based AU detection achieving superior cross-domain performance and maintaining competitive within-domain performance compared to the state-of-the-art. • FG-Net is data-efficient. The performance of FG-Net trained on 1k samples is close to the whole set (∼100k). 6.2 Method 6.2.1 Problem Formulation Facial Action Unit Detection. Given a video set S, for each frame x ∈ S, the goal is to detect the occurrence for each AU ai (i = 1, 2, ..., n) using function F(·). a1, a2, ..., an = F(x), (6.1) 77 Latent Code ( w+ ) StyleGAN2 Encoder StyleGAN2 Generator Pyramid-CNN Interpreter Feature Maps Detected Heatmaps MSE Loss Ground-truth Heatmaps AU Labels Input Image Interpolate CNN ReLU BatchNorm Interpreter Block Interpreter Block Interpreter Block Sum Train from scratch Fine-tune Figure 6.2: Overview of our proposed pipeline. FG-Net first encodes the input image into a latent code using a StyleGAN2 encoder (e.g. pSp [133] here). In the decoding stage [77], we extract the intermediate multi-resolution feature maps and pass them through our Pyramid CNN Interpreter to detect AUs coded in the form of heatmaps. Mean Squared Error (MSE) loss is used for optimization between the ground truth and predicted heatmaps. where n is the number of AUs. ai = 1 if the AU is active otherwise ai = 0. 6.2.2 Overview Figure 6.2 illustrates an overview of the proposed FG-Net. FG-Net first encodes and decodes the input image with the pSp encoder [133] and the StyleGAN2 generator [77] pre-trained on the FFHQ dataset [76]. During the decoding, FG-Net extracts feature maps from the generator. Leveraging the features extracted from a generative model trained on a large-scale and diverse dataset, FG-Net offers a higher generalizability for AU detection. To take advantage of the pixel-wise representations from the generator, FG-Net is designed to detect the AUs using a heatmap regression. To keep the training efficient and capture both local and global information, a Pyramid-CNN Interpreter is proposed to incorporate the multi-resolution feature maps in a hierarchical manner and detect the heatmaps representing facial action units. 78 1 1 2 2 4 4 9 9 6 6 12 12 25 25 26 26 1 1 2 2 4 4 6 6 12,14,15 12,14,15 7 7 10 10 17 17 23,24 23,24 Figure 6.3: Visualizations of the ROI centers for DISFA (left) and BP4D (right). AU indices are labeled above or below. 6.2.3 Model Prerequisites. Our proposed method is built on top of the StyleGAN2 generator [77]. The StyleGAN2 generator decodes a latent code z ∈ Z sampled from N (0, I)to an image. The latent code z is first mapped to a style code w ∈ W by a mapping function. Both z and w have 512 dimensions. There are k synthesis blocks (in practice k = 9) and each block has two convolution layers and one upsampling layer. Each convolution layer is followed by an adaptive instance normalization (AdaIN) layer [70] which is controlled by the style code w. However, for image inversion which encodes the images into the latent space, Wspace has limited expressiveness and thus can not fully reconstruct the input [194]. Therefore, prior works [1, 133] extend W-space to W+-space where a different style code w is fed to each AdaIN layer. W+-space alleviates the reconstruction distortion. The dimension of w + ∈ W+ is 18 × 512. Feature Extraction from StyleGAN2. To extract features from the StyleGAN2 generator, we first encode the input image to the latent space and then decode the latent code. Prior work [90] encodes the input image in an optimization-based manner. Optimization-based methods iteratively optimize a reconstruction objective which is extremely time-consuming. Instead, we utilize the pSp encoder E [133] to encode the input image x ∈ X and get the latent code w + ∈ W+ via w + = E(x). Despite the efficient encoding of the pSp encoder, the generator features may not capture the key facial features for AU detection. To address the problem, we fine-tune the encoder and the generator 79 Table 6.1: Region of Interest (ROI) for each action unit (AU). Scale is measured by inner-ocular distance (IOD). Landmark (Lmk) positions are illustrated in Figure 6.4. AU Description ROI Center 1 Inner Brow Raiser Lmk 22, 23 2 Outer Brow Raiser Lmk 18, 27 4 Brow Lowerer Brow center 6 Cheek Raiser 1 scale below eye center 7 Lid Tightener Lmk 39, 44 9 Nose Wrinkler Lmk 40, 43 10 Upper Lip Raiser Lmk 51, 53 12 Lip Corner Puller Lmk 49, 55 14 Dimpler Lmk 49, 55 15 Lip Corner Depressor Lmk 49, 55 17 Chin Raiser 0.5 scale below Lmk 57, 59 23 Lip Tightener Lmk 52, 58 24 Lip Pressor Lmk 52, 58 25 Lips part Lmk 52, 58 26 Jaw Drop Lmk 57, 59 during training. Then, we decode the latent code with the StyleGAN2 generator G [77] to obtain image x ′ = G(w +). During decoding, we extract the intermediate activations from the generator. To keep the training efficient, unlike the previous work [208] which extracts the outputs from all the AdaIN layers [70], we only extract the hidden states after the second AdaIN layer in each block. We denote the feature maps we get from the k blocks as {f1, f2, ..., fk} = G ′ (w +) = G ′ (E(x)). Heatmap Detection. The proposed method detects the AU occurrences in a heatmap regression-based approach. We generate the ground-truth heatmaps following the previous work [209]. We first define the Region of Interest (ROI) for each AU. We select two points on the face based on the most representative landmarks (see Figure 6.3, detailed positions are provided in Table 6.1). Then, we generate the ground-truth heatmaps with the definition of ROI. Figure 6.5 gives the visualization of the ground-truth heatmaps on DISFA and BP4D. Formally, given a face image x ∈ R w×h×3 , we generate n ground-truth heatmaps m1, m2, ..., mn ∈ R w×h with the AU labels, where n is the number of 80 Figure 6.4: The positions for the 68 facial landmarks. Image is adapted from the iBUG 300-W dataset [141]. AU 1 Inner brow raiser AU 2 Outer brow raiser AU 4 Brow lowerer AU 6 Cheek raiser AU 9 Nose wrinkler AU 12 Lip corner puller AU 25 Lips part AU 26 Jaw drop -1.0 -0.6 -0.2 0.2 0.6 1.0 DISFA BP4D Input Input AU 1 Inner brow raiser AU 2 Outer brow raiser AU 4 Brow lowerer AU 6 Cheek raiser AU 12 Lip corner puller AU 7 Lid tightener AU 10 Upper lip raiser AU 14 Dimpler AU 15 Lip corner depressor AU 17 Chin raiser AU 23 Lip tightener AU 24 Lip pressor Figure 6.5: Visualization of the generated ground-truth heatmaps on DISFA (first row) and BP4D (second row). We generate one heatmap for every AU which has two Gaussian windows with the maximum values at the two ROI centers (see Figure 6.3). The peak value is either 1 (red, AU is active) or −1 (blue, AU is inactive). AUs. Specifically, for heatmap mi , we add two Gaussian windows g 1 i and g 2 i with the maximum value at the two ROI centers c 1 i and c 2 i following [209]. g j i (p) = λi exp(− ||p − c j i ||2 2 2σ 2 ), j = 1, 2, (6.2) mi(p) = g 1 i (p) + g 2 i (p). (6.3) where p is the pixel location (p ∈ [1, w]×[1, h]). λi is the indicator denoting whether the i-th AU is active. λi = 1 if ai = 1 otherwise λi = −1. σ is the standard derivation. We clip the heatmaps into the range of [−1, 1] to make sure the peak value is either 1 or −1. 81 After feature extraction from the generative models, prior work [208, 9] upsamples the features to the input resolution and concatenates them according to the channel dimension. Then, each pixel is treated as a training sample and a multi-layer perceptron (MLP) is trained to detect the semantic class. Simply upsampling and concatenating all the feature maps results in redundant and high dimensional features (in practice the number of channels is 6080), thus leading to inefficient training and inference. More importantly, using single-pixel features for inference lacks the spatial context from nearby regions, which is crucial to AU detection, as demonstrated in the previous studies [212, 148]. To address these problems, we propose a multi-scale Pyramid-CNN Interpreter H for heatmap-based AU detection which incorporates the multi-resolution feature maps in a hierarchical manner (see Figure 4.5). Specifically, the Pyramid-CNN Interpreter H contains k pyramid levels, where k is the number of feature maps extracted from the generator. In each pyramid level, the hidden states from the last pyramid level ci−1 are first summed with the feature map from the generator fi and then passed through an interpreter block Ci . Each interpreter block consists of one Interpolate, one Convolution, one ReLU, and one BatchNorm layer. m = ck is the ultimate AU heatmap. specifically, c0 = 0, ci = Ci(ci−1 + fi), i = 1, 2, ..., k, (6.4) m = ck = H(f1, f2, ..., fk). (6.5) 6.2.4 Training and Inference Training. The learning objective is the Mean Squared Error (MSE) loss between the ground-truth heatmap m and the detected heatmap mˆ : L = ||m − mˆ ||2 2 . Inference. For each detected heatmap mˆ i , we sum up the whole heatmap. If the sum is greater than 0, the corresponding AU is active otherwise the AU is inactive. 82 6.3 Experiments 6.3.1 Datasets We select two publicly available datasets, i.e. , DISFA [111] and BP4D [205]. These two datasets are widely used for AU detection and are captured from different subjects with different backgrounds and lighting conditions. DISFA [111] includes videos from 27 subjects, with around 130,000 frames. Each frame has labels for eight AU intensities (1, 2, 4, 6, 9, 12, 25, and 26). Following the settings of previous studies [212, 148, 109], we map the AU intensity greater than 1 to the positive class. BP4D [205] consists of videos from 41 subjects with around 146,000 frames. Each frame has labels for 12 AU occurrences (1, 2, 4, 6, 7, 10, 12, 14, 15, 17, 23, and 24). We use dlib [80] to detect the 68 facial landmarks for all the frames and FFHQ-alignment to align them. The detected landmarks are also used for generating the ground-truth heatmaps (see Figure 6.5). 6.3.2 Implementation and Training Details All methods are implemented in PyTorch [123]. Code and model weights are available, for the sake of reproducibility.∗ We use a machine with two Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPUs with eight NVIDIA Quadro RTX8000 GPUs for all the experiments. Each image is resized into 128 × 128. We train the proposed model with the AdamW optimizer [107] for 15 epochs with a batch size of 8 on a single GPU. The learning rate is 5e − 5. The weight decay is 5e − 4. The gradient clipping is set to 0.1. σ for the heatmaps (Equation 6.2) is 20.0. The dropout rate is 0.1. ∗ https://github.com/ihp-lab/FG-Net 83 Table 6.2: Within-domain evaluation in terms of F1 score (↑). Except for GH-Feat and ME-GraphAU + FFHQ pre-train, all the baseline numbers are from the original papers. Our method has competitive performance compared to the state-of-the-art. (a) Within-domain evaluation on DISFA [111]. Methods AU1 AU2 AU4 AU6 AU9 AU12 AU25 AU26 Avg. DRML [212] 17.3 17.7 37.4 29.0 10.7 37.7 38.5 20.1 26.7 IdenNet [173] 25.5 34.8 64.5 45.2 44.6 70.7 81.0 55.0 52.6 SRERL [91] 45.7 47.8 59.6 47.1 45.6 73.5 84.3 43.6 55.9 UGN-B [158] 43.3 48.1 63.4 49.5 48.2 72.9 90.8 59.0 60.0 HMP-PS [159] 38.0 45.9 65.2 50.9 50.8 76.0 93.3 67.6 61.0 FAT [72] 46.1 48.6 72.8 56.7 50.0 72.1 90.8 55.4 61.5 Zhang et al. [209] 55.0 63.0 74.6 45.3 35.2 75.3 93.5 54.4 62.0 JAA-Net [ ˆ 148] 62.4 60.7 67.1 41.1 45.1 73.5 90.9 67.4 63.5 PIAP [164] 50.2 51.8 71.9 50.6 54.5 79.7 94.1 57.2 63.8 Chang et al. [26] 60.4 59.2 67.5 52.7 51.5 76.1 91.3 57.7 64.5 ME-GraphAU [109] 54.6 47.1 72.9 54.0 55.7 76.7 91.1 53.0 63.1 ME-GraphAU + FFHQ pre-train 46.1 44.8 72.4 48.2 48.1 70.3 90.9 55.4 59.5 GH-Feat [195] 16.9 13.8 39.1 37.1 16.7 65.0 78.7 28.1 36.9 Ours 63.6 66.9 72.5 50.7 48.8 76.5 94.1 50.1 65.4 (b) Within-domain evaluation on BP4D [205]. Methods AU1 AU2 AU4 AU6 AU7 AU10 AU12 AU14 AU15 AU17 AU23 AU24 Avg. DRML [212] 36.4 41.8 43.0 55.0 67.0 66.3 65.8 54.1 33.2 48.0 31.7 30.0 48.3 IdenNet [173] 50.5 35.9 50.6 77.2 74.2 82.9 85.1 63.0 42.2 60.8 42.1 46.5 59.3 SRERL [91] 46.9 45.3 55.6 77.1 78.4 83.5 87.6 63.9 52.2 63.9 47.1 53.3 62.9 UGN-B [158] 54.2 46.4 56.8 76.2 76.7 82.4 86.1 64.7 51.2 63.1 48.5 53.6 63.3 HMP-PS [159] 53.1 46.1 56.0 76.5 76.9 82.1 86.4 64.8 51.5 63.0 49.9 54.5 63.4 FAT [72] 51.7 49.3 61.0 77.8 79.5 82.9 86.3 67.6 51.9 63.0 43.7 56.3 64.2 Zhang et al. [209] 52.6 47.0 61.4 76.8 79.2 83.5 88.6 60.4 49.3 62.6 50.8 49.6 63.5 JAA-Net [ ˆ 148] 53.8 47.8 58.2 78.5 75.8 82.7 88.2 63.7 43.3 61.8 45.6 49.9 62.4 PIAP [164] 54.2 47.1 54.0 79.0 78.2 86.3 89.5 66.1 49.7 63.2 49.9 52.0 64.1 Chang et al. [26] 53.3 47.4 56.2 79.4 80.7 85.1 89.0 67.4 55.9 61.9 48.5 49.0 64.5 ME-GraphAU [109] 52.7 44.3 60.9 79.9 80.1 85.3 89.2 69.4 55.4 64.4 49.8 55.1 65.5 ME-GraphAU + FFHQ pre-train 51.1 38.8 57.0 76.8 78.9 83.2 88.3 64.1 44.0 61.5 44.5 45.2 61.1 GH-Feat [195] 42.7 43.2 47.6 73.5 66.2 75.6 83.8 54.2 43.9 62.4 41.9 45.0 56.7 Ours 52.6 48.8 57.1 79.8 77.5 85.6 89.3 68.0 45.6 64.8 46.2 55.7 64.3 6.3.3 Experimental Results The models are evaluated for within-domain and cross-domain performance in addition to data efficiency. Cross-domain evaluation enables us to measure the generalization ability of our AU detection method. For all the experiments, F1 score (↑) is the evaluation metric. Within-domain Evaluation. We perform within-domain evaluation on widely used DISFA and BP4D datasets. We follow the same evaluation protocols as the previous studies [148, 159, 109]. Both datasets are evaluated with subject-independent 3-fold cross-validation. We use two folds for training and one 84 Table 6.3: Cross-domain evaluation between DISFA and BP4D in terms of F1 scores (↑). Our model achieves superior performance compared to the baselines. ∗ The numbers are reported from the original paper. Direction DISFA → BP4D BP4D → DISFA AU 1 2 4 6 12 Avg. 1 2 4 6 12 Avg. DRML [212] 19.4 16.9 22.4 58.0 64.5 36.3 10.4 7.0 16.9 14.4 22.0 14.1 JAA-Net [ ˆ 148] 10.9 6.7 42.4 52.9 68.3 36.2 12.5 13.2 27.6 19.2 46.7 23.8 ME-GraphAU [109] 36.5 30.3 35.8 48.8 62.2 42.7 43.3 22.5 41.7 23.0 34.9 33.1 ME-GraphAU + FFHQ pre-train 20.1 32.9 38.0 64.0 73.0 45.6 51.2 14.4 54.4 17.7 30.6 33.7 GH-Feat [195] 29.4 30.0 37.1 64.0 73.5 46.8 18.9 15.2 27.5 52.7 50.1 32.9 Patch-MCD∗ [198] - - - - - - 34.3 16.6 52.1 33.5 50.4 37.4 IdenNet∗ [173] - - - - - - 20.1 25.5 37.3 49.6 66.1 39.7 Ours 51.4 46.0 36.0 49.6 61.8 49.0 61.3 70.5 36.3 42.2 61.5 54.4 fold for validation and iterate three times. We compare FG-Net to the state-of-the-art AU detection methods, including DRML [212], IdenNet [173], SRERL [91], UGN-B [158], HMP-PS [159], FAT [72], Zhang et al. [209], JAA-Net [ ˆ 148], PIAP [164], Chang et al. [26], and ME-GraphAU [109]. These baseline numbers are reported from the original papers. Previous methods do not use the FFHQ dataset for training. Thus, to make the comparison fair, two baselines are implemented and compared, e.g. , ME-GraphAU + FFHQ pre-train and GH-Feat [195]. Specifically, we first pre-train the ME-GraphAU’s backbones (ResNet and Swin Transformer) with the FFHQ dataset and its facial expression labels. Then we train the ME-GraphAU with the pre-trained backbones. GH-Feat extracts features from generative models and it is pre-trained on the FFHQ in a self-supervised manner. For both baselines, we implement with the officially released source codes. Table 6.2 reports the within-domain results regarding the average performance of AUs. We provide detailed results for every individual AU in the supplementary material. On DISFA, FG-Net outperforms all the baseline methods and achieves an average F1 score of 65.4. The major improvement comes from AU1 and AU2. On BP4D, FG-Net achieves competitive performance. These results demonstrate that the pixel-wise features extracted from StyleGAN2 are beneficial for heatmap-based AU detection. Cross-domain Evaluation. We perform two directions of cross-domain evaluation, i.e. , DISFA to BP4D and BP4D to DISFA. For each direction, we use two folds and one fold of data from the source domain as the 85 ME-GraphAU FG-Net (ours) Ground-truth Inner Brow Raiser Inner Brow Raiser Inner Brow Raiser Outer Brow Raiser Outer Brow Raiser Outer Brow Raiser Brow Lowerer Brow Lowerer Brow Lowerer Cheek Raiser Cheek Raiser Cheek Raiser Lip Corner Puller Lip Corner Puller Lip Corner Puller Inner Brow Raiser Inner Brow Raiser Inner Brow Raiser Outer Brow Raiser Outer Brow Raiser Outer Brow Raiser Brow Lowerer Brow Lowerer Brow Lowerer Cheek Raiser Cheek Raiser Cheek Raiser Lip Corner Puller Lip Corner Puller Lip Corner Puller Figure 6.6: Case analysis on ME-GraphAU [109] and FG-Net. Models are trained on BP4D and tested on DISFA. Orange means active AU while blue means inactive AU. FG-Net is more accurate than MEGraphAU. training and validation set and use the target data as the testing set. We compare the proposed method with DRML [212], JAA-Net [ ˆ 148], ME-GraphAU [109], ME-GraphAU + FFHQ pre-train, and GH-Feat [195] since they are open-source, and we can use the officially released source codes and model weights to conduct the experiments. In addition, we compare with Patch-MCD [198] and IdenNet [173]. The numbers are reported from the original paper. Ertugrul et al. [47, 48] and Hernandez et al. [66] do not report F1 scores for the cross-domain performance of the aforementioned directions. We report the cross-domain results in Table 6.3. As expected, compared to the within-domain performance, all the baseline methods suffer a considerable performance loss when evaluated across corpora. In particular, when evaluated from BP4D to DISFA, the baseline methods’ performance (average F1) drops by more than 30%, which demonstrates the challenging nature of cross-domain AU detection and the importance of developing generalizable AU detection. Compared with DRML and JAA-Net, ME-GraphAU achieves higher cross-domain performance. We ˆ suspect it is because it utilizes the pre-trained models (ResNet [65] and Swin Transformer [102]) as the backbones. In addition, when we continue pre-training ME-GraphAU with the FFHQ dataset, we observe a further performance boost in both directions of cross-domain evaluations. Similarly, GH-Feat, which is 86 Table 6.4: AU intensity estimation on DISFA [111] in terms of MSE (↓) and MAE (↓). Our method has competitive performacne compared to the state-of-the-art. Metric Method AU1 AU2 AU4 AU5 AU6 AU9 AU12 AU15 AU17 AU20 AU25 AU26 Avg. MSE 2DC [98] 0.32 0.39 0.53 0.26 0.43 0.30 0.25 0.27 0.61 0.18 0.37 0.55 0.37 HR [121] 0.41 0.37 0.70 0.08 0.44 0.30 0.29 0.14 0.26 0.16 0.24 0.39 0.32 APs [143] 0.68 0.59 0.40 0.03 0.49 0.15 0.26 0.13 0.22 0.20 0.35 0.17 0.30 Ours 0.25 0.21 0.47 0.07 0.35 0.20 0.27 0.15 0.27 0.16 0.23 0.40 0.25 MAE KJRE [207] 1.02 0.92 1.86 0.70 0.79 0.87 0.77 0.60 0.80 0.72 0.96 0.94 0.91 CCNN-IT [189] 0.73 0.72 1.03 0.21 0.72 0.51 0.72 0.43 0.50 0.44 1.16 0.79 0.66 KBSS [206] 0.48 0.49 0.57 0.08 0.26 0.22 0.33 0.15 0.44 0.22 0.43 0.36 0.33 SCC-Heatmap [51] 0.16 0.16 0.27 0.03 0.25 0.13 0.32 0.15 0.20 0.09 0.30 0.32 0.20 Ours 0.19 0.16 0.36 0.06 0.31 0.17 0.32 0.18 0.27 0.15 0.34 0.41 0.25 trained on the FFHQ dataset, also obtains superior performance than DRML and JAA-Net. The experi- ˆ mental results show the effectiveness of pre-training on the FFHQ dataset since it is a large and diverse facial image dataset. Moreover, Patch-MCD utilizes unsupervised domain adaptation with unlabeled target data while IdenNet is jointly trained by AU detection and face cluster datasets (CelebA [103]). Thus, with additional face data, these two methods have better cross-domain performance than the aforementioned baselines. For both directions of cross-domain evaluation, our proposed method achieves superior performance compared to the baselines. Specifically, when evaluated from BP4D to DISFA, FG-Net can outperform the baselines by 15% in terms of the average F1 score. The major improvement comes from AU1 and AU2, which is consistent with the findings in within-domain evaluation. Overall, the results showcase that features extracted from the StyleGAN2 generator are generalizable and thus improve the performance for cross-domain AU detection, showing its potential to solve AU detection in a real-life scenario. We present two qualitative examples of cross-domain prediction in Figure 6.6. The models are trained on the BP4D dataset and tested on the DISFA dataset. ME-GraphAU fails in those two cases while the proposed FG-Net method accurately predicts the action units. AU Intensity Estimation. Unlike AU detection which is a binary classification problem, intensity estimation provides a discrete regression from input face images. FG-Net addresses the AU detection problem 87 100 1k 5k 10k whole Number of Training Samples 0 10 20 30 40 50 60 70 F1 Score Data Efficieny Evaluation on DISFA ME-GraphAU ME-GraphAU + FFHQ pre-train Ours (a) Evaluation on DISFA. 100 1k 5k 10k whole Number of Training Samples 0 10 20 30 40 50 60 70 F1 Score Data Efficieny Evaluation on BP4D ME-GraphAU ME-GraphAU + FFHQ pre-train Ours (b) Evaluation on BP4D. Figure 6.7: Data efficiency evaluation with different numbers of samples. Our method is data-efficient and its performance trained on 1k samples is close to the whole set. using a heatmap regression which can be extended to AU intensity estimation. Specifically, following [121], the peak value for the heatmap is the corresponding AU intensity (ranging from 0 to 5) and we take the maximum of each heatmap as the predicted AU intensity. The results on DISFA are shown in Table 6.4. The evaluation metrics are mean squared error (MSE ↓) and mean absolute error (MAE ↓). The results show that FG-Net also achieves competitive performance compared to the state-of-the-art for AU intensity estimation. Data Efficiency Evaluation. To further evaluate the generalization capacity of the proposed approach, an investigation of its learning capability with limited samples is conducted through within-domain evaluation. In this evaluation, a subset of the training data is randomly selected, while the testing data remains unchanged to facilitate assessment. The model is trained using four different sample sizes: 100, 1k, 5k, and 10k. A comparative analysis is performed between our method and two other approaches, namely MEGraphAU [109] and ME-GraphAU + FFHQ pre-train. It is noteworthy that ME-GraphAU + FFHQ pre-train and our method employ the same pre-training dataset. The efficiency evaluation results, depicted in Figure 6.7, demonstrate the impact of data scarcity on performance for both datasets. Notably, ME-GraphAU [109] exhibits remarkably low F1 scores when trained 88 Table 6.5: Ablation study for FG-Net. F1 score (↑) is the metric. D and B stand for DISFA and BP4D. D → B means the model is trained on DISFA and tested on BP4D and similar to B → D. (i) Our method gets better performance than GH-Feat [195]. (ii) With every component, our method achieves the highest within-domain performance while removing late features gets the best cross-domain performance. D B Avg. D → B B → D Avg. Upscale & concat 64.2 62.7 63.4 42.5 35.9 39.2 Latent code 68.4 58.8 63.6 46.4 47.3 46.9 - Early 68.1 61.7 64.9 37.9 47.2 42.6 - Middle 67.4 63.1 65.3 48.5 38.0 43.3 - Late 67.4 62.8 65.1 51.2 56.6 53.9 FG-Net 68.9 63.6 66.3 49.0 54.4 51.7 with 100 and 1k samples on the DISFA dataset, as well as with 100 samples on the BP4D dataset. This outcome can be attributed to the limited and sparse nature of the training set, causing ME-GraphAU to predict inactive AUs predominantly. By contrast, the performance of ME-GraphAU improves when pre-trained on the FFHQ dataset, underscoring the effectiveness of utilizing this extensive and diverse facial dataset for pre-training. However, even with 100 samples from the DISFA dataset, the performance of ME-GraphAU remains at 0. In comparison, FG-Net outperforms ME-GraphAU + FFHQ pre-train when trained with partial training data for both datasets. Notably, FG-Net trained on 1k samples achieves performance levels approaching those of the full training set. Furthermore, even with a mere 100 training samples, FG-Net manages to achieve commendable performance. These results serve as evidence of the robust generalization ability exhibited by our proposed method when confronted with limited data. Ablation Study. We conduct three ablation experiments: (i) We compare to the existing upscaling and concatenating features proposed in [208, 9] (upscale & concat). (ii) We directly compare to using latent code to predict the activations of AUs (latent code). (iii) We explore the best blocks for extracting feature maps. Specifically, we divide the features extracted from the nine synthesis blocks into three groups, where each group has three feature maps, and denote them as the early, middle, and late groups. Each time, we 89 GT FG-Net - Late - Middle - Early Upscale & concat Within-domain AU 1 AU 2 AU 4 AU 6 AU 12 Cross-domain AU 1 AU 2 AU 4 AU 6 AU 12 -1.0 -0.6 -0.2 0.2 0.6 1.0 Figure 6.8: Visualization of the detected heatmaps for ablation study. With all the components, FG-Net detects the most similar heatmaps to the ground-truth (GT) for within-domain evaluation. Removing late features results in the best cross-domain evaluation. remove one group. We perform both within- and cross-domain evaluations for the ablation study. Note that for within-domain evaluation, we use two folds for training and one fold for validation. Table 6.5 shows the within- and cross-domain performance on DISFA and BP4D. (i) We observe that FGNet outperforms Upscale & concat for both within- and cross-domain settings. We believe inference with singe-pixel features lacks the inductive bias, considering local features, necessary for AU detection. (ii) FGNet outperforms latent code for predicting AU activations for both within- and cross-domain experiments. We think using the heatmap regression allows the model to localize where the AUs occur and improves the model’s capacity. In addition, compared with the 2D feature maps, the latent codes lose the semanticrich representations. (iii) For the contributions of different feature maps, we observe that removing any one of the feature groups lowers the within-domain performance. Surprisingly, removing late features achieves the highest cross-domain performance. We suspect it is because the late features contain more high-frequency and domain-specific information which reduces the model’s generalization ability. We visualize the ground-truth and detected heatmaps for ablation study in Figure 6.8. For the withindomain evaluation, models are trained and tested with BP4D; For the cross-domain evaluation, models are 90 AU 9 Nose wrinkler AU 26 Jaw drop AU 15 Lip corner depressor Input Prediction GT -1.0 -0.6 -0.2 0.2 0.6 1.0 Figure 6.9: Visualization of the failure cases. FG-Net achieves inferior performance on AU9, AU15, and AU26. trained with BP4D and tested with DISFA. For latent code, we directly use it to predict the AU activations, thus, we do not have the detected heatmaps for latent code. For within-domain evaluation, FG-Net detects all AUs correctly, whereas the other methods output the wrong prediction for AU2 (outer brow raiser), showing that FG-Net achieves the best within-domain performance with every component. For crossdomain evaluation, both using all features and removing late features detect all AUs correctly. However, removing late features results in a more accurate heatmap for AU12 than using all features. 6.3.4 Limitations In the within-domain evaluation, FG-Net achieves inferior results on AU9 (nose wrinkler), AU15 (lip corner depressor), and AU26 (jaw drop). Failure cases are shown in Figure 6.9. We suspect it is because the FFHQ dataset lacks faces with such active AUs, and thus the StyleGAN2 features can not capture the corresponding information well. In addition, these failure AUs are not common in DISFA and BP4D thus they do not appear in the cross-domain evaluations and we can not evaluate the generalization for them. FG-Net addresses the AU detection problem using a heatmap regression. Though our method can be extended to AU intensity estimation, there are only three common AUs for intensity estimation between 91 BP4D and DISFA (6, 12, and 17) with no AU on the eyebrows. Thus, we can not properly evaluate the generalization ability of FG-Net for AU intensity estimation. 6.4 Conclusions In this chapter, we propose FG-Net, a data-efficient method for generalizable facial action unit detection. FG-Net extracts the generalizable and semantic-rich features from the generative model. A Pyramid CNNInterpreter is proposed to detect AUs coded as heatmaps which makes the training efficient and captures essential information from the nearby regions. The experimental results demonstrate the challenging nature of cross-domain AU detection and the importance of developing generalizable AU detection. We show that the proposed FG-Net method has a strong generalization ability when evaluated across corpora or trained with limited data, demonstrating its strong potential to solve action unit detection in a real-life scenario. Social Implications. Our work falls within the broad domain of facial expression analysis. Despite potential benefits, any surveillance technology can be misused, and sensitive private information may be revealed by malicious actors. Mitigation strategies for such misuses include restrictive licensing and government regulations. 92 Chapter 7 Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition This chapter is co-authored with Minh Tran. 7.1 Introduction With the ubiquity of voice assistive technologies, speech emotion recognition (SER) is becoming increasingly important as it allows for a more natural and intuitive interaction between humans and machines. Although SER technology has made significant progress in recent years, accurately detecting emotions from speech remains a challenging task. This is partly due to the vast variability in how people express their feelings through speech, which can depend on culture [163], gender [88], or age [115], among others. Personalization is a promising solution to address the variability of emotional expression in speech. By tailoring emotion recognition systems to match individuals’ unique expressive behaviors, the approach can lead to a more robust and inclusive model that is better equipped to accurately detect emotions for a wide range of users. Existing studies on personalized emotion recognition generally use hand-crafted features of speech on datasets with a small number of speakers (ten or fewer speakers) [29, 79, 75, 186, 8]. Recently, SER systems achieve state-of-the-art results [161, 32] via fine-tuning large pre-trained speech encoders such as HuBERT 93 [69] or wav2vec2.0 [5]. This raises three important questions: (1) What happens to the personalization gap as the number of speakers increases for fine-tuned encoders? (2) How do existing personalization methods behave when the input speech features are not fixed? (3) How can we incorporate personalization with pre-trained encoders to boost performance? In this chapter, we perform extensive experiments on the MSP-Podcast corpus [108] with more than 1,000 speakers to answer these questions. We first show that as the number of speakers increases, the personalization gap (the performance difference between speaker-dependent and speaker-independent) of fine-tuned models decreases, which motivates the need for methods that adapts the pre-trained weights for personalization prior to fine-tuning. Hence, we propose to continue the pre-training process of the speech encoder jointly with speaker embeddings (see Figure 7.2 (a)). We also introduce a simple yet effective unsupervised personalized calibration step to adjust label distribution per speaker for better accuracy (see Figure 7.2 (b)). The proposed methods are unsupervised, requiring no prior knowledge of the test labels. Experimental results on arousal and valence estimation show that the proposed methods achieve state-ofthe-art results for valence estimation while consistently outperforming the encoder fine-tuning baseline and a recent personalization method evaluated on the same dataset [160]. The major contributions of this chapter are as follows. • We propose a method for personalized adaptive pre-training to adjust the existing speech encoders for a fixed set of speakers. • We propose an unsupervised personalized post-inference technique to adjust the label distributions. • We provide extensive experimental results along with an ablation study to demonstrate the effectiveness of the methods. • We further show that our methods can be extended to unseen speakers without the need to re-train any component, achieving superior performance compared to the baselines. 94 7.2 Method 7.2.1 Problem Formulation Unsupervised personalized speech emotion recognition: Given a speech dataset containing N utterances with emotion labels (arousal or valence) and speaker IDs D = {(ui , yi , si)} N i=1. We assume access to all information except for the emotion labels of the test set during the training phase. Our goal is to produce a robust emotion recognition model that performs better than a model exposed to the same amount of data excluding speaker ID information. We further want our method to be extensible to new speakers outside of D. 7.2.2 Dataset We use the MSP-Podcast corpus [108] as our dataset D. MSP-Podcast is the largest corpus for speech emotion recognition in English, containing emotionally rich podcast segments retrieved from audio-sharing websites. Each utterance in the dataset is annotated using crowd-sourcing with continuous labels of arousal, valence, and dominance along with categorical emotions. In this chapter, we focus on arousal and valence estimation. The labels range from 1 to 7. The dataset contains pre-defined train, validation, and test sets, namely Dtr, Dval, Dte, which are subject independent. We use two versions of the dataset, namely v1.6 and v1.10, for the experiments. To be consistent with prior studies [97, 160, 161], most of our experiments are based on MSP-Podcast v1.6. We remove all the utterances marked with “Unknown" speakers in accordance with our problem formulation. Following Sridhar et al. [160], we split the test set into two subsets test-a and test-b that share the same set of speakers. Each speaker in test-a contains 200s of speech in total while test-b contains the rest of the recordings. test-a is used to train speaker-dependent models along with the train set. For experiments on the unseen speakers, we evaluate the models on the 95 Table 7.1: Details and statistics of our splits for MSP-Podcast. split train validation test-a test-b test-c # utterances 26470 5933 1684 8434 7304 # speakers 987 41 50 50 62 total duration 44.2h 10h 2.9h 13.8h 9.5h corpus version v1.6 v1.6 v1.6 v1.6 v1.10 speakers who are in the v1.10 test set but not in the v1.6 test set, namely test-c. Table 7.1 provides the details and statistics of our splits for the MSP-Podcast dataset. 7.2.3 Pre-trained Speech Encoder In this chapter, we use HuBERT [69] as our pre-trained encoder E due to its superior performance [161, 188]. HuBERT consists of two main components, namely a 1D CNN and a Transformer encoder [184]. The 1D CNN takes raw waveforms as inputs and returns low-level feature representations of speech. Then, the features are passed into the Transformer encoder to generate high-level feature representations via the self-attention mechanism. During the pre-training process, HuBERT first generates pseudo-labels by performing K-means clustering on the pre-extracted features, e.g., MFCCs. Then, the model learns in a selfsupervised manner through the task of predicting pseudo-labels for randomly masked frames. Therefore, the pre-training loss Lpt for HuBERT can be defined as the sum of the cross-entropy loss computed over the masked frames. 7.2.4 Personalization Gap To motivate our proposed methodology, we investigate the potential gain from the personalization of finetuned HuBERT on valence regression (the dimension with the most potential gain from personalization as demonstrated by Sridhar et al. [160]). In particular, we first create subsets Dk tr of Dtr with k speakers, where k ∈ {50, 100, 250, 500, 987}. Speaker-independent models with k speakers are trained on Dk tr sets. 96 Figure 7.1: Performance gap between speaker-dependent and speaker-independent models for valence estimation with varying the number of training speakers. To make the results stable, we ensure that Di tr ⊂ Dj tr, ∀i < j. For speaker-dependent models with k speakers, we randomly remove 50 speakers (# speakers in test-a) from Dk tr to get Dˆk tr, and fine-tune the models on Dˆk tr∪ test-a. For all experiments, we fine-tune the HuBERT-base encoder on the generated sets and report the performance on test-b. Since test-a and test-b share the same set of speakers, we consider the performance of speaker-dependent models to loosely correlate with the performance ofsupervised personalization methods, and hence, the performance gap between speaker-dependent and speaker-independent models captures the potential gain from personalization. Figure 7.1 demonstrates the inverse relationship between k and the performance gap. The evaluation metric is the Concordance Correlation Coefficient (CCC ↑). It suggests that given sufficiently large and diverse training data, the pre-trained encoders become robust enough to learn both the general emotional patterns and the unique characteristics of different groups of speech expressions such that supervised training of the model on the test speakers leads to marginal gains. Hence, to enhance the performance of the pre-trained encoders for a target speaker, we can: (1) make the input data personalized (pre-processing); (2) modify the weights of the pre-trained encoder for the target speaker; or (3) adjust the label predictions to be more personalized (post-processing). Existing studies on 97 Mean-pooling BatchNorm1D BatchNorm1D . . . 1D CNN HuBERT Encoder Utterance . . . Embedding layer FC layer Speaker ID ℒ Pre-train Loss Fine-tune FC-128 FC-32 FC-out ReLU/Dropout ReLU/Dropout ℒ (a) Personalized Adaptive Pre-training (PAPT) (b) Personalized Label Distribution Calibration (PLDC) Speaker Embedding Module Interpreter Train set 1 utterances 2 utterances . . . utterances utterances Test speaker Fine-tuned HuBERT Encoder Main task Loss 1 features 2 features features features . . . Cos Sim Interpreter Predicted Label Distribution Top-K similar speakers Label Distribution Statistics (ഥ, ഥ) Calibration Figure 7.2: Overview of our proposed method. (a) Personalized Adaptive Pre-Training (PAPT) pre-trains the HuBERT encoder with learnable speaker embeddings in a self-supervised manner. (b) Personalized Label Distribution Calibration (PLDC) finds similar training speakers and calibrates the predicted label distribution with the training label statistics. personalized SER, e.g., [79, 8, 160], focus on the first approach. In this chapter, we explore the other two alternatives. 7.2.5 Performance variance across speakers Though simply fine-tuning HuBERT achieves promising overall performance on MSP-Podcast for speech emotion recognition, Wagner et al. [188] find that there is a huge variance across the per-speaker performance. We investigate whether the performance variance is due to the feature shift or the label shift. Specifically, to measure the feature and label shift for each target speaker, we calculate the KL divergence between the feature and label distributions of the target speaker and those of the whole training set. Then we calculate the Pearson correlation coefficient (PCC) between the feature/label shift and the speaker performance. For arousal estimation, we find that the PCC between the feature shift and the regression performance is −0.714 while the PCC between the label shift and performance is −0.502. The results suggest that both feature and label shifts contribute to the performance variance. Moreover, the correlation between the feature shift and label shift is 0.285, which suggests the potential of using features to detect and remove label shifts. 98 7.2.6 Personalized Adaptive Pre-training (PAPT) Inspired by prior study in task-adaptive pre-training [62] and the problem of feature shift described above, we propose to perform adaptive pre-training on D = {(ui , si)} N i=1 along with trainable speaker embeddings in a self-supervised manner. Specifically, in addition to the original speech encoder E, we train a speaker embedding network S to extract the speaker embedding ei = S(si) ∈ R d , where d is the embedding size for the Transformer. Then, the speaker embedding ei is summed with the utterance feature fi = E(ui) to get a personalized feature representation f p i = fi + ei . For personalized pre-training, f p i is used to compute the pre-training loss (cross-entropy) on pseudo-label prediction for masked frames. Lpt = − X Nb i=1 X Mi t=1 log P(lit|f p it), (7.1) where Nb is the number of utterances in the batch, Mi is the number of masked frames for utterance ui , and lit denotes the pseudo-label for the t-th masked frame in utterance ui . For ER downstream tasks, we reduce the temporal dimension for f p i by mean-pooling and feed the output to a fully-connected layer to produce the label predictions. 7.2.7 Personalized Label Distribution Calibration (PLDC) Motivated by the problem of label distribution shift described above, we further want to add a personalized post-inference technique to correct the predicted label distributions. Specifically, given the predictions for a target speaker, the main idea is to identify the most similar speakers from the train set based on the feature similarity and use their label distribution statistics (means and standard deviations) to calibrate the 99 predicted label distributions of the target test speaker. In particular, for speaker s in both the train and test set, we extract the features for each utterance of s and average them to form the speaker vector vs = PNs k=1 E¯ p f t(u k s ) Ns , (7.2) where Ep f t denotes the ER-fine-tuned model of Ep (the personalized adapted version of E), E¯ p f t(u k s ) denotes the mean-pooled vector representation for utterance u k s , and Ns is the number of utterances from speaker s. Then, for each speaker in the test set, we retrieve the top-k most similar speakers in the train set based on the cosine similarity between the speaker vectors. Next, we average the label distribution statistics from the retrieved speakers to get an estimation of the mean µ¯ and standard deviation σ¯. Finally, each predicted label y for the target speaker would be shifted as y˜ = y − µ σ × σ¯ + ¯µ, (7.3) where µ and σ are the mean and standard deviation for the predicted label distribution. Optionally, if we want to only shift the mean or standard deviation, we can replace µ¯ as µ or σ¯ as σ in the above equation, respectively. 7.3 Experiments 7.3.1 Implementation and Training Details We perform adaptive pre-training for ten epochs using the Adam optimizer with a linear learning rate scheduler (5% warm-up and a maximum learning rate of 1e −5 ) on a single NVIDIA Quadro RTX8000 100 Table 7.2: Evaluations on MSP-Podcast (test-b) in terms of CCC (↑). O-CCC refers to the overall CCC between the prediction and ground truth. A-CCC denotes the average CCC for each test speaker. Numbers in the brackets are the standard deviations calculated across speakers. Our proposed PAPT-FT achieves superior performance compared to the baselines. Arousal Valence Metric O-CCC A-CCC O-CCC A-CCC Vanilla-FT 0.712 0.512 0.607 0.514 B2 0.735 0.517 0.650 0.569 TAPT-FT 0.717 0.518 0.630 0.542 PAPT-FT 0.740 0.531 (0.177) 0.663 0.569 (0.133) + µ shift 0.722 0.528 (0.190) 0.660 0.566 (0.134) + σ shift 0.732 0.541 (0.168) 0.662 0.578 (0.131) + (µ, σ) shift 0.713 0.540 (0.178) 0.657 0.575 (0.131) GPU. The models are adaptively pre-trained on the combination of the official train, validation, and testb sets and validated on test-a. All other settings are identical to HuBERT’s pre-training configurations. For downstream fine-tuning experiments, we add a light interpreter on top of the HuBERT encoder to process the mean-pooled extracted representations. The interpreter consists of two fully-connected layers of size {128, 32} with ReLU activation, 1D BatchNorm, and a dropout ratio of 0.1 in-between the layers. The downstream models are fine-tuned for at most ten epochs using Adam optimizer (5e −5 learning rate) with early stopping. Following prior work [160], the models are optimized with a CCC loss LCCC = 1−CCC for arousal and valence estimation. All of our experiments are performed with the HuBERT-large architecture, except for the personalization gap experiments, as the model used to generate the pseudolabels for HuBERT-base is not publicly available. We report two evaluation metrics, namely the Overall CCC (O-CCC), which concatenates the predictions on all test speakers before computing a single CCC score for the test set, and A-CCC, which denotes the average CCC scores computed for each test speaker. 101 Table 7.3: Evaluations on unseen speakers (test-c). Arousal Valence Metric O-CCC A-CCC O-CCC A-CCC Vanilla-FT 0.360 0.263 0.243 0.290 TAPT-FT 0.384 0.267 0.339 0.306 PAPT-FT 0.398 0.280 0.321 0.322 + µ shift 0.374 0.284 0.299 0.320 + σ shift 0.386 0.294 0.320 0.332 + (µ, σ) shift 0.363 0.297 0.301 0.329 7.3.2 Baselines We compare our method to three baselines: (1) Vanilla-FT in which E is fine-tuned on Dtr. (2) B2 represents the data weighting method proposed by Sridhar et al. [160]. (3) Task-Adaptive Pre-Training (TAPT) in which encoder E is continued pre-training on D for ten epochs. 7.3.3 Experimental Results on test-b Table 7.2 shows the comparison between our proposed methods and the baselines on MSP-Podcast. Compared to the best-performing baselines, our methods achieve superior performance on both arousal and valence estimation, with a gain of 0.023 and 0.009 on arousal and valence A-CCC respectively. Notably, we achieve state-of-the-art results for the task of valence estimation, in which our Overall-CCC score achieves 0.665 (on the whole test set of MSP-Podcast v1.6) compared to 0.627 as reported by Sriniva et al. [161]. When using PLDC, we can observe a significant increase in A-CCC, which suggests performance improvement for individual speakers. However, we can also see that as A-CCC improves with PLDC, O-CCC generally decreases. We attribute this to the high variance in the number of utterances of each speaker in the test set. Furthermore, Table 7.2 also demonstrates that PLDC consistently achieves the best performance when we only perform σ shifting, while µ shifting often reduces both A-CCC and O-CCC. 102 Table 7.4: Effect of different speaker embedding fusion positions. Last First Prefix None L val pt (↓) 2.78 2.85 2.81 3.15 A-CCC (↑) 0.531 0.519 0.528 0.512 We hypothesize that it is more difficult to estimate the mean than the (high) variance for a speaker with a wide range of arousal/valence labels. 7.3.4 Evaluations on Unseen Speakers We further validate the robustness of our method on unseen speakers (test-c). We directly make inference with Ep f t on test-c without re-training any components. Specifically, for each utterance from an unseen speaker, we provide Ep f t with a training speaker embedding as a proxy for the unseen speaker. We apply the same strategy used in our PLDC module, in which we compute a vector representation for the current unseen speaker and each speaker in the train set as in Equation 7.2. However, we use the original pretrained encoder E instead of Ep f t as the model cannot extract feature representation for the current (unseen) speaker without a proxy speaker. We then use the (seen) speaker in the train set with the highest similarity score as a proxy for the current speaker. The retrieved proxy speakers can later be used for the PLDC module to further boost prediction performance, as demonstrated in Table 7.3. Our proposed methods outperform the baselines by a significant margin, with up to 0.030 and 0.026 on A-CCC for arousal and valence estimation, respectively. It is important to note that the B2 method [160] is not applicable in this case as it would require re-adjustment of the data sample weights given the new speakers, which requires re-training the model. 7.3.5 Ablation Study Table 7.4 shows the experimental results for arousal estimation on test-b of fine-tuned encoders (without PLDC) adaptively pre-trained with different fusion positions of the speaker embeddings. In particular, Last 103 refers to our proposed setting in which the speaker embeddings are added to the output of the Transformer encoder; First refers to speaker embeddings being added to the inputs of the first layer of the Transformer encoder, and Prefix refers to the setting in which the speaker embeddings are concatenated as prefixes to the inputs of the Transformer encoder. None refers to the vanilla HuBERT encoder. We also provide L val pt , the best pre-train loss on the validation set, i.e. , test-a, during the PAPT phase. We find that Last provides the best results. 7.4 Conclusions In this chapter, we propose two methods to adapt pre-trained speech encoders for personalized speech emotion recognition, namely PAPT, which jointly pre-trains speech encoders with speaker embeddings to produce personalized speech representations, and PLDC, which performs distribution calibration for the predicted labels based on retrieved similar speakers. We validate the effectiveness of the proposed techniques via extensive experiments on the MSP-Podcast dataset, in which our models consistently outperform strong baselines and reach state-of-the-art performance for valence estimation. We further demonstrate the robustness of the personalized models for unseen speakers. 104 Chapter 8 SetPeER: Set-based Personalized Emotion Recognition with Weak Supervision This chapter is co-authored with Minh Tran. 8.1 Introduction Emotion recognition is a foundational block for developing socially intelligent AI, with its applications spanning various domains from healthcare to user satisfaction assessment. Over recent years, there has been notable progress in the field, driven by advancements in deep learning and multimodal data processing [169, 75, 161, 146]. Despite these advances, there are challenges in building robust and generalizable emotion recognition, including the intricacies of cross-modal fusion and the scarcity of labeled data. One of the main challenges in emotion recognition is the inherent variability and subjectivity of emotional expressions, making a general-purpose emotion recognition model fail to perform consistently across a wide range of speakers [187]. Emotions are multifaceted constructs shaped by various factors, including culture, upbringing, personality, and situational context. As such, designing algorithms capable of accurately discerning and categorizing emotions across diverse individuals and contexts poses a formidable task. To address the issue, there have been several works exploring personalized emotion recognition for visual and speech tasks [146, 132, 169, 160]. For example, Shahabinejad et al. [146] jointly 105 train a face recognition and a visual emotion recognition model, enabling the model to learn personalized emotion representations. Sridhar et al. [160] propose a speaker matching method to find the most similar speakers in a fixed training set to use as data augmentation to train personalized speech emotion recognition systems. However, most of these methods do not apply to other modalities, are not extensible to unseen speakers, or require model re-training for new speakers, significantly reducing their utility. Another significant obstacle in emotion recognition, particularly for personalized approaches, stems from the scarcity of appropriate data. Prior efforts in personalized emotion recognition have predominantly focused on the speech modality due to data availability. However, these approaches were mainly trained and evaluated on a limited number of speakers [29, 79, 75, 186]. While recent advancements, such as the development of large pre-trained models like HuBERT [69] or Wav2Vec2 [5], have narrowed the personalization gap, challenges persist in availability of comprehensive datasets for personalized visual emotion recognition. Although databases like MSP-Podcast [108] have been utilized for personalized speech emotion recognition [160, 169], they often lack consistent labeling across speakers and do not support personalized visual emotion recognition due to their unimodality. As shown in Table 8.1, commonly used emotion recognition databases suffer from limitations such as a small number of speakers or insufficient samples per speaker. These constraints not only hinder progress in developing personalized emotion recognition systems but also pose significant challenges in personalized systems evaluation. This chapter addresses the aforementioned challenges in personalized emotion recognition. From the modeling perspective, we introduce a novel approach called the Set-based Personalized Representation Learning for Emotion Recognition (SetPeER). This model is designed to extract personalized, contextual information from as few as eight samples per speaker. Notably, SetPeER exhibits versatility across different modalities by merely adjusting the backbone encoder architecture, e.g. , HuBERT [69] for audio and VideoMAE [167] for vision, and remains effective on unseen speakers without requiring any retraining of components. At the core of SetPeER is a Personalized Feature Extractor module P that encodes a set of 106 utterances from the same speaker into meaningful speaker embeddings. These embeddings are then injected into pre-trained encoders to generate personalized features, thereby enhancing the model’s ability to capture individual nuances in emotional expression. Regarding data, we develop a framework to weakly label in-the-wild audiovisual videos. Specifically, we use pre-trained models for text, vision, or audio-based emotion recognition to assign weak labels to a target modality from the remaining two modalities (text and audio or text and vision) to a large dataset of unlabeled data with a large number of speakers. To improve label quality, we only keep the utterances with labels agreeing among two modalities for training emotion recognition models for the third modality. With the scalability of the proposed weak labeling approach, we introduce EmoCeleb-A and EmoCeleb-V, two large-scale datasets with substantially more speakers and samples per speaker than existing emotion recognition datasets. Through extensive experiments, we validate the usefulness of EmoCeleb-A and EmoCeleb-V. First, we demonstrate the superior performance of our proposed weak labeling pipeline compared to random guessing. Moreover, our findings indicate that models trained with our dataset can surpass those trained with human-annotated data in zero-shot out-of-domain evaluations, underscoring the role of scalability and diversity in enhancing generalization capabilities. We further use EmoCeleb-A and EmoCeleb-V to both train and evaluate SetPeER, alongside established emotion recognition datasets, namely MSP-Podcast [108] for audio and MSP-Improv [22] for vision tasks. The comprehensive evaluation validates the effectiveness of our proposed model in comparison to existing personalized emotion recognition approaches. The major contributions of this chapter are as follows. • We use cross-modal labeling to create a large-scale weakly-labeled emotion dataset, i.e. , EmoCeleb. The dataset is more than 150 hours long and contains around 1,500 speakers with at least 50 utterances for each speaker. • We propose a novel personalization method with set learning. The model learns representative speaker embedding with only eight unlabeled utterances for a new speaker. 107 Table 8.1: Comparison of EmoCeleb with previous emotion recognition datasets. Mod indicates the available modalities, (a)udio, (v)ision, and (t)ext. TL denotes the total number of hours. # U and # S denote the number of utterances and speakers respectively. Our datasets are larger and have more speakers, with at least 50 utterances per speaker. Dataset Mod TL (h) # U # S Per Speaker Stats Mean Median RAVDESS [105] {a,v} 1.5 1.4K 24 60 60 AFEW [41] {a,v} 2.5 1.6K 0.3K 5 - HUMAINE [43] {a,v} 4 50 4 13 - RECOLA [138] {a,v} 4 46 46 1 1 SEWA [87] {a,v} 5 0.5K 0.4K 1 1 SEMAINE [113] {a,v} 7 0.3K 21 13 6 CREMA-D [23] {a,v} 8 7.4K 91 82 82 MSP-Improv [22] {a,v} 9 8.4K 12 0.7K 0.7K VAM [59] {a,v} 12 0.5K 20 25 - IEMOCAP [20] {a,v} 12 10K 10 1.0K 1.0K MSP-Face [185] {a,v} 25 9.4K 0.4K 23 15 CMU-MOSEI [201] {a,v,t} 66 24K 1.0K 24 4 MSP-Podcast [108] {a,t} 71 43K 1.0K 40 12 EmoCeleb-A {a} 159 74K 1.5K 50 50 EmoCeleb-V {v} 162 75K 1.5K 50 50 • Extensive experiments show our generated dataset’s validity and utility. Experiments also demonstrate effectiveness in personalized emotion recognition. 8.2 EmoCeleb Dataset Existing datasets for audiovisual emotion recognition have few speakers or lack enough data points per individual. This motivates us to develop a novel dataset via cross-modal labeling, i.e. , utilizing information from one or more modalities to annotate another. Our approach enables the development of a large-scale emotion dataset with weak labels suitable for training and evaluating personalized emotion recognition systems. Figure 8.1 provides an overview of the EmoCeleb dataset generation process. To enhance the utility of the dataset, we diverge from previous approaches such as the one by Albanie et al. [3], which utilizes 108 It was the first song that we wrote for the record and it just felt really exciting. Vision Text Vision Emotion Classifier Text Emotion Classifier Text Prob. Vision Prob. Averaged Prob. Weak Audio Label = H Audio Audio Emotion Classifier KL Divergence Discard Low Agreement High Agreement Audio Prob. KL Divergence Discard Low Agreement High Agreement Averaged Prob. Weak Vision Label = H Figure 8.1: Overview of cross-modal labeling: (i) Emotion recognition with two modalities (vision+text or audio+text) to provide weak supervision. (ii) Weak labels are retained when two modalities are in sufficient agreement (measured by KL divergence). (iii) Inference results are averaged to generate a weak label for the target modality (audio or vision). a single modality input for cross-modal distillation (from vision to audio). Instead, we perform emotion recognition using two modalities to provide weak supervision. Weak labels are retained only when the emotion recognition results from both modalities agree. In particular, with the three modalities (vision, audio, and text), we perform cross-modal labeling in two directions: combining vision and text to label audio (EmoCeleb-A) and leveraging audio and text to label vision (EmoCeleb-V). 8.2.1 Labeling Dataset We perform weak labeling on VoxCeleb2 [33] which is an audiovisual dataset for speaker recognition. VoxCeleb2 includes interview videos of celebrities from YouTube, which contains over 1M utterances with more than 6K speakers. The dataset is roughly gender balanced (61% male), and the speakers span a wide range of ethnicities, accents, professions and ages [33]. The dataset provides the identities and apparent gender information for the speakers, but it does not have any emotion labels. We only use the English portion∗ of VoxCeleb2, which contains 1,326 hours of content. ∗ https://github.com/facebookresearch/av_hubert/blob/main/avhubert/preparation/data/vox-en.id.gz 109 Table 8.2: Number of utterances in each class for EmoCeleb. Dataset Neutral Anger Happiness Surprise Total EmoCeleb-A 45,288 3,682 21,466 3,664 74,100 EmoCeleb-V 39,774 6,909 19,168 9,259 75,110 8.2.2 Unimodal Emotion Recognition Vision. For vision-based analysis, we utilize Masked Auto-Encoder (MAE) [64] as the backbone. We begin by initializing the MAE with a pre-trained checkpoint† , which is trained on the EmotionNet dataset [50]. Subsequently, we perform supervised training on the Aff-Wild2 dataset [85], for frame-level emotion recognition. We carry out frame-level inference for every utterance in the VoxCeleb2 dataset and employ average-pooling to aggregate the results, thereby obtaining utterance-level logits for categorical emotions. Audio. In the audio domain, we adopt an open-source model‡ based on WavLM [30] trained on the MSPPodcast dataset [108] for speech emotion recognition. Text. The VoxCeleb2 dataset [33] does not provide transcripts. Thus, we first use Whisper [130] for speech recognition. Then, we employ an open-source model§ for text emotion recognition. This model is built upon RoBERTa [101] and has been trained on diverse text emotion datasets sourced from Twitter, Reddit, student self-reports, and television dialogue utterances, e.g. , GoEmotions [36], AIT-2018 [114], MELD [128], and CARER [144]. The aforementioned methods produce logits corresponding to Ekman’s six basic emotions [44], namely, anger, disgust, fear, happiness, sadness, and surprise, in addition to neutral. 8.2.3 Cross-modal Labeling We illustrate the labeling process using the vision + text → audio direction as a representative example. The approach used in the alternate direction (audio + text → vision) is identical. † https://github.com/AIM3-RUC/ABAW4 ‡ https://huggingface.co/3loi/SER-Odyssey-Baseline-WavLM-Categorical § https://huggingface.co/j-hartmann/emotion-english-roberta-large 110 I was so happy today. He was like jumping in the pool and everything like it's so happy. Happiness And how can you live a lie and ask other people to live the truth? Without having a life, not just an a inner life, but some kind of life experience. But the hell are you going to say? I felt quite well. I had a great communication with my teammates, they held me a lot. Everything you can imagine helps me get better. That was incredible. I'm just impressed as a work environment. It looks fantastic. It really does. EmoCeleb-A EmoCeleb-V Happiness Anger Anger Surprise Surprise Figure 8.2: Examples of emotional expressions in EmoCeleb-A and EmoCeleb-V. Green solid lines denote the modalities used for cross-modal labeling, while red dashed lines refer to the target modalities. For a given utterance x, we independently generate the logits for categorical emotions with vision and text, denoted as hv and ht , respectively. Weak labels are retained only when the two modalities are in agreement. We compute the Kullback-Leibler (KL) divergence between the inference results from both modalities. If the KL divergence exceeds 1, we discard the data point. If the KL divergence is less than or equal to 1, we average the inference results to formulate a weak label for the audio: hˆ a = 1 2 (hv + ht). The threshold is selected based on agreement with ground truth labels in CMU-MOSEI dataset and the balance of label distribution. The predicted category yˆa is then obtained by selecting the argument with the maximum value in hˆ a. 111 Table 8.3: Comparison with a single human annotator. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Both directions of weak label generation achieve superior performance compared to random guessing. CMU-MOSEI MSP-Face Method ACC F1 ACC F1 Random 30.5 24.9 27.4 24.6 EmoCeleb-A 50.8 36.4 43.4 41.2 EmoCeleb-V 57.2 42.2 41.8 38.4 Human 70.8 49.6 69.4 69.2 8.2.4 Post-processing EmoCeleb exhibits a highly imbalanced distribution of labels, particularly with a sparse representation of disgust, fear, and sadness. This scarcity is likely attributed to the nature of the VoxCeleb2 dataset, which predominantly comprises interview videos featuring celebrities. Within such contexts, expressions of these three emotions are uncommon. Thus, we remove these three emotion classes. Furthermore, to ensure that each speaker has sufficient utterances for effective training and evaluation of personalized emotion recognition models, we also discard speakers with fewer than 50 utterances. After this procedure, we have over 150 hours of content for both directions of cross-modal labeling. Specifically, EmoCeleb-A contains 1,480 speakers with 74,100 utterances, and EmoCeleb-V includes 1,494 speakers with 75,110 utterances. Importantly, each speaker contributes a minimum of 50 utterances. Detailed statistics of the two datasets are provided in Tables 8.1 and 8.2. Examples of emotions in EmoCeleb are shown in Figure 8.2. 8.2.5 Label Quality Evaluation We evaluate our weak labels through (i) comparison with human annotations and (ii) comparison of the utility of labels for model training with existing emotion recognition datasets. To maintain consistency with the label space of EmoCeleb, our analysis focuses on four emotions: anger, happiness, surprise, and neutral. 112 Table 8.4: Comparison with existing emotion datasets. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Model trained with EmoCeleb outperforms RAVDESS and CMU-MOSEI which are manually labeled. Target IEMOCAP MSP-Face MSP-Face Modality Audio Audio Vision Source ACC F1 ACC F1 ACC F1 Random 35.5 24.5 27.4 24.6 27.4 24.6 RAVDESS [105] 31.3 28.0 21.0 16.5 12.9 6.7 CMU-MOSEI [201] 39.1 29.9 27.7 20.8 32.3 18.5 MSP-Podcast [108] 53.8 38.0 39.1 34.7 - - EmoCeleb 48.1 31.9 35.7 30.1 33.5 26.9 Comparison with human annotations. We apply the same labeling process to two established emotion datasets, namely CMU-MOSEI [201] and MSP-Face [185]. We assess the generated weak labels against the ground truth provided by these datasets. Since both datasets collect annotations from multiple annotators, we additionally evaluate the performance of a single annotator’s judgments against the consensus ground truth. The results of these evaluations are detailed in Table 8.3. We show that both directions of weak label generation achieve superior performance compared to random guessing (Random). Comparison with existing emotion datasets. We conduct a zero-shot evaluation to benchmark our datasets against existing emotion recognition datasets. We train an emotion recognition model (HuBERT [69] for audio and VideoMAE [167] for vision) on one source dataset and subsequently evaluate its performance on a different target dataset. Detailed results of zero-shot transfer are provided in Table 8.4. We exclude IEMOCAP [20] from the vision model evaluations due to its non-frontal face views, which create a significant domain gap and affect the assessment of model generalization. Additionally, MSP-Improv [22] is not selected as a target dataset because it lacks the surprise emotion class. The results indicate that EmoCeleb not only surpasses random guessing (Random) but also outperforms two established emotion datasets, RAVDESS and CMU-MOSEI. Both evaluations demonstrate the efficacy and utility of our weakly-labeled dataset. 113 F u sio n Layer Embedding Personalized Feature Extractor T r a n s fo r m e r Lin e a r Lin e a r Ve c t e r Q u a n t. Lin e a r B a c k b o n e E n c o d e r N x Emotion Prob. Linear ReLU & Dropout BatchNorm Linear ReLU & Dropout BatchNorm Linear Emotion Classifier P Softmax ersonalized Feature Extractor Shared Weights InfoNCE Loss Dissim Sim Sim Contrastive Learning Speaker 2 Speaker 1 Cross-Entropy Loss Figure 8.3: SetPeER overview. The personalized feature extractor P generates layer-specific personalized embeddings from the input and feeds the embeddings to the backbone encoder layer. These personalized embeddings serve as contextual cues for the current layer, aiding in generating more targeted features. The weights of P are shared across layers. Additionally, we apply contrastive learning for embeddings generated from P to enhance the consistency in producing personalized speaker embeddings. 8.3 Method The goal of the SetPeER is to learn personalized representations for emotion recognition using a set of K utterances from a single speaker. Drawing inspiration from recent advancements in set-based representation learning, our approach focuses on deriving personalized speaker representations from the input set of utterances. These personalized representations are then conditioned on the features generated by deep encoders, as illustrated in Figure 8.3. SetPeER comprises two main components: a multi-layer backbone encoder E designed to produce high-level representations from audio/visual input signals and a lightweight personalized feature extractor P aimed at generating personalized representations from input sets. 8.3.1 Backbone Encoder The backbone encoder E produces high-level feature representations from raw audio or video inputs. Although SetPeER is applicable to many backbone encoders with transformer architecture, we adopt the 114 widely-used HuBERT [69] and VideoMAE [167] as the backbone encoders for our audio and vision modalities, respectively. As a high-level overview, both architectures consist of two main components: a feature extractor E0 to extract low-level representations from raw audio or video inputs and a deep encoder E′ to extract high-level representations from the extracted low-level features. For HuBERT [69], E0 consists of several layers of 1D Convolutional Neural Networks (1D-CNN) to generate features at 20ms audio frames from raw waveforms sampled at 16kHz. For VideoMAE [167], E0 is a space-time cube embedding that maps 3D raw video tokens to patches with a pre-specified channel dimension. The deep encoder E′ for both architectures are a stack of N transformer encoder layers [184], i.e. , E ′ = {E1, . . . , EN }, where the output of the i-th layer Ei is xi = Ei(xi−1) for i ∈ [1, 2, . . . , N] and x0 is the features produced by E0. The output of the last layer xN is temporally mean-pooled and fed to linear layers to produce the emotion classification predictions. 8.3.2 Personalized Feature Extractor The objective of P is to produce personalized embeddings given a set of utterances. A key property of set-based learning is permutation invariance, i.e. , the output for a set remains the same regardless of the ordering of the input. We follow the previous work [202, 89] and use permutation-invariant modules to build the personalized feature extractor P. Specifically, P consists of several linear layers to reduce the dimensionality of the inputs, a transformer encoder layer (without positional encoding), and a Vector Quantization module [182] to discretize the learned representations into meaningful concepts. Formally, we want to extract personalized features for each encoder layer in E′ , given a set of utterances of the same speaker Sx = {x 1 , x2 , . . . , xK}. For the first encoder layer, P takes as inputs p0 the temporally meanpooled features extracted from E0 while for the remaining layers, P takes as inputs pl the temporally mean-pooled features extracted from the l-th layer El in E′ . In other words, p1 = Pool(E0(Sx)) and 115 pl = Pool(xl−1). The dimension of pl is RK×D, where K is the size of the input set and D is the feature dimension. Given pl , P extracts the speaker embeddings for the set as follows. Dimensionality reduction. We want to keep the parameter count of SetPeER analogous to the original encoders to demonstrate the effectiveness of the proposed method. Hence, we first use a linear layer L1 to reduce the dimensionality of the inputs from D to C and share P across all layers of E using a learnable layer embedding Φ(·): ql = L1(pl + Φ(l)). Contextualized feature learning. Then, we leverage a transformer encoder layer T [184] to generate contextualized representations for the set of processed vectors. We do not add any positional encoding to the ql to ensure permutation invariance: rl = T(ql). Personalized embedding generation. Next, we average the produced contextualized representations to generate a single vector representing the set and use another linear layer L2 to resize the generated embedding to a target output dimension of size Q × C, where Q denotes the number of embeddings per speaker we want to extract. sl = L2(Pool(rl , dim = 0)). (8.1) Quantized Speaker Representation Codebooks. VQ-VAE is a popular technique to acquire a quantized codebook of image elements, facilitating the autoregressive synthesis of images. We extend Vector Quantization (VQ) [182] to create personalized speaker embeddings with two main motivations. First, certain individual attributes, such as race, gender, and age, are inherently discrete. Moreover, VQ facilitates the creation of compact and generalized feature representations by filtering out irrelevant information from the continuous space. For VQ, we use a discrete codebook Z = {zi}M i=1 where zi ∈ RC to generate Q embeddings from sl , where M denotes the number of entries in the codebook. In particular, we first reshape sl into RQ×C. 116 Then, for each personalized vector of size C in sl , we look up the nearest entry j in Z and output the corresponding representation zj for the entry. During back-propagation, we use a straight-through gradient as in [182]. Finally, we use a linear layer L3 to map the produced speaker embeddings from C to D. tl = L3(VQ(sl)) ∈ RQ×D. (8.2) 8.3.3 Training Scheme In section 8.3.2, we present the personalized embedding extraction process of P for a single speaker. In each training step, SetPeER receives B sets of labeled utterances, each representing a speaker and consisting of K utterances. We utilize P to derive personalized speaker embeddings for every layer of E. These embeddings are concatenated with contextualized features extracted from the previous layer (or features from E0 for the first layer), thereby integrating personal information into the features generated at each layer. This technique is commonly called Prefix Tuning [95]. xl = El([tl ; xl−1]), (8.3) where El is the l-th layer of E′ . We later show the difference in fusion strategies between the extracted personalized embeddings and the deep contextualized features in an ablation study. Finally, the encoder’s output is temporally mean-pooled and fed into a linear head to predict emotions relative to the groundtruth labels using the cross-entropy loss LCE(˜y, y). Consistency-aware embedding generation. Ideally, P should produce identical outputs given two sets of utterances from the same speaker. Hence, to enhance the consistency of P in producing personalized speaker embeddings, we propose to use contrastive representation learning. Specifically, given the input p ∈ RK×D for the personalized feature extractor, we split it into two equal subsets p 1 and p 2 ∈ R K 2 ×D. 117 We use P to extract the speaker embeddings t 1 and t 2 for these two sets. We enhance P’s ability to extract consistent features with an InfoNCE contrastive loss [122]. LNCE = − 1 B X B i=1 log[ exp(t 1 i · t 2 i /τ ) P i̸=k exp(t 1 i · t 2 k /τ ) + exp(t 1 i · t 2 i /τ ) ], (8.4) where B represents the number of speakers we train in one batch and τ stands for the temperature parameter. Overall, SetPeER is optimized with the following loss function with hyper-parameters λ1, λ2, and λ3: L = λ1LCE +λ2LNCE +λ3LV Q, where LV Q is the commitment loss associated with Vector Quantization. More details on the commitment loss are in [182]. 8.4 Experiments 8.4.1 Implementation and Training Details All methods are implemented in PyTorch [123]. We provide code in the supplementary materials. The code and datasets will be published upon acceptance. Model architecture. We adopt HuBERT-base [69] and VideoMAE-tiny [167] as our audio and vision encoders, respectively. It is important to note we aim to develop and validate a general model suitable for personalization across various backbone architectures. Consequently, we select two widely used foundational backbones across various audio and vision tasks but with a relatively small number of parameters for efficient training. Both architectures consist of 12 transformer encoder layers with the feature dimension D = 768 for HuBERT and D = 384 for VideoMAE. We use the same personalization network P for both audio and vision experiments, in which C = 64, Q = 4, and M = 512. This results in ∼ 400K additional trainable parameters, about 0.5% of the number of parameters of HuBERT-base [69] and 1.2% of the number of parameters of VideoMAE-tiny [167]. 118 Model training. We optimize the network weights using the AdamW optimizer [107] on a single NVIDIA Quadro RTX8000 GPU. The weight decay is 1e −4 . The gradient clipping is 1.0. We train all the models for 100 epochs with a learning rate of 3e −5 . We set λ1 = 1.0, λ2 = 0.1, λ3 = 0.1 for training loss weights. To facilitate set learning, our data loaders are designed at the speaker level. Specifically, during training, a batch comprises B speakers, each composed of a set of K utterances randomly drawn from all utterances of the corresponding speaker. Consequently, within an epoch, SetPeER encounters every speaker in the dataset, though not necessarily all utterances. During testing, we conduct inference on one speaker at a time, i.e. , B = 1, accommodating varying numbers of utterances (K) per speaker. However, we ensure that the model never encounters more than K utterances within a single batch. In all our experiments, we set B = 8 and K = 8 during training. 8.4.2 Datasets We divide EmoCeleb into train, validation and test sets with a distribution ratio of 70%, 10%, and 20%, respectively, on a speaker-independent basis. This means speakers included in the training set are excluded from the validation and test sets to ensure no overlap. Additionally, we perform experiments on two benchmark emotion datasets, i.e. , MSP-Improv [22] and MSP-Podcast [108] to evaluate the utility of our weakly-labeled dataset and the effectiveness of our proposed method. While MSP-Podcast has been used in prior research on personalized speech emotion recognition [169, 160], no suitable dataset has emerged with both high-quality visual data and a diverse pool of speakers for audiovisual emotion recognition experiments. Existing datasets like CMU-MOSEI and MSP-Face offer visual information with a large speaker pool; however, CMU-MOSEI lacks speaker identity labels, while MSP-Face yields performance akin to random guessing [185]. Consequently, for visual evaluation, we opted for MSP-Improv alongside EmoCeleb-V. Although MSP-Improv features a small number of speakers (12), it remains a popular choice in current visual and audio-visual emotion recognition literature. 119 Table 8.5: Speech emotion recognition on EmoCeleb-A and downstream datasets. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Average accuracy (A-ACC %, ↑) and average F1-score (A-F1 %, ↑) across speakers are also reported. EmoCeleb-A MSP-Podcast-4 MSP-Podcast-8 MSP-Improv Method ACC A-ACC F1 A-F1 ACC A-ACC F1 A-F1 ACC A-ACC F1 A-F1 ACC A-ACC F1 A-F1 HuBERT [69] 47.7 51.2 46.5 41.8 47.3 46.2 49.0 43.3 24.5 25.5 22.4 21.4 54.1 54.0 51.8 51.5 HuBERT-PT - - - - 49.4 46.1 49.9 42.0 24.4 26.8 23.9 24.1 56.0 55.7 53.8 53.3 PAPT [169] 48.6 53.4 47.1 42.1 50.0 48.3 50.9 43.5 25.2 27.4 24.8 24.4 56.2 56.0 53.6 53.4 SetPeER (ours) 50.0 54.3 49.1 45.5 51.7 51.1 52.6 47.0 26.1 28.5 26.0 25.2 57.3 57.6 54.2 54.0 MSP-Improv is an acted audiovisual emotional database that explores emotional behaviors during acted and improvised dyadic interaction [22]. The dataset consists of 8,438 turns (over 9 hours) of emotional sentences, categorized into four primary emotions: neutral, happiness, sadness, and anger. The corpus has six sessions, and each session has one male and one female speaker (12 in total). We use sessions 1 − 4 as the training set, session 5 as the validation set, and session 6 as the testing set. MSP-Podcast is the largest corpus for speech emotion recognition in English. The dataset contains speech segments from podcast recordings. Each utterance in the dataset is annotated using crowd-sourcing with continuous labels of arousal, valence, and dominance along with categorical emotions. In this chapter, we exclude any samples that lack speaker identification. This refinement process yields a total of 42,541 utterances, encompassing over 71 hours of emotional speech content. The corpus provides the official data split and has eight emotion classes: neutral, happiness, sadness, anger, surprise, fear, disgust, and contempt. We conduct the downstream evaluation in two ways: (i) we use the subset with the four emotion categories in EmoCeleb (MSP-Podcast-4); (ii) we use all the eight emotion categories (MSP-Podcast-8). 8.4.3 Experimental Results Quantitative Analysis. We pre-train SetPeER on EmoCeleb and then fine-tune it on the downstream datasets with supervised emotion recognition. Thus, we report the model performance on both EmoCeleb and downstream datasets. Accuracy (ACC %, ↑) and F1-score (F1 %, ↑) are the evaluation metrics. Additionally, we report the average accuracy (A-ACC ↑) and the average F1-score (A-F1 ↑) across speakers. 120 Table 8.6: Visual emotion recognition on EmoCeleb-V and MSP-Improv. SetPeER surpasses the baseline methods across all evaluated metrics. EmoCeleb-V MSP-Improv Method ACC A-ACC F1 A-F1 ACC A-ACC F1 A-F1 VideoMAE [167] 38.6 33.6 36.7 27.0 52.8 52.0 49.9 48.2 VideoMAE-PT - - - - 54.1 54.3 52.7 51.7 PAPT [169] 39.2 33.4 38.0 27.2 54.5 54.7 53.0 52.8 SetPeER (ours) 39.4 34.0 38.6 27.9 57.3 55.7 56.7 55.0 Three baseline methods are implemented and compared. We do not benchmark our method against existing approaches tuned for maximum performance with more complex backbone architectures, as the backbone in SetPeER can be interchangeable. • Vanilla backbones. We train HuBERT / VideoMAE on EmoCeleb and downstream datasets with the official checkpoints. • Pre-trained backbones (PT) We pre-train HuBERT / VideoMAE on EmoCeleb and then fine-tune them on downstream tasks. • PAPT [169] trains speaker embeddings on an extensive set of training speakers in a self-supervised fashion. These embeddings are then incorporated into the generated features via prefix tuning for personalized emotion recognition. In the testing phase on unseen speakers, the method identifies the most closely aligned speakers from the training set and uses the corresponding trained embeddings to generate personalized features. SetPeER differs from PAPT in two key aspects: Firstly, while PAPT requires two iterations for personalization, our model can be trained directly with labels, bypassing the need for initial self-supervised training. Secondly, PAPT relies on a diverse and large training speaker set for matching unseen speakers, whereas our dataset performs well with fewer speakers (see Table 8.6). Efficiency-wise, our model eliminates the need to match each test speaker with every training speaker, substantially reducing inference time. Nevertheless, as far as we know, PAPT remains the only personalization method for unseen speakers without retraining any components. 121 (a) Training set, full model. (b) Testing set, full model. (c) Training set, w/o VQ. (d) Testing set, w/o VQ. Figure 8.4: t-SNE visualizations of speaker embeddings from MSP-Podcast. Blue points represent male speakers and orange points indicate female speakers. Representations by SetPeER (first row) show clear separation w.r.t. gender. For a fair comparison, we initialize the backbone encoder of PAPT with our pre-trained backbone (PT) on EmoCeleb. The audio and vision performances are provided in Tables 8.5 and 8.6, respectively. Results in both tables show that our proposed method outperforms all other competing methods across a variety of metrics. We can observe that HuBERT-PT consistently outperforms HuBERT across both audio and visual experiments. This underscores the suitability of our datasets, EmoCeleb-A and EmoCeleb-V, not only as effective evaluation datasets for personalization but also as promising resources for large-scale pre-training in emotion recognition tasks. SetPeER further boosts the performance of HuBERT-PT by a large margin, especially in the per-speaker metrics (A-ACC and A-F1), demonstrating the effectiveness of the proposed personalized feature extraction pipeline. Compared to PAPT [169], we not only demonstrate superior performance overall but also remain effective on the MSP-Improv dataset with a limited number of training speakers (ten speakers). On the other hand, PAPT only achieves marginal improvements over HuBERT-PT on the MSP-Improv dataset for both audio and visual modalities, indicating its limitations when confronted with a small pool of training speakers. Qualitative Analysis. To understand the information learned in speaker embedding, we inspect the information learned by the personalized encoder P. In particular, we investigate the relation between the extracted speaker embeddings and gender, which is the only demographic information available for the 122 Table 8.7: Ablations for SetPeER. Fusion refers to fusing speaker embedding with audiovisual features for personalized emotion recognition. A and V stand for audio and vision modalities respectively. Modules MSP-Podcast-4 (A) MSP-Improv (V) ACC F1 ACC F1 SetPeER 51.7 52.6 57.3 56.7 −LNCE 51.3 51.6 55.4 54.9 −V Q 51.0 51.8 56.4 55.1 Fusion Strategy Prefix [95] 51.7 52.6 57.3 56.7 Addition 50.9 51.4 53.8 48.7 Cross-attn [171] 48.2 49.5 55.9 52.8 MSP-Podcast dataset [105] ¶ . Figures 8.4a and 8.4b display the 2D T-SNE visualizations [183] of speaker embeddings from MSP-Podcast, with each point representing an utterance. Colors denote gender, with blue representing male and orange representing female. It is evident that SetPeER can generate linearly separable features with respect to gender, even without explicit gender labels. This not only showcases SetPeER’s capability in producing useful personalized features but also underscores the significance of gender in emotion recognition, aligning with the literature [153]. Ablations. We perform extensive ablation studies to demonstrate the effectiveness of each component, as shown in Table 8.7. (i) Contrastive loss LNCE. Removing the contrastive loss leads to notable performance degradation, with approximately a 2% decrease in both accuracy and F1 score on MSP-Improv (V). This underscores the importance of maintaining uniform representations across various inputs from the same speaker. (ii) Vector Quantization. Quantizing personalized speaker embeddings proves to be effective, increasing the F1 metric by 1% on the MSP-Podcast-4 (A) and 1.8% on the MSP-Improv (V) dataset. Furthermore, in Figures 8.4c and 8.4d, we can see a clear degradation in cluster quality when a model is trained without the VQ module. (iii) Fusion strategy. Information in speaker embedding is fused with the input to provide personalized emotion recognition. This chapter used Prefix Tuning, for this purpose. In addition to Prefix Tuning [95], which temporally pre-pend tl with xl−1, we explore two other fusion ¶We cannot produce the T-SNE plot for our visual model due to the limited speaker pool of MSP-Improv. 123 Figure 8.5: Impact of set size K on performance. Larger set sizes lead to higher performance. strategies, namely addition and cross-attention [171]. For addition, we set Q = 1 and directly add tl to xl−1. For cross-attention, we adapt the cross-attention formulation proposed by Tsai et al. [171], where keys and values are xl−1 and queries are tl . Overall, Prefix Tuning exhibits notably superior performance compared to the other two fusion strategies. The discrepancy likely arises because Addition is constrained by a fixed number of embeddings (Q = 1), whereas Cross-attention suffers from information loss. (iv) Set size K. In Figure 8.5, we investigate the impact of set size (K) on SetPeER’s learning process. Ideally, a larger set size enables more precise construction of personalized information, leading to more accurate predictions. However, the practicality of having a large set size is often limited by the availability of samples per speaker. Therefore, it is crucial to find the optimal value for K that balances performance and practicality. As expected, SetPeER becomes increasingly effective as K increases, yet the returns appear to diminish at K = 8. 8.5 Conclusions In this study, we introduce SetPeER, a modality-agnostic framework designed for personalized emotion recognition. Our approach leverages cross-modal labeling to curate a large dataset for both training and evaluating personalized emotion recognition models. We present an innovative personalized architecture, enhanced with set learning, which is adept at efficiently learning distinctive speaker features. Through comprehensive experiments, we showcase the utility of the EmoCeleb dataset and the superior efficacy of 124 the proposed method for personalized emotion recognition, outperforming baseline models on the MSPPodcast and MSP-Improv benchmarks. 125 Chapter 9 Conclusions In this thesis, we explore different machine learning methods to enhance the generalization ability for expression and emotion recognition with minimal human efforts, including unsupervised domain adaptation, multimodal learning, generative model features, and unsupervised personalization. In Chapter 3, we propose a Speaker-Invariant Domain-Adversarial Neural Network (SIDANN) to separate the speaker bias from the domain bias. Specifically, we add a speaker discriminator to detect the speaker’s identity. There is a gradient reversal layer at the beginning of the discriminator so that the encoder can unlearn the speaker-specific information. The cross-domain experimental results indicate that the proposed SIDANN has a better domain adaptation ability than the DANN. In Chapter 4, traditional DA models might not preserve local features that are necessary for visual emotion recognition while reducing domain discrepancies in global features. To address this problem, we reverse the training order. Specifically, we employ first-order facial animation warping to generate a synthetic dataset with the target data identities for contrastive pre-training. Then, we fine-tune the base model with the labeled source data. Our experiments with cross-domain evaluation indicate that the proposed FATE model substantially outperforms the existing domain adaptation models, suggesting that the proposed model has a better domain generalizability for emotion recognition. 126 In Chapter 5, we introduce X-Norm, a novel approach for modality fusion. X-Norm can effectively align the modalities by exchanging the parameters in normalization layers. Since only normalization parameters are engaged, the proposed fusion mechanism is lightweight and easy to implement. We evaluate X-Norm through extensive experiments on two multimodal tasks, i.e. , emotion and action recognition, and three databases. The quantitative and qualitative analyses show that normalization parameters can encode the modality-specific information, and exchanging these between the two modalities can implicitly align the modalities, thus enhancing the model capacity. In Chapter 6, we propose FG-Net, a data-efficient method for generalizable facial action unit detection. FG-Net extracts the generalizable and semantic-rich features from the generative model. A Pyramid CNNInterpreter is proposed to detect AUs coded as heatmaps which makes the training efficient and captures essential information from the nearby regions. We show that the proposed FG-Net method has a strong generalization ability when evaluated across corpora or trained with limited data, demonstrating its strong potential to solve action unit detection in a real-life scenario. In Chapter 7, we propose two methods to adapt pre-trained speech encoders for personalized speech emotion recognition, namely PAPT, which jointly pre-trains speech encoders with speaker embeddings to produce personalized speech representations, and PLDC, which performs distribution calibration for the predicted labels based on retrieved similar speakers. We validate the effectiveness of the proposed techniques via extensive experiments. We further demonstrate the robustness of the personalized models for unseen speakers. In Chapter 8, we introduce SetPeER, a modality-agnostic framework designed for personalized emotion recognition. Our approach leverages cross-modal labeling to curate a large dataset for both training and evaluating personalized emotion recognition models. We present an innovative personalized architecture, enhanced with set learning, which is adept at efficiently learning distinctive speaker features. Through comprehensive experiments, we showcase the utility of the EmoCeleb dataset and the superior efficacy of 127 the proposed method for personalized emotion recognition, outperforming baseline models on the MSPPodcast and MSP-Improv benchmarks. Our proposed methods leverage unlabeled data and meaningful representation learning to amplify the capabilities of expression and emotion recognition models, particularly for unseen datasets or subjects. By harnessing the potential of unlabeled data and bolstering the generalization capacities of perception models, we lay the groundwork for more effective and insightful analyses of human behavior. This not only enhances our understanding of human behavior but also propels advancements in fields spanning psychology, sociology, human-computer interaction, and beyond. Future Directions. Moving forward, there are several promising avenues for further exploration. One avenue involves adapting existing pre-trained models for more generalizable human behavior understanding. For instance, leveraging large vision-language models to generate captions for expressive behaviors, providing valuable weak supervision signals. Additionally, we can explore the use of diffusion models to generate high-quality affective data tailored to specific subjects, facilitating personalized training or data augmentation. Another promising direction is to delve into downstream applications, translating robust and generalizable behavior understanding into real-world scenarios. This could include applications such as therapist empathy analysis in mental health settings, detection of communication difficulties in social conversations, and engagement detection in virtual group meetings. These future directions hold the potential to deepen our understanding and application of human behavior analysis in various domains. 128 Bibliography [1] Rameen Abdal, Yipeng Qin, and Peter Wonka. “Image2stylegan: How to embed images into the stylegan latent space?” In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 4432–4441. [2] Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. “Deep audio-visual speech recognition”. In: IEEE transactions on pattern analysis and machine intelligence (2018). [3] Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. “Emotion recognition in speech using cross-modal transfer in the wild”. In: Proceedings of the 26th ACM international conference on Multimedia. 2018, pp. 292–301. [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer Normalization”. In: arXiv preprint arXiv: Arxiv-1607.06450 (2016). [5] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. “wav2vec 2.0: A framework for self-supervised learning of speech representations”. In: Advances in neural information processing systems 33 (2020), pp. 12449–12460. [6] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. “Multimodal Machine Learning: A Survey and Taxonomy”. In: arXiv preprint arXiv: Arxiv-1705.09406 (2017). [7] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. “Multimodal machine learning: A survey and taxonomy”. In: IEEE transactions on pattern analysis and machine intelligence 41.2 (2018), pp. 423–443. [8] Jaehun Bang, Taeho Hur, Dohyeong Kim, Thien Huynh-The, Jongwon Lee, Yongkoo Han, Oresti Banos, Jee-In Kim, and Sungyoung Lee. “Adaptive data boosting technique for robust personalized speech emotion in emotionally-imbalanced small-sample environments”. In: Sensors 18.11 (2018), p. 3744. [9] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. “Label-efficient semantic segmentation with diffusion models”. In: arXiv preprint arXiv:2112.03126 (2021). 129 [10] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M. Martinez, and Seth D. Pollak. “Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements”. In: Psychological Science in the Public Interest 20.1 (2019). PMID: 31313636, pp. 1–68. doi: 10.1177/1529100619832930. [11] Pablo Barros, Emilia Barakova, and Stefan Wermter. “Adapting the interplay between personalized and generalized affect recognition based on an unsupervised neural framework”. In: IEEE Transactions on Affective Computing 13.3 (2020), pp. 1349–1365. [12] Pablo Barros, German Parisi, and Stefan Wermter. “A personalized affective memory model for improving emotion recognition”. In: International Conference on Machine Learning. PMLR. 2019, pp. 485–494. [13] Pablo Barros and Alessandra Sciutti. “Ciao! a contrastive adaptation mechanism for non-universal facial expression recognition”. In: 2022 10th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE. 2022, pp. 1–8. [14] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. “Is Space-Time Attention All You Need for Video Understanding?” In: arXiv preprint arXiv: Arxiv-2102.05095 (2021). [15] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. “Understanding batch normalization”. In: Advances in neural information processing systems 31 (2018). [16] Paulo Blikstein. “Multimodal learning analytics”. In: Proceedings of the third international conference on learning analytics and knowledge. 2013, pp. 102–106. [17] Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. “Deep generative modelling: A comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models”. In: arXiv preprint arXiv:2103.04922 (2021). [18] Léon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 421–436. [19] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. “Unsupervised pixel-level domain adaptation with generative adversarial networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 3722–3731. [20] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. “IEMOCAP: Interactive emotional dyadic motion capture database”. In: Language resources and evaluation 42.4 (2008), p. 335. [21] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. “MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception”. In: IEEE Transactions on Affective Computing 8.1 (2017), pp. 67–80. doi: 10.1109/TAFFC.2016.2515617. 130 [22] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. “MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception”. In: IEEE Transactions on Affective Computing 8.1 (2016), pp. 67–80. [23] Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. “Crema-d: Crowd-sourced emotional multimodal actors dataset”. In: IEEE transactions on affective computing 5.4 (2014), pp. 377–390. [24] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. “Partial transfer learning with selective adversarial networks”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 2724–2732. [25] Joao Carreira and Andrew Zisserman. “Quo vadis, action recognition? a new model and the kinetics dataset”. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 6299–6308. [26] Yanan Chang and Shangfei Wang. “Knowledge-Driven Self-Supervised Representation Learning for Facial Action Unit Recognition”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 20417–20426. [27] Emile Chapuis, Pierre Colombo, Matteo Manica, Matthieu Labeau, and Chloé Clavel. “Hierarchical Pre-training for Sequence Labelling in Spoken Dialog”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020, pp. 2636–2648. [28] Yashpalsing Chavhan, ML Dhore, and Pallavi Yesaware. “Speech emotion recognition using support vector machine”. In: International Journal of Computer Applications 1.20 (2010), pp. 6–9. [29] Luefeng Chen, Wanjuan Su, Yu Feng, Min Wu, Jinhua She, and Kaoru Hirota. “Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction”. In: Information Sciences 509 (2020), pp. 150–163. [30] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing”. In: IEEE Journal of Selected Topics in Signal Processing 16.6 (2022), pp. 1505–1518. [31] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for contrastive learning of visual representations”. In: International conference on machine learning. PMLR. 2020, pp. 1597–1607. [32] Li-Wei Chen and Alexander Rudnicky. “Exploring wav2vec 2.0 fine-tuning for improved speech emotion recognition”. In: arXiv preprint arXiv:2110.06309 (2021). [33] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. “Voxceleb2: Deep speaker recognition”. In: arXiv preprint arXiv:1806.05622 (2018). 131 [34] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. “Scaling Egocentric Vision: The EPIC-KITCHENS Dataset”. In: European Conference on Computer Vision (ECCV). 2018. [35] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. “The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 43.11 (2021), pp. 4125–4141. doi: 10.1109/TPAMI.2020.2991965. [36] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. “GoEmotions: A dataset of fine-grained emotions”. In: arXiv preprint arXiv:2005.00547 (2020). [37] Didan Deng, Zhaokang Chen, and Bertram E Shi. “Multitask emotion recognition with incomplete labels”. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020). IEEE. 2020, pp. 592–599. [38] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “ImageNet: A large-scale hierarchical image database”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848. [39] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. “Imagenet: A large-scale hierarchical image database”. In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–255. [40] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 4171–4186. [41] Abhinav Dhall, Roland Goecke, Simon Lucey, Tom Gedeon, et al. “Collecting large, richly annotated facial-expression databases from movies”. In: IEEE multimedia 19.3 (2012), p. 34. [42] Prafulla Dhariwal and Alexander Nichol. “Diffusion models beat gans on image synthesis”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 8780–8794. [43] Ellen Douglas-Cowie, Roddy Cowie, Ian Sneddon, Cate Cox, Orla Lowry, Margaret Mcrorie, Jean-Claude Martin, Laurence Devillers, Sarkis Abrilian, Anton Batliner, et al. “The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data”. In: Affective Computing and Intelligent Interaction: Second International Conference, ACII 2007 Lisbon, Portugal, September 12-14, 2007 Proceedings 2. Springer. 2007, pp. 488–500. [44] Paul Ekman. “An argument for basic emotions”. In: Cognition & emotion 6.3-4 (1992), pp. 169–200. [45] Paul Ekman. “Are there basic emotions?” In: (1992). [46] Paul Ekman. Facial action coding system. 1977. 132 [47] Itir Onal Ertugrul, Jeffrey F Cohn, László A Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. “Cross-domain au detection: Domains, learning approaches, and measures”. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE. 2019, pp. 1–8. [48] Itir Onal Ertugrul, Jeffrey F Cohn, László A Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. “Crossing Domains for AU Coding: Perspectives, Approaches, and Measures”. In: IEEE transactions on biometrics, behavior, and identity science 2.2 (2020), pp. 158–171. [49] Florian Eyben, Martin Wöllmer, and Björn Schuller. “Opensmile: the munich versatile and fast open-source audio feature extractor”. In: Proceedings of the 18th ACM international conference on Multimedia. 2010, pp. 1459–1462. [50] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 5562–5570. [51] Yingruo Fan, Jacqueline Lam, and Victor Li. “Facial action unit intensity estimation via semantic correspondence learning with dynamic graph convolution”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 07. 2020, pp. 12701–12708. [52] Kexin Feng and Theodora Chaspari. “A Review of Generalizable Transfer Learning in Automatic Emotion Recognition”. In: Frontiers in Computer Science 2 (2020), p. 9. [53] Yaroslav Ganin and Victor Lempitsky. “Unsupervised Domain Adaptation by Backpropagation”. In: Proceedings of the 32nd International Conference on Machine Learning. Ed. by Francis Bach and David Blei. Vol. 37. Proceedings of Machine Learning Research. Lille, France: PMLR, July 2015, pp. 1180–1189. url: http://proceedings.mlr.press/v37/ganin15.html. [54] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. “Audio set: An ontology and human-labeled dataset for audio events”. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 776–780. [55] John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. “Progressive Neural Networks for Transfer Learning in Emotion Recognition”. In: Proc. Interspeech 2017 (2017), pp. 1098–1102. [56] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Domain adaptation for large-scale sentiment classification: A deep learning approach”. In: (2011). [57] Daniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi. “Watching the world go by: Representation learning from unlabeled videos”. In: arXiv preprint arXiv:2003.07990 (2020). [58] Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor Lempitsky. “Coordinate-based texture inpainting for pose-guided human image generation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 12135–12144. 133 [59] Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. “The Vera am Mittag German audio-visual emotional speech database”. In: 2008 IEEE international conference on multimedia and expo. IEEE. 2008, pp. 865–868. [60] Hatice Gunes and Björn Schuller. “Categorical and dimensional affect analysis in continuous input: Current trends and future directions”. In: Image and Vision Computing 31.2 (2013), pp. 120–136. [61] Dan dan Guo, Long Tian, Minghe Zhang, Mingyuan Zhou, and Hongyuan Zha. “Learning Prototype-oriented Set Representations for Meta-Learning”. In: International Conference on Learning Representations. 2021. [62] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. “Don’t stop pretraining: Adapt language models to domains and tasks”. In: arXiv preprint arXiv:2004.10964 (2020). [63] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. “MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis”. In: arXiv preprint arXiv:2005.03545 (2020). [64] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. “Masked autoencoders are scalable vision learners”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022, pp. 16000–16009. [65] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [66] Javier Hernandez, Daniel McDuff, Alberto Fung, Mary Czerwinski, et al. “DeepFN: towards generalizable facial action unit recognition with deep face normalization”. In: arXiv preprint arXiv:2103.02484 (2021). [67] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. “CNN architectures for large-scale audio classification”. In: 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE. 2017, pp. 131–135. [68] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. “CyCADA: Cycle-Consistent Adversarial Domain Adaptation”. In: Proceedings of the 35th International Conference on Machine Learning. 2018. [69] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), pp. 3451–3460. [70] Xun Huang and Serge Belongie. “Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization”. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). Oct. 2017. 134 [71] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015, pp. 448–456. [72] Geethu Miriam Jacob and Bjorn Stenger. “Facial action unit detection with transformers”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 7680–7689. [73] Mimansa Jaiswal, Zakaria Aldeneh, and Emily Mower Provost. “Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning”. In: 2019 International Conference on Multimodal Interaction. 2019, pp. 174–184. [74] Mimansa Jaiswal and Emily Mower Provost. “Privacy enhanced multimodal neural representations for emotion recognition”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 05. 2020, pp. 7985–7993. [75] Ning Jia and Chunjun Zheng. “Two-level discriminative speech emotion recognition model with wave field dynamics: A personalized speech emotion recognition method”. In: Computer Communications 180 (2021), pp. 161–170. [76] Tero Karras, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, pp. 4401–4410. [77] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. “Analyzing and improving the image quality of stylegan”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 8110–8119. [78] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. “The kinetics human action video dataset”. In: arXiv preprint arXiv:1705.06950 (2017). [79] Jae-Bok Kim and Jeong-Sik Park. “Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition”. In: Engineering applications of artificial intelligence 52 (2016), pp. 126–134. [80] Davis E. King. “Dlib-ml: A Machine Learning Toolkit”. In: Journal of Machine Learning Research 10 (2009), pp. 1755–1758. [81] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [82] D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. “Analysing Affective Behavior in the First ABAW 2020 Competition”. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pp. 794–800. 135 [83] Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, and Stefanos Zafeiriou. “Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond”. In: International Journal of Computer Vision 127.6 (2019), pp. 907–929. [84] Dimitrios Kollias and Stefanos Zafeiriou. “Aff-wild2: Extending the aff-wild database for affect recognition”. In: arXiv preprint arXiv:1811.07770 (2018). [85] Dimitrios Kollias and Stefanos Zafeiriou. “Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace”. In: arXiv preprint arXiv:1910.04855 (2019). [86] Jean Kossaifi, Antoine Toisoul, Adrian Bulat, Yannis Panagakis, Timothy M Hospedales, and Maja Pantic. “Factorized higher-order CNNs with an application to spatio-temporal emotion estimation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 6060–6069. [87] Jean Kossaifi, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fabien Ringeval, Jing Han, Vedhas Pandit, Antoine Toisoul, Bjoern W Schuller, et al. “Sewa db: A rich database for audio-visual emotion and sentiment research in the wild”. In: IEEE transactions on pattern analysis and machine intelligence (2019). [88] Ann M Kring and Albert H Gordon. “Sex differences in emotion: expression, experience, and physiology.” In: Journal of personality and social psychology 74.3 (1998), p. 686. [89] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. “Set transformer: A framework for attention-based permutation-invariant neural networks”. In: International conference on machine learning. PMLR. 2019, pp. 3744–3753. [90] Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, and Sanja Fidler. “Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 8300–8311. [91] Guanbin Li, Xin Zhu, Yirui Zeng, Qing Wang, and Liang Lin. “Semantic relationships guided representation learning for facial action unit recognition”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 01. 2019, pp. 8594–8601. [92] Haoqi Li, Ming Tu, Jing Huang, Shrikanth Narayanan, and Panayiotis Georgiou. “Speaker-Invariant Affective Representation Learning via Adversarial Training”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 7144–7148. [93] Jianan Li, Xuemei Xie, Qingzhe Pan, Yuhan Cao, Zhifu Zhao, and Guangming Shi. “SGM-Net: Skeleton-guided multimodal network for action recognition”. In: Pattern Recognition 104 (2020), p. 107356. [94] Mao Li, Bo Yang, Joshua Levy, Andreas Stolcke, Viktor Rozgic, Spyros Matsoukas, Constantinos Papayiannis, Daniel Bone, and Chao Wang. “Contrastive Unsupervised Learning for Speech Emotion Recognition”. In: arXiv preprint arXiv:2102.06357 (2021). 136 [95] Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, pp. 4582–4597. [96] Zheng Lian, Ya Li, Jianhua Tao, and Jian Huang. “Speech emotion recognition via contrastive loss under siamese networks”. In: Proceedings of the Joint Workshop of the 4th Workshop on Affective Social Multimedia Computing and First Multi-Modal Affective Computing of Large-Scale Multimedia Data. 2018, pp. 21–26. [97] Wei-Cheng Lin and Carlos Busso. “Chunk-level speech emotion recognition: A general framework of sequence-to-one dynamic temporal modeling”. In: IEEE Transactions on Affective Computing (2021). [98] Dieu Linh Tran, Robert Walecki, Stefanos Eleftheriadis, Bjorn Schuller, Maja Pantic, et al. “Deepcoder: Semi-parametric variational autoencoders for automatic facial action coding”. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, pp. 3190–3199. [99] Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang, Xu Sun, and Yuexian Zou. “Rethinking skip connection with layer normalization”. In: Proceedings of the 28th International Conference on Computational Linguistics. 2020, pp. 3586–3598. [100] Kuan Liu, Yanen Li, Ning Xu, and Prem Natarajan. “Learn to Combine Modalities in Multimodal Deep Learning”. In: arXiv preprint arXiv: Arxiv-1805.11730 (2018). [101] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”. In: arXiv preprint arXiv: Arxiv-1907.11692 (2019). [102] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. “Swin transformer: Hierarchical vision transformer using shifted windows”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10012–10022. [103] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. “Deep Learning Face Attributes in the Wild”. In: Proceedings of International Conference on Computer Vision (ICCV). Dec. 2015. [104] Leili Tavabi Liupei Lu and Mohammad Soleymani. “Self-Supervised Learning for Facial Action Unit Recognition through Temporal Consistency”. In: British Machine Vision Conference (BMVC). 2020. [105] Steven R Livingstone and Frank A Russo. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English”. In: PloS one 13.5 (2018), e0196391. [106] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. “Learning transferable features with deep adaptation networks”. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. JMLR. org. 2015, pp. 97–105. 137 [107] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: International Conference on Learning Representations (ICLR). New Orleans, LA, USA: OpenReview, 2019. [108] Reza Lotfian and Carlos Busso. “Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings”. In: IEEE Transactions on Affective Computing 10.4 (2017), pp. 471–483. [109] Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. “Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition”. In: arXiv preprint arXiv:2205.01782 (2022). [110] Brais Martinez, Michel F Valstar, Bihan Jiang, and Maja Pantic. “Automatic analysis of facial actions: A survey”. In: IEEE transactions on affective computing (2017). [111] S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, Philip Trinh, and Jeffrey F Cohn. “Disfa: A spontaneous facial action intensity database”. In: IEEE Transactions on Affective Computing 4.2 (2013), pp. 151–160. [112] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder. “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent”. In: IEEE transactions on affective computing 3.1 (2011), pp. 5–17. [113] Gary McKeown, Michel F Valstar, Roderick Cowie, and Maja Pantic. “The SEMAINE corpus of emotionally coloured character interactions”. In: 2010 IEEE international conference on multimedia and expo. IEEE. 2010, pp. 1079–1084. [114] Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko. “Semeval-2018 task 1: Affect in tweets”. In: Proceedings of the 12th international workshop on semantic evaluation. 2018, pp. 1–17. [115] Joann M Montepare and Heidi Dobish. “Younger and older adults’ beliefs about the experience and expression of emotions across the life span”. In: Journals of Gerontology Series B: Psychological Sciences and Social Sciences 69.6 (2014), pp. 892–896. [116] Emilie Morvant, Amaury Habrard, and Stéphane Ayache. “Majority vote of diverse classifiers for late fusion”. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer. 2014, pp. 153–162. [117] Jonathan Munro and Dima Damen. “Multi-modal domain adaptation for fine-grained action recognition”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 122–132. [118] Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. “Attention bottlenecks for multimodal fusion”. In: Advances in Neural Information Processing Systems 34 (2021). 138 [119] Michael Neumann and Ngoc Thang Vu. “Improving speech emotion recognition with unsupervised representation learning on unlabeled speech”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 7390–7394. [120] Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. “Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space”. In: IEEE Transactions on Affective Computing 2.2 (2011), pp. 92–105. [121] Ioanna Ntinou, Enrique Sanchez, Adrian Bulat, Michel Valstar, and Yorgos Tzimiropoulos. “A transfer learning approach to heatmap regression for action unit intensity estimation”. In: IEEE Transactions on Affective Computing (2021). [122] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. “Representation learning with contrastive predictive coding”. In: arXiv preprint arXiv:1807.03748 (2018). [123] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. “Automatic differentiation in PyTorch”. In: NeurIPS Autodiff Workshop. 2017. [124] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. “Multi-adversarial domain adaptation”. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2018. [125] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. “Film: Visual reasoning with a general conditioning layer”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018. [126] Rosalind W Picard. Affective computing. MIT press, 2000. [127] Ronald Poppe. “A survey on vision-based human action recognition”. In: Image and vision computing 28.6 (2010), pp. 976–990. [128] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2019. [129] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. “Spatiotemporal contrastive video representation learning”. In: arXiv preprint arXiv:2008.03800 (2020). [130] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. “Robust speech recognition via large-scale weak supervision”. In: International Conference on Machine Learning. PMLR. 2023, pp. 28492–28518. [131] Wasifur Rahman, Md. Kamrul Hasan, Sangwu Lee, Amir Zadeh, Chengfeng Mao, Louis-Philippe Morency, and Ehsan Hoque. “Integrating Multimodal Information in Large Pretrained Transformers”. In: arXiv preprint arXiv: Arxiv-1908.05787 (2019). 139 [132] Martina Rescigno, Matteo Spezialetti, and Silvia Rossi. “Personalized models for facial emotion recognition through transfer learning”. In: Multimedia Tools and Applications 79.47 (2020), pp. 35811–35828. [133] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. “Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation”. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 2021. [134] Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. “AVEC 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition”. In: Proceedings of the 2018 on audio/visual emotion challenge and workshop. 2018, pp. 3–13. [135] Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic. “AVEC 2015: The 5th international audio/visual emotion challenge and workshop”. In: Proceedings of the 23rd ACM international conference on Multimedia. 2015, pp. 1335–1336. [136] Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. “AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition”. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 2019, pp. 3–12. [137] Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. “Avec 2017: Real-life depression, and affect recognition workshop and challenge”. In: Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 2017, pp. 3–9. [138] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. “Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions”. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE. 2013, pp. 1–8. [139] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. “Beyond sharing weights for deep domain adaptation”. In: IEEE transactions on pattern analysis and machine intelligence 41.4 (2018), pp. 801–814. [140] Magdalena Rychlowska, Rachael E Jack, Oliver GB Garrod, Philippe G Schyns, Jared D Martin, and Paula M Niedenthal. “Functional smiles: Tools for love, sympathy, and war”. In: Psychological science 28.9 (2017), pp. 1259–1270. [141] Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. “300 faces in-the-wild challenge: The first facial landmark localization challenge”. In: Proceedings of the IEEE international conference on computer vision workshops. 2013, pp. 397–403. [142] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. “Maximum classifier discrepancy for unsupervised domain adaptation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 3723–3732. 140 [143] Enrique Sanchez, Mani Kumar Tellamekala, Michel Valstar, and Georgios Tzimiropoulos. “Affective processes: stochastic modelling of temporal context for emotion and facial expression recognition”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 9074–9084. [144] Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu, and Yi-Shin Chen. “CARER: Contextualized affect representations for emotion recognition”. In: Proceedings of the 2018 conference on empirical methods in natural language processing. 2018, pp. 3687–3697. [145] Björn Schuller, Michel Valster, Florian Eyben, Roddy Cowie, and Maja Pantic. “AVEC 2012: the continuous audio/visual emotion challenge”. In: Proceedings of the 14th ACM international conference on Multimodal interaction. 2012, pp. 449–456. [146] Mostafa Shahabinejad, Yang Wang, Yuanhao Yu, Jin Tang, and Jiani Li. “Toward personalized emotion recognition: A face recognition based attention method for facial emotion recognition”. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE. 2021, pp. 1–5. [147] Amir Shahroudy, Tian-Tsong Ng, Yihong Gong, and Gang Wang. “Deep multimodal feature analysis for action recognition in rgb+ d videos”. In: IEEE transactions on pattern analysis and machine intelligence 40.5 (2017), pp. 1045–1058. [148] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. “JAA-Net: Joint facial action unit detection and face alignment via adaptive attention”. In: International Journal of Computer Vision 129.2 (2021), pp. 321–340. [149] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. “Wasserstein distance guided representation learning for domain adaptation”. In: Thirty-Second AAAI Conference on Artificial Intelligence (AAAI). 2018. [150] Ekaterina Shutova, Douwe Kiela, and Jean Maillard. “Black holes and white rabbits: Metaphor identification with visual features”. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016, pp. 160–170. [151] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. “Animating arbitrary objects via deep motion transfer”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 2377–2386. [152] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. “First Order Motion Model for Image Animation”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019. url: https://proceedings.neurips.cc/paper/2019/file/31c0b36aef265d9221af80872ceb62f9-Paper.pdf. [153] Maxim Sidorov, Alexander Schmitt, Eugene Semenkin, and Wolfgang Minker. “Could speaker, gender or age awareness be beneficial in speech-based emotion recognition?” In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 2016, pp. 61–68. 141 [154] Karen Simonyan and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos”. In: Advances in neural information processing systems 27 (2014). [155] Konstantinos Skianis, Giannis Nikolentzos, Stratis Limnios, and Michalis Vazirgiannis. “Rep the set: Neural networks for learning set representations”. In: International conference on artificial intelligence and statistics. PMLR. 2020, pp. 1410–1420. [156] Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. “A survey of multimodal sentiment analysis”. In: Image and Vision Computing 65 (2017), pp. 3–14. [157] Mohammad Soleymani, Kalin Stefanov, Sin-Hwa Kang, Jan Ondras, and Jonathan Gratch. “Multimodal Analysis and Estimation of Intimate Self-Disclosure”. In: 2019 International Conference on Multimodal Interaction. 2019, pp. 59–68. [158] Tengfei Song, Lisha Chen, Wenming Zheng, and Qiang Ji. “Uncertain graph neural networks for facial action unit detection”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 7. 2021, pp. 5993–6001. [159] Tengfei Song, Zijun Cui, Wenming Zheng, and Qiang Ji. “Hybrid message passing with performance-driven structures for facial action unit detection”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 6267–6276. [160] Kusha Sridhar and Carlos Busso. “Unsupervised personalization of an emotion recognition system: The unique properties of the externalization of valence in speech”. In: IEEE Transactions on Affective Computing 13.4 (2022), pp. 1959–1972. [161] Sundararajan Srinivasan, Zhaocheng Huang, and Katrin Kirchhoff. “Representation learning through cross-modal conditional teacher-student training for speech emotion recognition”. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022, pp. 6442–6446. [162] Ramanathan Subramanian, Julia Wache, Mojtaba Khomami Abadi, Radu L Vieriu, Stefan Winkler, and Nicu Sebe. “ASCERTAIN: Emotion and personality recognition using commercial sensors”. In: IEEE Transactions on Affective Computing 9.2 (2016), pp. 147–160. [163] Antje von Suchodoletz and Robert Hepach. “Cultural values shape the expression of self-evaluative social emotions”. In: Scientific Reports 11.1 (2021), pp. 1–14. [164] Yang Tang, Wangding Zeng, Dafei Zhao, and Honggang Zhang. “PIAP-DF: Pixel-Interested and Anti Person-Specific Facial Action Unit Detection Net with Discrete Feedback Learning”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 12899–12908. [165] Li Tao, Xueting Wang, and Toshihiko Yamasaki. “Self-supervised video representation learning using inter-intra contrastive framework”. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, pp. 2193–2201. 142 [166] Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. “Estimation of continuous valence and arousal levels from faces in naturalistic conditions”. In: Nature Machine Intelligence 3.1 (2021), pp. 42–50. [167] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training”. In: Advances in neural information processing systems 35 (2022), pp. 10078–10093. [168] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. “A closer look at spatiotemporal convolutions for action recognition”. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018, pp. 6450–6459. [169] Minh Tran, Yufeng Yin, and Mohammad Soleymani. “Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition”. In: Proc. INTERSPEECH 2023 (2023). [170] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. “Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network”. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2016, pp. 5200–5204. [171] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. “Multimodal transformer for unaligned multimodal language sequences”. In: Proceedings of the conference. Association for computational linguistics. Meeting. Vol. 2019. NIH Public Access. 2019, p. 6558. [172] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. “Multimodal Transformer for Unaligned Multimodal Language Sequences”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, July 2019, pp. 6558–6569. doi: 10.18653/v1/P19-1656. [173] Cheng-Hao Tu, Chih-Yuan Yang, and Jane Yung-jen Hsu. “Idennet: Identity-aware facial action unit detection”. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE. 2019, pp. 1–8. [174] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. “Adversarial discriminative domain adaptation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 7167–7176. [175] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. “Deep domain confusion: Maximizing for domain invariance”. In: arXiv preprint arXiv:1412.3474 (2014). [176] Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou. “End-to-end multimodal emotion recognition using deep neural networks”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1301–1309. [177] Panagiotis Tzirakis, George Trigeorgis, Mihalis A Nicolaou, Björn W Schuller, and Stefanos Zafeiriou. “End-to-end multimodal emotion recognition using deep neural networks”. In: IEEE Journal of Selected Topics in Signal Processing 11.8 (2017), pp. 1301–1309. 143 [178] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization: The missing ingredient for fast stylization”. In: arXiv preprint arXiv:1607.08022 (2016). [179] Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. “Avec 2016: Depression, mood, and emotion recognition workshop and challenge”. In: Proceedings of the 6th international workshop on audio/visual emotion challenge. 2016, pp. 3–10. [180] Michel Valstar, Björn Schuller, Kirsty Smith, Timur Almaev, Florian Eyben, Jarek Krajewski, Roddy Cowie, and Maja Pantic. “Avec 2014: 3d dimensional affect and depression recognition challenge”. In: Proceedings of the 4th international workshop on audio/visual emotion challenge. 2014, pp. 3–10. [181] Michel Valstar, Björn Schuller, Kirsty Smith, Florian Eyben, Bihan Jiang, Sanjay Bilakhia, Sebastian Schnieder, Roddy Cowie, and Maja Pantic. “AVEC 2013: the continuous audio/visual emotion and depression recognition challenge”. In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge. 2013, pp. 3–10. [182] Aaron Van Den Oord, Oriol Vinyals, et al. “Neural discrete representation learning”. In: Advances in neural information processing systems 30 (2017). [183] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” In: Journal of machine learning research 9.11 (2008). [184] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017). [185] Andrea Vidal, Ali Salman, Wei-Cheng Lin, and Carlos Busso. “MSP-face corpus: a natural audiovisual emotional database”. In: Proceedings of the 2020 international conference on multimodal interaction. 2020, pp. 397–405. [186] Nikolaos Vryzas, Lazaros Vrysis, Rigas Kotsakis, and Charalampos Dimoulas. “Speech emotion recognition adapted to multimodal semantic repositories”. In: 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP). IEEE. 2018, pp. 31–35. [187] Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Björn W Schuller. “Dawn of the transformer era in speech emotion recognition: closing the valence gap”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2023). [188] Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Florian Eyben, and Björn W Schuller. “Dawn of the transformer era in speech emotion recognition: closing the valence gap”. In: arXiv preprint arXiv:2203.07378 (2022). [189] Robert Walecki, Vladimir Pavlovic, Björn Schuller, Maja Pantic, et al. “Deep structured learning for facial action unit intensity estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, pp. 3405–3414. 144 [190] Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, and Yu Qiao. “Suppressing uncertainties for large-scale facial expression recognition”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 6897–6906. [191] Mei Wang and Weihong Deng. “Deep visual domain adaptation: A survey”. In: Neurocomputing 312 (2018), pp. 135–153. [192] Weiyao Wang, Du Tran, and Matt Feiszli. “What makes training multi-modal classification networks hard?” In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 12695–12705. [193] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. “Huggingface’s transformers: State-of-the-art natural language processing”. In: arXiv preprint arXiv:1910.03771 (2019). [194] Weihao Xia, Yulun Zhang, Yujiu Yang, Jing-Hao Xue, Bolei Zhou, and Ming-Hsuan Yang. “Gan inversion: A survey”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2022). [195] Yinghao Xu, Yujun Shen, Jiapeng Zhu, Ceyuan Yang, and Bolei Zhou. “Generative hierarchical features from synthesizing images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 4432–4442. [196] Zixiaofan Yang and Julia Hirschberg. “Predicting Arousal and Valence from Waveforms and Spectrograms Using Deep Neural Networks.” In: INTERSPEECH. 2018, pp. 3092–3096. [197] Yufeng Yin, Baiyu Huang, Yizhen Wu, and Mohammad Soleymani. “Speaker-invariant adversarial domain adaptation for emotion recognition”. In: Proceedings of the 2020 International Conference on Multimodal Interaction. 2020, pp. 481–490. [198] Yufeng Yin, Liupei Lu, Yizhen Wu, and Mohammad Soleymani. “Self-Supervised Patch Localization for Cross-Domain Facial Action Unit Detection”. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021). IEEE. 2021, pp. 1–8. [199] Seunghyun Yoon, Seokhyun Byun, Subhadeep Dey, and Kyomin Jung. “Speech emotion recognition using multi-hop attention mechanism”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 2822–2826. [200] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. “Tensor Fusion Network for Multimodal Sentiment Analysis”. In: arXiv preprint arXiv: Arxiv-1707.07250 (2017). [201] AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph”. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2236–2246. [202] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. “Deep sets”. In: Advances in neural information processing systems 30 (2017). 145 [203] Gloria Zen, Enver Sangineto, Elisa Ricci, and Nicu Sebe. “Unsupervised domain adaptation for personalized facial emotion recognition”. In: Proceedings of the 16th international conference on multimodal interaction. 2014, pp. 128–135. [204] Shiqing Zhang, Shiliang Zhang, Tiejun Huang, Wen Gao, and Qi Tian. “Learning affective features with a hybrid deep model for audio–visual emotion recognition”. In: IEEE Transactions on Circuits and Systems for Video Technology 28.10 (2017), pp. 3030–3043. [205] Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database”. In: Image and Vision Computing 32.10 (2014), pp. 692–706. [206] Yong Zhang, Weiming Dong, Bao-Gang Hu, and Qiang Ji. “Weakly-supervised deep convolutional neural network learning for facial action unit intensity estimation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 2314–2323. [207] Yong Zhang, Baoyuan Wu, Weiming Dong, Zhifeng Li, Wei Liu, Bao-Gang Hu, and Qiang Ji. “Joint representation and estimator learning for facial action unit intensity estimation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 3457–3466. [208] Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. “Datasetgan: Efficient labeled data factory with minimal human effort”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 10145–10155. [209] Zheng Zhang, Taoyue Wang, and Lijun Yin. “Region of interest based graph convolution: A heatmap regression approach for action unit detection”. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, pp. 2890–2898. [210] Jianfeng Zhao, Xia Mao, and Lijiang Chen. “Speech emotion recognition using deep 1D & 2D CNN LSTM networks”. In: Biomedical Signal Processing and Control 47 (2019), pp. 312–323. [211] Jinming Zhao, Ruichen Li, Jingjun Liang, Shizhe Chen, and Qin Jin. “Adversarial Domain Adaption for Multi-Cultural Dimensional Emotion Recognition in Dyadic Interactions”. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 2019, pp. 37–45. [212] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. “Deep region and multi-label learning for facial action unit detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 3391–3399. [213] Sicheng Zhao, Guiguang Ding, Jungong Han, and Yue Gao. “Personality-Aware Personalized Emotion Recognition from Physiological Signals.” In: IJCAI. 2018, pp. 1660–1667. [214] Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. “Personalized emotion recognition by personality-aware high-order learning of physiological signals”. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15.1s (2019), pp. 1–18. 146 [215] Sicheng Zhao, Xin Zhao, Guiguang Ding, and Kurt Keutzer. “EmotionGAN: Unsupervised domain adaptation for learning discrete probability distributions of image emotions”. In: Proceedings of the 26th ACM international conference on Multimedia. 2018, pp. 1319–1327. [216] Jiahao Zheng, Sen Zhang, Zilu Wang, Xiaoping Wang, and Zhigang Zeng. “Multi-channel Weight-sharing Autoencoder Based on Cascade Multi-head Attention for Multimodal Emotion Recognition”. In: IEEE Transactions on Multimedia (2022), pp. 1–1. doi: 10.1109/TMM.2022.3144885. 147
Abstract (if available)
Abstract
Emotions play a significant role in human creativity and intelligence including rational decision-making. Emotion recognition is the process of identifying human emotions, e.g., happiness, sadness, anger, and neutral. External manifestations of emotions can be recognized by tracking emotional expressions. Recent deep learning approaches have shown promising performance for emotion recognition. However, the performance of automatic recognition methods degrades when evaluated across datasets or subjects, due to variations in humans, e.g., culture, gender, and environmental factors, e.g., camera and background. Expression and emotion annotations require laborious manual coding by trained annotators. The annotation is both time-consuming and expensive. Thus, the manual annotations required by the supervised learning methods present significant practical limitations when working with new datasets or individuals.
Throughout this thesis, we delve into various methodologies aimed at enhancing the generalization capabilities of perception models with minimal human efforts. Our investigation encompasses unsupervised domain adaptation, multimodal learning, generative model feature extraction, and unsupervised personalization, all of which enhance the adaptability of recognition models to unseen datasets or subjects lacking labels. This thesis includes four major contributions. First, we advance the understanding of unsupervised domain adaptation by proposing innovative approaches to obtain domain-invariant and discriminative features without relying on target labels. This lays a foundation for improving model generalization across diverse datasets. Second, we introduce a novel architecture for bimodal fusion, enabling the extraction of meaningful representations for multimodal emotion and action recognition. Notably, our approach remains agnostic to specific tasks or architectures, underscoring its versatility and lightweight nature. Furthermore, our exploration into generative model feature extraction yields significant advancements in data efficiency for facial action unit detection. By extracting generalizable and semantic-rich features, our method achieves competitive performance even with a few training samples, thereby demonstrating its strong potential to solve expression recognition in a real-life scenario. Finally, we tackle the issue of unsupervised personalization for emotion recognition on unseen speakers. By leveraging domain-adapted pre-training and learnable speaker embeddings, coupled with cross-modal labeling to construct a large-scale weakly-labeled emotion dataset, our work facilitates the development of personalized emotion recognition systems at scale.
Overall, these contributions not only advance the scientific understanding of perception models and representation learning but also offer practical implications for real-world applications such as affective computing and personalized systems. By addressing key challenges in model generalization, multimodal learning, data efficiency, and personalized adaptation, our work significantly pushes the boundaries of the field, paving the way for more robust and adaptable perception models in diverse contexts.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Towards learning generalization
PDF
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Understanding and generating multimodal feedback in human-machine story-telling
PDF
Computational modeling of mental health therapy sessions
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Multimodal representation learning of affective behavior
PDF
Emotional appraisal in deep reinforcement learning
PDF
Multiparty human-robot interaction: methods for facilitating social support
PDF
Toward situation awareness: activity and object recognition
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Building generalizable language models for code processing
PDF
On virtual, augmented, and mixed reality for socially assistive robotics
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Learning shared subspaces across multiple views and modalities
PDF
Computational foundations for mixed-motive human-machine dialogue
PDF
Incorporating aggregate feature statistics in structured dynamical models for human activity recognition
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Emotional speech resynthesis
Asset Metadata
Creator
Yin, Yufeng
(author)
Core Title
Towards generalizable expression and emotion recognition
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/21/2024
Defense Date
05/06/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
emotion recognition,expression recognition,generalization,machine learning,multimodal learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Soleymani, Mohammad (
committee chair
), Mataric, Maja (
committee member
), Miller, Lynn (
committee member
)
Creator Email
yinyf1996@gmail.com,yufengy@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113950088
Unique identifier
UC113950088
Identifier
etd-YinYufeng-12974.pdf (filename)
Legacy Identifier
etd-YinYufeng-12974
Document Type
Dissertation
Format
theses (aat)
Rights
Yin, Yufeng
Internet Media Type
application/pdf
Type
texts
Source
20240521-usctheses-batch-1156
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
emotion recognition
expression recognition
generalization
machine learning
multimodal learning