Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodal representation learning of affective behavior
(USC Thesis Other)
Multimodal representation learning of affective behavior
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIMODAL REPRESENTATION LEARNING OF AFFECTIVE BEHAVIOR by Sayan Ghosh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2018 Copyright 2018 Sayan Ghosh Abstract With the ever increasing abundance of multimedia data available on the Inter- net and crowd-sourced datasets/repositories, there has been a renewed interest in machine learning approaches for solving real-life perception problems. Current research has achieved state-of-the-art performances on tasks such as image under- standing and speech recognition. However, such techniques have only recently made inroads into research problems relevant to the study of human emotion and behavior understanding. The primary research challenges addressed by this disser- tation are three fold. Firstly, it is an open problem to build representation learning systems for emotion recognition which provide comparable or better performance to existing models trained with well-engineered domain knowledge based features. Secondly, there has been limited investigation in applying deep neural networks to the task of building good multimodal representations of data. Such multimodal representations should not only improve performance on downstream tasks such as classification and retrieval, but also model the importance of each modality for the desired task. Thirdly, the effective integration of cues from visual, acoustic and textual modalities still poses a challenge. Particularly, the fusion of verbal infor- mation with affective and non-verbal cues from other modalities for applications such as language modeling and emotional text generation has not been studied. ii In this dissertation the above research challenges are addressed for three areas - (1) Unimodal Representation Learning: In the visual modality a novel multi- label CNN (Convolutional Neural Network) is proposed for learning AU (Action Unit) occurrences in facial images. The multi-label CNN learns a joint representa- tion for AU occurrences, obtaining competitive detection results; and is also robust across different datasets. For the acoustic modality, denoising autoencoders and RNNs (Recurrent Neural Networks) are trained on temporal frames from speech spectrograms, and it is observed that representation learning from the glottal flow signal (the component of the speech signal with vocal tract influence removed) is effective for speech emotion recognition. An unsupervised data-driven approach to estimation of the glottal flow signal is also proposed. (2)Multimodal Represen- tation Learning: An importance-based multimodal autoencoder (IMA) model is introduced which can learn joint multimodal representations as well as importance weights for each modality. The IMA model achieves performance improvement rel- ative to baseline approaches for the tasks of digit recognition and emotion under- standing from spoken utterances. (3) Non-verbal and Affective Language Models: This dissertation studies deep multimodal fusion in the context of neural language modeling by introducing two novel approaches - Affect-LM and Speech- LM. These models obtain perplexity reductions over a baseline language model by integrating verbal affective and non-verbal acoustic cues with the linguistic context for predicting the next word. Affect-LM also generates text in different emotions at various levels of intensity. The generated sentences are emotionally expressive while maintaining grammatical correctness as evaluated through a crowdsourced perception study. To summarize, this dissertation proposes several novel repre- sentation learning based approaches to address problems in the emerging area of automated human behavior and affect understanding. iii Acknowledgments I was in a quandary when trying to write an acknowledgement section for this dissertation. I wanted to express my gratitude to all individuals who had helped me, directly or indirectly during these five years of PhD life. However I lack enough patience these days to write a lengthy and protracted statement of thanks which would span tens of pages. The last time I wrote something like this was six years back and it seems so different when I read it now. But some things will stay the same, such as awesome advisors, and supportive friends, assistants and family. For a faculty member, having thousands of citations, bringing in millions of dollars in NSF or NIH grant money, teaching lots of courses or being internation- ally recognized experts in academic fields are no doubt enviable accomplishments. However they all pale compared to the more desirable quality of being supportive mentors to a graduate student and being highly invested in his career endeavours and professional success. In a world where too much importance is given on objec- tive metrics and not enough on the basic human virtue of empathy, I am extremely fortunate to have had Stefan and LP as my advisors. They’ve been very support- ive of whatever decisions I have made, including those which have not followed a consistent academic agenda. I thank Stefan for encouraging me to follow my heart all throughout the program. It was LP who taught me the nuances of tech- nical writing. I look forward to stay associated with them in the future. Special iv thanks to my qualifier and defense committee members, Prof. Kevin Knight, Prof. Panayiotis Georgiou and Prof. Aram Galstyan. They’ve provided useful inputs to strengthen my research and dissertation. Sometimesgraduatestudentsturnouttobeendowedwithsuperhumanabilities and single-handedly finish their thesis work in a couple of years. Sometimes they havesidekickswhodogruntworkforthemandhelpthemgraduatesooner. Eugene helped me with feature extraction and preprocessing before I was on them. He’s probably one of the biggest reasons I’m graduating in five years instead of eight. He’s also been a good friend and occasional accomplice. Though in fairness, I feel I showed him a bit too much that grad school is not exactly a bed of roses. I hope that’s not going to put him off from doing a PhD in the future. Also a big thanks to past lab members - Derya, Sunghyun, Moitreya, Lixing, Mathieu, as well as anyone else whose name I missed out. Alesia was very efficient and thoughtful in allocating ICT resources, promptly handling the administrative side of projects and grants. Ramesh was a wonderful friend and I also enjoyed interacting with Eli, Zahra, Elnaz, Himanshu, Rens, Melissa and Ali. One of the important issues I tried to deal with was the constant criticism from industrial circles that PhD students were overqualified, that they could not code their way out of a paper bag, and spent too much time engrossed in their specific research projects to ship anything meaningful. I did a bunch of internships throughout my PhD to not only gain valuable practical experience integral to my graduate studies in computer science (the coding was a part of it too) but also to find out how industry was different from academia. Now I’m convinced that there is no single solution to this rift. I’m happy that tech companies are hiring more and more PhDs; and finally appreciating the expertise and deal-with- uncertainty skills new graduates bring. I’m also happy that more grad students v are exploring their options and moving to industry, and realizing that publishing is just one of many ways to have impact and contribute to society. Shiv Vitaladevuni, Jiunwei Chen and Matt Hearst were my managers at Amazon and Microsoft when I interned in those companies; and they were all awesome. Andy, Paul, Colin, Alex and Chao were my internship mentors. The good advice and help I got from my managers and mentors helped me learn several useful things such as scrum meetings, minimum viable products, all-hands, working with product managers, big data query languages and best of all, how to avoid talking about work during lunch. I should also thank my friends at USC, particularly the folks who graduated with Master’s degrees and helped me immensely with homeworks and job interview preparation. A shout-out to Vineeth, Arul, Pooja, Ankit, Vishal, Harsha, Pranav and Pankaj for being great classmates and friends. I wish them the best for their careers. I’malsogratefultoallfacultywhotaughtmycourses. Partofmyacademic progress here was not only about my thesis, but also coursework which gave me a joy in rediscovering computer science. I earned a teeny-weeny bit of experience in distributed systems due to Prof. Clifford Neuman and Prof. Wyatt Lloyd. Their courses were absolutely fantastic. Thanks to my parents for their unyielding support for my efforts, particularly with me staying away from home for years and not being able to attend to their needs in person. I’m also grateful to Gubitokaka for helping me settle in LA on my very first day and for stepping into the shoes of a family friend so far away from home. Thanks to my fiancée for being patient during the last year when I was trying to finish up. Its truly amazing to stay in a country where customs, history and people from the world over converge. Decisions may take longer to make, but the additional perspective gained from such an experience is invaluable. vi Contents Abstract ii Acknowledgments iv List of Tables xi List of Figures xv 1 Introduction 1 1.1 Modeling Human Affect and Behaviour . . . . . . . . . . . . . . . . 3 1.2 Current Research Challenges . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Related Work 16 2.1 Unimodal Representation Learning . . . . . . . . . . . . . . . . . . 16 2.1.1 Brief Introduction to Representation Learning . . . . . . . . 16 2.1.2 Emotion Recognition From Facial Images . . . . . . . . . . . 18 2.1.3 Emotion Recognition from Speech . . . . . . . . . . . . . . . 21 2.2 Multimodal Representation Learning . . . . . . . . . . . . . . . . . 28 2.3 Neural Language Models and Affective Text Generation . . . . . . . 32 3 Facial Expression Recognition 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Multi-label CNN Model . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . 42 3.3.3 Train/Validation/Test Splits . . . . . . . . . . . . . . . . . . 43 3.3.4 Hyper-parameter Validation . . . . . . . . . . . . . . . . . . 44 3.3.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 45 3.3.6 Baseline Approaches . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vii 3.4.1 Visualization of Joint AU representations . . . . . . . . . . . 47 3.4.2 Same-dataset Experiments . . . . . . . . . . . . . . . . . . . 47 3.4.3 Cross-dataset Experiments . . . . . . . . . . . . . . . . . . . 49 3.4.4 Correlations between Action Units . . . . . . . . . . . . . . 51 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4 Emotion Recognition from Speech 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Models for Representation Learning . . . . . . . . . . . . . . . . . . 56 4.2.1 Stacked Denoising Autoencoders . . . . . . . . . . . . . . . . 56 4.2.2 Deep Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.3 Recurrent Autoencoder . . . . . . . . . . . . . . . . . . . . . 57 4.2.4 Pre-trained MLP (Multi-layer Perceptrons) models . . . . . 59 4.2.5 BLSTM (Bidirectional LSTM)-RNNs . . . . . . . . . . . . . 60 4.2.6 Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . 61 4.2.7 Spectrogram and Feature Extraction . . . . . . . . . . . . . 61 4.2.8 Temporal Context Windows . . . . . . . . . . . . . . . . . . 62 4.2.9 Glottal Flow Extraction . . . . . . . . . . . . . . . . . . . . 62 4.2.10 Autoencoder Training . . . . . . . . . . . . . . . . . . . . . 63 4.2.11 Emotion Classifier Training . . . . . . . . . . . . . . . . . . 65 4.2.12 Visualization of Learned Representations . . . . . . . . . . . 66 4.2.13 Emotion Classification Results . . . . . . . . . . . . . . . . . 69 4.3 Unsupervised Glottal Inverse Filtering . . . . . . . . . . . . . . . . 71 4.3.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . 72 4.3.2 Datasets and Experimental Setup . . . . . . . . . . . . . . . 74 4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 76 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5 Importance-based Multimodal Autoencoder 79 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Description of IMA Model . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.1 Multimodal Autoencoder with Alignment . . . . . . . . . . . 82 5.2.2 Importance Network Training . . . . . . . . . . . . . . . . . 84 5.2.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . 86 5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.1 Autoencoder Training . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.3 Digit Datasets Preprocessing . . . . . . . . . . . . . . . . . . 88 5.3.4 Processing of MNIST digits . . . . . . . . . . . . . . . . . . 89 5.3.5 Processing of TIDIGITS/TI46 spoken digits . . . . . . . . . 89 5.3.6 Multimodal Pairing in MNIST-TIDIGITS dataset . . . . . . 90 5.3.7 IEMOCAP Dataset Processing . . . . . . . . . . . . . . . . 91 viii 5.3.8 Non-verbal Feature Extraction (IEMOCAP) . . . . . . . . . 92 5.3.9 Multimodal Pairing in IEMOCAP Dataset . . . . . . . . . . 93 5.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.1 Loss Functions and Hyper-parameter Tuning . . . . . . . . . 93 5.4.2 Competing Baseline Models . . . . . . . . . . . . . . . . . . 94 5.4.3 Evaluation of Importance Network Performance . . . . . . . 95 5.4.4 Retrieval Experiments . . . . . . . . . . . . . . . . . . . . . 96 5.4.5 Downstream Classification Tasks . . . . . . . . . . . . . . . 97 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.1 Visualization of Learned Representations . . . . . . . . . . . 98 5.5.2 Precision and Recall for Importance Networks . . . . . . . . 100 5.5.3 Digit Retrieval Results . . . . . . . . . . . . . . . . . . . . . 101 5.5.4 Downstream Digit Classification Performance . . . . . . . . 103 5.6 Experimental Results on IEMOCAP . . . . . . . . . . . . . . . . . 105 5.6.1 Visualization of Learnt Embeddings . . . . . . . . . . . . . . 106 5.6.2 Importance Network Performance . . . . . . . . . . . . . . . 108 5.6.3 Emotion Retrieval Results . . . . . . . . . . . . . . . . . . . 109 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6 Language Modeling with Affective and Non-verbal Cues 112 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 Baseline Language Model . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4 Corpora for Language Model Training . . . . . . . . . . . . . . . . 116 6.5 Affect-LM Language Model . . . . . . . . . . . . . . . . . . . . . . 118 6.5.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . 119 6.5.2 Descriptors for Affect Category Information . . . . . . . . . 119 6.5.3 Affect-LM Network Architecture . . . . . . . . . . . . . . . 120 6.5.4 Affect-LM Perplexity Evaluation Setup . . . . . . . . . . . 121 6.5.5 Affect-LM for Sentence Generation . . . . . . . . . . . . . . 121 6.5.6 Sentence Generation Perception Study . . . . . . . . . . . . 122 6.5.7 Perplexity Results . . . . . . . . . . . . . . . . . . . . . . . 123 6.5.8 Text Generation Results . . . . . . . . . . . . . . . . . . . . 124 6.5.9 MTurk Perception Study Results . . . . . . . . . . . . . . . 125 6.5.10 Affective Word Representations . . . . . . . . . . . . . . . . 129 6.6 Speech-LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.6.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . 130 6.6.2 Processing of the Fisher Corpus . . . . . . . . . . . . . . . . 131 6.6.3 Baseline Models for Comparison . . . . . . . . . . . . . . . . 132 6.6.4 Speech-LM Training Methodology . . . . . . . . . . . . . . . 132 6.6.5 Perplexity Results and Distributions . . . . . . . . . . . . . 134 6.6.6 Word Representations Learnt by Speech-LM Model . . . . . 135 ix 6.6.7 Text Generation Results . . . . . . . . . . . . . . . . . . . . 137 6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7 Conclusions and Future Work 139 7.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Reference List 148 A Appendix1: ProbabilisticFrameworkforImportanced-basedMul- timodal Autoencoder Model (IMA) 165 A.0.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 166 A.0.2 Multimodal Variational Alignment and Fusion . . . . . . . . 166 x List of Tables 3.1 Leave-One-Out Classification Performance Of Multi-label CNN on DISFA dataset (12 Action Units). The results indicate compara- ble performance to the baseline approaches in Song et al. (2015) and Jiang et al. (2014). Note that the baseline approaches have not reported the same metrics on DISFA, thus necessitating comparison with both approaches. . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2 Leave-One-Out Classification Performance Of Multi-label CNN on BP4Ddataset(10ActionUnits). Themulti-labelCNNperformance outperforms the baseline metrics reported for the FERA 2015 Facial AU occurrence challenge. . . . . . . . . . . . . . . . . . . . . . . . 49 xi 3.3 Cross-dataset generalization performance of Multi-label CNN. Both accuracy and 2AFC metrics are reported. The network trained on DISFA generalizes well to BP4D, where a relative accuracy improve- ment of 1.54% is obtained over the BP4D-only baseline. The perfor- mance obtained on the CK+ dataset (accuracy 88.14% and 2AFC 0.78) is similar to that reported in Jiang et al. (2014). The net- work trained on DISFA generalizes well to BP4D, where a relative accuracy improvement of 1.54% is obtained over the BP4D-only baseline network with a slight decrease of 0.02 in the 2AFC score. When trained on BP4D and tested on CK+, the network obtains an average accuracy of 77.7%, and a 2AFC score of 0.759. . . . . . 50 4.1 Test Accuracies reported for different feature sets on IEMOCAP utterances. Results for the BLSTM-RNN approach are shown in bold 69 4.2 Correlation between IAIF and the proposed approach. Statistics are presented in the format mean, median (standard deviation) . . . . . 76 5.1 Network Architecture, modalities and training configurations for model trained on each of the MNIST-TIDIGITS and the IEMO- CAP dataset. SSE denotes the Sum of Squared Error loss function . 91 5.2 Precision@K scores (denoted as P@K) corresponding to K=10, 50 and 100 for the importance-based autoencoder and other baseline modelsforthetaskofretrievalofsimilarmultimodalMNIST-TIDIGITS paired data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 xii 5.3 Testsetaccuraciesobtainedbythemultimodalembeddingsfromthe importance-based autoencoder and other baseline models. Accura- ciesareshownbothfor2and50-dimensionalrepresentations. Multi- modal embeddings from the proposed model outperform unimodal and fusion approaches without importance-based weighting. For each model the performance x(y,z) is reported, where x is the over- all accuracy. y is the accuracy on (noise image, true spoken digit) paired samples. z is the accuracy on (true digit image, noise speech) paired samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4 Precision@K scores (denoted as P@K) at K=10 and 50 (Mean Aver- age Precision) scores for the importance-based autoencoder (IMA model) and other baseline models for the task of retrieval of utter- ances from IEMOCAP with the same primary emotion which is one of happy, angry, sad, neutral. . . . . . . . . . . . . . . . . . . . . . 110 6.1 Summary of corpora used in the experiments. CMU-MOSI and SEMAINEareobservedtohavehigheremotionalcontentthanFisher and DAIC corpora. The type of conversations are also indicated for each corpus. The Fisher corpus is used to train the Affect-LM and Speech-LM models. The remaining corpora are not utilized for training Affect-LM due to their small size; but for adaptation from an existing model already trained on Fisher corpus. . . . . . . . . . 117 xiii 6.2 Evaluationperplexityscoresobtainedbythebaselineand Affect-LM models when trained on Fisher and subsequently adapted on DAIC, SEMAINE and CMU-MOSI corpora. In all cases the Affect-LM modelobtainslowerperplexitycomparedtothebaselineLSTM-LM. The corpora with more emotional content (such as the CMU-MOSI and SEMAINE datasets) result in greater perplexity reduction than the Fisher corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.3 Example sentences generated by the model conditioned on differ- ent affect categories such as Happy, Angry, Sad, Negative Emotion and Positive Emotion. Affect-LM can generate sentences for each category which are perceptually affective without sacrificing gram- matical correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.4 Summary of results from perplexity evaluation between three com- peting models - baseline LSTM model, Affect-LM,Speech-LM. Rela- tiveimprovementsinperplexitywithrespecttobaselinearereported in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.5 Averageentropy ˆ H andperplexityscores(denotedbyP)fordifferent predicted word categories, for the Baseline and Speech-LM models. Speech-LM achieves reduction in perplexity over major word cate- gories. All differences are highly significant and Speech-LM signif- icantly outperforms the baseline model with observed p-values < 0.0001 using paired t-tests. The word frequencies reported in the table correspond to the test set. . . . . . . . . . . . . . . . . . . . . 135 6.6 Example sentences generated by Speech-LM for instances of spoken words obtained from the Fisher corpus for different prosody instances.137 xiv List of Figures 1.1 Overview of main models and techniques introduced in this disser- tation. The primary research questions are indicated in blue, data sources are in dark blue, with models in dark yellow and attributes in green. Three major focus areas - unimodal/multimodal represen- tation learning and affective/non-verbal language modeling are also indicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Common Action Unit (AU) labels as defined by Ekman & Friesen (1978) utilized in recognition experiments. Images are reproduced from https://www.cs.cmu.edu/ face/facs.htm and show both upper (AUs 1,2,4,5,6 and 9) and lower face (AUs 12,15,17 and 20) actions. 39 3.2 Architecture of the multi-label CNN. There are two convolutional stages and two fully-connected layers interspersed by max-pooling layers, with the size of the output layer equal to the number of AUs . . . . . . . . . 40 3.3 Joint AU representations learnt by the multi-label CNN. Representations for AU 2 (Outer Brow Raiser) and AU 6 (Cheek Raiser) for BP4D and DISFA datasets. Multi-label CNN is trained on BP4D and represen- tations obtained for both BP4D and DISFA datasets. AU presence is indicated in red; absence in blue. . . . . . . . . . . . . . . . . . . . . . 46 xv 3.4 Pairwise correlations among AUs for ground truth(left figure) and predicted labels(right figure) (blue indicates -1 phi-index, red indi- cates +1). The similarity between the figures demonstrates that the CNN learns correlations among AUs. For example, AUs 1 (Inner brow raiser) and 2 (Outer brow raiser) are highly correlated in the AU predictions, which follows the ground truth correlations. . . . . 51 4.1 LSTM-RNN autoencoder for encoding and obtaining latent repre- sentations from sequential data. The hidden bottleneck layer is sub- sequently visualized and expected to be discriminative of affective attributes. The RNN autoencoder is trained on frame representa- tions obtained from the stacked denoising autoencoder. The latent representation of the entire sequence is considered to be the average of bottleneck representations obtained at each frame. . . . . . . . . 58 4.2 Experimental setup for emotion recognition on IEMOCAP dataset. Speech and glottal flow spectrogram representations which have beenlearntbydenoisingautoencodersareinputtosupervisedBLSTM- RNN classifiers for the task of emotion recognition. For comparison, the COVAREP features are also passed through the same experi- mental pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Visualization of frame and utterance-level representations of emo- tion, activation and valence learned by the recurrent autoencoder formodelconfigurationTIED-128-5. Theintensitygradientforeach dimension is from blue (lowest) to red (highest). Yellow - Sadness; Red - Neutral; Blue - Happiness; Green - Angry . . . . . . . . . . . 67 xvi 4.4 Speech and glottal spectrogram representations learnt by stacked denois- ing autoencoder. Each data point representing an utterance is the aver- ageofframerepresentationsintheutterance. Theglottalrepresentations lie on a manifold of lower variability than the speech representations; but nevertheless effective at emotion discriminativeness. . . . . . . . . . . . 69 5.1 Overview of the proposed model IMA, including the multimodal autoencoder, the importance networks and the main loss functions to be optimized. The IMA model is trained in two stages - multi- modal autoencoder followed by the importance networks. . . . . . . 82 5.2 t-SNE visualizations of embeddings learnt by the importance-based mul- timodal autoencoder for the MNIST-TIDIGITS dataset, including the modality-specific representations and the joint multimodal representa- tions. The modality specific representations are superimposable due to the alignment cost minimized during training. The model weighs the modalities before fusion; thus for example in the case when a combina- tion of a MNIST digit image and noise speech digits is provided as input, it is still mapped to the appropriate location (within the digit location) in the joint embedding. Colors denote 0:Red; 1:Green; 2:Blue; 3:Purple; 4:Orange; 5:Cyan; 6:Yellow; 7:Magenta; 8:Olive; 9:Black; Gray: Noise. . 99 xvii 5.3 Regions of uncorrelated noise in each modality are colored in red, where actual digit images (MNIST) and spoken digits (TIDIGITS) are colored in blue. Figures (a) and (b) show the location of noise in each modality representation u 1 (image) and u 2 (speech). Figures (c) and (d) show the movement of noise data points to the respective digit clusters for the multimodal representation z. This movement happens because noise in one modality is paired with actual digit information from the other modality, resulting in noise being suppressed during multimodal fusion. 100 5.4 t-SNE visualizations of importance network representations learnt by the model. The rows correspond to the proposed model trained on the MNIST-TIDIGITS dataset. In left columns, the ground truth refers to presence (colored in gray) of uncorrelated noise in each representation. The column to the right has data points predicted by model as noise in red; otherwise in blue. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Precision, Recall and F1 score curves for the importance-based autoen- coder trained on the MNIST-TIDGITS dataset. Only the positive cate- gory (positive indicates presence of noise) is selected for reporting metrics.104 xviii 5.6 Visualization of different unimodal representations learnt by the IMA model on the IEMOCAP dataset. Figure 5.6(a) shows the unimodal representations for different happy, angry and sad words. Figure 5.6(b) shows the same representations for all emotional words, showing that locations of emotional words are also aligned with the acoustic repre- sentations in Figure 5.6(c). Note that the colors blue, red and yellow respectively denote the happy, angry and sad emotions. The importance network representations are shown in Figures 5.6(d)-(f) for words and acoustic frames. The regions learned by the network as important are in blue; rest are in red. In acoustic modality, neutral speech frames are in gray. Most function words in Figure 5.6(d) have been identified by IMA as not important for emotion. . . . . . . . . . . . . . . . . . . . . . . 107 5.7 Precision, Recall and F1 score curves for the importance-based autoen- coder trained on the IEMOCAP dataset. Both positive and negative class F1 scores are considered, with their average. The optimal ρ for the verbal modality is a low value (0.05), while it is around 0.6 for the acoustic modality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1 Embeddings learnt by Affect-LM. Representative words from each emotion category (such as sad, angry, anxiety, negative emotion and positive emotion) are indicated on the scatter plot in different colors. Note how positive emotion is distinct and far away from the negative emotion categories in the plot. . . . . . . . . . . . . . . . . 125 xix 6.2 Amazon Mechanical Turk study results for generated sentences in thetargetaffectcategoriespositiveemotion, negativeemotion, angry, sad, and anxious (a)-(e). The most relevant human rating curve for each generated emotion is highlighted in red, while less relevant rat- ing curves are visualized in black. Affect categories are coded via different line types and listed in legend below figure. . . . . . . . . . 126 6.3 Mechanical Turk study results for grammatical correctness for all generated target emotions. Perceived grammatical correctness for each affect categories are color-coded. . . . . . . . . . . . . . . . . . 128 6.4 Neural Architecture of the Speech-LM model. The processing at each timestept is shown here which is unrolled for the entire batch. f(c t−1 ) and g(s t−1 ) are representations of the word and non-verbal acoustic contexts respectively. . . . . . . . . . . . . . . . . . . . . . 130 6.5 t-SNE word representations learnt by Speech-LM. Positive valence words are in blue; negative valence words are in red. . . . . . . . . . 136 A.1 Flow of information in generation (left) and inference (right) net- works for the model which performs both alignment and fusion. z is the latent multimodal representation, u 1 ···u M denotes the uni- modal representations and x 1 ···x M are the inputs for M modalities.165 xx Chapter 1 Introduction The most significant product of human innovation, apart from scientific and technological marvels such as electricity, propulsion systems and the printing press, has been that of the acquisition and processing of information and its efficient uti- lization to obtain a better understanding of the world around us and an improve- ment in the quality of our lives. Modern communication technologies, such as the telephone, the video camera and the internet have not only brought people closer, surmounting geographical and cultural barriers, but have also been a rich source of data to glean insight into how the behavior and emotions we express are integral to a holistic understanding of society itself. The role played by intelligent automated systems towards this goal, particu- larlyinanalyzingvastamountsofdataandtherecentadvancesinmachinelearning cannot be emphasized enough. In addition to a surge of recent academic interest, new products such as Amazon’s Alexa virtual assistant 1 , the Google search engine and Facebook’s Instagram 2 make use of such data-driven technologies. Central to these ubiquitous products are the state-of-the-art performances achieved in per- ception tasks such as object recognition (He et al., 2016), automatic speech recog- nition (Battenberg et al., 2017), machine translation/reading (Bahdanau et al., 2014) and conversational understanding (Chen et al., 2017). This has primarily been driven by the resurgence of work on deep neural networks, the abundance 1 https://www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?ie=UTF8&node=9818047011 2 https://www.instagram.com/ 1 of data obtainable from the Internet and crowd-sourced datasets, as well as the massive increase in computation power by dedicated hardware modules. While the standard perception tasks mentioned above have potential impactful applications, they are not addressing the more challenging tasks involving multiple human senses such as auditory or visual. In fact, many of our day-to-day activities involveacombinationofsuchsenses. Forexample, considerhumancommunication which involves an exchange between multiple speakers and an interplay of spoken words, facialexpressions, head motions andemotionally expressivespeech (DeVito, 1995). The automated understanding of emotion and behavior necessitate building algorithms and systems which can effectively fuse verbal spoken words as well as non-verbal information in visual and acoustic cues to provide a better understand- ing of the intent of an ongoing conversation between individuals. This research problem has been motivated by new applications such as designing virtual inter- active agents and identification of attributes such as negotiating ability (Park et al., 2013), sentiment and mental health diagnosis such as depression and PTSD (Post Traumatic Stress Disorder) using data analysis techniques on well-annotated datasets (Scherer et al., 2013; Ghosh et al., 2014; Chatterjee et al., 2015). Considering the effectiveness of machine learning techniques for data-driven pattern discovery and predictive tasks, they are currently utilized for the under- standing of human emotion and communication in settings which mostly involve small carefully curated datasets, along with well-engineered feature extractors and shallow models mostly trained in a supervised manner. Considering the emerging role of deep neural networks and unsupervised representation learning approaches, the role of non-verbal cues in multimodal behavior understanding has been studied to a limited extent in this framework. This dissertation demonstrates the effective- ness of deep learning to this application area by addressing the following primary 2 challenges - (1) unimodal representation learning approaches to obtain non-verbal discriminative embeddings from raw data; (2) multimodal representation learning from language and non-verbal cues for emotion understanding and (3) integration ofaffective and non-verbalcuesin generativemodels oflanguagefor modelingemo- tional text and spoken words. The work in this dissertation also investigates other related problems such as learning affective word representations; disentangling fac- tors of variation and performance evaluation on domains with unseen datasets. In this introduction chapter, a brief description of challenges faced when broadly conducting research in the modeling of human behavior and emotion is provided. A discussion of research contexts and challenges specifically addressed by this dissertation are also described with a succinct list of relevant research questions. Further, contributions described in this dissertation to address these challenges are also presented. This chapter concludes with an outline plan for the remainder of this dissertation. 1.1 Modeling Human Affect and Behaviour Affect is a term that subsumes emotion and longer term constructs such as mood and personality and refers to the experience of feeling or emotion (Scherer et al., 2010). The automated analysis of emotion is integral to an understanding of human communication (DeVito, 1995). Picard (1997) provides a detailed discus- sion of the importance of affect analysis in human communication and interaction. Unliketaskssuchasimageandspeechrecognitionwhichareoftenwell-definedwith proper benchmarks, large corpora for training, and ground-truth targets/labels, tasks for problems in human behavior are subjectively evaluated and often pose significant obstacles, due to the following challenges: 3 • Tasks for such problems have limited training data with multiple factors of vari- ation such as human subject identity, gender, language, and cultural differences. • Domain differences are produced due to various reasons, including cultural and individual personality differences. A suitable example is acted expressions (which are intentionally created) vs. spontaneous ones (which are genuine and in response to stimuli). Models trained on acted data cannot generalize well to the spontaneous case, since the latter has more subtle cues (Zeng et al., 2009). • The attributes being predicted (such as emotional categories or affective dimen- sions such as valence/activation) are often uncertain and noisy and thus have to be labeled by multiple annotators. Due to their uncertainty, the annotated labels are not always in agreement between human raters. • Problems do not have well-defined benchmarks or datasets, which often necessi- tates creation of datasets tailored for each individual problem. Consequently, it is more complex when performing evaluation, or comparing proposed techniques with current state-of-the-art ones. To alleviate this shortcoming, there has been effort to organize benchmarking competitions such as AVEC (Ringeval et al., 2017) and publicly release standardized datasets such as SEMAINE (McKeown et al., 2012) and IEMOCAP (Busso et al., 2008) for further research. 1.2 Current Research Challenges Section 1.1 described challenges faced in general when working on research problems involving human behavior and affect, with applications such as emotion sensing and human-computer interaction. While previous literature has already addressed some of the aforementioned issues, this dissertation focuses on more 4 Speech Signal Facial Images Textual Data Emotion Valence Activation I feel so sad…. Action Units Multi-label CNN Affect-LM Speech-LM Importance-based Autoencoder Glottal Inverse Filtering Denoising Autoencoder Speech/Glottal Spectrogram COVAREP OpenSMILE LSTM-RNN Retrieval/Classification Words LIWC Dictionary Linguistic Context Linguistic Context Unimodal Representation Learning Multimodal Representation Learning Improved prediction Of backchannels, emotional words, <eos> Emotionally Colored Sentences Multimodal Embeddings Affective Word Representations Q1.5 Q3.1, 3.2 Q1.1, 1.2 Q1.3, 1.4 Q2.1, 2.2 Q3.3, 3.4 Affective and Non-verbal Neural Language Modeling Figure 1.1: Overview of main models and techniques introduced in this disserta- tion. Theprimaryresearchquestionsareindicatedinblue, datasourcesareindark blue, with models in dark yellow and attributes in green. Three major focus areas - unimodal/multimodal representation learning and affective/non-verbal language modeling are also indicated. 5 emerging unsolved problems in the area of deep learning applied to human affect and behavior understanding. The main techniques introduced in this dissertation are pictorially shown in Figure 1.1. Collectively these approaches can be even- tually combined to design a multimodal conversational agent which would sense emotion from facial images and speech (representations), integrate cues from mul- tiple modalities (multimodal) for inference of affect, turn-taking and backchannels and also capable of sythesizing appropriate emotional responses (language models). Considering these three major areas of focus, the research challenges addressed in this dissertation are: 1. Unimodal Representation Learning: Previous approaches to emotion recognition and generationhave utilized existing priordomain knowledge to cre- ateengineeredfeatureextractors. Examplesofsuchfeaturesarefaciallandmark locations such as eyes and lip corner-points which are useful cues to identify facial expressions (Gizatdinova & Surakka, 2006) and pitch/loudness of voice, which are indicative of emotion expressed in speech. While such features have been shown to be effective at classification and learning from limited, annotated amounts of well-curated data, they do not capture all important characteristics of the input signal suitable for minimizing the objective function. In contrast, feeding the signal (such as pixels in a facial image, or frames from a speech spectrogram) directly as input to a neural network enables the model to learn features directly suited for the task. While work is being done in this area (Tri- georgis et al., 2016), the domain of representation learning for human affect is still in its infancy and several interesting open avenues remain unexplored. Embeddings of the input data such as images, words and acoustic frames learnt through neural network training are also useful for downstream clas- sification tasks. For example, word embeddings (Mikolov et al., 2013) have 6 been useful for subsequent tasks such as sentiment analysis. The representa- tion learning problem also includes learning affective word embeddings and how word representations change conditioned on another modality; these problems have been studied in the framework of multi-sense word embeddings (Huang et al., 2012) but not extensively investigated for the role played by non-verbal visual or acoustic cues in modifying semantics of a word. The research questions pertinent to this challenge are: • Q1.1 How can we learn feature representations from facial images for the task of Action Unit recognition that gives comparable performance to engineered features ? • Q1.2 Is it possible to learn a shared representation across Action Units which also captures their co-occurrences ? • Q1.3: Can unsupervised models such as denoising autoencoders learn emo- tion discriminative representations directly from speech spectrograms? • Q1.4: Does the filtering of vocal tract influence from speech signals improve the task of speech emotion recognition when utilizing features learnt from spectrograms ? • Q1.5 Is it possible to construct an unsupervised generative model for the process of voiced speech production ? Can the influence of the human vocal tract be removed from speech signals using this model ? 2. Multimodal Representation Learning: Human behavior (expressed through conversation/dialogue or emotions) is an interplay of multiple modali- ties - visual, acoustic and verbal. The integration of information from multiple modalities has shown to improve performance on different classification tasks related to the study of human behavior such as depression detection (Ghosh et 7 al., 2014) and sentiment analysis (Pérez-Rosas et al., 2013). Considering the effectiveness of such approaches, the automated learning of joint multimodal representations is an important problem. While different multimodal repre- sentation learning approaches (Ngiam et al., 2011; Wang et al., 2016; Wu & Goodman, 2018) have been proposed, it still remains an open problem to reli- ably integrate and fuse information from multiple modalities. An example is the fusion of cues from verbal messages, non-verbal acoustics and facial images to improve conversational understanding. Particularly, neural network approaches to the problem of multimodal representation learning do not scale well from a modeltrainingperspective. Previousapproachesaremostlyrestrictedtoexperi- mentsonunimodaldatasets, suchasMNISTdigits/CelebAdataset. Otherthan applications such as image captioning, there has been limited research on repre- sentation learning from real-world multimodal data such as spoken utterances. Additionally, there might be the presence of uncorrelated noise in each modal- ity which is not addressed by previous approaches. For example, in spoken language, function words (e.g. I,he,the) which play syntactic roles but are not relevant (i.e. uncorrelated) to emotion are also present along with emotional words. The relevant research questions posed in this dissertation are: • Q2.1: Can meaningful multimodal representations be learnt in a scalable manner by neural network approaches ? Assuming the presence of uncorre- lated noise where all data in each modality is not relevant, can the impor- tances of different modalities be learnt in this framework ? • Q2.2: Could a model which learns multimodal representations with modality importances improve on the task of emotion analysis from spoken utterances compared to unimodal and other multimodal baseline approaches ? 8 3. Non-verbal and Affective Language Modeling: This dissertation investi- gates several research challenges not only for individual modalities, but also the problem of learning good representations for multimodal data. One important application of multimodal analysis is the understanding of how the emotion in an ongoing conversation, or the non-verbal cues in accompanying modali- ties (such as speech) interacts with verbal messages. It is commonly expected that the presence of emotionally colored words would likely result in emotional words being present in the rest of the utterance. Similarly, previous litera- ture (Nygaard & Queen, 2008; Cutler et al., 1997) has shown that non-verbal cues in the acoustic modality (for example stress patterns and rising/falling pitch) helps a listener disambiguate emotion and the semantics of utterances. The interaction between affect/acoustic non-verbal cues and words has not been previouslystudiedintheframeworkofneurallanguagemodels. Neurallanguage models can potentially be utilized for text generation conditioned on these affec- tive or non-verbal attributes. However previous approaches have investigated rule-based heuristic approaches for emotional text generation, without quan- titative evaluation on multiple emotionally colored corpora. There has been no study either on the utilization of non-verbal acoustic inforrmation in neural language models. This motivates the development of novel language models which utilize affective and non-verbal acoustic cues in the context (i.e. in pre- vious words and speech) for predicting the next word. The relevant research questions addressed in this dissertation are: • Q3.1: Can language modeling of conversational text be improved by incor- porating affective information in the linguistic context ? 9 • Q3.2: Are the sentences generated using the affective neural language model emotionally expressive as evaluated in a crowd-sourced perception study, while not sacrificing grammatical correctness ? • Q3.3: Can language modeling performance be improved through integration of linguistic information from the context and non-verbal acoustic features accompanying the spoken words ? • Q3.4: How is the performance improvement distributed across different word categories when integrating non-verbal acoustic features with the linguistic context ? 1.3 Major Contributions In this section, major contributions of the dissertation are described which address the primary research questions; these include novelties in methodology, improvements over existing and state-of-the-art approaches, and relevant peer- reviewed publications. 1. Unimodal Representation Learning: Action Units (AUs) are facial mus- cle movements indicative of underlying emotion expressed by a human subject. A novel multi-label CNN approach is introduced for AU detection to address research questionQ1.1. In this model, representation learning of discriminative descriptors is performed directly from the input image, instead of facial land- mark information, or hand-crafted features such as shape/color or HoG (His- togram of Gradient) features. Additionally, unlike previous approaches where each AU classifier is trained separately, for the multi-label CNN, all AUs are learnt jointly, resulting in a shared representation which does not require dedi- cated classifiers, addressing research question Q1.2. AU detection performance 10 obtained by the multi-label CNN is competitive with existing approaches, and also exhibits cross-domain robustness between datasets. This work (Ghosh et al., 2015) was published at ACII 2015(Affective Computing and Intelligent Interaction Conference). To investigate research questionsQ1.3 andQ1.4, denoising autoencoders are trained on speech spectrograms with different feature sets (such as FFT and Mel spectrograms) and architectures. The learnt representations are not only discriminative of affective attributes such as valence and activation, but can also beinputtoclassifiersforspeechemotionrecognition. ALSTM-RNN(Recurrent Neural Network) trained on spectrogram features obtains performance compet- itive with previous approaches. Further, filtering out vocal tract influence from speech signals (glottal flow extraction) prior to autoencoder training obtains similar performance for speech recognition, thus addressing research question Q1.4. This work has resulted in two publications (Ghosh et al., 2016a,b) at the ICLR 2016 (International Conference on Learning Representations) Workshop and the Interspeech 2016 conference. Speech production involves an airflow signal shaped by the human glottis and the vocal tract before being finally emitted through the lips. The vocal tract induces resonant frequencies in the emitted signal, which manifest as vow- els and other distinct voiced phonetic units. Previous research has shown the importance of glottal features in emotion and affect recognition (Tahon et al., 2012); hence filtering out vocal tract influence on speech (glottal inverse fil- tering) is an important step in this direction. Pre-existing approaches such as compressive sensing or L 1 optimization are not formulated in an unsupervised dictionary learning framework. To address research question Q1.5 an unsuper- vised generative model is introduced which learns vocal tract basis filters from 11 data for glottal inverse filtering. Experiments conducted on multiple speech datasets show improved performance compared to the well-known IAIF (Alku, 1992) algorithm with robustness across different voice qualities such as breathy, modal and tense voices. This work was published at the European Signal Pro- cessing Conference (EUSIPCO), 2016. 2. Multimodal Representation Learning: Previous literature on multimodal representation learning with deep neural networks have not sufficiently focused on scalable integration of modalities, or the challenge of uncorrelated noise in each modality. To address research question Q2.1 an importance based multimodal autoencoder is proposed which not only learns the latent factor across multiple modalities through a joint multimodal representation, but also learns a importance network for each modality. Evaluation experiments are performed on image and spoken digit datasets, and the IEMOCAP dataset. This includes embedding visualizations and downstream classification for tasks such as digit and multimodal emotion recognition (research question Q2.1 and 2.2). Experimental results show improvement over unimodal representations and other fusion approaches. This work will be under revision at the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 3. Affective and Non-verbal Language Modeling: Previous generation approaches for emotional text in dialog applications utilize hand-crafted rules/heuristics, andarenottrainedonlargeconversationalcorpora. Toaddress research question Q3.1, a novel language model Affect-LM is proposed where additional information from the emotion in the word context is sensed and uti- lizedtoimprovenextwordpredictionasmeasuredthroughperplexityreduction. 12 This approach does not require manually expensive emotion labeling of utter- ances, and uses an affective lexicon for emotion sensing. Affect-LM obtains a reduction in test set perplexity, learns affective word representations and can be utilized to generate emotional text which also maintains grammatical correct- ness as evaluated through a user study (research questionQ3.2). This work was published at Annual Conference of Association for Computational Linguistics (ACL), 2017. The acoustic modality not only contains spoken words, but also non- linguistic cues such as backchannels and speech prosody related to affect and behavior. However, there has not been much research focus on integrating these cues into conversational language modeling. To investigate research question Q3.3, a novel language model Speech-LM is proposed where the acoustic non- verbal features co-occurring with the word context are utilized to predict the next word. This achieves an improvement in test set perplexity compared to a vanilla neural language model with significant gains when predicting end- of-sentence tokens, backchannels and emotionally colored words (Q3.4). This work is the first large-scale study (on a≈2000 hour corpus) on fusing non- verbal information into language models and is being prepared for submission to Empirical Methods for Natural Language Processing (EMNLP), 2018. 1.4 Outline This section concludes the introduction to this dissertation with an outline of the following chapters: 13 • Chapter 2 (Related Work): This section presents related work in the litera- tureonthetopicofemotionrecognitionfromimagesandspeech; neurallanguage modeling; deep generative modeling and neural networks. • Chapter 3 (Facial Expression Recognition): The problem of emotion recognition from facial images is discussed, particularly AU (Action Unit) detec- tion. A multi-layer CNN (Convolutional Neural Network) approach is presented, along with detection and performance on multiple datasets in terms of accuracy and 2AFC (Two-alternative Forced Choice) scores. • Chapter 4 (Emotion Recognition from Speech): Unsupervised deep mod- els for representation learning from spectrograms extracted from the speech and glottal flow signal are described along with emotion classification results using supervised models such as a LSTM (Long Short Term Memory) network. An unsupervised machine-learning approach to extract the glottal flow signal from speech is also presented. • Chapter 5 (Importance-based Multimodal Autoencoder): In this chap- ter a multimodal autoencoder model is proposed, which not only learns the underlying latent representation from data in multiple modalities, but also an importance network for each modality relevant to a shared latent factor. A prob- abilistic framework for the importance-based autoencoder is also provided, along with visualization of learnt embeddings, retrieval and downstream classification experiments on the MNIST, TIDIGITS and the IEMOCAP dataset. • Chapter 6 (Language Modeling with Affective and Non-verbal Cues): In this chapter, two novel language models, Affect-LM and Speech-LM are introduced which utilize additional emotional content from the verbal context and non-verbal acoustic content from the spoken context to predict the next 14 word. The results of perplexity analysis and emotional text generation are also described. • Chapter 7 (Conclusions and Future Work): Concluding remarks are described, along with some exciting avenues for future research in the field of deep learning for affect and conversational understanding. 15 Chapter 2 Related Work Inthischapter, acomprehensivesurveyofliteraturerelevanttothisdissertation is presented. The related work not only covers state-of-the-art approaches but also corpora and analysis tools extensively used in the experiments described in this dissertation. The organization of this chapter follows the main areas and research questions stated in Chapter 1. Limitations in previous literature are also discussed, which motivate the introduction of open problems and the primary research methodologies which address them in subsequent chapters. Some of the related works discussed in this chapter are relatively recent at the time of writing, cite the publications from this dissertation and highlight latest research trends in the field of representation learning for human affect. 2.1 Unimodal Representation Learning 2.1.1 Brief Introduction to Representation Learning Classical machine learning approaches previously had been limited to careful feature engineering (which is often problem specific) and models such as Sup- port Vector Machines (Chapelle et al., 1999) and Random Forests (Díaz-Uriarte & De Andres, 2006) which have few parameters, learning from limited amounts of data. However along with the aforementioned interest in deep neural net- works (Bengio et al., 2013), research has moved to end to end representation learning from data, which generally involves the following techniques: 16 • Learning without a predefined feature set: the model performs feature learning directlyfromtherawdata. Thisenablesittolearnvariationsinrawdatabeyond those captured by a manually defined feature set, and also facilitates reuse of representations across multiple tasks. Examples include feature learning from speech signal samples and image pixels. • Training is done by neural networks consisting of multiple layers and often mil- lions of parameters (Szegedy et al., 2015). This makes possible the learning of complex non-linear functions from the data, with neurons in each layer interact- ing with those in the layer above with activation functions such as sigmoid and ReLU (Rectified Linear Unit). • Learning on large amounts of data, either labeled or unlabeled often through minibatch inputs (Ioffe & Szegedy, 2015). Since deep models have large number of parameters, they often require large datasets for training to avoid overfitting. During minibatch training, a finite batch of data is utilized for updating network weights which can be efficiently scaled to large datasets. • Usage of specialized hardware for the training task, such as FPGA (Field Pro- grammable Gate Arrays)/GPUs (Graphics Processing Units) for maximal effi- ciency (Ovtcharov et al., 2015). The work in this dissertation utilizes Nvidia GPU accelerators such as Tesla K40 and Titan X for model training. In one well-known instance of representation learning approaches outperforming feature engineered approaches, Krizhevsky et al. (2012) trained CNNs (Convo- lutional Neural Networks) with 60 million parameters using two GPUs for the ILSVRC (ImageNet Large Scale Visual Recognition Competition), outperform- ing the second best entry (which was based on SIFT features) by 10.9% on the Top-5 test error rate. Since then, several models for representation learning have 17 been applied to perception problems for images, speech and natural language. For example, RBMs (Restricted Boltzmann Machines) (Dahl et al., 2010) and Denois- ing Autoencoders (Vincent et al., 2008) are multi-layered neural models which can be pre-trained in an unsupervised manner before being fine-tuned as classi- fiers with an additional softmax layer. LSTM-RNNs (Long Short Term Memory Recurrent Neural Networks) introduced in Hochreiter & Schmidhuber (1997) are temporal neural networks used in time-series analysis which can exploit long range dependencies in sequences. More recently deep generative models have been intro- duced, such as VAE (Variational Autoencoders) proposed in Kingma & Welling (2013) and GANs (Generative Adversarial Networks) in Goodfellow et al. (2014). A VAE model performs variational inference through a multi-layered encoder net- work; whileaGANisaninference-freemodelwhichemploysutilizestwonetworks- a generator to create high quality data samples, and a discriminator to distinguish between the generator output and synthetic noise. This dissertation addresses the research problem of learning representations for affective attributes directly from pixels and image samples, which provide perfor- mance competitive with manually engineered feature extractors. In the rest of this section an extensive literature review of the main affective applications - (1) emotion recognition in the visual modality through Action Unit (AU) detection and (2) speech emotion recognition are provided. 2.1.2 Emotion Recognition From Facial Images The automated recognition of emotion from facial images can be enabled by analysis of Action Units (AU), which are mid-level attributes and correspond to muscle movements on different facial regions. In Ekman & Friesen (1978) the FACS (Facial Action Coding System) was introduced which is a standard to 18 group and classify Action Units. It is necessary to utilize automated machine learning algorithms for AU recognition since it is expensive and time-consuming to manually annotate images. The literature address the problem of AU recognition from not only still images, but also videos. AU recognition can be posed as a classification task, where the presence/absence of an AU is a binary decision, or a regression task where the intensity of each AU is estimated. Machine Learning Approaches: Li & Ji (2005) used DBNs (Dynamic Bayesian Networks) for modeling correlations among AUs and estimating their intensities. Wu et al. (2010) explored multilayer architectures using GEFs (Gabor Energy Filters) for AU recognition. Littlewort et al. (2011) introduced CERT (Computer Expression Recognition Toolbox) for estimating AU intensities using SVM classifiers. Song et al. (2015) modeled AU sparsity and co-occurrence using a Bayesian compressed sensing model, reporting 86% accuracy on the DISFA dataset, and 94% on the Cohn-Kanade extended (CK+) dataset. Jiang et al. (2014) explore the problem of which face regions features should be extracted from and propose a decision-level fusion with LBP (Local Binary Pattern) and LPQ (Local Phase Quantization) to obtain a best performance of 0.81 2AFC (Two-alternative Forced Choice) score on the DISFA dataset. Zhang et al. (2014a) modeled AU recognition using the Lp norm MTMKL (Multi-Task Multiple Kernel Learning) framework. Baltrušaitis et al. (2015) investigated improvements in AU recognition through additional labeled data from different datasets, utilizing HoG (Histogram of Gradients) and geometric features along with a median-based normalization scheme to account for neutral faces. Neural Network Based Approaches: Bartlett et al. (1996) is one of the earliest attempts to detect Action Units using a three-layer feed-forward neural network trained on local image features (such as wrinkle measurements) and 19 optical flow descriptors. Tian et al. (2001) employed a three-layer neural network with one hidden layer for identifying lower face AUs. Handcrafted facial features, transient features such as left nasio-labal furrow angle, and presence of nose wrin- kles were used as inputs to the neural network. Bazzo & Lamar (2004) explored Gabor wavelet features and neural networks to obtain 86.6% accuracy on upper face AUs and 81.6% on lower face AUs as evaluated on a multi-subject dataset. These methods all use a combination of hand-engineered feature extractors along with multi-layer perceptrons and do not learn representations directly from facial images for the task of AU recognition. Multi-label Approaches to AU Recognition: Most existing approaches consider the problem of detecting each AU separately, thus failing to exploit the dependencies among them. For example AU 1 (inner brow raiser) is correlated with AU 2 (outer brow raiser) and a classifier which captures this correlation can achieve improved performance in detecting both AUs. Wang et al. (2013) explored an RBM (Restricted Boltzmann Machine) based approach to integrate low-level features and global dependencies between AUs for improved recognition on CK+ (Lucey et al., 2010) and SEMAINE datasets. Eleftheriadis et al. (2015) train a MC-LVM (Multi-Conditional Latent Variable Model) through Gaussian process modeling to model AU dependencies to outperform state-of-the-art approaches on the CK+, Shoulder Pain and DISFA (Mavadati et al., 2013) datasets. Representation Learning Based Approaches: Gudi et al. (2015) proposed a deep CNN for AU detection and regression (FERA 2015 Challenge) around the same time as the ML-CNN approach described in this dissertation. Khorrami et al. (2015) trained CNNs for facial expression recognition on the CK+ and TFD (Toronto Face Dataset) and found that intermediate level features learnt by CNNs 20 are indicative of facial action units. Consequently, several researchers worked on variants of CNNs for AU recognition including temporal extensions, such as (1) Han et al. (2016) in which the authors compare the performance of their Incremental Boosted Convolutional Neural Network with the ML-CNN and (2) a hybrid CNN-LSTM approach implemented by Chu et al. (2016). Dapogny et al. (2017) also compared the ML-CNN with their proposed Confidence-Weighted LEP approach (which is a combination of Random Forests and hierarchical autoencoders), obtaining better AUC performance on the BP4D and DISFA datasets. Bishay & Patras (2017) proposed an architecture with a mixture of specialized multi-label deep networks for AU recognition along with a novel cost function to address the problem of multi-label imbalance. 2.1.3 Emotion Recognition from Speech Speech is a complex signal with multiple factors of variation which include not only phonetics and the verbal information associated with spoken word semantics, but also additional information contained in the voice signal. The non-verbal cues in speech are indicative of the underlying speaker state, tone, and gender. They have been shown to be effective at identification and diagnosis of emotion (Wu et al., 2010), voice quality (Gobl & Chasaide, 2003) and depression (Scherer et al., 2013). In this section related work on the role of speech signals in affect are dis- cussed, including literature on removing influence of the human vocal tract from speech signals; common acoustic feature sets; previous machine learning and rep- resentation learning approaches. Speech Production and Glottal Inverse Filtering: The knowledge of speech generation process is important to model several factors of variation in speech, 21 such as gender, speaker identity, and phonetic information, along with emotion andvoicequality. Speechproductioninvolvesanairflowshapedbyopening/closing of the human glottis, which is the space located between the vocal folds. These factors of variation are related to individual components of the speech production model. Specifically, shaping of the airflow in the vocal tract produces formant rich voiced speech (such as vowels) carrying phonetic information. In contrast, the component of the speech signal emitted at the glottal source (which is referred to as the glottal flow signal) does not contain verbal/phonetic information since it is produced before vocal tract shaping and thus might be more discriminative of the speaker’s underlying affective state rather than the phonetic information. The extraction of the glottal flow signal from speech is called glottal inverse filtering. During speech production, the glottis shapes a constant airflow input to produce a train of pulses during voiced sound generation, which is called the glottal volume velocity waveform (Alku, 1992). When the glottis closes, the glottal volume veloc- ity airflow resonates in the vocal tracts, leading to voiced sounds being produced. The source filter model assumes that the sound source and the vocal tract are independent (Fant, 1971). The final stage in speech generation is the generated acoustic wave passing through the lip openings. The glottal flow excitation can be measured using a laryngograph, however it is possible to separate the source and the vocal tract filter and estimate the excitation through a computational approach (Sondhi & Gopinath, 1971) The classic approach for glottal inverse filtering is LPC (Linear Predictive Cod- ing) based estimation (Schroeder & Atal, 1985), where various phases in the glottal excitation, such as glottal closure and opening instants are estimated from an anal- ysis of the LPC residual. Iterative Adaptive Inverse Filtering (IAIF) was proposed 22 by Alku (1992) where the glottal excitation waveform is estimated in an itera- tive filtering process by first canceling the effects of lip radiation and estimating a lower-order vocal tract model, after which the glottal excitation is obtained by inverse filtering with a higher-order model. Recently, various approaches (Gia- cobello et al., 2010; Chetupalli & Sreenivas, 2014) have been proposed in which the L 1 sparsity of the residual is optimized directly, based on a linear constraint. Bayesian methods have also been introduced, such as Giri & Rao (2014) in which the block sparsity of the glottal flow is encoded using a prior, and Casamitjana et al. (2015) where Bayesian priors are investigated in a TVLP (Time Varying Linear Prediction) framework. Kane et al. (2013a) proposed to estimate the OQ (open quotient) in the glottal excitation using an ANN (Artificial Neural Network) and compared it to other approaches for estimating OQ. Airaksinen et al. (2015) used a DNN (Deep Neural Network) to estimate the glottal source from robust low-level speech features. Common Features for Affect Analysis: There has been a lot of research on well-engineered features which are discriminative of affective traits, such as loudness and pitch of the voice (Koolagudi & Rao, 2012). The feature sets and datasets described here are used subsequently in the dissertation as baselines for model training and experimentation. ThefirstfeaturesetdescribedhereisCOVAREP(ACooperativeVoiceAnalysis Repository) introduced in Degottex et al. (2014) which is open-source and freely available in the Matlab and Octave languages. COVAREP consists not only of (1) MFCC (Mel-Frequency Cepstral Coefficient) features, F0 (fundamental frequency) and primary/secondary formants, but also (2) features parameterizing the glottal flow, such as NAQ (Normalized Open Quotient), QOQ (Quasi-Open Quotient) and (3)additionalnon-verbalfeaturessuchasVUV(Voiced/Unvoicedspeechindicator) 23 and PS (Peak Slope) which measures voice breathiness. COVAREP has been effectively used for research into affect understanding and voice quality analysis from speech (Scherer et al., 2013). The second feature set used in the experiments described in this dissertation is the Interspeech 2011 Para-linguistic Speaker State Challenge feature set imple- mented in the OpenSMILE toolbox (Eyben et al., 2010). It captures non-verbal prosodic information and consists of the following main categories of speech fea- tures: (1) F0 (Fundamental Frequency), (2) RMS energy, (3) Spectral Flux, (4) Spectral Slope, (5) Shimmer, and (6) Jitter. The feature set also consist of statis- tical functionals extracted over time intervals in order to capture better temporal variation of the features, such as mean, standard deviation, median, quartiles, skewness, and kurtosis. The C++ based implementation in the OpenSMILE tool- box is efficient and is the desired feature set of choice for large speech corpora such as Fisher (Cieri et al., 2004). Importance of Non-verbal Acoustic Information: The primary form of communication between humans, in particular in the context of telephone conver- sations is spoken language. Speech itself is comprised of both linguistic content (i.e., words) as well as non-verbal factors (Burgoon et al., 1978; Vargas, 1986; Ephratt, 2011) such as prosody or voice quality. While linguistic content plays a dominant role in the decoding of spoken messages and is in fact the only source of information in traditional text-based language modeling (Mikolov et al., 2010; Sundermeyer et al., 2012), non-verbal information can be crucial for the disam- biguation of affect (emotion), or semantics (meaning/interpretation) of a spoken message. In particular, the prosody of speech, which pertains to the rhythm, into- nation, loudness, and melody conveys additional information to the listener. For example, pauses, stress patterns, or rising/falling pitch, can help the listener to 24 parse a message at syllable, word, and phrase levels (Cutler & Norris, 1988; Cutler et al., 1997) or identify if a question was uttered (Hedberg & Sosa, 2002). Wang and Seneff (Wang, 2001) investigate the role of prosody in speech recognition, pitch tracking, and confidence scoring for English and Mandarin. Further, extensive research has identified a direct link between prosody and the speakers attitude, mood, or affect (Juslin et al., 2005; Bachorowski, 1999; Scherer, 2003; Nygaard & Queen, 2008). Higher loudness and pitch, for example, might convey a happy or angry affective state of the speaker, while a more quiet and low pitch voice might correspond to sadness (Scherer et al., 1991). These findings have recently been investigated in a series of end-to-end neural models for the automatic recognition of emotion in speech (Trigeorgis et al., 2016). Stolcke et al. (2000) have investigated the effect of prosodic features in dia- logue act classification and have reported significant improvements. In more recent work (Ishii et al., 2003; Nygaard & Lunders, 2002; Schirmer & Kotz, 2003; Filippi et al., 2016), prosody has also been identified to directly influence the meaning of a message. Nygaard & Lunders (2002) for example have shown that change in prosodic context of a message influences listener perception of affectively colored homophones (e.g., bridal vs. bridle or die vs. dye). While previous literature shows that non-verbal information has a direct influence on word meaning and that it is of considerable semantic relevance, previous literature has not investi- gated this in a neural language modeling framework on a real-life conversational corpus or shown that incorporating the additional information improves perfor- mance. A subsequent study by Nygaard et al. (2009) further showed that prosody can not only change the affective meaning of spoken language, but that it can convey meaning along several dimensions of antonymic pairs (e.g., small vs. tall, fast vs. slow, etc.). Overall, their work shows that non-verbal information has a 25 direct influence on word meaning and that it is of considerable semantic relevance. Machine Learning Approaches for Emotion Recognition from Speech: Most prior work on machine learning techniques applied to speech emotion recog- nition have mostly utilized off-the-shelf features such as MFCCs (Mel Fequency Cepstral Coefficients) and non-verbal features, with models such as SVMs (Sup- port Vector Machines) and HMMs (Hidden Markov Models). Schuller et al. (2004) combined acoustic and linguistic information using a hybrid SVM-belief network on the FERMUS-III corpus (Schuller, 2002). Busso et al. (2007) explored emo- tional speech models as variants of an underlying neutral speech model for emotion recognition using HMMs. Jin et al. (2015) generated feature representations using standard acoustic and lexical features for emotion recognition from IEMOCAP, obtaining 55.4% accuracy from early fusion of acoustic features (cepstrum and Gaussian supervectors). Neural networks have also been trained for this task on prosodic features. Li et al. (2013) explore emotion recognition with hybrid DNN- HMMmodels, whereRBMs(RestrictedBoltzmannMachines)withpre-trainingon MFCC features are reported to improve performance. Han et al. (2014) performed speech emotion recognition from the IEMOCAP corpus using a combination of DNN (Deep Neural Network) and ELM (Extreme Learning Machines). They obtained20%relativeaccuracyimprovementcomparedtopreviousapproaches.Lee &Tashev(2015)userecurrentneuralnetworksforspeechemotionrecognitionfrom IEMOCAP, where the label of each frame is modeled as a sequence of random variables. They obtained a weighted accuracy (which is computed based on actual frequencies of each emotional category in the dataset) improvement of 12% com- pared to a DNN-ELM baseline (Han et al., 2014). RepresentationLearningfromSpeech: Jaitly&Hinton(2011)proposetrans- forming autoencoders to learn acoustic events in terms of onset times, amplitudes, 26 and rates. They use the Arctic database and TIMIT for their experiments. Graves et al. (2013) investigated deep recurrent neural networks for speech recognition, achieving a test set error of 17.7% on TIMIT. Sainath et al. (2013) worked on filterbank learning in a deep neural network framework, where it was found that filterbanks similar to the Mel scale were learnt. They improved on their work in Sainath et al. (2014) by incorporating delta learning and speaker adaptation. There has been limited exploration of representation learning on lower level data, such as the time domain waveform/spectrogram for emotion recognition. Xia & Liu (2013) trained a denoising autoencoder on features from the Interspeech 2010 paralinguistic set with two mutually disentangled representations related to neutral and emotion-specific information respectively. They obtain improved per- formance on the IEMOCAP dataset compared to only using static features. Mao et al. (2014) learn emotion-discriminative features from spectrograms using sparse auto-encoders and subsequent classification with a CNN. They also disentangle the learned speech representations from nuisance factors such as speaker identity and noise through labeled data. Deng et al. (2014) perform domain adapta- tion using adaptive denoising autoencoders to transfer knowledge obtained from other speech datasets to the AIBO dataset (Batliner et al., 2008) and regularize SVM training on standard OpenSMILE acoustic features, thus improving recog- nition performance. Lim et al. (2016) proposed a novel time-distributed CNN and transformed the speech signal to its STFT representation, subsequently training a combined CNN-LSTM network on the Berlin Database, and obtaining improved F-1 score compared to a CNN only approach. Trigeorgis et al. (2016) perform rep- resentation learning for end-to-end speech arousal and valence regression on the RECOLA dataset. They train LSTM-CNN models directly on the speech wave- form, demonstrating significant improvement over standard features as evaluated 27 using the concordance correlation coefficient. Later Work on Representation Learning: Subsequent to the work in this dis- sertation being published, there has been a continuing interest in not only explor- ing deep networks for speech emotion recognition, but also on attention learning in sequential deep models and generative deep models applied to this task. For this task, attention learning is the data-driven discovery of which parts of the speech signal are relevant to the task of emotion recognition. Huang & Narayanan (2016)trainedattention-basedBLSTMnetworksonMFCCfeaturesextractedfrom IEMOCAP; they show that it is possible to learn the relevant sub-structure from the speech signal (i.e. find which speech frames are most important given the task) and obtain better performance compared to a BLSTM model with no attention mechanism. Chang & Scherer (2017) leverage unlabeled data and DCGANs (Deep Convolutional Generative Adversarial Networks) to learn discriminative emotion representations from speech on the AMI (Carletta, 2007) and IEMOCAP datasets. Wang & Tashev (2017) learn representations at utterance level from speech using DNNs and ELM (Extreme Learning Machines) on a Mandarin corpus; in addition to emotion recognition improvement, age/gender classification performance is also better than human judges. In Sahu et al. (2017), adversarial autoencoders origi- nally proposed in Makhzani et al. (2015) are utilized to obtain emotion discrim- inative two-dimensional representations from OpenSMILE features and generate synthetic samples which are subsequently leveraged as additional training data. 2.2 Multimodal Representation Learning The growing research interest in deep learning has been extended to a study of data involving more than one modality. There has been extensive work in 28 multimodal machine learning on current open problems, which can be categorized as described in Baltrušaitis et al. (2018) as representation, translation, alignment, fusion and co-learning. Several additional problems exist in applying deep learning techniques to multimodal machine learning, such as data insufficiency and noise in modality acquisition. A component of this dissertation work mostly involves the problem of multimodal representation learning using deep neural networks, and thus a detailed description of related work in this area is provided in this section. Theearliestapproachestomultimodalfusioncanbecategorizedas early fusion, where features from multiple modalities are concatenated, late fusion (Atrey et al., 2010), where the outputs of modality-specific classifiers are combined, such as through a voting mechanism, and intermediate approaches (Ghosh, 2015a). These have been followed up with more sophisticated model-based modality fusion approaches, such as multiple kernel learning (Sikka et al., 2013), HCRFs (Hidden Conditional Random Fields) (Song et al., 2012) and neural networks. Ngiametal.(2011)exploredRBM(RestrictedBoltzmannMachine)baseddeep autoencoders for different multimodal tasks such as fusion, cross-modality learn- ing and shared representation learning. They evaluate their models on the task of audio-visual speech classification. Srivastava & Salakhutdinov (2012) propose a multimodal DBM (Deep Boltzmann Machine) which outperforms conventional models such as SVM (Support Vector Machines) on the task of jointly modeling images and text captions, and is able to retrieve keywords which best describe input images. Kim & Provost (2013) apply deep belief networks to audiovisual emotion recognition, and show that the networks are able to capture non-linear interactions between features. Harwath et al. (2016) automatically derive basic representations of spoken language from entirely unsupervised spoken captions of images. They show that it is possible to derive correlates between visual cues and 29 acoustic information in an entirely unsupervised learning experiment. Martinez et al. (2013) train convolutional autoencoders to learn features for each modality, which are combined at the last predictor layer. They use pairwise preferences as targets (which implies that instead of assigning scores to affect attributes, a binary decision is made on which of two input samples exhibit more of the target attribute). Experiments are performed on the Maze-Ball dataset, which consists of physiological signals acquired from players during a video game. Multimodal deep modeling approaches have not only been applied to modality translation (where information in one modality is projected to another modal- ity), but have also incorporated fusion-based approaches, such as the integration of visual features for language modeling conditioned on images as in Kiros et al. (2014). Rajagopalan et al. (2016) propose the MV-LSTM (Multi-view LSTM), where LSTM networks are used to model not only intra-model temporal correla- tions but also correlations between modalities. Results on multimodal behavior recognition and image captioning outperform state-of-the-art models such as log- bilinear language models (Kiros et al., 2014) and neural image captioning (Vinyals et al., 2015). Several recent approaches have extended deep generative models to multimodal data, such as Wang et al. (2016), who propose a deep variational CCA (Canonical Correlation Analysis) approach. They make use of a shared-private paradigm (Jia et al., 2010) in their model, where it is possible to map between modalities. How- ever, the drawback of their approach is that the latent shared posterior is inferred from only one modality, and they do not provide a framework for effectively fusing inputmodalities.Vedantametal.(2017)introduceaTELBO(Tri-ELBO)objective 30 function to generate images conditioned on three good aspects of visual imagina- tion (correctness, coverage and compositionality) which also employs a product- of-experts decomposition with modality specific inference networks. Suzuki et al. (2016)proposeaKL-divergenceminimizationapproachtolearntheposteriorinfer- ence conditioned on both modalities through a joint inference network. However, it is not scalable to more than two modalities due to exponential model com- plexity. Wu & Goodman (2018) propose MVAE (Multimodal Variational Autoen- coders),wheretheproblemoflearningindividualinferencenetworksforeachsubset ofmodalitiesisaddressedthroughasharedproductofexpertsrulewhichfacilitates parameter sharing. They show improved performance over existing approaches through test log-likelihoods different datasets such as the MNIST, Fashion MNIST and CelebA particularly in the weakly-supervised case. However they apply a sub- sampled training approach for unimodal inference networks which increases the number of data points and hence training time. Further, the above approaches assume that there are no dominant factors of variation in the data unrelated to the latent factor learnt by the autoencoder. Most of their experiments are also performed on standard datasets such as MNIST and CelebA, and have not been applied to the real-world problem of learning interactions between spoken words and non-verbal acoustic features. These dissertation addresses some of these short- comings in Chapter 5 and 6, which respectively introduce an importance-based multimodal autoencoder (IMA) and two neural language models integrating lin- guistic and non-verbal acoustic information (Affect-LM and Speech-LM). 31 2.3 Neural Language Models and Affective Text Generation In this section, the research areas of language modeling and affective text gen- eration are introduced, along with a detailed literature review of previous work and their limitations which motivate the research in this dissertation. A review of more recent approaches to text generation, such as deep generative models for text generation and emotional conversational agents is also provided. Introduction to Language Modeling: Language modeling is an integral com- ponentofspokenlanguagesystems, andtraditionallyn-gramapproacheshavebeen used (Stolcke, 2002) with the shortcomings that (1) they are unable to general- ize to word sequences which are not present in the training set (2) they are not able to model longer temporal dependencies between words. Bengio et al. (2003) proposed neural language models, which address this shortcoming by generaliz- ing through word representations. Mikolov et al. (2010) and Sundermeyer et al. (2012) extend neural language models to a recurrent architecture, where a target wordw t is predicted from a context of all preceding words w 1 ,w 2 ,...,w t−1 with an LSTM (Long Short-Term Memory) neural network. There also has been recent effort on building language models conditioned on other modalities or attributes of the data. For example, Vinyals et al. (2015) introduced the neural image caption generator, where representations learnt from an input image by a CNN (Convo- lutional Neural Network) are fed to an LSTM language model to generate image captions. Kiros et al. (2014) used an LBL model (Log-Bilinear language model) for two applications - image retrieval given sentence queries, and image captioning. Lower perplexity was achieved on text conditioned on images rather than language models trained only on text. 32 Introduction to Affective Text Generation: It is an important research prob- lem to build automated systems for generating text which is emotionally colored, and can be utilized in applications such as dialogue systems for more human-like responses to user inputs or queries. Previous approaches have mostly used heuris- tics and syntactic rules for generating emotional text, with data-driven approaches limitedtosmallhand-craftedcorpora, andwithoutquantitativeevaluationonlarge conversational corpora. Language Models can be utilized for generating text, since they are trained to predict the next word, given a context of previous words. How- ever it has only been recently possible to generate human-like, syntactically cor- rect sentences through neural language models such as the LSTM-LM, which can model longer temporal dependencies between words. In this section, traditional text generation approaches are described, with their limitations, which motivates the application of neural language models to generating emotional text. Traditional Approaches for Emotional Text Generation: Previous litera- ture on affective language generation has not focused on customizable state-of-the- art neural network techniques to generate emotional text, nor have they quantita- tively evaluated their models on multiple emotionally colored corpora. Mahamood & Reiter (2011) use several NLG (natural language generation) strategies for pro- ducing affective medical reports for parents of neonatal infants undergoing health- care. While they study the difference between affective and non-affective reports, their work is limited only to heuristic based systems and do not include conversa- tional text. Mairesse & Walker (2007) developed PERSONAGE, a system for dia- logue generation conditioned on extraversion dimensions. Extraversion is a human personality trait related to outgoing, energetic and social behavior. They trained regression models on ground truth judge’s selections to automatically determine which of the sentences selected by their model exhibit appropriate extraversion 33 attributes. Keshtkar & Inkpen (2011) use heuristics and rule-based approaches for emotional sentence generation. Their generation system is not trained on large corpora and they use additional syntactic knowledge of parts of speech to create simple affective sentences. The work described in this dissertation addresses these shortcomings in previous literature by training neural language models on large conversational corpora, with perceptual user studies of generated emotional sen- tences. Deep Generative Modeling for Text: Contemporary to the recent interest in deep generative modeling for images, where GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders) and their variants are trained on large image datasets and generalize well thus creating believable synthesized image samples, there have also been efforts to adapt such techniques for text gen- eration, albeit with limited effectiveness. Hu et al. (2017) describe several reasons for the difficulty in applying such models in text generation. There are multiple factors of variation in text due to the complexity of underlying semantics, and autoencoder based approaches which attempt to decode and generate text from an underlying representation generally do not work well. The manifold on which nat- urally sounding believable sentences lie might correspond to a small region of the latent representation space, so sampling uniformly throughout the latent space is unlikely to always yield naturally sounding sentences. Bowman et al. (2016) pro- posed a RNN based VAE model to generate sentences conditioned on a latent global representation (which is a shortcoming of an RNN-LM model). They show that their model can be utilized for word imputation and can generate believ- able sentences through homotopy (linear interpolation). However the model does not perform well on the language modeling task and has higher perplexity than a RNN-LM model. Rajeswar et al. (2017) apply generative adversarial networks for 34 text generation conditioned on PCFGs (Probabilistic Context Free Grammars). However the generated sentences are not naturally sounding and the model does not generalize well. Yu et al. (2017) proposed SeqGAN, which applies reinforce- ment learning approaches to the generator network, thus addressing the problem of generating discrete symbols. Hu et al. (2017) addressed the problem of control- lable text generation based on attributes such as tense and sentiment. The authors train a hybrid model with VAEs and attribute discriminators on the IMDB text corpus (for sentiment labels) and TimeBank dataset (for tense annotations). This method is semi-supervised and achieves good performance (in sentiment and tense classification accuracy) for meaningful generated sentences. Recent Approaches to Emotional Text Generation: Subsequently after the research described in Chapter 6 was published, there has been an exploration of deep learning approaches to generate emotional responses in the context of chatbots and conversational agents. Asghar et al. (2017) suggest techniques to integrate emotional information into LSTM-based neural conversational models through affective word embeddings and specialized objective functions. Zhou et al. (2017) propose the Emotional Chatting Machine, which elicits emotional responses to an input sentence, given the target emotion category. The proposed model uti- lizes memory-based approaches and an emotion classifier to decide whether affec- tive or neutral responses are suitable in the context of the conversation. They compare their model with a baseline Seq2seq conversational model, but not with Affect-LM due to the different nature of the task (response generation vs language modeling). Ni et al. (2017) employ neural network based collaborative filtering methods to learn user preferences for products and generate user-specific reviews with sentiment. Wang et al. (2017) propose the Group Linguistic Bias Aware Neu- ral Response Generation model, which can generate responses in different talking 35 styles (which they refer to as linguistic biases among population groups). Affective Lexicons for Text Analysis: Along with data-driven approaches for text generation, there also has been work in the field of linguistics and social psychology on the roles played by words, including their behavioral and affec- tive meanings. To aid this analysis, several lexicons of affective words have been made available to researchers. This dissertation utilizes affective word lexicons, which merits their discussion in this section. The role of prior knowledge (such as through lexicons) is also very important for the analysis and generation of affective language. Pennebaker et al. (2001) introduced the LIWC (Linguistic Inquiry and Word Count) tool [http://liwc.wpengine.com] for analysis of text from a social psychology perspective. The core of the tool is the LIWC dictionary, where each category (80 in total) is defined based on the social and physiological meaning of words. Words are associated with each category, on the assumption that the categories themselves are linked to social, affective and cognitive processes. They include not only function words (such as pronouns, prepositions, articles, auxiliary verbs and conjunctions), but also emotion words, which are indicative of positive and negative sentiments. LIWC includes a hierarchical categorization of words corresponding to social, affective (emotional), cognitive (related to thoughts), per- ceptual and biological processes. Similar to LIWC, other affective lexicons such as SentiWordNet(Baccianellaetal.,2010)andtheNRCemotionlexicon(Mohammad & Turney, 2010) have been introduced. The NRC lexicon is available in English and other languages, and has categories corresponding to eight major emotional categories (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). It also has a higher number of vocabulary words (14,182 words) than the LIWC lex- icon. The LIWC tool has been used extensively in the research throughout this dissertation, particularly in Chapters 5 and 6. 36 Chapter 3 Facial Expression Recognition 3.1 Introduction The problem of representation learning when applied to human emotion and behavior understanding utilizes information from the visual, acoustic and verbal modalities. Facial expressions are of great importance in human communication sincetheyconveyinformationabouttheaffectivestateofthesubject. ActionUnits are muscle movements on the subject’s face which are activated during different emotional expressions (Ekman & Friesen, 1978). The problem of Action Unit (AU) recognition from facial images, as described in Chapter 2 is of great importance to thecomputervisionandaffectivecomputingcommunitysinceitnotonlyfacilitates the building of automated emotion recognition systems but also addresses the challenge of expensive labeling of AUs manually from images. In this dissertation, a novel multi-label CNN (Convolutional Neural Network) approach is introduced to address the problem of AU recognition. This work is the first attempt to learn AU occurrences using CNNs; after its publication, several authors have reported their work on different representation learning approaches to this problem (Chu et al., 2016; Bishay & Patras, 2017). Figure 3.1 shows some common AUs which are considered for recognition in this work. The approach described in this chapter addresses the following research questions: 37 • Q1.1 How can we learn feature representations from facial images for the task of Action Unit recognition that gives comparable performance to engineered features ? • Q1.2 Is it possible to learn a shared representation across Action Units which also learns their co-occurrences ? • Q1.2(b) Is it possible to achieve cross-domain robustness (through training and testing on different datasets) with shared AU representations ? This work has two main contributions: (1) the problem of AU recognition was posed as a multi-label representation learning problem, and instead of training a separate CNN for each AU, a single CNN was trained on image pixels with a multi- label softmax classification loss. This yielded competitive results on various AU datasets (2) cross-domain robustness was achieved when training and evaluating the multi-label CNN on different datasets (3) The multi-label CNN learns corre- lations between AU occurrences which mirror those obtained from ground truth labels rated by human annotators. In this chapter the model description and experimental details are described in Section 3.2 and 3.3 respectively, including network architecture and datasets used for training/evaluation. Section 3.4 discusses experimental results and Section 3.5 concludes the chapter. 3.2 Multi-label CNN Model Since multiple AUs might be present on a single facial image, but previous literature mostly trains a separate model for each AU occurrence, this motivates the problem of detecting AUs as a multi-label binary classification task. Assume 38 (a) AU 1 (Inner Brow Raiser) (b) AU 2 (Outer Brow Raiser) (c) AU 4 (Brow Lowerer) (d) AU 5 (Upper Lid Raiser) (e) AU 6 (Cheek Raiser) (f) AU 9 (Nose Wrinkler) (g) AU 12 (Lip Corner Puller) (h) AU 15 (Lip Corner Depressor) (i) AU 17 (Chin Raiser) (j) AU 20 (Lip Stretcher) Figure 3.1: Common Action Unit (AU) labels as defined by Ekman & Friesen (1978) utilized in recognition experiments. Images are reproduced from https://www.cs.cmu.edu/ face/facs.htm and show both upper (AUs 1,2,4,5,6 and 9) and lower face (AUs 12,15,17 and 20) actions. that there areC AU categories,N data points{x i } where each x i denotes thei-th image and y ij ∈{0,1} is a label denoting presence/absence of the j-th AU. Let f(x i ) be the transformation learnt by the neural network at the hidden layer just prior to computation of the loss function. The multi-label softmax classification loss is defined as : J =− 1 N N X i=1 C X j=1 y ij log( ˆ p ij ) =− 1 N N X i=1 C + (i) X j=1 log( ˆ p ij ) (3.1) where C + (i) is the number of AUs actually present in image x i , ˆ p ij are the pre- diction probabilities from the last softmax layer for the i-th input image and the j−th Action Unit. It is to be noted that the network attempts to learn rep- resentations which are simultaneously discriminative of all AUs. Thus, instead 39 of learning a representation independently for each AU, a joint representation is learnt. Subsequent to mapping each input facial image x i to its joint AU represen- tation, it is needed to threshold the prediction probabilities to obtain predictions of presence/absence for each AU. To address this problem, it was found that a simple QDA (Quadratic Discriminant Analysis) classifier can be trained to map the shared representation at the CNN top layer to presence/absence labels, while achieving good performance. Implementing a QDA classifier after the CNN stage requires separate classifier training for each AU, but is advantageous because (1) a separate CNN does not have to be trained for each AU and (2) the QDA classifier does not require hyper-parameter validation which is a time consuming process during training. The open-source Caffe toolbox (Jia et al., 2014) was used for the Figure 3.2: Architecture of the multi-label CNN. There are two convolutional stages and two fully-connected layers interspersed by max-pooling layers, with the size of the output layer equal to the number of AUs AU detection experiments. Since the multi-label softmax regression is not built in by default, a custom multi-label classifier layer was built for this task. The CNN contains a large number of parameters, and due to limited participant data, three techniques were used to avoid over-fitting the dataset: (1) two dropout lay- ers between fully connected ones (2) data augmentation by random mirroring and 40 cropping when creating data batches for the CNN, and (3) usage of L2 regular- ization with a suitable validated learning rate. Figure 3.2 shows the architecture of the CNN trained on the BP4D dataset Valstar et al. (2015) with 10 AU labels. Preliminary experiments also indicate that a Mean-Variance Normalization layer improves detection performance. This layer is implemented in the Caffe toolbox, and normalizes the input data to have zero mean and unit variance before passing it to higher layers. 3.3 Experimental Methodology 3.3.1 Datasets The AU recognition experiments were performed on three main datasets: • Extended Cohn-Kanade dataset (CK+): This dataset was introduced in Lucey et al. (2010) and consists of 582 fully FACS coded image sequences of both spontaneous and nonspontaneous facial expressions from 123 subjects. Each frame containing an AU was either given an intensity degree on a 7-point ordi- nal scale or an ’unspecified intensity’ label. The lack of an AU label indicates absence. Since AU detection is posed as a multi-label binary classification prob- lem, any frame containing an intensity of 3 or higher was mapped to a positive presence label. The CK+ dataset was substantially smaller than DISFA or BP4D. Thus it is used strictly for the testing phase in cross dataset experimen- tation. • DISFA (Denver Intensity of Spontaneous Facial Expressions) dataset was introduced in Mavadati et al. (2013) and contains stereo videos of 27 subjects spontaneously generating facial expressions when watching an emotive 41 video stimulus, with a total of 54 videos. Each video consists of 4845 FACS coded frames for 12 AUs: 25 (Lip part) and 26 (Jaw Drop) in addition to the above, with presence, absence, and intensity labels. For this dataset, only used presence and absence labels were used. • BP4D-Spontaneous dataset (Zhang et al., 2014b): This partition of BP4D was part of the 2015 FERA challenge dataset. It consists of 41 subjects and 34 AUs, andthesubjectswereyoungadultswhospontaneouslygeneratedemotional responsestostimulustasks. Fortheexperimentsinthischapter,onlythe10com- mon AUs shared between the CK+ and DISFA datasets were used. Binary pres- ence and absence labels were available, so these were straight-forwardly parsed without thresholding the AU intensities for every frame. 3.3.2 Dataset Preprocessing The datasets described in Section 3.3.1 were preprocessed, with an OpenCV face detector (frontal) being applied on every frame to obtain the bounding box of the face. Subsequently, the face was cropped out. Though the model training and evaluation is subject-independent, it is assumed that multiple facial images of each participant are available for training and testing (such as multiple video frames), wherethemeanfacialimageofeachparticipantcanbecomputedformean normalization. This technique is similar to subtraction of a person’s face from his/her “neutral” face to obtain a subject-independent analysis of expressions. It has been widely used in the literature (Tian et al., 2005), and it is assumed that the mean face averaged over all frames is very similar to the neutral face. This also removes the need for a separate neutral facial image to aid the classification. After subject specific normalization, global mean normalization is also performed, and each pixel’s intensity is divided by 255 to keep it between 0 and 1. It is worth 42 noting that this is a simple normalization method, and usage of advanced methods such as facial landmarks could improve performance. This would be helpful for real-world videos where the subject is not always front facing the camera but is at an angle or moving his/her head, thus introducing more variability in the images. In that case, normalizing all angular movements and head rotations to a standard would be expected to improve performance. ThesparsityoftheAUoccurrencesalsoposesanadditionalchallenge. Previous work (Sandbach et al., 2012) in the literature reports that performance improve- ments are obtained by balancing the dataset prior to training. However, unlike in a single label setting, balancing one AU may result in another AU being unbalanced, since all AUs are jointly considered in the multi-label loss function. This problem was investigated by trying out different approaches to multi-label balancing, so that the fraction of positive/negative classes across all AUs was as balanced as possible. While leading to some improvement overall, this did not result in any significant performance improvements. 3.3.3 Train/Validation/Test Splits A leave-one-subject-out testing scheme was used, and the DISFA and BP4D datasets were split into training, validation and testing sets. For DISFA and BP4D in each testing fold, 75% of the subjects were used for training, and the remaining 25% for validation according to a random seed. For BP4D dataset, the train- validation split was done according to that predefined in the FERA 2015 challenge guidelines (Valstar et al., 2015). All experiments were performed in a subject independent manner for all datasets. In addition to the multi-label CNN being trained and evaluated on the same dataset, a cross-dataset evaluation scheme was also setup. The training split of the 43 DISFA dataset was used for training and validation, with subsequent testing on the CK+ and the BP4D datasets. Similarly, the training split of the BP4D dataset was used for training and validation, with testing done on the DISFA and the CK+ datasets. With the CK+ dataset, only testing was performed, since the number of images in the dataset are inadequate to train the large number of weights in the network (≈100,000) without risk of over-fitting. It also facilitates measurement of the generalization ability of the network, and a comparison with baseline results reported in existing literature. For the cross-dataset experiments only 10 AUs were considered which all three datasets share, which are: 1 (Inner brow raiser), 2 (Outer brow raiser), 4 (Brow lowerer), 5 (Upper Lid raiser), 6 (Cheek raiser), 9 (Nose wrinkler), 12 (Lip corner puller), 15 (Lip corner depressor), 17 (Chin raiser) and 20 (Lip stretcher). Figure 3.1 shows examples of facial images corresponding to these selected AUs. 3.3.4 Hyper-parameter Validation The relevant hyper-parameters to tune for the CNN, along with search range were (1) Optimal training iterations (the number of iterations beyond which over- fitting occurs): 5000 to 10000 (2) Base learning rate of the network: 0.0001 to 0.01 (3) Weight decay parameter : 5e− 3 to 5e− 6 (4a) Kernel size for the convolu- tional layers : 5 to 15 (4b) Kernel size for the pooling layers : 2 to 4 (5) Learning momentum: 0.5 to 0.9. A detailed description of the hyper-parameters can be found in Bergstra et al. (2011). Due to the large number of hyper-parameters involved, a random search was conducted over the hyper-parameter space in place of a grid search. It has also been reported in the literature, that random search is more beneficial in this setting (Bergstra & Bengio, 2012). After validation is 44 complete, the network is retrained with a combination of training and valida- tion data prior to leave-one-out testing. Early stopping is used for training, with the optimal model being selected 1000 iterations after the validation error starts increasing. The top layer representation from the trained CNN is extracted, and for each AU, a QDA (Quadratic Discriminant Analysis) classifier was trained to predict presence/absence of the AU in the frame of interest. 3.3.5 Performance Metrics To measure the performance of the CNN classifier on the task of AU detection, two metrics are utilized : (1) Accuracy and (2) 2AFC (Two-alternative Forced Choice) score. These metrics were used due to the ease of comparison with base- lines in existing literature, and the desired property that the choice of evaluation metrics should be insensitive to the amount of skew in the testing set, as reported by Jeni et al. (2013). The accuracy is equal to the percentage of testing exam- ples correctly classified, while the 2AFC score is the fraction of correctly classified examples in a 2AFC trial experiment, which has been shown to be a good approx- imation to the AUC (Area under ROC curve) score (Valstar et al., 2015). 3.3.6 Baseline Approaches The performance of the multi-label CNN approach is compared with the fol- lowing baseline approaches. It is to be noted that all previous approaches do not evaluate on the same datasets, which necessitates comparison with several previ- ous publications, each of which are evaluated on a subset of the CK+, DISFA and BP4D datasets. 45 1. The approach described in Song et al. (2015), which exploits the property of ActionUnitsparsity, whereoutof45AUs, onlyafewareactiveonafacialimage at any moment. They also propose a novel Bayesian graphical model which learns the co-occurrences between AUs in addition to the sparsity property. For comparison, the multi-label CNN’s performance is compared to their results obtained on the DISFA dataset in terms of accuracy score. 2. Jiang et al. (2014), where the authors propose a decision-level fusion with LBP (Local Binary Pattern) and LPQ (Local Phase Quantization) on the DISFAdataset. Themulti-labelCNN’sperformanceiscomparedtotheirresults obtained on the DISFA dataset in terms of the 2AFC metric. 3. The baseline approaches mentioned in the FERA 2015 Action Unit recognition challenge (Valstar et al., 2015), where linear SVM (Support Vector Machine) classifiers are trained for the task of AU occurrences, utilizing geometric (facial landmark locations) and appearance (Local Binary Gabor Patterns) feature sets. They report both accuracy and 2AFC scores for their approaches. (a) AU 2 (BP4D) (b) AU 6 (BP4D) (c) AU 2 (DISFA) (d) AU 6 (DISFA) Figure 3.3: Joint AU representations learnt by the multi-label CNN. Representations for AU 2 (Outer Brow Raiser) and AU 6 (Cheek Raiser) for BP4D and DISFA datasets. Multi-label CNN is trained on BP4D and representations obtained for both BP4D and DISFA datasets. AU presence is indicated in red; absence in blue. 46 3.4 Experimental Results 3.4.1 Visualization of Joint AU representations Figure 3.3 shows the joint representations learnt from the DISFA datasets, corresponding to Action Units 2 (Outer Brow Raiser) and 6 (Cheek Raiser). The representations obtained from the topmost layer were visualized. The figures show the same joint space (corresponding to the same dataset), in which different AU presence clusters occur in different regions. In this case, the multi-label CNN was trained on the BP4D dataset, and subsequently utilized in feed-forward mode for input images belonging to both the BP4D (same domain as training) and the DISFA (different domain) datasets. The representations are discriminative of the AU presence, including those from the test set, of which the network has no knowledge during training phase. Also since there is almost no co-occurrence between these two AUs, the region with dominances of these facial actions (colored inred)havenooverlap. ThejointrepresentationsarenotonlydiscriminativeofAU presence, but are also indicative of co-occurrences between AUs, thus addressing research question Q1.2. 3.4.2 Same-dataset Experiments For a quantitative evaluation of performance, two main metrics were selected to report leave-one-subject out AU detection performance - (1) Accuracy and (2) 2AFC (Two-alternative Forced Choice) score. 2AFC score is a suitable choice of evaluation metric which is insensitive to the amount of skew in the testing set, as reported by Jeni et al. (2013). The 2AFC score is the fraction of correctly classified examples in a 2AFC trial experiment, which has been shown to be a good approximation to the AUC score (Area under ROC curve)(Valstar et al., 47 Table 3.1: Leave-One-Out Classification Performance Of Multi-label CNN on DISFA dataset (12 Action Units). The results indicate comparable performance to the baseline approaches in Song et al. (2015) and Jiang et al. (2014). Note that the baseline approaches have not reported the same metrics on DISFA, thus necessitating comparison with both approaches. Action Unit Accuracy 2AFC Score 1 87.5 0.702 2 88.6 0.715 4 78.5 0.704 5 90.6 0.773 6 83.9 0.855 9 90.2 0.829 12 86.1 0.913 15 86.7 0.683 17 81.4 0.679 20 88.4 0.639 25 78.5 0.838 26 74.9 0.757 Average 84.6 0.757 Song et al. (2015) 86.8 - Jiang et al. (2014) - 0.77 2015). Tables 3.1 and 3.2 show results for single dataset experiments, where the multi-label CNN is trained and subsequently evaluated on the same dataset. When trained on DISFA, the average accuracy and 2AFC scores obtained are 84.6% and 0.757 respectively. Further, the average 2AFC score (0.757) is comparable to the performance reported in Jiang et al. (2014), where the authors use region-specific features, along with prior knowledge of AU muscle contractions to obtain average 2AFC scores of 0.77 for feature level fusion, and 0.805 for decision-level-fusion. The corresponding scores for BP4D training are 75.69% and 0.751, which outper- formstheFERA2015reportedbaselinesof72.0%accuracyand0.4272AFC.These 48 Table 3.2: Leave-One-Out Classification Performance Of Multi-label CNN on BP4D dataset (10 Action Units). The multi-label CNN performance outperforms the baseline metrics reported for the FERA 2015 Facial AU occurrence challenge. Action Unit Accuracy 2AFC Score 1 70.85 0.643 2 76.11 0.675 4 73.53 0.720 5 85.49 0.619 6 73.82 0.839 9 81.36 0.763 12 79.49 0.878 15 65.97 0.642 17 65.25 0.665 20 86.08 0.728 Average 75.80 0.717 7-AU Avg 75.69 0.751 Valstar et al. (2015) 72.00 0.427 results address research questions Q1.1 and Q1.2, showing that a shared repre- sentation can be learnt for AU occurrences which obtains comparable performance to manually engineered feature sets. 3.4.3 Cross-dataset Experiments Cross-domain AU detection experiments are also conducted, and the results are shown in Table 3.3. Multi-label CNN and QDA classifiers were each trained on the DISFA and BP4D datasets, and subsequently tested on each of three datasets - DISFA, BP4D and CK+. The CNN/QDA trained on DISFA achieves an average 2AFCscoreof0.690onBP4D,and0.780onCK+dataset. WhentrainedonBP4D, the model achieves an average 2AFC of 0.73 on DISFA, and 0.759 on the CK+ dataset. The network trained on DISFA generalizes well to BP4D, where a relative accuracy improvement of 1.54% is obtained over the BP4D-only baseline network with a slight decrease of 0.02 in the 2AFC score. The performance obtained on 49 Table 3.3: Cross-dataset generalization performance of Multi-label CNN. Both accuracy and 2AFC metrics are reported. The network trained on DISFA general- izeswelltoBP4D,wherearelativeaccuracyimprovementof1.54%isobtainedover the BP4D-only baseline. The performance obtained on the CK+ dataset (accu- racy 88.14% and 2AFC 0.78) is similar to that reported in Jiang et al. (2014). The network trained on DISFA generalizes well to BP4D, where a relative accuracy improvement of 1.54% is obtained over the BP4D-only baseline network with a slight decrease of 0.02 in the 2AFC score. When trained on BP4D and tested on CK+, the network obtains an average accuracy of 77.7%, and a 2AFC score of 0.759. Source BP4D BP4D DISFA DISFA Target CK+ DISFA CK+ BP4D AU Accuracy 2AFC Accuracy 2AFC Accuracy 2AFC Accuracy 2AFC 1 86.05 0.791 83.81 0.66 85.21 0.739 75.91 0.676 2 85.11 0.86 81.17 0.717 86.88 0.786 79.69 0.637 4 79.94 0.73 83.26 0.74 75.58 0.669 70.95 0.678 5 93.11 0.723 92.22 0.762 93.73 0.79 92.57 0.706 6 61.87 0.723 70.33 0.87 85.64 0.729 69.78 0.818 9 85.37 0.949 90.45 0.841 93.42 0.974 83.87 0.794 12 55.42 0.889 62.36 0.873 91.07 0.887 63.63 0.838 15 75.59 0.493 75.17 0.617 85.13 0.618 76.67 0.597 17 63.08 0.636 68.85 0.585 88.14 0.803 66.92 0.634 20 92.16 0.798 95.79 0.635 94.59 0.807 93.35 0.525 Average 77.77 0.759 80.34 0.73 88.14 0.78 77.34 0.69 the CK+ dataset (accuracy 88.14% and 2AFC 0.78) is similar to that reported in Jiang et al. (2014) (accuracy 80.56% and 2AFC 0.80), which is also in a cross dataset setting, training on MMI database (Pantic et al., 2005) and testing on CK+. This shows that the multi-label CNN approach is robust across datasets. When trained on BP4D and tested on CK+, the network does not generalize as well, obtaining an average accuracy of 77.7%, and a 2AFC score of 0.759. These results address research question Q1.2(b), showing that the multi-label CNN is robust when evaluated on datasets with differences in domain, such as subject identity and lighting conditions. 50 Figure 3.4: Pairwise correlations among AUs for ground truth(left figure) and pre- dicted labels(right figure) (blue indicates -1 phi-index, red indicates +1). The sim- ilarity between the figures demonstrates that the CNN learns correlations among AUs. For example, AUs 1 (Inner brow raiser) and 2 (Outer brow raiser) are highly correlated in the AU predictions, which follows the ground truth correlations. 3.4.4 Correlations between Action Units A visualization of AU co-occurences is also performed to obtain an insight into how the multi-label CNN learns correlations between AU occurrences. Figure 3.4 shows a heat map of pairwise correlations among the 12 AUs for the development partition of the DISFA dataset, constructed using a 75%-25% split. The label set is augmented with a 13th label to indicate neutral facial expressions, before performing training and validation of the network. Since label presence/absence is binary, the Phi-coefficient is used as a measure of AU correlation. The figure shows the respective correlations measured from the ground truth labels, and the prediction labels generated by the proposed approach. The heat-maps are very similar, showing that the CNN is able to learn a good approximation of the ground truth correlations. For example, AUs 1 (Inner brow raiser) and 2 (Outer brow raiser) are highly correlated in the AU predictions, which follows the ground truth 51 correlations. AUs 6 and 12 are correlated which corresponds to the Duchenne smile. Further, the 13th label (neutral) is negatively correlated with the other AUs, which is evident in Figure 3.4. 3.5 Conclusions In this chapter, a multi-label convolutional neural network based approach is described for the task of AU detection from facial images. In contrast to previous approaches, image features are learnt directly from image pixels instead of hand- engineered feature extraction techniques. The approach is inherently multi-task, where a network learns a shared representation among different AUs, and each AU- specific classifier can utilize the learnt features for effective prediction. Both same- dataset and cross-dataset classification experiments are performed on the CK+, DISFA and BP4D datasets and results indicate that convolutional neural networks are not only good at learning discriminative features for the same dataset, but that the learnt features are also robust when evaluated in a cross-dataset setting. Per- formances reported in terms of accuracy and 2AFC scores achieve performance competitive with previous approaches. This addresses research questions Q1.1 andQ1.2, showing the effectiveness of this approach in learning representations of AUs from facial images, which achieve comparable performance to manually engi- neered feature sets. Further, the network also learns correlations among the AUs, thus obviating the need to design and train multiple networks for each AU. This approach could be extended to temporal modeling of AU occurrences in a multi- label setting, and regression of AU intensities from facial images. After this work was published, several authors propose other representation learning approaches for AU recognition as discussed in Chapter 2, including temporal models such as 52 LSTM (Long Short Term Memory) in Chu et al. (2016) and models for handling multi-label imbalance in AU datasets in Dapogny et al. (2017). 53 Chapter 4 Emotion Recognition from Speech 4.1 Introduction Theanalysisofbehaviorandaffectcanbestudiedforallmodalitiesofdatarele- vant to human communication including verbal messages, facial images and speech signals. Multi-labelCNNs(ConvolutionalNeuralNetworks)havebeenshowntobe effectiveatlearningvisualexpressionsfromfacialimagesasdescribedinChapter3. Human communication also makes use of non-verbal information which are present in the speech signal and backchannels such as um, hmm and huh. This provides additional cues about gender, emotion and or whether the speaker has a breathy, modal or tense voice (Kane et al., 2013b). These non-verbal cues, while not related to the syntactic or linguistic content of spoken utterances, are very important for an effective analysis of spoken conversations. This has motivated a lot of research work into building automated systems which additionally utilize this additional information for the analysis of conversational speech, with applications such as dialogue systems (Cassell et al., 2001) and voice quality classification (Degottex et al., 2014). The research work described in this chapter is on investigating different rep- resentations of the acoustic signal and their application to the problem of emo- tion classification from speech. The task of emotion recognition has relied on standard acoustic features such as MFCCs (Mel-Frequency Cepstral Coefficients), fundamental frequency or zero-crossing rate to train classifier models. However 54 there has not been much focus on open problems such as learning representations directly from the speech signal, analyzing the effect of glottal flow extraction from the speech signal for emotion recognition or unsupervised models to estimate the glottal flow signal itself. The primary research questions addressed in this chapter are: Q1.3. Can discriminative representations be learnt from speech spectrograms for the task of emotion recognition ? Q1.4 Doesremovingtheinfluenceofthevocaltractfromthespeechsignalthrough glottal flow extraction improve performance on the task of speech emotion recog- nition ? Q1.5 Can the glottal flow waveform be extracted from the speech signal through an unsupervised data-driven framework ? Spectrograms are frame-based spectra (obtained from the energy of each time frame and frequency bin) which provide useful abstractions in the field of signal processing, and are suitable for representation learning. The work in this disserta- tion also focuses on the process of estimating the glottal flow waveform from the speech signal, and its utilization in the task of speech emotion recognition. During speech generation, the airflow at the glottal source is shaped by resonances in the vocal tract before being emitted as the speech signal. It has been observed that glottal features are useful for the task of non-verbal speech analysis (Scherer et al., 2013), which motivates representation learning experiments on the glottal flow signal in this dissertation. 55 4.2 Models for Representation Learning In this section, the following machine learning models are described in detail. They are used for representation learning from speech and glottal flow spectro- grams in the experiments described in this chapter. Unsupervised models (namely autoencoders) learn representations directly from spectrograms, and supervised models utilize the representations of the speech signal learnt by the autoencoders for the task of speech emotion recognition. 4.2.1 Stacked Denoising Autoencoders The autoencoder is a neural network typically trained to learn a lower- dimensional distributed representation of the input data. The input dataset of N data points{x i }∀i∈{1,2,3...N} is passed into a feedforward neural network of one hidden layer, where the hidden layer is a bottleneck layer with activations y i ∀i∈{1,2,3...N}. The activations are obtained as (for the purposes of this work, tanh activations are assumed) y i = tanh(Wx i +b) (4.1) The output of the autoencoder z i is obtained from the autoencoder activations as: z i = W 0 y i +b 0 (4.2) For an autoencoder with tied weights, W = W 0 T . This would require less param- eters to be trained, thus acting as a regularizer. The autoencoder is trained using backpropagation, much as in an ordinary feedforward neural network. The loss function employed for training the autoencoder is the SSE (Sum of Squared Error 56 Loss) L = P i=N i=1 kx i − z i k 2 . Vincent et al. (2008) introduce denoising autoen- coders, where the data point x i is corrupted (by randomly setting a fraction of the elements to zero) to produce ˜ x i , from which the original clean data point x i is reconstructed by the autoencoder. This enables the autoencoder to learn not only latent information in the data, but also robust dependencies among the elements in x i . When training denoising autoencoders in a greedy stacked fashion, there are multiple layers with weights W k−1 and W k for the k-th hidden layer, where the autoencoder activations at the layer are (where y i0 = ˜ x i ) : y ik = tanh(W k−1 y ik−1 +b k−1 ) (4.3) 4.2.2 Deep Autoencoders In a deep autoencoder, the network is trained to reconstruct the input cor- rupted with noise much like a normal stacked autoencoder, however, the encoder and decoders themselves have three or four layers. As opposed to greedy pre- training of the stacked denoising autoencoder, in a deep autoencoder the weights for all layers in the network are trained jointly through backpropagation. Deep autoencoders were investigated in Feng et al. (2014) for noisy reverberant speech recognition, and Shao et al. (2015) for missing modality face recognition. They are known to often yield more interesting features than greedily trained stacked autoencoders, which motivates this work on training deep autoencoders along with stacked architectures for learning affective representations. 4.2.3 Recurrent Autoencoder A recurrent neural network is widely used for learning from temporal data, where the k-th hidden layer of the network at time t is a function of the k− 1-th 57 Figure 4.1: LSTM-RNN autoencoder for encoding and obtaining latent representa- tions from sequential data. The hidden bottleneck layer is subsequently visualized and expected to be discriminative of affective attributes. The RNN autoencoder is trained on frame representations obtained from the stacked denoising autoencoder. The latent representation of the entire sequence is considered to be the average of bottleneck representations obtained at each frame. hidden layer at time t, and the k-th layer itself at the previous time step t− 1, as shown below: h k,t =f(h k−1,t ,h k,t−1 ) (4.4) The output of the network at time t, y t is a function of the hidden layer below it. When the recurrent neural network is trained as an autoencoder, the tar- get sequence is set to be equal to the input sequence. An RNN can be trained using backpropagation through time, however the training suffers from the vanish- ing gradient problem (Pascanu et al., 2013). To address this issue, Hochreiter & Schmidhuber (1997) proposed LSTM networks, which can model long-term tempo- ral dependencies, and do not suffer from vanishing gradients. In the experiments described here, single-hidden layer recurrent BLSTM-autoencoders are used to learn dimensionally reduced representations at utterance level. BLSTM (Bidirec- tional LSTM) networks can model temporal dependencies in both directions rela- tive for each time-step compared to an LSTM which can model only one direction. 58 The utterance representation can be obtained from the BLSTM cell activation for the last frame of the utterance, or by averaging the hidden layer BLSTM cell acti- vations across all frames. Utterance level representations are generated through an average over all frames, as visualization experiments indicate that this leads to better discrimination between emotion categories. Similar to the denoising autoen- coder, the SSE loss is used for training the recurrent autoencoder. An architectural diagram of an LSTM recurrent autoencoder is shown in Figure 4.1, along with cells and the input/output sequences. In this section unsupervised models for learning representations from speech have been presented. However, for a quantitative evaluation of the effectiveness of these representations, the learnt features are subsequently utilized for training super- vised neural network models, which are described in this sub-section. While it is possible to perform representation learning by training supervised models directly on the raw data, unsupervised models have the advantage of leveraging unlabeled data, which would be costly to manually annotate with the ground truth. The supervised neural networks used in this work are described in the following sub- sections. 4.2.4 Pre-trained MLP (Multi-layer Perceptrons) models Intheemotionclassificationexperimentsdescribedhere, themodelsaretrained both at frame and at utterance level considering emotion as a target. Previ- ous research (Hinton & Salakhutdinov, 2006) has shown the benefit of training a stacked autoencoder in an unsupervised manner on data, and then subsequently adapting to supervised target labels. This process is referred to as "pre-training", having the effect of model regularization and being specifically suitable for limited amounts of data. Motivated by this potential for pre-training, the trained stacked 59 autoencoder models described above are augmented with a softmax classification layer on top to transform them into supervised MLP networks. 4.2.5 BLSTM (Bidirectional LSTM)-RNNs BLSTM-RNNs are used for sequence classification with a target replication scheme in this work, where the target class for each time step is assigned to the emotion category of the entire utterance. For prediction of an input sequence x ={x 1 ,x 2 ,x 3 ,...,x T } the predicted emotion category is obtained by a majority voting scheme on the predictions of individual time steps in the sequence, which correspond to context frames in the utterance. The BLSTM-RNN has two lay- ers, each of cell size 30, with a four-dimensional softmax layer on top for emotion classification. Validation experiments are performed on the RNN to find the best model, where each hyper-parameter setting is obtained from random sampling on the hyper-parameter ranges: (1) Learning Rate : [6e-6,8e-6,1e-5,2e-5,4e-5] (2) Momentum : [0.7,0.8,0.9] (3) Input noise variance : [0.0,0.1,0.2,0.3] (4) Weight noise variance : [0.0,0.05,0.1,0.15,0.2] (5) Batch size: 1300 utterances (6) Maxi- mum Epochs : 100. To improve generalizability of the BLSTM, random noise is addedtotheinputsequencesandthemodelweightsineveryepoch, andcanbecon- trolled by the noise variance hyperparameters. The BLSTM-RNN implementation is from the CURRENNT toolbox (Weninger et al., 2015). In the following sub- sections, the methodology of the experiments is described which address Research Questions Q1.3 and 1.4 introduced earlier. Firstly, pre-processing and feature extraction steps are described for the IEMOCAP dataset (Busso et al., 2008), a standard multimodal dataset for emotion recognition. Further, the experimental setup is also discussed, including unsupervised training from speech spectrograms 60 and classification for the emotion recognition task utilizing the learned represen- tations. 4.2.6 Dataset Preprocessing The IEMOCAP dataset is split into seven subjects (training set), and three subjects (validation set). Each utterance is annotated by three annotators for emotion labels and affective dimensions. The primary emotions being considered are Happy, Angry, Sad and Neutral. These emotions are not only widespread throughout the corpus, but also have been considered in previous work on speech emotion recognition (Han et al., 2014). For the supervised classification task, only utterances which have at least two annotators in agreement about the utterance emotionareconsidered. Inadditiontotheprimaryemotioncategories,theaffective dimensions valence and activation are also considered. Activation is a measure of a subject’s excitability when communicating emotion. High activation would correspond to emotions such as excitement and anger, while low activation would be displayed in sadness. Valence is the measure of the subject’s sentiment, where anger and sadness correspond to negative valence, while happiness corresponds to positive valence. The affective dimensions are annotated on a Likert scale of 1-5, where the dimension ratings across all three annotators have been averaged. 4.2.7 Spectrogram and Feature Extraction Spectrograms are extracted from the utterances, with a frame width of 20 ms and a frame overlap of 10 ms. In two different settings, 513 and 128 FFT (Fast Fourier Transform) bins are extracted, to investigate whether fine-grained frequency information is necessary to produce better affective representations. A 61 log-scale is used in the frequency domain, since a higher emphasis in lower fre- quencies has been shown to be more significant for auditory perception. Log- Mel spectrograms (Ezzat & Poggio, 2008) are also extracted from the utterances, with 40 filters in the Mel filterbank. Standard features from the COVAREP tool- box (Degottex et al., 2014) are also used; not for visualization but for comparison with the learned representations in an emotion recognition task. 4.2.8 Temporal Context Windows Each frame of the spectrogram is 20 ms in duration, which is generally accepted in the speech recognition community (Huang et al., 2001). However, for learning emotions and other affective attributes, prior work has shown that longer temporal windows of the order of a hundreds of milliseconds is useful (Kim & Provost, 2013; Han et al., 2014). Consequently, the concept of temporal context windows is explored, where each input feature to the autoencoder is a window of consecutive spectral frames. If K is a parameter denoting the length of past/future context, and X t is a spectrogram frame at time t, the input temporal context window at timet is [X t−K ,...,X t−2 ,X t−1 ,X t ,X t+1 ,X t+2 ,...,X t+K ]. Experiments in this work have been conducted with K = 2. 4.2.9 Glottal Flow Extraction Paralinguistic and affective attributes such as emotion, valence and activation should be speaker and phonetic invariant, and not sensitive to changes in the speaker’s identity or verbal content (phoneme or words being uttered). The effect of filtering out the factors of variation (speaker identity and phonetic informa- tion) from the speech signal prior to training of the denoising autoencoder and the 62 BLSTM-RNN could improve classification performance. The glottal source wave- form has this property and is obtained by glottal inverse filtering of the speech sig- nal using the Iterative Adaptive Inverse Filtering (IAIF) algorithm (Alku, 1992). The interested reader is referred to Chapter 2 for an introduction to the glottal flow signal. While the signal obtained through inverse filtering may be an approx- imate of the actual glottal waveform and potentially result in experimental bias, the IAIF algorithm is chosen because it is widely used in the literature. Besides, glottal flow based features such as Normalized Open Quotient (NAQ) and Quasi- open Quotient (QOQ) have shown great success at being discriminative features when classifying tasks such as depression assessment (Ghosh et al., 2014), and voice quality classification (Scherer et al., 2013). 4.2.10 Autoencoder Training Research Question 1.3 is addressed by training unsupervised representation learning models (denoising autoencoders with different architectures) on spectro- gram frames obtained from the training set in the IEMOCAP dataset. The dis- criminativeness of these representations is inspected not only from visualization experiments, but also subsequent supervised emotion classification experiments conductedontheIEMOCAPdataset. Experimentswereperformedusingtemporal context windows of 1(single frames) and 5, while the feature sets and representa- tions are utilized for training are: (1) 513-bin FFT spectrograms (2) 128-bin FFT spectrograms and (3) Log-Mel features. The following autoencoder architectures are utilized for training: • Stacked denoising autoencoder described in Section 4.2.1. Both Tied and Untied configurations are trained. For the Tied configuration, the encoder and decoder have shared weights, which is not followed in the Untied configuration. 63 • Deep autoencoder described in Section 4.2.2. For the deep autoencoder, the Tied and Untied configurations were not considered for simplicity. Similartomostpriorworkinthisdomain(Mascietal.,2013), apyramidalstacking approach is utilized for the autoencoders, where the number of neurons is halved for the next higher layer. For example, the model trained on 513-bin FFT spec- trograms with context window of 1 frames has the architecture 513-256-128-64. Similarly the model trained on Log-Mel spectrograms with a context window of 5 frames has the configuration 200-100-50. For all architectures, the autoencoder is pretrained with batched stochastic gradient descent for 5 epochs, with a learning rateof1e-4, aweightdecayof1e-4, andbatchsizeof500. 20%ofthefeaturesinthe frame are randomly dropped when training the denoising autoencoders. The com- bination of the best features and autoencoder models are obtained through empir- ical classification experiments. In each configuration, a simple softmax classifier (refer to Section 4.2.4) is trained on the extracted bottleneck features till conver- gence. It is observed that the best emotion classification accuracy (50.39%) on the validationsetisobtainedusing128-binFFTspectrograms,withatemporalcontext window of 5 frames, and a stacked denoising architecture with tied weights. For subsequent reference, this model can be labeled as TIED-128-5. For more details aboutthecorrespondingperformanceobtainedwithdifferentautoencoderarchitec- tures, the interested reader is referred to Ghosh (2015b). On the best autoencoder architecture, the recurrent autoencoder from Section 4.2.3 is also trained, where theentirefeaturesequenceforeachutteranceisreconstructedbyarecurrentneural network. A BLSTM (Bidirectional LSTM)-RNN model (Hochreiter & Schmidhu- ber, 1997) is trained as an autoencoder over the activations obtained for selected spectrogram and autoencoder configurations, and the resulting representations are also visualized in Section 4.2.12. 64 Figure 4.2: Experimental setup for emotion recognition on IEMOCAP dataset. Speech and glottal flow spectrogram representations which have been learnt by denoising autoencoders are input to supervised BLSTM-RNN classifiers for the task of emotion recognition. For comparison, the COVAREP features are also passed through the same experimental pipeline. 4.2.11 Emotion Classifier Training Research Question 1.4 concerns the emotion recognition performance obtained by temporal classifiers, such as the BLSTM-RNN model described in Section 4.2.5, when trained on glottal flow representations learned by denoising autoencoders. After the glottal flow spectrogram is obtained for each utterance using the IAIF inverse filtering algorithm (Alku, 1992), it is passed through the same pipeline for comparison with speech spectrogram representations and COVAREP features. Figure 4.2 shows the experimental pipeline. The stacked autoencoder model used here is the TIED-128-5 model with tied weights operating on temporal context windows of 5 adjacent frames. The utterances in the IEMOCAP dataset are split into five sessions, where each session consists of a dyadic conversation between a male and a female speaker. The emotion classification are performed in a leave one session out strategy, similar to Lee & Tashev (2015). Since there are 10 speakers in thedataset, eachsessionconsistsof2speakers. Foreachfold, utterancesfromeight 65 speakers (four sessions) correspond to the training set, and from the remaining ses- sion, hyper-parameter validation is performed on one speaker, testing is performed on the other speaker, and vice-versa. Both weighted and unweighted classification accuracies are reported for the entire testing set (scripted+improvised) and a sub- set consisting only of improvised utterances. Weighted accuracy is the accuracy over all testing utterances in the dataset, and unweighted accuracy is the average accuracy over each emotion category (Happy, Angry, Sad and Neutral). Foreachsession, stackedautoencoderpre-trainingisperformedoveralltraining set utterances as described previously. The BLSTM-RNN training for the emotion classification task is done only for utterances which have been labeled with the primary emotion categories. The competing baselines in this experiment are : (1) The DNN-ELM approach in Han et al. (2014) where the authors train a Deep Neural Network (DNN) with an ELM (Extreme Learning Machine). (2) Jin et al. (2015) where acoustic and lexical features are fused to create higher level representations. They also use standard features, along with techniques such asBoW(BagofWordsModeling). Forfairnessofcomparisontheirresultsobtained with acoustic features are reported. (3) COVAREP-RNN where features from the COVAREP toolbox are extracted and stacked to create context window descriptors of five frames, instead of obtain- ing spectrogram representations. 4.2.12 Visualization of Learned Representations Figure 4.3 show the representations learnt by the best autoencoder model TIED-128-5, which is the best combination of unsupervised model and spectro- gram set, as obtained through validation. Representations shown are learnt both at frame level (by the stacked denoising autoencoder) and sequence level (by the 66 (a) Emotion:Frame level (b) Activation:Frame level (c) Valence:Frame level (d) Emotion:Utterance level (e) Activation:Utterance level (f) Valence:Utterance level Figure 4.3: Visualization of frame and utterance-level representations of emotion, activation and valence learned by the recurrent autoencoder for model configura- tion TIED-128-5. The intensity gradient for each dimension is from blue (lowest) to red (highest). Yellow - Sadness; Red - Neutral; Blue - Happiness; Green - Angry BLSTM-RNN recurrent autoencoder). Figure 4.3 shows scatter plots of the frame and utterance-level representations for four primary emotions - (1) Happy (2) Sad (3)Angryand(4)Neutral. ThevisualizationsqualitativelyaddressResearchQues- tion 1.3 by providing an insight into the acoustic characteristics of the speech, as well as the discriminative ability of the autoencoder to learn affective characteris- tics from the data. All frames within an utterance are assigned the emotion label of the entire utterance. The unsupervised representations can clearly separate out anger (in green) from sadness (in yellow), which gets grouped into tight clusters. Happiness (in blue) and anger (in green) have low-variance clusters, while the high variance neutral category does not have a well-defined cluster. This is intu- itive, since from prior psychological studies (Scherer, 2005), the neutral emotion is 67 not well-defined in terms of spectral characteristics, and also depends largely on speaker specific attributes. It is also interesting to examine if the autoencoder can learn affective attributes from the speech such as activation (intensity of emotions) and valence (positive or negative sentiments). The figures also show scatter plot of learned representations, colored for each attribute according to intensity (blue for lowest and red for high- est). The autoencoder is most discriminative of activation, followed by valence. The sensitivity of the representation to activation explains why anger and sadness are clustered and well separated. On examining the scatterplots for valence, it is found that low valence utterances (sadness and anger) are more towards the edges, comparedtomediumandhighvalenceutterances. Figure4.3showsutterance-level scatter plots of emotions and other affective attributes such as valence and acti- vation. From the plots a similar correlation with affective traits is observed as in the frame based plots. This shows that unsupervised feature learning from speech spectrograms learns representations discriminative of emotion as well as valence and activation. In the next sub-section, emotion classification results utilizing these representations as features are described. The best autoencoder configuration TIED-128-5 is also trained on FFT spec- trograms obtained from the speech and glottal flow signals for utterances in the validation set. The representation for each utterance is computed as the average of frame representations obtained from the autoencoder bottleneck layers. Fig 4.4 shows the representations learnt from the speech and glottal flow spectrograms respectively, where each data point corresponds to a distinct utterance. It is appar- ent that the glottal flow signal does not exhibit as much variability as speech (due to verbal information filtered out), and since it is essentially a train of pulses, the 68 resulting manifold exhibits lower variance than the speech representation. How- ever, emotion discriminativeness is also apparent from the scatter plots for the glottal flow signal, thus motivating a study of it’s effectiveness for the task of emotion recognition. (a) Utterance representations (Speech) (b) Utterance representations (Glottal) Figure 4.4: Speech and glottal spectrogram representations learnt by stacked denoising autoencoder. Each data point representing an utterance is the average of frame represen- tations in the utterance. The glottal representations lie on a manifold of lower variability than the speech representations; but nevertheless effective at emotion discriminativeness. Table 4.1: Test Accuracies reported for different feature sets on IEMOCAP utterances. Results for the BLSTM-RNN approach are shown in bold Feature Set Wt. Unwt. Happy Angry Sad Neutral COVAREP-RNN 48.19 50.26 36.13 56.98 65.57 42.38 Spectrogram (Speech)-RNN 48.01 49.82 28.6 47.9 70.7 52.11 Spectrogram (Glottal)-RNN48.91 49.74 32.4 45.0 64.6856.9 Spectrogram (Speech)-MLP 45.15 47.14 63.68 24.85 65.68 34.36 COVAREP-MLP 40.41 42.02 49.63 34.85 57.60 26.00 4.2.13 Emotion Classification Results Emotion classification experiments are performed to obtain an insight into whether the removal of phonetic variations from the speech signal (through glottal 69 inverse filtering) improves emotion recognition performance. Table 4.1 compares the classification performance on IEMOCAP utterances between different feature sets and models such as the BLSTM-RNN and the Multi-layer Perceptron (MLP). On an examination of results, it is evident that representations learnt by the stacked autoencoder are comparable to COVAREP features, showing that repre- sentationlearningprovidescomparableperformancetooff-the-shelfspeechfeatures (Research Question Q1.3). Classification accuracy obtained by the BLSTM-RNN is higher than the MLP (which is expected), with the RNN trained on glottal flow havinghighestweightedaccuracyoverall, andbestperformanceonthe Neutral cat- egory. The approach introduced in Han et al. (2014) trains the DNN-ELM model on improvised utterances from IEMOCAP, though the BLSTM-RNN approach in this work trains and tests both on scripted and improvised utterances. The unweighted accuracy obtained by the DNN-ELM is around 50%, which is compa- rable to the results obtained by the BLSTM-RNN. The acoustic fusion approach in Jin et al. (2015) report 10 fold leave-one-speaker-out validation accuracies on all utterances (scripted as well as improvised), but do not explicitly evaluate on a testing set. The performances they report (weighted accuracy of 49% for cepstral BoW to 55.4% for feature fusion) are comparable to the validation accuracy of 57% which the BLSTM-RNN model obtains in the leave-one-session-out scenario. These results address Research Question Q1.4, and thus it can be concluded that representationslearntfromtheglottalflowspectrogramobtainsimilarperformance to speech spectrograms. 70 4.3 Unsupervised Glottal Inverse Filtering In this chapter, representations obtained from the speech and glottal flow spec- trograms have been investigated for the task of speech emotion recognition. The glottal flow signal is of primary interest here since: (1) it has been shown in Section 4.2.13 that representations obtained from the glottal flow spectrogram are effective for speech emotion recognition. Previous literature (Degottex et al., 2014) has also shown that descriptions from the glottal pulse (such as Nor- malized Amplitude Quotient and Quasi-Open Quotient) are useful for non-verbal behavior analysis from speech (2) The glottal flow signal is the component of the acoustic signal which is not influenced by the vocal tract and has limited pho- netic information (such as formants) in it, making it potentially indicative of the speaker’s state. Most previous approaches to glottal flow extraction from speech have relied on signal processing andL 1 minimization through convex optimization approaches (Chetupalli & Sreenivas, 2014). Apart from limited Bayesian treat- ment (Casamitjana et al., 2015), there has been no focus on data-driven dictionary learning approaches to this problem. A machine learning framework for this task could potentially lay the framework for such neural network approaches in the future and improve performance for tasks such as voice-quality classification. In this section, an unsupervised data driven approach to glottal inverse filtering is presented. This addresses Research Question Q1.5, and a data-driven unsuper- vised approach for glottal flow extraction from speech is presented. The process of voiced speech generation, which has been described in Chapter 2 and is described here more formally, can be interpreted as an impulse train t(n) passing through several components such as: (1) Glottis, which has an impulse response g(n) ; (2) Vocal Tract with impulse response v(n) ; (3) Lip Radiation (which is essentially the derivative operation) to produce the output speech signalx(n). Glottal inverse 71 filtering involves estimation of v(n) from x(n) to obtain g(n)∗t(n), which is the derivative of the glottal flow signal. While describing the speech production process formally, the z-transform is used here, which is common in digital signal processing to describe discrete sig- nals. More details of the z-transform can be obtained from Ogata (1995). It is assumed that there is a latent dictionary of vocal tracts (capturing typical varia- tions such as gender, speaker identity and vowel formants as individual dictionary atoms) which is learnt from training data, and that for each speech frame, the optimal vocal tract atom most suitable for it is selected and used for inverse fil- tering to produce the glottal flow derivative. This data-driven approach warrants investigation since the process of inverse filtering removes the effect of the vocal tract, and facilitates future work on representation learning to remove verbal infor- mation in an end-to-end manner from speech signals. The approach described here assumes an all-pole model of the vocal tract, similar to current literature (Fujisaki & Ljungqvist, 1986) and is described in more detail in Section 4.3.1. For evalua- tion, this approach is compared to the IAIF algorithm, which is commonly used for glottal flow extraction (Alku, 1992). 4.3.1 Model Description The well-known all-pole vocal tract model assumes the presence of P poles, wherex(n) is a speech sample at timen,{a 1 ,a 2 ,...,a P } are the vocal tract param- eters, w(n) is the sparse glottal excitation derivative to be estimated, and e(n) is white noise: x(n) = P X p=1 a p x(n−p) +w(n) +e(n) (4.5) 72 Following a vector representation of the speech samples for a window size of T samples, and assuming the quasi-stationary nature of the vocal tract throughout the frame, then the speech production can be described by y = Xa+w+e where y is the vector of samples [x(P +1)x(P +2)x(P +3)...x(P +T )], X is the Toeplitz matrix constructed from the speech samples. This expression makes use of the property that a convolution can be represented as multiplication with a Toeplitz matrix (Giri & Rao, 2014). w and e are vectors representing the sparse glottal excitation derivative and white noise respectively. Considering a collection of N frames, then for thei−th frame y i = X i a i +w i +e i . The white noise e i for thei-th frame is modeled by a zero-mean Gaussian distribution with identity covariance matrix, thus: y i |X i ,a i ,w i ∼N (X i a i +w i ;σ 2 I) (4.6) Due to the spiky nature of the derivative w i for the i-th frame, a multi-variate Laplacian prior is imposed on w i with location 0 and scale b defined as w i ∼ Laplace(0,b). Assume independence of w i , thus P(w i |X i ,a i ) =P(w i ) and by the chain rule: P(y i ,w i |X i ,a i ) =P(y i |X i ,a i ,w i )·P(w i ) (4.7) Expanding the probability distributions, and collapsing the parameters σ and b into a sparsity factor λ, the NLL (negative log-likelihood) for all frames is: NLL = N X i=1 ky i −X i a i −w i k 2 +λkw i k 1 (4.8) Assume that the all-pole coefficients a i are selected from a combination ofK basis vocal tract filters h 1 ,h 2 ,...,h K , so that a i = P K j=1 c ij h j = Hc i , where H is a matrix of basis filters. From a probabilistic interpretation of sparse coding (Mairal et al., 73 2009), fortheentiredatasetofN framestheL 1 -normconstrainedlossisformulated as follows: L = N X i=1 ky i −X i Hc i −w i k 2 +λkw i k 1 (4.9) where λ is a hyper-parameter controlling the amount of sparsity in the residual. For simplicity assume a ‘winner-takes-all’configuration, where the vocal tract filter for each frame is contributed to by only one basis filter. This is enforced by a one-hot encoding scheme in c i for the i-th frame. Denoting M i as the cluster to which the i-th frame belongs, then c ij = 1 when j =M i , and zero otherwise. While this assumption greatly aids interpretability, this model can be extended to a sparse combination of basis filters, where an additional regularization term can be introduced to control the sparsity in c i . Training the model corresponds to an optimization of the loss L over the unknown glottal excitation derivatives {w i } and the dictionary H, given the signal information in{y i ,X i } for a training set of N frames. A full description of the alternating minimization algorithm for training the model can be found in Algorithm 1. 4.3.2 Datasets and Experimental Setup Three sets of data are considered for training the model - (1) Cereproc train- ing data (five speakers) (Kane et al., 2013b) (2) CMU Arctic single speaker dataset (Kominek & Black, 2004) and (3) Finnish Vowels dataset (Airas & Alku, 2007). Evaluation of the quality of the extracted glottal source waveforms is done by comparing their similarity to the IAIF estimates. The following metrics are chosen for measurement: (1) Log-spectral distortion (LSD) (Giri & Rao, 2014) and (2) Pearson’s correlation coefficient (computed over each frame) and there is 74 Algorithm 1 :Iterative procedure for parameter training in the unsupervised glottal inverse filtering approached proposed in this chapter. Notations are included. N: Number of frames K: Size of dictionary T: Frame duration in samples P: Number of poles ~ y i : Signal vector for i-th frame ~ X i : Toeplitz data matrix for i-th frame H : [ ~ h 1 ~ h 2 ... ~ h K ] Dictionary W : [~ w 1 ~ w 2 ... ~ w N ] Excitation matrix Initialize Dictionary : H← rand(P,K) Frame memberships: M i ← None , i∈ {1,2,...,N} Initialize Clusters: C(j)← {}, j∈ {1,2,...,K} Initialize Excitation Matrix: W← zeros(T,N) while loss not converged do 1. Assign to Clusters for i∈{1,2,...,N} do M i ← argmin j k~ y i − ~ X i h j − ~ w i k C(M i )←C(M i )∪{i} end for 2. Update Basis H for j∈{1,2,...,K} do ~ h j ← ( P i∈C(j) ~ X i T ~ X i ) −1 P i∈C(j) ~ X i T (~ y i − ~ w i ) end for 3. Update Excitation ~ w for i∈{1,2,...,N} do ~ w i ← argmin w k~ w + ~ X i h j −~ y i k 2 +λk~ wk 1 end for end while also an analysis of correlation across three phonation types: breathy, modal and tense voices. 75 Table 4.2: Correlation between IAIF and the proposed approach. Statistics are presented in the format mean, median (standard deviation) Train Val Metric Breathy Modal Tense Finnish Finnish LSD(dB) 9.39,8.77 10.15,9.52 11.52,10.78 Train Val (3.08) (3.22) (3.58) Pearson 0.71,0.74 0.71,0.78 0.68,0.72 (0.199) (0.197) (0.188) Cereproc Finnish LSD(dB) 9.51, 9.27 10.27,9.73 11.18,10.53 Train Val (3.02) (3.09) (3.31) Pearson 0.56,0.66 0.62,0.71 0.59,0.67 (0.319) (0.304) (0.294) Arctic Finnish LSD(dB) 8.87,8.64 9.40,9.94 10.75,10.14 Val (2.36) (3.02) (3.09) Pearson 0.75,0.79 0.71,0.75 0.65,0.69 (0.19) (0.19) (0.20) Cereproc Cereproc LSD(dB) 11.44,10.52 11.85,10.94 12.14,11.26 Train Val (4.27) (4.44) (4.47) Pearson 0.29,0.43 0.26,0.38 0.29,0.38 (0.38) (0.37) (0.39) 4.3.3 Experimental Results Table 4.2 presents the mean, median and standard deviation of the evalu- ation metrics (LSD and correlation coefficient) for each dataset and phonation combination. From the table it is observed that there is a high correlation (and consequently low distortion) between the glottal excitations and the IAIF estimates, particularly when validated on the Finnish vowel dataset. Correlation is higher for breathy and modal voices, compared to tense voices. This could be related to the degree of sparsity in the glottal flow derivative for different phonation types and the model has to be tuned accordingly. This addresses Research Question Q1.5, and shows that an unsupervised dictionary-learning approach can be formulated for the task of glottal inverse filtering, and the resulting glottal flow waveforms are highly correlated with those obtained from the IAIF approach across a range of voice qualities. It is to be noted that the 76 main focus is the development of an unsupervised data-driven framework for glot- talinversefiltering, ratherthanoutperformingexistingstate-of-the-artapproaches. 4.4 Conclusions In this chapter, three primary research questions are posed - (Q1.3) whether machine learning models trained on representations learnt from speech obtain comparable performance to standard feature-sets for speech emotion recognition; (Q1.4) if removing the influence of the vocal tract from the speech signal through glottal inverse filtering improves recognition performance; and (Q1.5) whether it is possible to design an unsupervised data-driven approach for the task of glottal inverse filtering. Experiments on representation learning for speech emotion recog- nition are performed, where features learnt using stacked denoising autoencoders from spectrograms are visualized and found to be discriminative of emotion and affective attributes such as valence and activation. Emotion classification exper- iments are performed using MLP (Multi-layer Perceptron) models and BLSTM- RNN sequential models, with performance comparable to standard non-verbal fea- tures being obtained. The effect of removing phonetic variations from the speech signal through glottal inverse filtering is also quantitatively evaluated. Represen- tations obtained from the glottal flow spectrogram are also subsequently utilized in speech emotion classification and shown to have comparable performance as representations learnt from the original speech spectrogram. This implies that the glottal flow signal removes factors of variation due to the vocal tract, but does not provide any additional discriminative information for this task. 77 Further, unsupervised models are proposed for the process of voiced speech generation, and a data-driven dictionary learning based approach to glottal flow extraction is also described. The unsupervised approach is similar in performance to IAIF (Iterative Adaptive Inverse Filtering) approach, and provides a framework for subsequent adaptation of deep generative neural network approaches to this task. The experiments in this chapter demonstrate the effectiveness of represen- tation learning approaches for speech emotion recognition, including the ability to train end-to-end models on abundant unlabeled data without the requirement of manually designing feature sets. 78 Chapter 5 Importance-based Multimodal Autoencoder 5.1 Introduction This dissertation has investigated the problem of unimodal representation learning for facial expression and speech emotion recognition respectively in Chap- ter 3 and 4. While representation learning from individual modalities provide useful information about affective states of a human subject, real-world modeling of human behavior and affect would involve multiple modalities. During human conversations, there is an exchange of information between multiple participants involving not only verbal messages which convey information about grammar and semantics of utterances, but also cues from other modalities, such as visual (facial images) and acoustic (human speech). To properly understand the intent and the underlying speaker state (for example, emotion) for an utterance, it in necessary to integrate data from these modalities in a machine learning framework. The abundance of unlabeled data and the potential of integrating probabilistic models with deep neural networks have led to a resurgence of interest in deep generative models such as VAEs (Variational Autoencoders) in Kingma & Welling (2013) and GANs (Generative Adversarial Networks) in Goodfellow et al. (2014). In a deep generative model, the conditional probability distributions of observed and latent random variables are modeled with complex non-linear functions. The 79 inference of the latent variable given the observed data enables the learning of meaningful representations. For the scope of this chapter, VAE models are of primary interest, where the latent variable is inferred through an encoder, which is generally modeled using a neural network. The recent work on multimodal VAE models (Wang et al., 2016; Wu & Good- man, 2018; Suzuki et al., 2016) address the problem of efficiently combining infor- mation from multiple modalities in a scalable fashion without (1) requiring an inference network for every subset of input modalities, or (2) a straightforward approach by concatenating data from all modalities and providing them to a large network. However the efficient fusion of information from multiple modalities in a VAE framework still remains an open challenge. There are also other shortcomings in related work. Firstly, the datasets used in previous papers (such as MNIST, Fashion MNIST and CelebA) do not have any uncorrelated noise in modalities. Uncorrelated noise refers to the presence of data in any modality which is uncorre- lated with the underlying latent factor. For example, consider the MNIST dataset where digit images and their labels serve as paired inputs, but some image dig- its are replaced with other images (for example, grayscale images of trees) which are not associated with the digit identity, which is the latent factor in this case. Proper multimodal representations should not only learn the unimodal digit rep- resentations, but also infer that the tree images are uncorrelated noise (and hence unimportant to the task). The presence of uncorrelated noise does occur in real- world data such as emotional spoken utterances as explained later in the chapter. Secondly, prior work has mostly been restricted to image datasets such as MNIST and CelebA, treating associated labels (such as digit labels and celebrity facial attributes for example) as a second modality. While this works for toy datasets, they do not investigate their approaches to truly multimodal data (such 80 as images+speech, or speech+vocabulary words). Thirdly, prior work has mostly focused on image synthesis and cross-modal translation quality, and not on the quality of the learned multimodal representations. In this chapter, a novel model, IMA (Importance-based Multimodal Autoen- coder) is introduced for multimodal representation learning. The main research questions addressed in this work are : • Q2.1: Can meaningful multimodal representations be learnt in a scalable man- ner by neural network approaches ? Assuming the presence of uncorrelated noise where all data in each modality is not relevant, can the importances of different modalities be learnt in this framework ? • Q2.2: Could a model which learns multimodal representations with modality importances improve on the task of emotion analysis from spoken utterances compared to unimodal and other multimodal baseline approaches ? The rest of this chapter is organized as follows: in Section 5.2, the proposed IMA model is described; the methodology of the experiments, including datasets and network architecture are discussed in Section 5.3. A detailed description of the experimental setup (including retrieval and classification experiments) is in Sec- tion 5.4. Experimental results, including visualization of representations and per- formance on downstream retrieval and classification tasks are presented in Sec- tions 5.5. Results on the real-life task of emotion understanding are presented in Section 5.6. The chapter is concluded in Section 5.7. 81 ( ) ( ) Latent Representation (z) + 1 1 2 2 1 2 ( ) ( ) 1 2 ( ) ( ) ( ) ( ) Decoder Networks Encoder Networks Importance Network Modality (1) Importance Network Modality (2) Figure 5.1: Overview of the proposed model IMA, including the multimodal autoencoder, theimportancenetworksandthemainlossfunctionstobeoptimized. The IMA model is trained in two stages - multimodal autoencoder followed by the importance networks. 5.2 Description of IMA Model A brief description of the IMA model is provided in this section, including the subnetworks, loss functions and an introduction to notation used through- out this chapter. As shown in Figure 5.1, the IMA model consists of two main components-(1) the multimodal autoencoder with alignment and (2) the modality- specific importance networks. In Section 5.2.1, the multimodal autoencoder is described and Section 5.2.2 discusses training of the importance networks. 5.2.1 Multimodal Autoencoder with Alignment Assume that the input training examples consist of multimodal paired data, where each data point is denoted as x ={x 1 ,x 2 ,...x M }. There are M modalities 82 andthetrainingsetconsistsofN paireddatapoints. Theinputdata x j ineachj-th modalityispassedthroughanencoderforthatmodality. Theoutputoftheencoder for the j-th modality is denoted as u j and the multimodal joint representation which aggregates all modality representations is given by: z = 1 M M X j=1 u j (5.1) The multimodal representation z is passed through the decoder networks to obtain the reconstruction in thej-th modality as ˆ x j .L (j) rec is the reconstruction loss for the j-th modality which could be the sigmoid cross-entropy for Bernoulli distributed data (such as binarized image digits), or SSE (Sum of Squared Errors) for Gaus- sian distributed data. While more details are provided in the appendix, using a similar formulation as a VAE (Kingma & Welling, 2013) model,L glob is a global KL-divergence term which forces z to be Gaussian distributed with zero mean and unit diagonal covariance. This term is for regularization of the multimodal representation z. An additional term for each modality is introduced in this framework, which is denoted byL (j) align . For the j-th modality, it is the SSE error between the multimodal representation z and the unimodal encoder outputs u j given by: L (j) align (z,u j ) = z−u j 2 (5.2) This alignment term as a part of the regularization, and it’s effect can be visualized in the representations presented in Section 5.5.1. In Wu & Goodman (2018), the authors propose a subsampled training paradigm to ensure that the individual inference networks for each modality are also trained. They observe the absence of modality-specific inference terms in their training loss. In the case of the IMA, this 83 issue is addressed through the alignment regularization termL align for individual training of each modality’s inference network. This requires only M additional loss terms instead of a random subset of 2 M losses for sub-sampled training as in the MVAE model (Wu & Goodman, 2018) or 2 M subnetworks as in the JMVAE- KL model (Suzuki et al., 2016). During training of this model the below loss is optimized (where each term has associated hyper-parameter weights): L auto :=λ glob N X i=1 L glob (z i )+ N X i=1 M X j=1 λ (j) rec L (j) rec (x ij ,ˆ x ij ) + N X i=1 M X j=1 λ (j) align L (j) align (z i ,u ij ) (5.3) 5.2.2 Importance Network Training The description above assumes that there is no uncorrelated noise in the observed data, which is independent of the latent factor z. This is not always valid in real-world tasks. For example, in emotional spoken utterances, words are often present which merely play a syntactic role and do not specifically relate to emotion. These are called function words and should be weighed less during infer- ence of z from the observed data x. More generally, assuming the presence of a regionR j corresponding to uncorrelated noise inside eachj-th modality’s manifold it holds that x j ∈R j ⇒ u j ⊥ z. Weightsy ij ∈ [0,1] can be learned for thei-th data point in thej−th modality, so that y ij denotes the importance of each data point x ij (i.e. the degree to which x ij does not belong toR j ). A modality-specific neural network is trained to map from x ij to y ij ; this is called the importance network. Importance network training makes use of a loss function explicitly capturing the uncorrelated relationship between z and u j , weighted by y j . Similar work has also applied more rigorous independence criterion, such as MMD (Maximum Mean Discrepancies) as auxiliary losses in a variational framework such as the 84 VFAE (Variational Fair Autoencoder) (Louizos et al., 2015). Since the latent variables are modeled as Gaussians, independence and uncorrelated properties are equivalent, and the following alternative loss term can be defined based on the Frobenius norm of the cross-covariance between z and u j : L (j) corr = B P i=1 (1−y ij ) h (z−μ z )(u j −μ u ) i T B P i=1 (1−y ij ) 2 F (5.4) μ z = B P i=1 (1−y ij )z i B P i=1 (1−y ij ) μ u = B P i=1 (1−y ij )u ij B P i=1 (1−y ij ) (5.5) Minimizing this cost is equivalent to enforcing zero-correlation between u j and z based on mini-batch statistics during training, where size of a mini-batch is B. This cost can be derived from the definition of cross-covariance, where each i-th sample in the mini-batch is not weighted equally but according to y ij . When learning to predict y ij ≈ 1.0∀i,j trivially decreasesL corr down to zero, and thus it is needed to regularize the importance network training through an additional loss function referred to asL (j) local . This term utilizes the hyper-parameterρ j which serves as prior about how much of the j-th modality is corrupted by uncorrelated noise, and is defined by: L (j) local = ¯ y j log ¯ y j ρ j + (1− ¯ y j )log 1− ¯ y j 1−ρ j (5.6) ¯ y j = 1 B P i y ij is the average value of y ij as computed over a mini-batch of size B. The importance network for each modality minimizesL imp , which is the weighted 85 sum of the cross-covariance based loss and the regularization term, as defined below. λ (j) local and λ (j) corr are hyper-parameter weights for each loss term. L imp = M X j=1 λ (j) local L (j) local (¯ y j ) +λ (j) corr L (j) corr (5.7) The final multimodal joint representation is obtained by a weighted average of the unimodal representations, as shown below: z(x) = M P j=1 y ij f θ j (x j ) M P j=1 y ij (5.8) 5.2.3 Practical Considerations Figure 5.1 shows an overview of the proposed IMA model, including the loss functionsL glob ,L (j) align ,L (j) rec (for training the multimodal autoencoder with align- ment), andL (j) corr ,L (j) local (for training the importance networks). Some practical considerations during training are: • The model is named as an importance-based multimodal autoencoder, since the actual implementation is not variational and the encoder/decoder/importance networks are all deterministic; with all variance terms considered as very small. Please refer to appendix for more details. • While it is possible to optimize all cost functions in a common framework, it is observed that a phase of training the alignment and fusion model, followed by importance networks works well in practice. For the experiments in this chapter, the multimodal autoencoder is trained for around 100 epochs, followed by importance network training for 100 more epochs. This is also dependent on 86 the choice of training data. The autoencoder tries to reconstruct all data; even the uncorrelated factors, while the importance network filters them out. • The purpose of the importance networks is to learn a gating of the input modalities based on the correlation loss, similar to that of a mixture-of-experts model. Shazeer et al. (2017) recommend large batch sizes for training sparse mixture-of-experts models through backpropagation, and for IMA it is also observed that a mini-batch size of 5000 enables faster convergence of the pro- posed model. 5.3 Methodology 5.3.1 Autoencoder Training Training of the IMA model is performed through a two-stage minimization of the constituent loss functions. Firstly, the multimodal autoencoder is updated to infer z and u j , after which the importance networks are trained through minimiza- tion of the cross-covariance losses. The reconstruction loss functions are dependent on the nature of the data, for example the SSE (Sum of Squared Errors) for Gaus- sian data and cross-entropy loss for sparse binarized data. 5.3.2 Datasets The MNIST and TIDIGITS datasets are considered for the experiments to evaluate performance of the proposed IMA model on a multimodal dataset which is not limited to only digit image-label combinations, as is done in previous lit- erature (Vedantam et al., 2017; Wu & Goodman, 2018). The multimodal digit 87 dataset is constructed through pairing and helps obtain an insight into the oper- ation of the model. The IEMOCAP dataset, previously described in Chapter 2 is also used for a real-world application. The MNIST-TIDIGITS Audio-Visual Digits Dataset is synthesized from three datasets for digit recognition from images and speech respectively - MNIST dataset (Salakhutdinov & Hinton, 2007); along with the TIDIGITS connected spoken digit sequence corpus (Leonard, 1984) and TI46 (Liberman, 1993) digits respectively. The TIDIGITS corpus consists of con- necteddigitsequences(whicharesubsequentlysegmented)from326speakers, each pronouncing77sequences. TheTI46corpusconsistsofisolateddigitsutteredby16 speakers, each pronouncing 26 tokens. These two spoken digit datasets are merged to create a dataset as large as the number of samples in MNIST. The motivation underlying this is not only to obtain more examples for training, but also to pair image and spoken digits to create a multimodal dataset, as subsequently described in Section 5.3.3. 5.3.3 Digit Datasets Preprocessing The datasets considered for the experiments are further pre-processed to cre- ate feature sets amenable to model training, these steps not only involve feature extraction, but also modality pairing along with incorporation of uncorrelated noise. Note that the noise samples which are included in the synthesized datasets not only serve to demonstrate how the model works, but also quantify performance (due to availability of ground truth for the presence of noise in each sample). Fur- ther, the composition of each noise image or speech sample is not important here; rather it is their co-occurrence pattern with the rest of the modalities which the model learns from during training. For example, if the noise images were selected to be grayscale image of cats instead of synthetic noise, the performance would be 88 similar. The only constraint imposed on the composition of noise samples is the locality of their representations in the joint embedding manifold to enable a model hyperplane learnt by the importance networks to separate it out from the rest of the samples. 5.3.4 Processing of MNIST digits EachMNISTdigitisof28*28dimensionality, andthepixelintensitiesareinthe range [0,255]. Each intensity is divided by 255 to normalize the pixel values to the range [0,1]. Images are also constructed which are the same dimensionality (28*28) as the MNIST digits, but consist of white Gaussian noise at a predefined level of sparsity K. For all experiments, K = 0.8. Synthesis is performed by creating a 28*28 grayscale image of white Gaussian noise at each pixel with an intensity mean of 0.5 and a standard deviation of 0.12, which ensures that almost all the image pixels are in the range [0,1], and then setting K-fraction of those pixels to zero. Negative intensity pixels are also set to zero. All images (MNIST and noise) are subsequently flattened, resulting in 784-dimensional feature vectors. 5.3.5 Processing of TIDIGITS/TI46 spoken digits The speech waveforms from the TIDIGITS and the TI46 datasets are first resampled to 12.5 KHz, and then the TIDIGITS dataset is further segmented by speaker and digit to generate audio files, each corresponding to a single speaker uttering one digit. Noise speech samples are generated by sampling white Gaussian noise for each time step with a mean of 0 and a standard deviation of 1; which corresponds to the same numeric amplitude ranges as the digit audio files. Each synthesized noise waveform is of one second duration and has a sampling rate of 12.5 KHz. Subsequently, MFCC (Mel Frequency Cepstral Coefficient) features are 89 extracted from each audio file (both the synthesized noise and the spoken digits), with 20 ms windows; 10 ms shift and 12 cepstral coefficients not including the energy term. Each digit utterance also consists of recording pauses at the start and end time samples. To filter these pauses, 29 frames from either side (left and right)oftheutterancemidpointareselectedandconcatenatedtoforma58*12=696 dimensional feature vector; the remaining frames are discarded. 5.3.6 Multimodal Pairing in MNIST-TIDIGITS dataset A multimodal dataset is constructed by pairing the noise/MNIST digit images in the visual modality with the noise/TIDIGIT spoken digits based on the common digit label. After the digit image and speech features are obtained using the steps above, a parameter R called the noise factor is introduced to control the amount bywhicheachtrueimageorspeechfeatureco-occurswiththenoisefeaturevectors. By ’true’, it refers to a feature vector corresponding to an actual digit instead of synthesized noise. More formally, assume there are N true image-speech feature pairs with a noise factor of R. N∗R noise image and noise speech vectors are synthesized in each case, and considered one by one. For each noise image feature, a true spoken digit feature vector is chosen uniformly to create a (noise image, TIDIGIT speech) pair. Similarly, for each noise speech feature, a true MNIST digit is chosen uniformly to create a (MNIST image, noise speech) pair. This would correspond to a physical process where digits are both shown and uttered, however in some samples, either the image or speech modality is completely corrupted and replaced with uncorrelated data. The composition of the final dataset after this pairing operation would beN∗(1−R)(true MNIST image, true TIDIGIT speech), N∗R (noise image, true TIDIGIT speech) and N∗R (true MNIST image, noise 90 Table 5.1: Network Architecture, modalities and training configurations for model trainedoneachoftheMNIST-TIDIGITSandtheIEMOCAPdataset. SSEdenotes the Sum of Squared Error loss function Dataset Modalities Enc/Dec Loss functions LR MNIST- TIDIGITS 1: Images (784 D) 2: Speech (696 D) 1: 500 neurons 2 layers 2: 100 neurons 2 layers 1: Binary Cross-Entropy 2: Binary Cross-Entropy Stage 1 1e-3 Stage 2 1.0 IEMOCAP 1: Words one-hot (1215 D) 2: Non-verbal Features (55 D) 1: Weight Matrix (1215*D)/1-layer network 2: 50 neurons 2 layers 1: Softmax Cross-Entropy 2: SSE Stage 1 1e-3 Stage 2 1e-1 speech) pairs. For all experiments in this chapter, R = 0.1. Note that the actual digit label is not known at model training time. 5.3.7 IEMOCAP Dataset Processing The IEMOCAP dataset consists of spoken utterances, with both text data as well as speech waveforms in a dialogue setting. The speech data in IEMOCAP correspond not only to emotion, but also to other factors of variation such as the speaker identity, gender, and topic of conversation. Moreover, even for the text data each word not only conveys a semantic meaning but also have lexical and syntactic roles. For example, function words such as I, the and of do not carry affective information, but dominantly play a syntactic role. Further, considering bimodal paired data consisting of words and the speech features they co-occur with, then the main correlation between these streams of data will be based on word phonetics, due to each word being pronounced simi- larly in the same language across different speakers, emotions and genders. While phonetic and verbal information would be useful for applications such as ASR 91 (Automatic Speech Recognition), for the purposes of this study the phonetic infor- mation is filtered out from the speech features before model training. This would still retain non-verbal features which would correlate with spoken emotionally col- ored words. Thus, this processed dataset consists of word-prosodic speech pairs, which mostly correlate through affect which is the latent factor; but each modal- ity also has information not related to this shared latent factor such as function words present in text. Note that just as in the synthesized MNIST-TIDIGITS digits dataset, the framework is unsupervised; the autoencoder has no access to the actual emotion or valence labels. 5.3.8 Non-verbal Feature Extraction (IEMOCAP) The COVAREP toolbox is used to extract features from speech waveforms in the IEMOCAP dataset with a sliding window of 20 ms. duration and 10 ms. shift. The COVAREP toolbox is an open-source toolbox (Degottex et al., 2014), which is commonly used in applications such as emotion and depression analysis from voice. To capture additional temporal information, five adjacent frames are stacked together as described in Ghosh et al. (2016a) to form a 50-dimensional vector. The dataset also has transcriptions with token-level timestamps. For each spoken token, the COVAREP features extracted from the waveform co-occurring with it are averaged to create a single 50-dimensional prosodic feature vector. Each acoustic vector consists of features useful in voice analysis such as NAQ (Normal- ized Amplitude Quotient), QOQ (Quasi-Open Quotient) and F0 (Fundamental Frequency). It is important to note that the MFCC features are omitted from this set, as they contain phonetic information. 92 5.3.9 Multimodal Pairing in IEMOCAP Dataset The set of (word, non-verbal acoustic) vector pairs are grouped based on the utterance they appear in. The process of averaging the stacked COVAREP vectors at token level reduces the total number of feature vectors in the dataset to around 150000; and it would be desirable to expand this number for more training data. Since the MFCC features are not considered, some of the feature variability within the utterance is removed resulting in factors of variation (such as emotion) which are more global in nature. The scope of the acoustic features co-occurring with a word is expanded to the entire utterance.Thus if an IEMOCAP utterance initially has N word-acoustic feature pairs, it is expanded to N 2 pairs for model training. This is analogous to techniques such as skip-gram word embeddings (Mikolov et al., 2013), where a word context is not restricted to the immediate adjacent words but to a window surrounding it. 5.4 Experimental Setup 5.4.1 Loss Functions and Hyper-parameter Tuning Table 5.1 summarizes the network architecture and training configurations for models trained on each dataset. From Algorithm 1, there are differ- ent hyper-parameters associated with different cost functions in this model - λ glob for the global KL-divergence term, and modality specific hyper-parameters λ (j) local ,λ (j) align ,λ (j) rec and λ (j) corr . For the experiments on MNIST-TIDIGITS dataset in this chapter, it is found that setting all the hyper-parameter values to 1.0 is suffi- cient for acceptable performance; while not requiring any additional tuning. The experiments on IEMOCAP however are set to the hyper-parameter values λ (j) corr = 93 1.0,λ (j) align = 1000.0,λ glob = 1.0,λ (j) rec = 1.0,λ (speech) local = 100.0 and λ (words) local = 1.0. The most important hyper-parameter is ρ j , which is called the importance prior for thej-th modality. ρ j in the interval [0,1] sets a predetermined level of importance for each modality, which serves to regularize training and is related to domain knowledge. For example, when training on IEMOCAP, it could be hypothesize that function words are not correlated with the underlying latent factor such as emotion, and hence ρ for text can be set to a very low value, since such func- tion words comprise a large fraction in terms of word frequency. Similarly for the MNIST-TIDIGITS dataset, the amount of uncorrelated noise (which is not related to the digit labels) in the paired multimodal dataset is 10%, requiring a high value of ρ j since most data points in each modality are important. 5.4.2 Competing Baseline Models Performance of the following models are evaluated in this study: 1. Joint multimodal representations from proposed importance-based multimodal autoencoder (IMA) model. 2. Unimodal VAE models: The data from each modality is separately used for training a VAE model with the mean parameter μ z for the latent variable z considered as the representation. 3. Unimodal representations from IMA model, which just considers the output u j of each of thej-th modality-specific encoders as the final representation. While this representation is unimodal, during training the IMA model has access to all modalities, so u j is influenced by other modalities through the alignment and reconstruction losses. 94 4. Unweighted multimodal representation z where the unimodal representations u j from above are combined without any weighting as z = 0.5u 1 + 0.5u 2 . 5. MVAE model proposed in Wu & Goodman (2018) with a sub-sampled training paradigm and shared product of experts for the multimodal inference. 6. JMVAE-KL model proposed in Suzuki et al. (2016) which obtains a bimodal representation through KL-divergence minimization with each unimodal repre- sentation. 7. Early fusion model which concatenates the embeddings learnt in (1) to create a multimodal embedding. 5.4.3 Evaluation of Importance Network Performance Since the model is unsupervised where the MNIST labels are not available for training, there is no quantitative way to determine the best value of the hyper-parameters for each problem or dataset. However to evaluate performance of the model, the ground truth labels from the MNIST-TIDIGITS and IEMO- CAPdatasetsareutilized, notonlythroughdownstreamclassificationperformance which utilizes the learnt multimodal embeddings, but also through evaluating per- formance of the modality-specific importance networks. For the i-th data point in the j-th modality, the importance network maps x ij to a score y ij ∈ [0,1]. Given the ground truth assignment of x ij to a positive or negative class depending on whether it corresponds to uncorrelated noise, precision, recall and F1 scores can be computed to determine the optimal importance prior parameter, as well as quan- tify importance network performance with varying values ofρ j . The F1 scores can be computed either based on the minority class (which is uncorrelated noise for the MNIST-TIDGITS dataset), or both classes (as in the IEMOCAP dataset). 95 FortheIEMOCAPdataset, experimentsarealsoconductedtoobtainaninsight intohowtheparameterρchangestheF1scoresforbothnoiseandnon-noiseclasses. For the vocabulary words, the ground truth positive category is defined to be the class of all vocabulary words belonging to the LIWC negative emotion and posi- tive emotion categories. LIWC is a text-analysis tool developed by Pennebaker et al. (2001), which groups English words based on their linguistic, social and affec- tive meaning. From an examination of co-occurrence of (word, non-verbal speech features) paired data from IEMOCAP utterances, the function words occur irre- spective of the shared latent variable (emotion), while the presence of emotionally coloredwordsarecorrelatedwiththislatentfactor. Similarlyforthespeechmodal- ity, the positive category is defined to be the class of non-verbal acoustic feature vectors belonging to the neutral emotion as annotated in the IEMOCAP dataset. For both modalities, the positive classes are minority classes in occurrence. 5.4.4 Retrieval Experiments An intrinsic evaluation is conducted for the multimodal representation quality through retrieval experiments. Even if a modality is corrupted through uncorre- lated noise, the multimodal representation does not take it into account, and the representation of a (MNIST true digit, speech noise) pair will be similar to the rep- resentation of (image noise, spoken TIDIGITS digit) if the underlying digit labels are the same. Consider the test set of MNIST-TIDIGITS dataset, and obtain the multimodal representations for each paired data-point. Define the concept of relevance as: Two (image, speech) paired samples are relevant if they correspond to the same digit label, even with noise in any one of the modalities. For each of i∈{1,2,3...N}, theK nearest neighbors are obtained for thei-th sample in terms 96 of the Euclidean distance and count the number of relevant neighbors as C i . The Precision@K score is computed as : Precision@K = P i C i NK (5.9) Results are reported for two and 50-dimensional representations at K ∈ {10,50,100}. IEMOCAP Dataset: Retrieval experiments are also conducted on the IEMO- CAP dataset to evaluate the multimodal representations. The model maps a (word,non-verbal acoustic feature) pair to a representation, and this should take into account the complementarity of multimodal data. Specifically, it is expected that even if the word is a function word, or the non-verbal acoustic feature vector is from a uninformative region (which is considered to be neutral) in the acoustic representation, the other modality should provide meaningful importance about the overall emotion. Thus, if f(.) denotes the IMA output given the word-speech pairs, then the following relation is expected: f(angry word, neutral speech)≈ f(neutral word, angry speech) For the retrieval experiments, two data points are defined to be relevant if they occur in the same IEMOCAP primary emotion category, one out of : happy, angry, sad and neutral. The same metric, Precision@K is considered for this experiment, same as the MNIST-TIDIGITS dataset, with identical baselines and experimental settings. 5.4.5 Downstream Classification Tasks The quality of the joint multimodal embedding z learnt by the model is quan- titatively evaluated by downstream digit classification task respectively for the 97 MNIST-TIDIGITS dataset. A two-layer MLP (Multi-layer Perceptron) classifier is utilized with ReLU (Rectified Linear Unit) activation functions for the supervised task of digit classification, given the digit representations learnt by the importance- based autoencoder and other baseline models. The classifier has a 10-class soft- max layer on top, and is trained till loss function convergence. A validation set is also constructed for hyper-parameter tuning of the learning rate, which is tuned from the range η∈{0.0625,0.125,0.25,0.5,1.0,2.0,4.0,8.0}. Each dataset is first split into a training, validation and test set with 50000, 15000 and 10000 samples respectively. The test set is used only for evaluating performance, and the valida- tion set is used for hyper-parameter tuning. It is to be noted that all unsupervised models (the importance-based autoencoder and the baselines) are trained only on the training split of 50000 samples. The experiments are performed in two scenar- ios corresponding to 2-dimensional (2 neurons/layer in MLP) and 50-dimensional (50 neurons/layer in MLP) representations respectively. Further, the classification performances are categorized depending on whether the MNIST-TIDIGIT paired data point had both true labels, or if one of the modalities had uncorrelated noise. 5.5 Experimental Results 5.5.1 Visualization of Learned Representations In Figure 5.2, the unimodal representations u j for j∈{1,2} are presented, along with the joint multimodal representation of the paired samples. For the MNIST-TIDIGITS dataset, j = 1,2 correspond to image and speech respectively. While the global KL-divergenceL global centers all representations at the origin as in an ordinary VAE model, the term L align not only makes the unimodal 98 (a) MNIST (Image) (b) TIDIGITS (Speech) (c) Multimodal Figure 5.2: t-SNE visualizations of embeddings learnt by the importance-based multi- modal autoencoder for the MNIST-TIDIGITS dataset, including the modality-specific representations and the joint multimodal representations. The modality specific repre- sentations are superimposable due to the alignment cost minimized during training. The model weighs the modalities before fusion; thus for example in the case when a combi- nation of a MNIST digit image and noise speech digits is provided as input, it is still mapped to the appropriate location (within the digit location) in the joint embedding. Colors denote 0:Red; 1:Green; 2:Blue; 3:Purple; 4:Orange; 5:Cyan; 6:Yellow; 7:Magenta; 8:Olive; 9:Black; Gray: Noise. representations super-imposable, but can also improve unimodal representations throughalignment. ForexamplethisisvisibleintheMNIST-TIDIGITSdatasetfor speech representations in Figure 5.2(b), where disentanglement of spoken digits is improveddueto alignmentwiththe MNISTrepresentations. Themultimodal joint representation improves the representation quality by combining data from both modalities. For example, in Figure 5.2(c), digits 2, 3 and 7 are more disentangled in the multimodal representation compared to that for MNIST. Figure 5.3 shows how the samples in the noise region of either modality migrate to the appropriate digit clusters, due to pairing with a "clean" additional modality. This demon- strates the effectiveness of the IMA model at filtering uncorrelated noise. The top layer of each importance network is also visualized. They implement y ij for the j-th modality given x ij . Figure 5.4(a)-(d) shows the t-SNE representations for the MNIST-TIDIGITS dataset, with both the ground truth label clusters and the importance network outputs. It is observed that the network correctly learns the 99 (a) MNIST (Image) (b) TIDIGITS (Speech) (c) Image noise (Multimodal) (d) Speech noise (Multimodal) Figure 5.3: Regions of uncorrelated noise in each modality are colored in red, where actual digit images (MNIST) and spoken digits (TIDIGITS) are colored in blue. Figures (a) and (b) show the location of noise in each modality representationu 1 (image) andu 2 (speech). Figures (c) and (d) show the movement of noise data points to the respective digit clusters for the multimodal representation z. This movement happens because noise in one modality is paired with actual digit information from the other modality, resulting in noise being suppressed during multimodal fusion. regions of the manifold which correspond to uncorrelated factors. Around 5% of the data is misclassified by the importance network, which manifests in the small red region in Figures 5.4(b,d). 5.5.2 Precision and Recall for Importance Networks The termL (j) local acts as a regularizer and the hyper-parameter ρ j plays an important role - it provides prior knowledge about how many samples in each modality are estimated to be uncorrelated with the shared latent factor. Increas- ingρ j implies that the less uncorrelated samples are expected, whereas decreasing 100 that value would indicate the presence of more "true" samples. For the MNIST- TIDIGITS datasets, the level of corrupted samples is set at 10% for both modal- ities, which means that a low value of ρ j would be satisfactory for training. A quantitative evaluation of the performance of the importance network is performed for the MNIST-TIDIGITS dataset by recording precision, recall and F1 scores for values of ρ∈ [0.05− 0.95] (ρ=0.0 and 1.0 are omitted due to numerical computa- tion issues). The ground truth labels are based on presence of uncorrelated noise in each modality, where positive label indicates presence of uncorrelated noise, negative class indicates otherwise. The scores are computed with respect to the minorityclass(whichispositive). FromanexaminationofFigure5.5, itisobserved that when ρ is low, most samples are predicted as positive by the network. This increases the false-positive rate, resulting in low precision. When ρ is high, most samples are predicted as negative, thus increasing false-negative rate and decreas- ingrecall. Thus, generallyanincreaseinprecisionanddecreaseinrecallisexpected with increasingρ, with the F1 score peaking somewhere midway. From the figures, the best F1 score occurs atρ =0.8 and 0.3, which indicates that these would be the optimal values for best performance in image and speech modalities respectively. 5.5.3 Digit Retrieval Results Retrieval experiments are conducted to understand intrinsic properties of the learnt multimodal representations, particularly how they differ from baselines. As previously described, the IMA model would map a paired combination of a noise sample in one modality and a true digit in the other modality to a cluster corre- sponding to the true digit. This property may not be observed in other approaches to multimodal representations such as early fusion, for example and can be evalu- ated through retrieval experiments as explained in Section 5.4.4. Table 5.2 shows 101 (a) MNIST (Image) Truth (b) MNIST (Image) Pred (c) TIDIGITS (Speech) Truth (d) TIDIGITS (Speech) Pred Figure 5.4: t-SNE visualizations of importance network representations learnt by the model. The rows correspond to the proposed model trained on the MNIST-TIDIGITS dataset. In left columns, the ground truth refers to presence (colored in gray) of uncor- related noise in each representation. The column to the right has data points predicted by model as noise in red; otherwise in blue. Precision@K scores (which are in the range [0,1]) atK=10, 50 and 100 for the test set of the MNIST-TIDIGITS dataset. The importance-based autoencoder outper- forms all other fusion approaches except the early-fusion representation in 2D as observed in the table. In 50D, the IMA model outperforms all other baselines. This addresses research question Q3.1 related to learning good multimodal repre- sentations and filtering uncorrelated noise. This property of filtering uncorrelated factors is of actual practical significance in tasks such as perception of spoken words and utterances. It would be desirable to construct representations which for example, would map both a function word spoken in an angry tone and an angry word spoken in a neutral tone to a cluster denoting angry emotion. 102 Table 5.2: Precision@K scores (denoted as P@K) corresponding to K=10, 50 and 100 for the importance-based autoencoder and other baseline models for the task of retrieval of similar multimodal MNIST-TIDIGITS paired data. Embedding Model 2D Pr@10 2 D Pr@50 2 D Pr@100 50D Pr@10 50D Pr@50 50D Pr@100 Importance-based Autoencoder (Weighted Fusion) 0.8042 0.7812 0.7718 0.9489 0.9326 0.9229 Importance-based Autoencoder (Unweighted Fusion) 0.7649 0.7354 0.7236 0.9414 0.8988 0.8635 Importance-based Autoencoder (Early Fusion) 0.8468 0.7997 0.7630 0.9471 0.8989 0.8553 JMVAE-KL (Suzuki et al., 2016) 0.7382 0.7007 0.6785 0.8986 0.7554 0.6707 MVAE (Wu & Goodman, 2018) 0.6108 0.5598 0.5406 0.8872 0.7338 0.6711 5.5.4 Downstream Digit Classification Performance Table 5.3 reports the classification accuracies on the digit classification task for both 2-D and 50-D multimodal representations for the importance based autoen- coder and other baselines reported in Section 5.4.2. The performance results are divided into three categories: (1) All multimodal samples with both true and noise data points (2) Multimodal samples with image noise but true spoken dig- its and (3) Multimodal samples with true MNIST image digits but speech noise. From an examination of the accuracies, the improvement due to multimodal rep- resentations and increase in representation dimensionality are notable. For the 2D experiments, the IMA model outperforms all other approaches, including the MVAE (Wu & Goodman, 2018) and JMVAE-KL (Suzuki et al., 2016) models. For 50Dexperiments, theIMAmodeloutperformstheunimodalandunweightedfusion approaches, but slightly less than the JMVAE-KL and MVAE approaches. This 103 occurs because in high dimensions, a strong classifier can estimate the importance of modalities directly when provided target digit labels, without being dependent on an input multimodal representation. The unimodal classifier trained only on image samples performs poorly when classifying image noise since the samples do not correspond to proper digits. However the multimodal representations can uti- lize information from the other modality even in presence of noise. The accuracy scores are not comparable to state-of-the-art scores (>99.5%) obtained on MNIST in contemporary literature (Wan et al., 2013), since a simple classifier on top of the multimodal representations is trained without convolutional layers. Also, all models are trained on a combination of true and noise samples, which reduces classification performance. (a) MNIST (Image Scores) (b) TIDIGITS (Speech Scores) Figure 5.5: Precision, Recall and F1 score curves for the importance-based autoencoder trained on the MNIST-TIDGITS dataset. Only the positive category (positive indicates presence of noise) is selected for reporting metrics. 104 Table 5.3: Test set accuracies obtained by the multimodal embeddings from the importance-based autoencoder and other baseline models. Accuracies are shown both for 2 and 50-dimensional representations. Multimodal embeddings from the proposed model outperform unimodal and fusion approaches without importance- based weighting. For each model the performance x(y,z) is reported, where x is the overall accuracy. y is the accuracy on (noise image, true spoken digit) paired samples. z is the accuracy on (true digit image, noise speech) paired samples. Embedding Model 2D 50D Unimodal VAE (MNIST Image) 50.82 (10.30,56.77) 89.04 (11.18, 98.09) Unimodal VAE (TIDIGIT speech) 46.06 (50.65, 9.11) 90.18 (98.46, 11.65) Importance-based Autoencoder (Weighted Fusion) 83.88 (69.29, 82.20) 96.94 (86.99, 94.46) Importance-based Autoencoder (Unweighted Fusion) 79.88 (59.86, 63.13) 96.84 (86.54, 93.61) Importance-based Autoencoder (Image View) 76.90 (12.06,83.05) 86.92 (9.41, 95.31) Importance-based Autoencoder (Speech View) 62.26 (66.00, 10.80) 82.88 (90.80, 11.70) Importance-based Autoencoder (Early Fusion) 80.68 (52.63,70.33) 97.4 (88.11, 92.97) MVAE (Wu & Goodman, 2018) 41.32 (39.69, 10.38) 98.18 (94.95,92.37) JMVAE (Suzuki et al., 2016) 53.06 (23.24, 14.40) 97.9 (93.85, 93.22) 5.6 Experimental Results on IEMOCAP The IMA model is trained on the IEMOCAP dataset, where paired data con- sisting of English vocabulary words and non-verbal acoustic features are provided as inputs, and the model not only learns the underlying latent factor z generating this data, but also the region in each unimodal representation u j which is ideally 105 uncorrelated with z. Experimental results show that the latent factor z learnt by the autoencoder bottleneck is discriminative of the primary emotions happy, angry and sad. Function words such as I, the and of play syntactic roles and are hence not related to z, likewise in the acoustic modality, the neutral tone is expected to co-occur with a different range of emotions. For example, even if an angry word is spoken in a neutral tone, the resulting emotion is expected to be angry, and the model should learn that the acoustic modality does not convey any important information in this case. Experiments show that while IMA is successful in iden- tifying non-important words and acoustic frames, it is still trained on a limited amount of data (around 10 hours of speech) in an unsupervised manner, and thus there is a significant amount of noise during training due to sparsity of words. For model training only vocabulary words which appear at least 5 times in the corpus are considered, resulting in a total vocabulary size of 1215 words. 5.6.1 Visualization of Learnt Embeddings Figure 5.6 shows unimodal representations learnt by the IMA model on the IEMOCAP dataset in two dimensions. The alignment between the vocabulary word representations and acoustic representations is evident from Figure 5.6(a-c). The happy, angry and sad words lie in their respective emotion regions. Note the higherfrequencyofsadandhappywordsatthetop-leftareaandangrywordsatthe bottom right in Figure 5.6(b). For the importance representations in Figure 5.6(d), the function words (at, there, for) are separated out from the emotionally colored words. For the acoustic importance representations, it is observed that the neutral framesarespread outinthemanifold andtheIMAmodel isabletodetect acertain fraction of these frames (≈30%). 106 (a) Word Representations (b) Distribution of emotions (c) Acoustic Representation (d) Importances (Word) (e) True Importances (Speech) (f) Pred Importances (Speech) Figure 5.6: Visualization of different unimodal representations learnt by the IMA model on the IEMOCAP dataset. Figure 5.6(a) shows the unimodal representations for dif- ferent happy, angry and sad words. Figure 5.6(b) shows the same representations for all emotional words, showing that locations of emotional words are also aligned with the acoustic representations in Figure 5.6(c). Note that the colors blue, red and yellow respectively denote the happy, angry and sad emotions. The importance network rep- resentations are shown in Figures 5.6(d)-(f) for words and acoustic frames. The regions learned by the network as important are in blue; rest are in red. In acoustic modal- ity, neutral speech frames are in gray. Most function words in Figure 5.6(d) have been identified by IMA as not important for emotion. 107 (a) Word Scores (b) Non-verbal Acoustic Scores Figure 5.7: Precision, Recall and F1 score curves for the importance-based autoencoder trained on the IEMOCAP dataset. Both positive and negative class F1 scores are con- sidered, with their average. The optimal ρ for the verbal modality is a low value (0.05), while it is around 0.6 for the acoustic modality. 5.6.2 Importance Network Performance Figure 5.7 shows the variation in the F1 scores for positive and negative cat- egories for two configurations: (1) The importance prior hyper-parameter ρ for words is set to 0.2, andρ for the acoustic non-verbal features is varied in the range [0.1-0.9], and (2) ρ for the acoustic modality is set to 0.6, and for words varied in the range [0.01-0.09]∪[0.1-0.9]. The positive category is defined as (1)verbal: vocabulary words labeled as negative emotion and positive emotion in the LIWC lexicon and (2)speech: acoustic non-verbal feature vectors labeled as belonging to a neutral category in the IEMOCAP dataset. The evaluation is performed on the validation set. The F1 scores are computed for both the positive and negative classes in each modality, along with the average F1 scores. This is done to provide adequate performance when evaluated on both categories. From Figure 5.7, the average F1 score for words peaks atρ = 0.05 (value of 0.5631), and for the acoustic modality asρ = 0.6 (value of 0.5176). Since a lot of vocabulary words do not have 108 emotional information, ρ is thus set to a small value. The F1 scores are also com- pared to a random classifier in both modalities. The random classifier labels each data point as 0 with probability r and 1 with probability 1−r, where r∈ [0,1]. For vocabulary words, the best average F1 score obtained was 0.5476 at r = 0.91; and in the acoustic modality the best average F1 score obtained was 0.4891 at r = 0.4. In both cases, the importance network performance outperforms that of the random baseline. 5.6.3 Emotion Retrieval Results As described in Section 5.4.4, utterances in the validation set of the IEMOCAP dataset are input to the IMA model at word level, and the multimodal represen- tation is obtained for each utterance. Table 5.4 shows the Precision@K scores for the multimodal representation representation z obtained from the IMA model and the following baselines : (1) early fusion of the unimodal representations u 1 and u 2 .(2) Unweighted average of the unimodal representations u 1 and u 2 . For com- parison, we have also reported the Precision@K scores obtained when using the average unimodal representation for each utterance, both for verbal and acoustic modalities. For all approaches, the verbal modality has produced lower precision scores than acoustic non-verbal modality, which is due to the high prevalence of function words in verbal utterances which do not consist of affective information. The IMA model weighs the acoustic information higher compared to text, and thus the weighted fusion scores are closer to the performance obtained from speech. The advantage of importance computation is apparent from first and second rows of Table 5.4, where the weighted retrieval performance is higher than unweighted retrieval performance. The JMVAE-KL model achieves better performance com- pared to the IMA model for both 2 and 50D, while the IMA model outperforms 109 Table 5.4: Precision@K scores (denoted as P@K) at K=10 and 50 (Mean Average Precision) scores for the importance-based autoencoder (IMA model) and other baseline models for the task of retrieval of utterances from IEMOCAP with the same primary emotion which is one of happy, angry, sad, neutral. Embedding Model 2D Pr@10 2 D Pr@50 50D Pr@10 50D Pr@50 Importance-based Autoencoder (Weighted Fusion) 0.4187 0.3980 0.4908 0.4406 Importance-based Autoencoder (Unweighted Fusion) 0.4117 0.3931 0.4896 0.4506 Importance-based Autoencoder (Early Fusion) 0.4378 0.3982 0.4960 0.4295 Importance-based Autoencoder (Audio Only) 0.4242 0.4005 0.4901 0.4425 Importance-based Autoencoder (Text Only) 0.3912 0.3516 0.4340 0.3620 JMVAE-KL Suzuki et al. (2016) 0.4259 0.4054 0.4960 0.4518 MVAE Wu & Goodman (2018) 0.3874 0.3626 0.4979 0.4561 MVAE in two-dimensions. These results address research question Q2.2 and show that there is a performance improvement obtained by learning importances in each modality. 110 5.7 Conclusions In this chapter, a multimodal representation learning model namely IMA (Importance-basedMultimodalAutoencoder)isproposedwhichcanintegratemul- tiple modalities to learn joint representations. Additional unimodal-specific align- ment terms are proposed in the variational lower bound to enable training of unimodal inference networks in a computationally efficient manner and perform inference in the absence of modalities. Further, the IMA model also investigates the problem of detecting uncorrelated noise through a cross-covariance based loss function which is minimized by modality-specific importance networks. Exper- iments are performed on standard datasets such as MNIST/TIDIGITS as well as real-world conversational datasets such as IEMOCAP. Retrieval and classifi- cation experiments are performed to evaluate quality of the learned multimodal representations for digit recognition and emotion understanding. The IMA model outperforms unimodal and other multimodal fusion approaches on the task of digit retrieval and classification, and is competitive with state-of-the-art approaches at learning representations of spoken utterances. Future work could focus on exten- sions to more than two modalities (such as visual), and sequence models such as RNNs (Recurrent Neural Networks) in the encoder-decoder networks for utterance level analysis. 111 Chapter 6 Language Modeling with Affective and Non-verbal Cues 6.1 Introduction In Chapters 3 and 4 of this dissertation, the research focus was on unimodal analysis from the visual modality (for facial expression recognition) and the acous- tic modality (for the task of speech emotion recognition). While cues from these modalities are of paramount importance for an understanding of human affect and conversations, the primary exchange of information between participants in a conversation is through verbal messages. The information contained in spoken utterances not only consist of semantics and other speech acts such as statements andquestions, butalsonon-verbalinformationinspokenwordswhichexpressover- all sentiment of the speaker in addition to the verbal cues already present in the conversation. The importance of affective and non-verbal acoustic cues for conver- sational understanding is described in Chapter 2 Section 2.1.3, along with prior work in this domain. Considering the importance of affective/behavioral cues to augment information present in verbal messages, it is justified to investigate the hypothesis that the integration of affective information, and/or non-verbal infor- mation in modalities accompanying verbal messages could improve the task of predicting the next word in a spoken utterance. For example, given the sentence I am crying and I feel so..., the next word could be influenced by not only the 112 emotion contained in the previous words (which is sad in this example), but also the negative tone in the accompanying voice. Thus the experiments described this chapter investigate the aforementioned hypothesis for the task of neural language modeling. Language modeling is already widely studied in existing literature, and is uti- lized for applications such as automatic speech recognition (Stolcke, 2002), cap- tioning of images and video (Kiros et al., 2014), as well as dialogue understanding. The interested reader is referred to Chapter 2 for a detailed discussion of prior work in this research area. This chapter proposes extensions to neural language models which incorporate the following additional cues: • Emotional information derived from the linguistic word context to better pre- dict the next word. This model is subsequently referred to as Affect-LM and described in Section 6.5. • Non-verbal cues present in the acoustic context (as extracted from prosodic speech features co-occurring with spoken words) to improve prediction of the next word. This model is subsequently referred to as Speech-LM and described in Section 6.6. The remainder of this chapter is described as follows - in Section 6.2 the primary researchquestionsaddressedbythisworkarestated, alongwiththemaincontribu- tions. In Section 6.3 and 6.4, a general framework for neural language modeling is introduced, along with corpora for training the Affect-LM and Speech-LM models. Sections 6.5 and 6.6 describe these language models respectively; these sections also include the experimental setup and primary results for perplexity evaluation and text generation. The chapter is concluded in Section 6.7. 113 6.2 Research Questions Motivated by the importance of affective and non-verbal cues for the task of language modeling, as described in the introduction, the main research questions addressed in this chapter are as follows: • Q3.1: Can language modeling of conversational text be improved by incorpo- rating affective information in the linguistic context ? • Q3.2: Are the sentences generated using the affective neural language model emotionally expressive as evaluated in a crowd-sourced perception study, while not sacrificing grammatical correctness ? • Q3.3: Can language modeling performance be improved through integration of linguistic information from the context and non-verbal acoustic features accom- panying the spoken words ? • Q3.4: How is the performance improvement distributed across different word categories when integrating non-verbal acoustic features with the linguistic con- text ? The research questions stated above are novel and addressing them would signifi- cantly contribute to the research frontier in the area of neural language modeling of conversational text and affective language generation. The shortcomings in related work which are addressed in this chapter are: • While the integration of verbal content and affective information (through emo- tion sensing from text or an accompanying modality) have already been inves- tigated for applications such as depression recognition (Ghosh et al., 2014) and multimodal sentiment analysis (Rosas et al., 2013), this integration has not been 114 investigated in the data-driven framework of neural language modeling. Further, though neural language models conditioned on other modalities such as images have been proposed as in Vinyals et al. (2015), however they have not been integrated with affect or non-verbal acoustic cues. • For the field of conversational text generation, previous literature in this domain have not focused sufficiently on customizable state-of-the-art neural network techniques to model emotional text, nor have their authors quantitatively eval- uated such models on multiple emotionally colored corpora. In contrast to the techniques discussed in this chapter, previous approaches such as described in Mahamood & Reiter (2011) and Mairesse & Walker (2007) use syntactic prior knowledge and utilize heuristics for language generation. 6.3 Baseline Language Model In this section, a neural language model which utilizes a LSTM (Long Short Term Memory) for next word prediction is described. This model is chosen as a baseline for subsequent experiments since it has been reported to achieve state- of-the-art perplexities compared to other approaches, such as n-gram models with Kneser-Ney smoothing (Jozefowicz et al., 2016). Unlike an ordinary recurrent neural network, an LSTM network does not suffer from the vanishing gradient problem which is more pronounced for very long sequences (Hochreiter & Schmid- huber, 1997). Formally, by the chain rule of probability, for a sequence ofM words w 1 ,w 2 ,...,w M , the joint probability of all words is given by: P (w 1 ,w 2 ,...,w M ) = t=M Y t=1 P (w t |w 1 ,w 2 ,....,w t−1 ) (6.1) 115 If the vocabulary consists of V words, the conditional probability of word w t as a function of its context c t−1 = (w 1 ,w 2 ,....,w t−1 ) is given by: P (w t =i|c t−1 ) = exp(U i T f(c t−1 ) +b i ) P V i=1 exp(U i T f(c t−1 ) +b i ) (6.2) f(.) is the output of an LSTM network which takes in the context words w 1 ,w 2 ,...,w t−1 as inputs through one-hot representations, U is a matrix of word representations which on visualization corresponds to POS (Part of Speech) infor- mation, while b i is a bias term capturing the unigram occurrence of wordi. Equa- tion 6.2 expresses the word w t as a function of its context for a LSTM language model which does not utilize any additional affective or non-verbal acoustic infor- mation. 6.4 Corpora for Language Model Training The Fisher English Training Speech Corpus is the main corpus used for training the Affect-LM and Speech-LM models, in addition to which three emotionally col- ored conversational corpora are selected for training Affect-LM. A brief description ofeachcorpusisgivenbelow, andinTable6.1, relevantstatisticsarereported, such as the total number of words, along with the fraction of emotionally colored words (those belonging to the LIWC (Pennebaker et al., 2001) affective word categories) in each corpus. • Fisher English Training Speech Parts 1 & 2: The Fisher dataset (Cieri et al., 2004) consists of speech from telephonic conversations of 10 minutes each, along with their associated transcripts. Each conversation is between two strangers who are requested to speak on a randomly selected topic from a set. Examples of conversation topics are Minimum Wage, Time Travel and Comedy. 116 Table 6.1: Summary of corpora used in the experiments. CMU-MOSI and SEMAINE are observed to have higher emotional content than Fisher and DAIC corpora. The type of conversations are also indicated for each corpus. The Fisher corpus is used to train the Affect-LM and Speech-LM models. The remaining corpora are not utilized for training Affect-LM due to their small size; but for adaptation from an existing model already trained on Fisher corpus. Corpus Name Conversations Words % Colored Words Content Fisher 11700 21167581 3.79 Telephone Conversations DAIC 688 677389 5.13 Distress Assessment Interviews SEMAINE 959 23706 6.55 Emotional Conversations (with SAL) CMU-MOSI 93 26121 6.54 Video Monologues • Distress Assessment Interview Corpus (DAIC): The DAIC corpus intro- duced by Gratch et al. (2014) consists of 70+ hours of dyadic interviews between a human subject and a virtual human, where the virtual human asks questions designed to diagnose symptoms of psychological distress in the subject such as depression or PTSD (Post Traumatic Stress Disorder). • SEMAINE dataset: SEMAINE (McKeown et al., 2012) is a large audiovisual corpus consisting of interactions between subjects and an operator simulating a SAL (Sensitive Artificial Listener). There are a total of 959 conversations which are approximately 5 minutes each, and are transcribed and annotated with affective dimensions. • Multimodal Opinion-level Sentiment Intensity Dataset (CMU- MOSI): (Zadeh et al., 2016) This is a multimodal annotated corpus of opinion videos where in each video a speaker expresses his opinion on a commercial product. The corpus consist of speech from 93 videos from 89 distinct speakers (41 male and 48 female speakers). This corpus differs from the others since it contains monologues rather than conversations. While all corpora contain spoken language, they have the following characteris- tics different from the Fisher corpus: (1) More emotional content as observed 117 in Table 6.1, since they have been generated through a human subject’s sponta- neous replies to questions designed to generate an emotional response, or from conversations on emotion-inducing topics (2) Domain mismatch due to recording environment (for example, the DAIC corpus was created in a mental health set- ting, while the CMU-MOSI corpus consisted of opinion videos uploaded online). (3) Significantly smaller than the Fisher corpus, which is 25 times the size of the other corpora combined. Training of the Affect-LM model is not directly performed on the emotionally colored corpora (MOSI, SEMAINE and DAIC) due to their smaller size. Though for future work, combined corpora could be utilized for training, it is expected to not significantly alter experimental results and is beyond the scope of this work. Only the Fisher corpus is used for Speech-LM training due to the usage of prosodic word-level acoustic features by the model, which would increase annotation and alignment costs if done on other corpora. Further, a difference in microphone recording characteristics leads to a mismatch in acoustic features with the Fisher corpus, and the author considers the 2000 hours of spoken data in Fisher corpus sufficient and large enough for training the Speech-LM model. 6.5 Affect-LM Language Model In this section, the first language model, namely Affect-LM is introduced to addressresearchquestionsQ3.1andQ3.2. Theprimarycontributionofthismodel are (1) the integration of emotional information from the word context to improve language modeling in terms of perplexity, and (2) the generation of conversational text conditioned on emotional categories which are also grammatically correct as measured in an user perception study. 118 6.5.1 Model Formulation The Affect-LM model has an additional energy term in the word prediction compared to Equation 6.2, and can be described by the following expression: P (w t =i|c t−1 ,e t−1 ) = exp(U i T f(c t−1 ) +βV i T g(e t−1 ) +b i ) P V i=1 exp(U i T f(c t−1 ) +βV i T g(e t−1 ) +b i ) (6.3) e t−1 is an input vector which consists of affect category information obtained from the words in the context during training, and g(.) is the output of a net- work operating on e t−1 .V i is an embedding learnt by the model for the i-th word in the vocabulary and is expected to be discriminative of the affective informa- tion conveyed by each word. Figure 6.1 presents a visualization of these affective representations. The parameter β defined in Equation 6.3, which is subsequently called the affect strength defines the influence of the affect category information (frequency of emotionally colored words) on the overall prediction of the target wordw t given its context. β is an important parameter in this work, and as shown subsequently in Section 6.5.8, adjustingβ changes the amount of emotional content in generated text. The formulation can be considered as an energy based model (EBM), where theadditionalenergytermcapturesthedegreeofcorrelationbetweenthepredicted word and the affective input (Bengio et al., 2003). 6.5.2 Descriptors for Affect Category Information TheAffect-LM modellearnsagenerativemodelofthenextwordw t conditioned not only on the previous wordsw 1 ,w 2 ,...,w t−1 but also on the affect category e t−1 which is additional information about emotional content. During model training, 119 the affect category is inferred from the context data itself. Thus a suitable fea- ture extractor is defined which can utilize an affective lexicon to infer emotion in the context. For the experiments here, the Linguistic Inquiry and Word Count (LIWC) text analysis program (Pennebaker et al., 2001) is utilized. LIWC has been previously described in Chapter 2. In this work, all word categories of LIWC corresponding to affective processes are considered: positive emotion, angry, sad, anxious, and negative emotion. Thus the descriptor e t−1 has five features with each feature denoting presence or absence of a specific emotion, which is obtained by binary thresholding of the features extracted from LIWC. For example, the affective representation of the sentence i will fight in the war is e t−1 ={“sad":0, “angry":1, “anxiety":0, “negative emotion":1, “positive emotion":0}. 6.5.3 Affect-LM Network Architecture A baseline LSTM language model is implemented in Tensorflow (Abadi et al., 2016), which follows the non-regularized implementation as described in Zaremba et al. (2014) and to which a separate energy term is added for the affect category in implementing Affect-LM. The vocabulary size is set to 10000 words with an LSTM network with 2 hidden layers and 200 neurons per hidden layer. The network is unrolled for 20 time steps, and the size of each minibatch is 20. The affect category e t−1 is processed by a multi-layer perceptron with a single hidden layer of 100 neurons and sigmoid activation function to yield g(e t−1 ). The output layer size is set to 200 for both f(c t−1 ) and g(e t−1 ). The network architecture is kept constant throughout for ease of comparison between the baseline and Affect-LM. 120 6.5.4 Affect-LM Perplexity Evaluation Setup To address research question Q3.1 and evaluate whether additional emotional information could improve the prediction performance, the corpora detailed in Section 6.4 are used for training in two stages as described below: (1)TrainingandvalidationofthelanguagemodelsonFisherdataset-The Fisher corpus is split in a 75:15:10 ratio corresponding to the training, validation and evaluation subsets respectively, and following the implementation in Zaremba et al. (2014), the language models (both the baseline and Affect-LM) are trained on the training split for 13 epochs, with a learning rate of 1.0 for the first four epochs, and the rate decreasing by a factor of 2 after every subsequent epoch. The learning rate and neural architecture are the same for all models. The model is validated over the affect strength β∈ [1.0,1.5,1.75,2.0,2.25,2.5,3.0]. The best performing model on the Fisher validation set is chosen and used as a seed for subsequent adaptation on the emotionally colored corpora. (2)Fine-tuning the seed model on other corpora- Each of the three corpora - CMU-MOSI, DAIC and SEMAINE are split in a 75:15:10 ratio to create individual training, validation and evaluation subsets. For both the baseline and Affect-LM, the best performing model from Stage 1 (the seed model) is fine-tuned on each of the training corpora, with a learning rate of 0.25 which is constant throughout, and a validation grid ofβ∈ [1.0,1.5,1.75,2.0]. For each model adapted on a corpus, the perplexities obtained by Affect-LM and the baseline model are compared when evaluated on that corpus. 6.5.5 Affect-LM for Sentence Generation Affect-LM can be used to generate sentences conditioned on the input affect category, the affect strengthβ, and the context words. For experiments to address 121 research question Q3.2, the following affect categories have been chosen - positive emotion, anger, sad, anxiety, and negative emotion (which is a superclass of anger, sad and anxiety). As evident from Equation 6.3, the affect strength β defines the degree of dominance of the affect-dependent energy term on the word prediction in the language model, consequently after model training β can be changed to control the degree of how “emotionally colored" a generated utterance is, varying fromβ = 0(neutral; baselinemodel)toβ =∞(thegeneratedsentencesonlyconsist of emotionally colored words, with no grammatical structure). 6.5.6 Sentence Generation Perception Study Affect-LM’s abilitytogenerateemotionallycoloredtextofvaryingdegreeswith- out severely deteriorating grammatical correctness (as stated in research question Q3.2) was evaluated by conducting an extensive perception study on Amazon’s Mechanical Turk (MTurk) platform. The MTurk platform has been successfully used in the past for a wide range of perception experiments and has been shown to be an excellent resource to collect human ratings for large studies (Buhrmester et al., 2011). Specifically, more than 200 sentences for four sentence beginnings were generated (namely the three sentence beginnings listed in Table 6.3 as well as an end of sentence token indicating that the model should generate a new sentence) in five affect categories happy(positive emotion), angry, sad, anxiety, and negative emotion. The Affect-LM model trained on the Fisher corpus was used for sentence gen- eration. Each sentence was evaluated by two human raters that have a minimum approval rating of 98% and are located in the United States. The human raters were instructed that the sentences should be considered to be taken from a con- versational rather than a written context: repetitions and pause fillers (e.g., um, 122 uh) are common and no punctuation is provided. The human raters evaluated each sentence on a seven-point Likert scale for the five affect categories, overall affective valence as well as the sentence’s grammatical correctness and were paid 0.05USD per sentence. Inter-rater agreement was measured using Krippendorffsα and considerable agreement was observed between raters across all categories (e.g., for valence α = 0.510 and grammatical correctness α = 0.505). For each target emotion (i.e., intended emotion of generated sentences) an ini- tial MANOVA was conducted, with human ratings of affect categories the DVs (dependent variables) and the affect strength parameter β the IV (independent variable). This was followed up by univariate ANOVAs to identify which DV changes significantly with β. A total of 5 MANOVAs and 30 follow-up ANOVAs were conducted, which necessitated updating the significance level to p<0.001 fol- lowing a Bonferroni correction. 6.5.7 Perplexity Results In Table 6.2, research question Q3.1 is addressed by presenting the perplexity scores obtained by the baseline model and Affect-LM, when trained on the Fisher corpus and subsequently adapted on three emotional corpora (each adapted model is individually trained on CMU-MOSI, DAIC and SEMAINE). The models trained on Fisher are evaluated on all corpora while each adapted model is evaluated only on it’s respective corpus. For all corpora, Affect-LM achieves lower perplexity on average than the baseline model, implying that affect category information obtained from the context words does improve language model prediction. The average perplexity improvement is 1.44 (relative improvement 1.94%) for the model trained on Fisher, while it is 0.79 (1.31%) for the adapted models. Larger improvements in perplexity are observed for corpora with higher content 123 Table 6.2: Evaluation perplexity scores obtained by the baseline and Affect-LM modelswhentrainedonFisherandsubsequentlyadaptedonDAIC,SEMAINEand CMU-MOSI corpora. In all cases the Affect-LM model obtains lower perplexity compared to the baseline LSTM-LM. The corpora with more emotional content (such as the CMU-MOSI and SEMAINE datasets) result in greater perplexity reduction than the Fisher corpus. Training (Fisher) Adaptation Perplexity Baseline Affect-LM Baseline Affect-LM Fisher 37.97 37.89 - - DAIC 65.02 64.95 55.86 55.55 SEMAINE 88.18 86.12 57.58 57.26 CMU-MOSI 104.74 101.19 66.72 64.99 Average 73.98 72.54 60.05 59.26 of emotional words. This is supported by the results in Table 6.2, where Affect- LM obtains a larger reduction in perplexity for the CMU-MOSI and SEMAINE corpora, which respectively consist of 2.76% and 2.75% more emotional words than the Fisher corpus. 6.5.8 Text Generation Results Table 6.3 shows three sentences generated by the model for input sentence beginnings I feel so ..., Why did you ... and I told him to ... for each of five affect categories - happy(positive emotion), angry, sad anxiety, and neutral(no emotion). They have been selected from a pool of 20 generated sentences for each category and sentence beginning. Affect-LM is trained on conversation transcripts and not grammatically vetted and correct news articles or Wikipedia entries, and thus there can be grammatical imperfections in the generated text. It is evident that Affect-LM can generate conversational text conditioned on an input emotion category, which satisfactorily addresses the research question Q3.2 stated in the introduction. 124 Figure 6.1: Embeddings learnt by Affect-LM. Representative words from each emotion category (such as sad, angry, anxiety, negative emotion and positive emotion) are indicated on the scatter plot in different colors. Note how positive emotion is distinct and far away from the negative emotion categories in the plot. 6.5.9 MTurk Perception Study Results In the following, research questionQ3.2 is addressed by reporting the main sta- tistical findings of the MTurk perception study, which are visualized in Figures 6.2 and 6.3. PositiveEmotionSentences: Themultivariateresultwassignificantforpos- itive emotion generated sentences (Pillai’s Trace=.327, F(4,437)=6.44, p<.0001). Follow up ANOVAs revealed significant results for all DVs except angry with p<.0001, indicating that both affective valence and happy DVs were successfully manipulated with β, as seen in Figure 6.2(a). Grammatical correctness was also 125 Table 6.3: Example sentences generated by the model conditioned on different affect categories such as Happy, Angry, Sad, Negative Emotion and Positive Emotion. Affect-LM can generate sentences for each category which are perceptually affective without sacrificing grammatical correctness Beginning Affect Completed sentence I Happy good because i think that it’s important to have a relationship with a friend feel Angry bad that i hate it and i hate that because they they kill themselves and then they fight so Sad sad to miss because i i miss the feelings of family members who i lost feelings with Anxious horrible i mean i think when we’re going to you know war and alert alert Neutral bad if i didn’t know that the decision was going on I Happy be honest and i said well i hope that i ’m going to be a better person told Angry see why he was fighting with my son him Sad leave the house because i hurt one and i lost his leg and hurt him to Anxious be afraid of him and he he just he just didn’t care about the death penalty Neutral do this position i think he is he’s got a lot of money he has to pay himself a lot of money Why Happy have a best friend did Angry say it was only a criminal being killed at a war or something you Sad miss your feelings Anxious worry about fear factor Neutral believe in divorce Figure 6.2: Amazon Mechanical Turk study results for generated sentences in the targetaffectcategories positive emotion, negative emotion, angry, sad, andanxious (a)-(e). The most relevant human rating curve for each generated emotion is highlighted in red, while less relevant rating curves are visualized in black. Affect categories are coded via different line types and listed in legend below figure. 1 2 3 4 5 6 7 (a) Positive Emotion (b) Negative Emotion (c) Angry (d) Sad (e) Anxious 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Anxious Happy Angry Emotional Strength (beta) Emotion Ratings Sad Affect Valence significantly influenced by the affect strength parameter β and results show that the correctness deteriorates with increasing β (see Figure 6.3). However, a post- hoc Tukey test revealed that only the highest β value shows a significant drop in grammatical correctness at p<.05. Negative Emotion Sentences: The multivariate result was significant for negative emotion generated sentences (Pillai’s Trace=.130, F(4,413)=2.30, 126 p<.0005). Follow up ANOVAs revealed significant results for affective valence and happy DVs with p<.0005, indicating that the affective valence DV was suc- cessfully manipulated with β, as seen in Figure 6.2(b). Further, as intended there were no significant differences for DVs angry, sad and anxious, indicating that the negative emotion DV refers to a more general affect related concept rather than a specific negative emotion. This finding is in concordance with the intended LIWC category of negative affect that forms a parent category above the more specific emotions, such as angry, sad, and anxious (Pennebaker et al., 2001). Grammatical correctness was also significantly influenced by the affect strength β and results show that the correctness deteriorates with increasing β (see Figure 6.3). As for positive emotion, a post-hoc Tukey test revealed that only the highest β value shows a significant drop in grammatical correctness at p<.05. Angry Sentences: The multivariate result was significant for angry gener- ated sentences (Pillai’s Trace=.199, F(4,433)=3.76, p<.0001). Follow up ANOVAs revealed significant results for affective valence, happy, and angry DVs with p<.0001, indicating that both affective valence and angry DVs were successfully manipulated with β, as seen in Figure 6.2(c). Grammatical correctness was not significantly influenced by the affect strength parameter β, which indicates that angry sentences are highly stable across a wide range of β (see Figure 6.3). How- ever, it seems that human raters could not successfully distinguish between angry, sad, and anxious affect categories, indicating that the generated sentences likely follow a general negative affect dimension. Sad Sentences: The multivariate result was significant for sad generated sen- tences(Pillai’sTrace=.377, F(4,425)=7.33, p<.0001). FollowupANOVAsrevealed significant results only for the sad DV with p<.0001, indicating that while the sad DV can be successfully manipulated with β, as seen in Figure 6.2(d). The 127 0 1 2 3 4 5 Emotional Strength (beta) 1 2 3 4 5 6 7 Grammatical Correctness Ratings Grammatical Evaluation Happy Angry Sad Anxious Negative Affect Figure 6.3: Mechanical Turk study results for grammatical correctness for all generated target emotions. Perceived grammatical correctness for each affect categories are color-coded. grammatical correctness deteriorates significantly with β. Specifically, a post-hoc Tukey test revealed that only the two highest β values show a significant drop in grammatical correctness at p<.05 (see Figure 6.3). A post-hoc Tukey test for sad reveals thatβ = 3 is optimal for this DV, since it leads to a significant jump in the perceived sadness scores at p<.005 for β∈{0,1,2}. AnxiousSentences: Themultivariateresultwassignificantforanxious gener- ated sentences (Pillai’s Trace=.289, F(4,421)=6.44, p<.0001). Follow up ANOVAs revealed significant results for affective valence, happy and anxious DVs with p<.0001, indicating that both affective valence and anxiety DVs were success- fully manipulated with β, as seen in Figure 6.2(e). Grammatical correctness was also significantly influenced by the affect strength parameter β and results show that the correctness deteriorates with increasing β. Similarly for sad, a post-hoc Tukey test revealed that only the two highest β values show a significant drop in grammatical correctness at p<.05 (see Figure 6.3). Again, a post-hoc Tukey test 128 for anxious reveals thatβ = 3 is optimal for this DV, since it leads to a significant jump in the perceived anxiety scores at p<.005 for β∈{0,1,2}. 6.5.10 Affective Word Representations In Equation 6.3, Affect-LM learns a weight matrix V which captures the corre- lation between thepredicted wordw t , andthe affectcategory e t−1 . Thus, each row of the matrix V i is an emotionally meaningful embedding of the i-th word in the vocabulary. Figure 6.1 presents a visualization of these embeddings, where each data point is a separate word, and words which appear in the LIWC dictionary are colored based on which affect category they belong to (labeling is done only for words in categories positive emotion, negative emotion, anger, sad and anxiety since these categories contain the most frequent words). Words colored grey are those not in the LIWC dictionary. In Figure 6.1, it is observed that the embed- dings contain affective information, where the positive emotion is highly separated from the negative emotions (sad, angry, anxiety) which are clustered together. 6.6 Speech-LM In this section, the second language model, Speech-LM is introduced to address researchquestionsQ3.3andQ3.4. Theprimarycontributionsofthismodelare(1) the integration of non-verbal information from the acoustic context to improve lan- guage modeling in terms of perplexity, and (2) perplexity analysis of the improve- ments over a baseline LSTM language model across word categories, which show thatthemajorityoftheimprovementsareforpredictionofend-of-utterancetokens, backchannels and emotionally colored vocabulary words. 129 Figure 6.4: Neural Architecture of the Speech-LM model. The processing at each timestept is shown here which is unrolled for the entire batch. f(c t−1 ) and g(s t−1 ) are representations of the word and non-verbal acoustic contexts respectively. 6.6.1 Model Formulation In the Speech-LM model, the network is trained towards predicting the next word using both the word context c t−1 as well as the non-verbal acoustic context s t−1 as inputs. Hence, the conditional probability of word w t is given by: P (w t =i|c t−1 ,s t−1 ) = exp(U i T f(c t−1 ) +βV i T g(s t−1 ) +b i ) P N j=1 exp(U j T f(c t−1 ) +βV j T g(s t−1 ) +b j ) (6.4) where s t−1 is provided as input to the LSTM network modeling g(.). V is an non- verbal outputword embedding matrix and models the correlation between the next predicted word w t and the representation g(s t−1 ) learnt by the model from the non-verbal acoustic context. The non-verbal strength β, defines the dominance of the acoustic s t−1 on the next word prediction compared to the verbal information c t−1 coming in from the language model. Figure 6.4 shows the neural architecture of the Speech-LM model. The non-verbal strengthβ (which is a hyper-parameter) is automatically optimized using validation described in Section 6.6.4. 130 6.6.2 Processing of the Fisher Corpus Prior to non-verbal feature extraction from the Fisher corpus, the following processing steps are performed: 1. Forcedalignment: ThePennPhoneticForcedAligner(P2FA)(Yuan&Liber- man, 2008) was used for aligning the transcriptions with the speech signal at a word level. The output of the forced aligner were the timestamps for the start and end of each word in the corpus. 2. Processing of non-verbal speech units: Certain speech units convey useful communicative information about the conversation but do not have any lexical role. The Fisher corpus is already annotated with presence of such non-verbal information (laughter, nasal sounds, lipsmack, coughing and breaths). All such non-verbal tokens are included in the vocabulary and considered in model train- ing. Moreover all occurrences of short pauses in the speech were ignored and hence all the acoustic information pertaining to these events was discarded as well. 3. Processing of end-of-sentence tokens The conversations are split into turn basedutterances, andtheendofeachspeakerturnisdenotedbyaspecial<eos> token. A feature vector of the same dimensionality as the non-verbal feature set (with every element set to zero) is inserted to co-occur with the end of utterances. To extract non-verbal speech information from the Fisher corpus, the OpenSMILE feature extraction toolbox (Eyben et al., 2010) was used. A detailed description of this toolbox is provided in Chapter 2 Section 2.1.3. Feature extraction was performed at the word-level (i.e., a feature vector was extracted from the speech signalcorrespondingtothestartandendtimestampsforeverytokeninthecorpus). 131 The MFCC (Mel Frequency Cepstral Coefficient) and other spectrogram based features were removed from the considered features in this work, since the goal is to model the non-verbal information of the speech signal rather than to the uttered words themselves. Subsequent to feature extraction, the acoustic context vectors had a dimensionality of 1408, which is constant for the remainder of this study. 6.6.3 Baseline Models for Comparison To address research question Q3.3 regarding improvements in language mod- eling performance achieved over a baseline LM, the performance of the Speech-LM model is compared with a baseline LSTM language model, where only informa- tion from the word context is utilized for prediction of the next word, and the Affect-LM model which was previously described in Section 6.5. It is expected that Speech-LM is potentially more powerful and can model additional sources of variation present in the non-verbal acoustic context in contrast to Affect-LM, which considers additional emotional cues only from the affective words in the lin- guistic context. For all models, a vocabulary of N = 10,000 words is used with a mini-batch size of 20 and the sequence models are unrolled over 20 time-steps. 6.6.4 Speech-LM Training Methodology The Fisher corpus is partitioned into training, testing and validations sets for model training, hyper-parameter tuning and evaluation. The corpus consists of approximately 12,000 speakers, and speaker-independent splits of the corpus were created, which implies that there are no speakers common between training, val- idation and testing sets. This ensures that the model does not learn non-verbal representations from the corpus which are speaker dependent. The corpus split is 75:15:10 between training, validation and test set speakers respectively. 132 The models Baseline and Speech-LM are both validated over a range of learn- ing rates η∈{0.25,0.5,1.0}. Additionally Speech-LM is validated over non-verbal strength β∈{0.5,1.0,1.5,2.0}. Affect-LM is validated with a constant learning rate ofη = 1.0 (the best configuration reported in Ghosh et al. (2017)) and affect strengthα∈{1.0,1.5,2.0,2.5}. Ineachcase, thebestperformingmodel(intermsof perplexity)onthevalidationsetischosen. Experimentally, itisfoundthatthebest performance is obtained with β = 0.5 and a learning rate η = 0.5 for Speech-LM. Additionally it is also of interest to obtain a more detailed insight into the perplexity reduction, particularly how it is distributed among the different tokens in the corpus as stated in research question Q3.4. For example, the non-verbal acoustic features appearing in the context might help improve prediction of utter- ance ending tokens; the appearance of backchannels or emotionally colored words. The test split of the corpus, which consists of 2,386,000 tokens and 9805 unique vocabulary words is selected for this experiment. The vocabulary words and tokens occurring in the Fisher corpus are split into seven categories - (1) negative valence words; (2) positive valence words listed in the LIWC lexicon (Pennebaker et al., 2001); (3) stop words such as a, the and of as defined in the NLTK (Bird & Loper, 2004) stopword list; (4) para-linguistic sounds such as laughter, breaths, coughs and lip smacks; (5) backchannels such as ah, mhm and er (6) the end of utterance <eos> symbol which is the most frequent in the corpus (7) a special category consisting of words which do not belong to above categories. Forthecomparedmodelstheaveragepredictionentropy ˆ H iscomputed for each unique word category over all (word and speech context, predicted word) tuples in the corpus test split and the corresponding perplexities are obtained. This experimental setup provides an understanding of how any improvements in 133 testperplexityaredistributedoverdifferentwordcategoriescomparedtoabaseline LSTM neural language model. 6.6.5 Perplexity Results and Distributions Table 6.4 shows the perplexities computed on the validation and test sets for a baseline LSTM language model, the Affect-LM model, and the Speech-LM model. The percentage improvement over the baseline perplexity for each model is also reported. The results show that the Speech-LM model achieves a 2.22% relative improvement in test perplexity over the baseline model, and a 2.09% improvement in perplexity over the Affect-LM model. This shows that it is beneficial to also consider non-verbal information from the acoustic signal in addition to the word context during language modeling. Research question Q3.4 is investigated in Table 6.5 which shows the stated next predicted word categories along with their frequency, and average prediction entropy ˆ H and perplexity for both the Baseline and Speech-LM models. Note that alltheobservedimprovementsusingSpeech-LM arehighlysignificantwithp-values < 0.0001 using pairwise t-tests. The p-values also withstand strict Bonferroni corrections for multiple comparisons. From an examination of Table 6.5 it is observed that the most reduction in perplexity is while predicting the <eos> symbol, followed by backchannels and nonverbal sounds. This finding confirms that the non-verbal information indeed improves prediction of end of sentence tokens relevant to turn-taking (Cassell et al., 2001) in conversational interactions by modeling changes in acoustics observed before speech pauses and backchannels. Additionally, modest reductions in per- plexity are also obtained in affective word categories (i.e., positive and negative affect), which indicates that emotional or affect relevant information is conveyed 134 Table 6.4: Summary of results from perplexity evaluation between three competing models - baseline LSTM model, Affect-LM, Speech-LM. Relative improvements in perplexity with respect to baseline are reported in parentheses. Language Model Validation Perplexity Test Perplexity Baseline LSTM-LM 50.556 50.877 Affect-LM 50.539(0.03%) 50.813(0.12%) Speech-LM 49.360(2.36%) 49.746(2.22%) Table 6.5: Average entropy ˆ H and perplexity scores (denoted by P) for different predicted word categories, for the Baseline and Speech-LM models. Speech-LM achieves reduction in perplexity over major word categories. All differences are highly significant and Speech-LM significantly outperforms the baseline model with observed p-values < 0.0001 using paired t-tests. The word frequencies reported in the table correspond to the test set. Predicted Word Frequency ˆ H ˆ H P (Baseline) P (Speech-LM) Reduction (P) <eos> 9.22% 1.675 1.591 5.338 4.908 8.05% Backchannels 3.95% 4.070 3.996 58.556 54.380 7.13% Non-verbal sounds 1.74% 3.959 3.933 52.404 51.059 2.56% Negative valence 0.90% 7.194 7.169 1331.418 1298.545 2.46% Positive valence 2.96% 5.360 5.336 212.724 207.680 2.38% Remaining Words 37.97% 4.972 4.958 144.315 142.308 1.39% Stop words 43.59% 3.321 3.309 27.688 27.357 1.19% through the non-verbal acoustic context. Lastly, stopwords exhibit the lowest reduction in perplexity, which demonstrates that these word categories have only minimal interaction with non-verbal information in the speech context due to their ubiquitous nature. 6.6.6 Word Representations Learnt by Speech-LM Model The parameter V in Equation 6.4 is learnt during training and models the cor- relation between the non-verbal acoustic information and the next predicted word. Each row of the matrix V corresponds to a unique non-verbal word representation learned by the language model. This learned representations may be influenced by a number of factors of variation in the speech, such as gender, prosody, as 135 well as affect. Figure 6.5 shows these representations which have been dimen- sionally reduced to two dimensions using t-SNE (t-Stochastic Neighbor Embed- ding) Maaten & Hinton (2008) from the Speech-LM. Major clusters corresponding insurance employees commission economy ouch, gotcha wow, whoa yep, huh alcoholic ban ads restaurants most even all we under Finance Stop words Backchannels Drugs Figure 6.5: t-SNE word representations learnt by Speech-LM. Positive valence words are in blue; negative valence words are in red. to semantic groupings such as sports, drugs/addiction, as well as stop words (such as all and we) appear, indicating that speakers generally use similar prosody in the acoustic context of words belonging to these groupings. A special backchannel groupingisnotable; itconsistsoffillersandpausetokenssuchasyeh,ugh, andyum. Thus the Speech-LM model learns backchannels and pause fillers in an unsuper- vised manner without explicit knowledge of them. Further, data points are colored based on their membership in the LIWC affective categories as described by Pen- nebaker et al. (2001)(blue - positive valence words and red-negative valence words). Words colored grey are those not in the LIWC dictionary. It is observed that any emotiondiscriminativenessinthemodelisprimarilythroughthesesemanticgroup- ings at the periphery of the embedding space, for example the drugs cluster would 136 Table 6.6: Example sentences generated by Speech-LM for instances of spoken words obtained from the Fisher corpus for different prosody instances. Sentence Seed Prosody instance Completed sentence (β=0.5) 1 good well that’s good that’s what i think I feel so 2 bad for a lot of science work in the market 3 good for jobs and then the first one was just the second 1 penn state so you know they’re they’re not going to have We went to 2 but we got to go to the war and we went to the united states 3 um the the first year of high school it was a lot of it 1 lot of fun to see it was just in the back of the car It was a 2 little bit more fun in the world and i think that 3 good one to get the little kids in the house mostly consist of sad words, while backchannels often indicate affirmative (and hence positive) responses. 6.6.7 Text Generation Results The Speech-LM model utilizes additional non-verbal information from the acoustic signal to predict the next word given the context; hence a trained Speech- LM model could be utilized to generate text conditioned on a spoken sentence beginning. Thus, experiments to obtain an insight into this generation property of the model. Three sentence beginnings are considered: I feel so, We went to and It was a. For each of the three seeds, examples of spoken utterances in the test split of the Fisher corpus are obtained, where words in the seed also appear in the utterance; the non-verbal speech features co-occurring with the words are also extracted. Table 6.6 shows some examples of sentences generated by Speech- LM for the given starting seed. For each spoken instance of a seed, the speaker and prosody/tones will be different and this will influence the generation of the next word. While it is evident that this model does generate conversational text at various instances of input prosody; a detailed investigation of the generation properties of this model could be potentially considered as future work. 137 6.7 Conclusions In this dissertation chapter, the problem of neural language modeling condi- tioned on affective and non-verbal acoustic information is addressed. Two novel models for conversational text are introduced -Affect-LM and Speech-LM where the emotional information in the linguistic word context, and the non-verbal cues in the acoustic context are respectively utilized for improving language modeling. The research questions stated at the introduction to this chapter are effectively addressed through experiments on these models, including perplexity evaluation on Fisher (a large conversational corpus) and MTurk perception studies on the emotional content and grammatical correctness of generated text. It is demonstrated that the results show an improvement in evaluation perplex- ity reduction over a vanilla LSTM baseline for both the Affect-LM and Speech-LM, and that Affect-LM can be used to generate text conditioned on emotional cat- egories at various levels of intensity without sacrificing grammatical correctness. For future work, these models can be used in different applications such as the integration of Speech-LM in a multimodal conversational agent which would not only utilize the linguistic content from words but also non-verbal information from acoustic channel as additional cues. Integrating both these sources of information not only would help the system infer the next word, but also improve prediction of backchannels, non-verbal sounds and emotional words. The system would thus be able to take turns and also utilize Affect-LM to generate emotional responses interspersed with appropriate backchannels/paralinguistic units such as laughter and sounds like erm and huh for more human-like responses. 138 Chapter 7 Conclusions and Future Work This dissertation has addressed research questions introduced in Chapter 1 which broadly span the areas of: (1) Unimodal representation learning to achieve affect recognition performance comparable to manually-engineered feature sets; (2) Multimodal representation learning which combines diverse sources of data such as the verbal modality and emotional speech and (3) Integration of affective and non-verbal information for modeling language generation. The work done in this dissertation has contributed to the first area through feature learning from facial images and speech spectrograms for the tasks of Action Unit and speech emotion recognition respectively. Further, this dissertation also contributes to the other two areas by introducing (1) an importance-based multimodal autoencoder for weighted multimodal fusion, and (2) two novel language models, Speech-LM and Affect-LM, which improve language modeling perplexity and can be utilized to generate emotionally colored text. In an unified manner, these contributions can aid the study of intelligent interaction through creation of affect sensing systems which can exploit all available cues in data; as well as the generation of affect by virtual agents. Several research questions which have been described in Section 1.1 are revisited in Section 7.1 along with contributions of this dissertation and poten- tial application areas. Section 7.2 also describes some upcoming challenges and research problems which can be posed as a follow-up to the work done in this dissertation. 139 7.1 Main Contributions In this section, the contributions of this dissertation to address the main research questions previously introduced in Chapter 1 are discussed, along with emerging application areas. • Unimodal Representation Learning: This dissertation has addressed the problem of learning affective attributes directly from the raw data in individual modalities such as vision, language and speech. Chapter 3 describes a novel multi-label Convolutional Neural Network to learn directly from image pixels for the AU (Action Unit) recognition task; and in Chapter 4 denoising autoencoders are trained directly on temporal frames from speech spectrograms. It is shown through evaluation on well-defined classification benchmarks that representation learned features obtain comparable performance to standard feature extractors for AU and speech emotion recognition. The multi-label CNN approach also learns a shared representation across AUs, achieving competitive AU recognition performance without the requirement of separately trained networks for each Action Unit. For the task of speech emotion recognition, it is also observed that representation learning from the glottal flow waveform extracted from speech obtains similar performance to representations learned from the speech signal. Further, an unsupervised dictionary learning approach is proposed for glottal inverse filtering which is effective across speech datasets and voice qualities. In Chapter 6, the integration of the verbal affective context of a word with neural language models also facilitate the learning of affective word embeddings, which can be utilized for downstream classification tasks such as sentiment analysis or conversational understanding. 140 The experiments described in this research area not only show the effec- tiveness of representation learning approaches but are also well-situated among contemporary research. Subsequently, there has been an increase in neural net- work based approaches for AU recognition (Chu et al., 2016; Bishay & Patras, 2017), and approaches have been proposed to learn end-to-end networks for valence and activation from speech waveforms (Trigeorgis et al., 2016). After the work on Affect-LM was published, researchers have investigated affective word embeddings in applications such as conversational agents (Asghar et al., 2017). • Multimodal Representation Learning: While the approaches described in this dissertation are effective for unimodal representation learning, the research problem of efficiently integrating information from multiple modalities to learn meaningful representations of data is an open challenge. This dissertation addresses the problem of learning multimodal representations with weighted modalities, in Chapter 5, where a novel model, the Importance-based Mul- timodal Autoencoder (IMA) is proposed. From an architecture perspective, the contributions of the IMA model are: (1) a probabilistic variational encod- ing framework where multimodal representation learning through aligment and fusion of modalities is theoretically derived (2) modality-specific importance net- works which learn whether the data in each modality is related to the underlying shared latent factor. Experiments are performed both on a MNIST-TIDIGITS image/spoken paired digit dataset and the IEMOCAP dataset, and the IMA model outperforms baseline approaches, including state-of-the-art models such as JMVAE (Suzuki et al., 2016) and MVAE (Wu & Goodman, 2018) on down- stream classification and retrieval tasks related to digit and emotion recognition. 141 • Neural Language Modeling with Affective and Non-verbal informa- tion: The dissertation work in this research area addresses the problems of whether the inclusion of affective and non-verbal cues can improve performance on the task of neural language modeling; and whether the resulting language model can generate emotionally colored sentences. It is shown that for the Affect-LM and Speech-LM models, the respective integration of affective and non-verbal information from a co-occurring modality (acoustic information for Speech-LM) is effective resulting in perplexity reduction on this task when evaluated on Fisher, a large conversational speech corpus. It is observed that the perplexity reduction obtained by the Speech-LM model is mostly when predicting end-of-sentence tokens, backchannels and emotional words. Additionally, the interaction of spoken words with affective cues is also learnt by the Affect-LM model, thus facilitating the generation of emotionally expressive conversational text which is also grammatically coherent as evaluated through MTurk user studies. The Affect-LM and Speech-LM models can be utilized for different applications, for example the design of chatbots which can integrate cues from multiple modalities to generate emotional responses to user queries. Similar applications are being studied in subsequent work (Zhou et al., 2017; Asghar et al., 2017). Besidesthethreemajorresearchareasdescribedabove, theapproachesproposedin the dissertation are also related to problems such as disentangling factors of varia- tion and cross-domain robustness. Information unrelated to affect is often filtered out as a pre-processing step to subsequent modeling. Examples include subject normalization (for the multi-label CNN model) and removal of linguistic infor- mation from the acoustic modality through glottal inverse filtering, as explained 142 in Chapter 4. Further, this dissertation has also investigated the robustness of representation learning models in a cross-domain setting, such as different source and target datasets. The Multi-label CNN for AU recognition has been trained on DISFA and BP4D datasets and also evaluated in a cross-dataset setting as described in Chapter 3, demonstrating robustness of the approach. Similarly, the Affect-LM model discussed in Chapter 6 has also been evaluated in a cross-domain setting, on different types of conversational corpora such as SEMAINE and MOSI. 7.2 Future Work In this section, three interesting problem statements are presented, which can be investigated as extensions to the approaches already discussed in this disserta- tion. The interested reader should note that while many open problems do exist in the research area of representation learning for human affect understanding; the problems stated below have been selected not only for their research worthiness, but also their potential societal impact and value. • Multimodal Learning for Affective Knowledge Mining: A large fraction of multimodal conversational data reside not only in datasets such as IEMOCAP and Fisher corpora, which are used for model training in this dissertation work; but also in other repositories such as YouTube, movies and user comments on socialmediawebsites. Mostofthemareunlabeled; andthusapotentialsourceof data for unsupervised and semi-supervised learning algorithms. The multimodal autoencoder model described in Chapter 5 showed how alignment and fusion can be performed in a similar framework; thus an interesting extension of this work wouldbetoautomaticallylearntheco-occurrenceandsimilaritybetweenspoken utterances, facial expressions and acoustic prosodic features from the vast (and 143 potentially ever increasing) amounts of data available online. This would also enable the mapping of text, video and audio into the same modality, and in a data mining scenario, this could be integrated with structured sources such as knowledge graphs and ontologies, thus providing affective meaning to real-world entities in addition to semantics. For example, the entities Hurricane Harvey and The Pink Panther carry distinct emotional information and can be learnt from the sentiment expressed by individuals from relevant tweets, or from facial expressions when people talk about these entities in videos. The design of an intelligent tutoring system, or an interactive search engine could benefit from such an affective multimodal knowledge graph. Such systems can utilize the affective meaning of entities to generate more anthropomorphic responses to queries during interactions with human users. • Deep Generative Models for Speech Production: There has been impres- sive strides by the deep learning community in the field of speech synthesis, for example Lyrebird for voice mimicry 1 , and Tacotron 2 released by Google 2 . Such systems employ large sequential neural networks which have been learnt end-to-end using vast amounts of training speech data with accompanying tran- scriptions. However, there has been limited progress towards building generative models for affective speech, particularly because of the limited availability of large emotionally expressive voice repositories. Besides, none of these systems leverage the well-known speech production mechanism which has been described in Chapter 2. Modern deep generative models such as VAEs (Variational Autoencoders) and GANs (Generative Adversarial Networks) are grounded in probabilistic interpretation, which begs the research question of whether the 1 https://lyrebird.ai/ 2 https://research.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html 144 unsupervised framework for glottal inverse filtering described in Chapter 4 can be integrated with these models. This would also enable the disentanglement of factors of variation in the acoustic modality, such as separating speaker iden- tity and verbal information from non-verbal information contained in the glottal flow signal. Other potential applications of such a model could be to emotional speech synthesis and unsupervised non-verbal representation learning. • Towards anthropomorphic animated conversational agents: A human- like conversational agent should not only be able to sense verbal information and non-verbal cues from a human user in all modalities - visual, acoustic and spoken words, but should also be able to generate appropriate responses. This would leverage models not only for emotion sensing, which has been described in Chapter 3 and 4, but also language models such as Affect-LM/Speech-LM and multimodal fusion through representation learning in Chapter 5. While this is an ambitious goal, it is noteworthy that research is already being performed not only in deep conversational modeling and chatbots (Asghar et al., 2017; Vinyals & Le, 2015), but also on the desirable anthropomorphic properties of such agents, such as in virtual humans (Bailenson et al., 2005; Astrid et al., 2010). For example, NADiA, which is a basic conversational agent currently under development in the author’s research group is deployed as an Android smartphone app and can sense the user’s facial expressions and input utterances in real-time. After detecting the user’s primary emotion, the inferred emotion category is provided as input to a facial mimicry module for virtual human ani- mation, with Affect-LM also being utilized to generate emotionally expressive responses. MTurk studies conducted so far have indicated that NADiA achieves an improved level of human-like performance compared to existing chatbots like 145 CleverBot 3 and produces comparable behavior to human generated reference outputs. Future work could focus not only on more accurate sensing of an user’s emotion and intents, but also the construction of perceptually meaningful gener- ative models, which would necessitate evaluation not only in terms of objective metrics such as accuracy, AUC (Area Under Curve) and F-1 scores, but also per- ceptual user studies to rate how anthropomorphic the responses are in terms of generation/mimicry quality. Another important challenge in this direction is the implementation of such models under resource constrained environments such as mobile GPUs (Graphics Processing Units) or the absence of large-scale cloud infrastructure. Hardware such as the NVIDIA Jetson TX2 4 and the Amazon AWS DeepLens 5 have also been introduced with support for specialized deep learning frameworks such as Tensorflow and PyTorch 2 and the deployment of conversational agents on such platforms would be of significant research interest. This dissertation has addressed the problem of representation learning of affect from facial images and speech, achieving performance competitive with standard feature extractors. IMA (Importance-based Multimodal Autoencoder) is proposed for multimodal representation learning, which learns modality-specific importance networks for weighted multimodal fusion and alignment. Representations learnt by IMA outperform baseline fusion approaches on classification and retrieval tasks for digit and emotion recognition. Two novel neural language models, Affect-LM and Speech-LM are also introduced which improve language modeling performance by integrating the linguistic context with verbal affective and non-verbal acoustic cues. Affect-LM also generates emotionally expressive text at acceptable levels 3 http://www.cleverbot.com/ 4 https://developer.nvidia.com/embedded/buy/jetson-tx2 5 https://aws.amazon.com/deeplens/ 146 of grammatical correctness. As future work, the approaches in this dissertation can be extended to research problems such as the design of anthropomorphic ani- mated conversational agents, deep generative models of speech production, and the integration of multimodal representation learning with knowledge bases and ontologies. At the time of writing this dissertation, some of these emerging ideas are not only ripe for being investigated in the machine perception and affective computing research communities, but also have the potential to be developed into socially impactful technological products. 147 Reference List Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al. (2016) Tensorflow: A system for large-scale machine learning In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA. Airaksinen M, Raitio T, Alku P (2015) Noise robust estimation of the voice source using a deep neural network In Acoustics, Speech and Signal Process- ing (ICASSP), 2015 IEEE International Conference on, pp. 5137–5141. AirasM,AlkuP(2007) Comparisonofmultiplevoicesourceparametersindifferent phonation types In Proceedings of Interspeech 2007. Alku P (1992) Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech communication 11:109–118. Asghar N, Poupart P, Hoey J, Jiang X, Mou L (2017) Affective neural response generation. arXiv preprint arXiv:1709.03968 . Astrid M, Krämer NC, Gratch J, Kang SH (2010) âĂIJit doesnâĂŹt matter what you are!âĂİ explaining social effects of agents and avatars. Computers in Human Behavior 26:1641–1650. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia systems 16:345–379. Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lex- ical resource for sentiment analysis and opinion mining. In LREC, Vol. 10, pp. 2200–2204. Bachorowski JA (1999) Vocal expression and perception of emotion. Current directions in psychological science 8:53–57. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 . 148 Bailenson JN, Swinth K, Hoyt C, Persky S, Dimov A, Blascovich J (2005) The independent and interactive effects of embodied-agent appearance and behavior on self-report, cognitive, and behavioral markers of copresence in immersive vir- tual environments. Presence: Teleoperators & Virtual Environments 14:379–393. Baltrušaitis T, Ahuja C, Morency LP (2018) Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence . Baltrušaitis T, Mahmoud M, Robinson P (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 6, pp. 1–6. IEEE. Bartlett MS, Viola PA, Sejnowski TJ, Golomb BA, Larsen J, Hager JC, Ekman P (1996) Classifying facial action In Advances in neural information processing systems, pp. 823–829. Batliner A, Steidl S, Nöth E (2008) Releasing a thoroughly annotated and pro- cessed spontaneous emotional database: the fau aibo emotion corpus In Proc. of a Satellite Workshop of LREC, Vol. 2008, p. 28. Battenberg E, Chen J, Child R, Coates A, Gaur Y, Li Y, Liu H, Satheesh S, Seetapun D, Sriram A et al. (2017) Exploring neural transducers for end-to-end speech recognition. arXiv preprint arXiv:1707.07413 . Bazzo JJ, Lamar MV (2004) Recognizing facial actions using gabor wavelets with neutralfaceaveragedifference InAutomatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on, pp. 505–510. IEEE. Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelli- gence 35:1798–1828. BengioY,DucharmeR,VincentP,JauvinC(2003) Aneuralprobabilisticlanguage model. Journal of machine learning research 3:1137–1155. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13:281–305. Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper- parameter optimization In Advances in neural information processing systems, pp. 2546–2554. 149 Bird S, Loper E (2004) Nltk: the natural language toolkit In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, p. 31. Association for Computational Linguistics. Bishay M, Patras I (2017) Fusing multilabel deep networks for facial action unit detection In Automatic Face & Gesture Recognition (FG 2017), 2017 12th IEEE International Conference on, pp. 681–688. IEEE. Bishop CM (2006) Pattern recognition and machine learning . Bowman SR, Vilnis L, Vinyals O, Dai A, Jozefowicz R, Bengio S (2016) Gener- ating sentences from a continuous space In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21. Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on psychological science 6:3–5. Burgoon JK, Saine T et al. (1978) The unspoken dialogue: An introduction to nonverbal communication Houghton Mifflin Boston. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation 42:335. Busso C, Lee S, Narayanan SS (2007) Using neutral speech models for emotional speech analysis In Eighth Annual Conference of the International Speech Com- munication Association. Carletta J (2007) Unleashing the killer corpus: experiences in creating the multi- everything ami meeting corpus. Language Resources and Evaluation 41:181–190. Casamitjana A, Sundin M, Ghosh P, Chatterjee S (2015) Bayesian learning for time-varying linear prediction of speech In Signal Processing Conference (EUSIPCO), 2015 23rd European, pp. 325–329. Cassell J, Nakano YI, Bickmore TW, Sidner CL, Rich C (2001) Non-verbal cues for discourse structure In Proceedings of the 39th Annual Meeting on Associa- tion for Computational Linguistics, pp. 114–123. Association for Computational Linguistics. Chang J, Scherer S (2017) Learning representations of emotional speech with deep convolutional generative adversarial networks In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 2746–2750. IEEE. 150 Chapelle O, Haffner P, Vapnik VN (1999) Support vector machines for histogram- based image classification. IEEE transactions on Neural Networks 10:1055–1064. Chatterjee M, Park S, Morency LP, Scherer S (2015) Combining two perspectives on classifying multimodal data for recognizing speaker traits In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 7–14. ACM. Chen YN, Celikyilmaz A, Hakkani-Tür D (2017) Deep learning for dialogue sys- tems. Proceedings of ACL 2017, Tutorial Abstracts pp. 8–14. Chetupalli SR, Sreenivas TV (2014) Time varying linear prediction using sparsity constraints In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6290–6293. Chu WS, De la Torre F, Cohn JF (2016) Modeling spatial and temporal cues for multi-label facial action unit detection. arXiv preprint arXiv:1608.00911 . Cieri C, Miller D, Walker K (2004) The fisher corpus: a resource for the next generations of speech-to-text. In LREC, Vol. 4, pp. 69–71. Cutler A, Dahan D, Van Donselaar W (1997) Prosody in the comprehension of spoken language: A literature review. Language and speech 40:141–201. Cutler A, Norris D (1988) The role of strong syllables in segmentation for lexi- cal access. Journal of Experimental Psychology: Human perception and perfor- mance 14:113–121. Dahl G, Mohamed Ar, Hinton GE et al. (2010) Phone recognition with the mean- covariance restricted boltzmann machine In Advances in neural information processing systems, pp. 469–477. Dapogny A, Bailly K, Dubuisson S (2017) Confidence-weighted local expres- sion predictions for occlusion handling in expression recognition and action unit detection. International Journal of Computer Vision pp. 1–17. Degottex G, Kane J, Drugman T, Raitio T, Scherer S (2014) CovarepâĂŤa collaborative voice analysis repository for speech technologies In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 960–964. IEEE. Deng J, Zhang Z, Eyben F, Schuller B (2014) Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Let- ters 21:1068–1072. 151 DeVito J (1995) The interpersonal communication book. IEEE Intelligent Sys- tems . Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microar- ray data using random forest. BMC bioinformatics 7:3. Ekman P, Friesen WV (1978) Manual for the facial action coding system Consult- ing Psychologists Press. Eleftheriadis S, Rudovic O, Pantic M (2015) Multi-conditional latent variable model for joint facial action unit detection In Proceedings of the IEEE Interna- tional Conference on Computer Vision, pp. 3792–3800. Ephratt M (2011) Linguistic, paralinguistic and extralinguistic speech and silence. Journal of pragmatics 43:2286–2307. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor In Proceedings of the 18th ACM interna- tional conference on Multimedia, pp. 1459–1462. ACM. Ezzat T, Poggio T (2008) Discriminative word-spotting using ordered spectro- temporal patch features. In SAPA@ INTERSPEECH, pp. 35–40. Citeseer. Fant G (1971) Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, Vol. 2 Walter de Gruyter. Feng X, Zhang Y, Glass J (2014) Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 1759–1763. IEEE. Filippi P, Ocklenburg S, Bowling DL, Heege L, Güntürkün O, Newen A, de Boer B (2016) More than words (and faces): evidence for a stroop effect of prosody in emotion word processing. Cognition and Emotion pp. 1–13. Fujisaki H, Ljungqvist M (1986) Proposal and evaluation of models for the glottal source waveform In Acoustics, Speech, and Signal Processing, IEEE Interna- tional Conference on ICASSP’86., Vol. 11, pp. 1605–1608. IEEE. Ghosh S (2015a) Challenges in deep learning for multimodal applications In Proceedings of the 2015 ACM on International Conference on Multimodal Inter- action, pp. 611–615. ACM. Ghosh S (2015b) Challenges in deep learning for multimodal applications In Proceedings of the 2015 ACM on International Conference on Multimodal Inter- action, ICMI ’15, pp. 611–615, New York, NY, USA. ACM. 152 Ghosh S, Chatterjee M, Morency LP (2014) A multimodal context-based approach for distress assessment In Proceedings of the 16th International Conference on Multimodal Interaction, pp. 240–246. ACM. Ghosh S, Chollet M, Laksana E, Morency LP, Scherer S (2017) Affect-lm: A neural language model for customizable affective text generation In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Accepted). Ghosh S, Laksana E, Morency LP, Scherer S (2016a) Learning representations of affect from speech. arXiv preprint arXiv:1511.04747, International Conference on Learning Representations (ICLR) 2016 Workshop . Ghosh S, Laksana E, Morency LP, Scherer S (2016b) Representation learning for speech emotion recognition. Interspeech 2016 pp. 3603–3607. Ghosh S, Laksana E, Scherer S, Morency LP (2015) A multi-label convolutional neural network approach to cross-domain action unit detection In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on, pp. 609–615. IEEE. GiacobelloD,ChristensenMG,MurthiMN,JensenSH,MoonenM(2010) Enhanc- ing sparsity in linear prediction of speech by iteratively reweighted 1-norm min- imization In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 4650–4653. Giri R, Rao B (2014) Block sparse excitation based all-pole modeling of speech In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 3754–3758. Gizatdinova Y, Surakka V (2006) Feature-based detection of facial landmarks from neutral and expressive facial images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28:135–139. Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech communication 40:189–212. GoodfellowI,Pouget-AbadieJ,MirzaM,XuB,Warde-FarleyD,OzairS,Courville A, Bengio Y (2014) Generative adversarial nets In Advances in neural informa- tion processing systems, pp. 2672–2680. Gratch J, Artstein R, Lucas GM, Stratou G, Scherer S, Nazarian A, Wood R, Boberg J, DeVault D, Marsella S et al. (2014) The distress analysis interview corpus of human and computer interviews. In LREC, pp. 3123–3128. Citeseer. 153 Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 6645–6649. IEEE. Gudi A, Tasli HE, den Uyl TM, Maroulis A (2015) Deep learning based facs action unit occurrence and intensity estimation In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 6, pp. 1–5. IEEE. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of INTERSPEECH, ISCA, Singapore pp. 223–227. Han S, Meng Z, KHAN AS, Tong Y (2016) Incremental boosting convolutional neural network for facial action unit recognition In Advances in Neural Infor- mation Processing Systems, pp. 109–117. Harwath D, Torralba A, Glass J (2016) Unsupervised learning of spoken language with visual context In Advances in Neural Information Processing Systems, pp. 1858–1866. HeK,ZhangX,RenS,SunJ(2016) Deepresiduallearningforimagerecognition In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Hedberg N, Sosa JM (2002) The prosody of questions in natural discourse In Speech Prosody 2002, International Conference. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. science 313:504–507. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computa- tion 9:1735–1780. HuZ,YangZ,LiangX,SalakhutdinovR,XingEP(2017) Towardcontrolledgener- ation of text In International Conference on Machine Learning, pp. 1587–1596. Huang CW, Narayanan SS (2016) Attention assisted discovery of sub-utterance structure in speech emotion recognition. In INTERSPEECH, pp. 1387–1391. Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers- Volume 1, pp. 873–882. Association for Computational Linguistics. 154 Huang X, Acero A, Hon HW, Foreword By-Reddy R (2001) Spoken language processing: A guide to theory, algorithm, and system development Prentice Hall PTR. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 . Ishii K, Reyes JA, Kitayama S (2003) Spontaneous attention to word content versus emotional tone: Differences among three cultures. Psychological Sci- ence 14:39–46. Jaitly N, Hinton GE (2011) A new way to learn acoustic events. Advances in Neural Information Processing Systems 24. JeniLA,CohnJF,DeLaTorreF(2013) Facingimbalanceddata–recommendations for the use of performance metrics In Affective Computing and Intelligent Inter- action (ACII), 2013 Humaine Association Conference on, pp. 245–251. IEEE. Jia Y, Salzmann M, Darrell T (2010) Factorized latent spaces with structured sparsity In Advances in Neural Information Processing Systems, pp. 982–990. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embed- ding In Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678. ACM. Jiang B, Martinez B, Valstar MF, Pantic M (2014) Decision level fusion of domain specificregionsforfacialactionrecognition In Pattern Recognition (ICPR), 2014 22nd International Conference on, pp. 1776–1781. IEEE. Jin Q, Li C, Chen S, Wu H (2015) Speech emotion recognition with acoustic and lexical features In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4749–4753. IEEE. Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 . Juslin PN, Scherer KR, Harrigan J, Rosenthal R, Scherer K (2005) Vocal expression of affect. The new handbook of methods in nonverbal behavior research pp. 65–135. Kane J, Scherer S, Morency LP, Gobl C (2013a) A comparative study of glot- tal open quotient estimation techniques In Proceedings of Interspeech 2013, pp. 1658–1662. ISCA. 155 Kane J, Scherer S, Aylett M, Morency LP, Gobl C (2013b) Speaker and language independent voice quality classification applied to unlabelled corpora of expres- sive speech In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 7982–7986. Keshtkar F, Inkpen D (2011) A pattern-based model for generating text to express emotion InAffective Computing and Intelligent Interaction, pp. 11–21.Springer. Khorrami P, Paine T, Huang T (2015) Do deep neural networks learn facial action units when doing expression recognition? In Proceedings of the IEEE Interna- tional Conference on Computer Vision Workshops, pp. 19–27. Kim Y, Provost EM (2013) Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 3677–3681. IEEE. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 . Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models In Proceedings of the 31st International Conference on Machine Learning (ICML- 14), pp. 595–603. Kominek J, Black AW (2004) The cmu arctic speech databases In Fifth ISCA Workshop on Speech Synthesis. Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Inter- national journal of speech technology 15:99–117. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks In Advances in neural information processing systems, pp. 1097–1105. Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition In Sixteenth Annual Conference of the International Speech Communication Association. Leonard R (1984) A database for speaker-independent digit recognition In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’84., Vol. 9, pp. 328–331. IEEE. Li L, Zhao Y, Jiang D, Zhang Y, Wang F, Gonzalez I, Valentin E, Sahli H (2013) Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion recognition In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 312–317. IEEE. 156 Li X, Ji Q (2005) Active affective state detection and user assistance with dynamic bayesian networks. IEEE transactions on systems, man, and cybernetics-part a: systems and humans 35:93–105. Liberman M (1993) Ti46-word spoken digits corpus In LDC (Linguistic Data Consortium), https://catalog.ldc.upenn.edu/ldc93s9. Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks In Signal and information processing association annual summit and conference (APSIPA), 2016 Asia-Pacific, pp. 1–4. IEEE. Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bartlett M (2011) The computer expression recognition toolbox (cert) In Automatic Face & Ges- ture Recognition and Workshops (FG 2011), 2011 IEEE International Confer- ence on, pp. 298–305. IEEE. Louizos C, Swersky K, Li Y, Welling M, Zemel R (2015) The variational fair autoencoder. arXiv preprint arXiv:1511.00830 . Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specifiedexpression InComputer Vision and Pattern Recognition Work- shops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 94–101. IEEE. Maaten Lvd, Hinton G (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9:2579–2605. Mahamood S, Reiter E (2011) Generating affective natural language for parents of neonatal infants In Proceedings of the 13th European Workshop on Natural Language Generation, pp. 12–21. Association for Computational Linguistics. Mairal J, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding In Proceedings of the 26th annual international conference on machine learning, pp. 689–696. ACM. Mairesse F, Walker M (2007) Personage: Personality generation for dialogue. Makhzani A, Shlens J, Jaitly N, Goodfellow IJ (2015) Adversarial autoencoders. CoRR abs/1511.05644. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16:2203–2213. 157 MartinezHP,BengioY,YannakakisGN(2013) Learningdeepphysiologicalmodels of affect. IEEE Computational Intelligence Magazine 8:20–33. Masci J, Meier U, Fricout G, Schmidhuber J (2013) Multi-scale pyramidal pooling network for generic steel defect classification In Neural Networks (IJCNN), The 2013 International Joint Conference on, pp. 1–8. IEEE. Mavadati SM, Mahoor MH, Bartlett K, Trinh P, Cohn JF (2013) Disfa: A spon- taneous facial action intensity database. IEEE Transactions on Affective Com- puting 4:151–160. McKeown G, Valstar M, Cowie R, Pantic M, Schroder M (2012) The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Com- puting 3:5–17. Mikolov T, Karafiát M, Burget L, Cernock` y J, Khudanpur S (2010) Recurrent neural network based language model. In Interspeech, Vol. 2, p. 3. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed represen- tations of words and phrases and their compositionality In Advances in neural information processing systems, pp. 3111–3119. Mohammad SM, Turney PD (2010) Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 26–34. Association for Computational Linguistics. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Ni J, Lipton ZC, Vikram S, McAuley J (2017) Estimating reactions and rec- ommending products with generative models of reviews In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 783–791. Nygaard LC, Herold DS, Namy LL (2009) The semantics of prosody: Acous- tic and perceptual evidence of prosodic correlates to word meaning. Cognitive Science 33:127–146. Nygaard LC, Lunders ER (2002) Resolution of lexical ambiguity by emotional tone of voice. Memory & cognition 30:583–593. 158 Nygaard LC, Queen JS (2008) Communicating emotion: Linking affective prosody and word meaning. Journal of Experimental Psychology: Human Perception and Performance 34:1017. Ogata K (1995) Discrete-time control systems, Vol. 2 Prentice Hall Englewood Cliffs, NJ. Ovtcharov K, Ruwase O, Kim JY, Fowers J, Strauss K, Chung ES (2015) Acceler- ating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper 2. Pantic M, Valstar M, Rademaker R, Maat L (2005) Web-based database for facial expression analysis In Multimedia and Expo, 2005. ICME 2005. IEEE Interna- tional Conference on, pp. 5–pp. IEEE. Park S, Scherer S, Gratch J, Carnevale P, Morency LP (2013) Mutual behav- iors during dyadic negotiation: Automatic prediction of respondent reactions In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Associ- ation Conference on, pp. 423–428. IEEE. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neu- ral networks In International Conference on Machine Learning, pp. 1310–1318. Pennebaker JW, Francis ME, Booth RJ (2001) Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates 71:2001. Pérez-Rosas V, Mihalcea R, Morency LP (2013) Utterance-level multimodal senti- ment analysis In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 973–982. Picard R (1997) Affective computing, Vol. 252 MIT press Cambridge. Rajagopalan SS, Morency LP, Baltrusaitis T, Goecke R (2016) Extending long short-term memory for multi-view structured learning In European Conference on Computer Vision, pp. 338–353. Springer. Rajeswar S, Subramanian S, Dutil F, Pal C, Courville A (2017) Adversarial gen- eration of natural language. arXiv preprint arXiv:1705.10929 . Ringeval F, Schuller B, Valstar M, Gratch J, Cowie R, Scherer S, Mozgai S, Cum- mins N, Schmitt M, Pantic M (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pp. 3–9. ACM. Rosas VP, Mihalcea R, Morency LP (2013) Multimodal sentiment analysis of spanish online videos. IEEE Intelligent Systems 28:38–45. 159 Sahu S, Gupta R, Sivaraman G, AbdAlmageed W, Espy-Wilson C (2017) Adver- sarial auto-encoders for speech based emotion recognition. Proc. Interspeech 2017 pp. 1243–1247. Sainath TN, Kingsbury B, Mohamed Ar, Ramabhadran B (2013) Learning filter banks within a deep neural network framework In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 297–302. IEEE. SainathTN,KingsburyB,MohamedAr, SaonG,RamabhadranB(2014) Improve- ments to filterbank and delta learning within a deep neural network framework In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6839–6843. IEEE. Salakhutdinov R, Hinton G (2007) Learning a nonlinear embedding by pre- serving class neighbourhood structure In Artificial Intelligence and Statistics, pp. 412–419. Sandbach G, Zafeiriou S, Pantic M (2012) Local normal binary patterns for 3d facial action unit detection In Image Processing (ICIP), 2012 19th IEEE Inter- national Conference on, pp. 1813–1816. IEEE. Scherer KR (2003) Vocal communication of emotion: A review of research paradigms. Speech communication 40:227–256. Scherer KR (2005) What are emotions? and how can they be measured? Social science information 44:695–729. Scherer KR, Banse R, Wallbott HG, Goldbeck T (1991) Vocal cues in emotion encoding and decoding. Motivation and emotion 15:123–148. Scherer KR, Bänziger T, Roesch E (2010) A Blueprint for Affective Computing: A sourcebook and manual Oxford University Press. Scherer S, Stratou G, Gratch J, Morency LP (2013) Investigating voice qual- ity as a speaker-independent indicator of depression and ptsd. In Interspeech, pp. 847–851. Scherer S, Stratou G, Morency LP (2013) Audiovisual behavior descriptors for depression assessment In Proceedings of the 15th ACM on International confer- ence on multimodal interaction, pp. 135–140. ACM. Schirmer A, Kotz SA (2003) Erp evidence for a sex-specific stroop effect in emo- tional speech. Journal of cognitive neuroscience 15:1135–1148. 160 Schroeder MR, Atal BS (1985) Code-excited linear prediction (celp): High-quality speech at very low bit rates In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1985., pp. 937–940. Schuller B (2002) Towards intuitive speech interaction by the integration of emo- tional aspects In Systems, Man and Cybernetics, 2002 IEEE International Con- ference on, Vol. 6, pp. 6–pp. IEEE. Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acous- tic features and linguistic information in a hybrid support vector machine-belief network architecture In Acoustics, Speech, and Signal Processing, 2004. Proceed- ings.(ICASSP’04). IEEE International Conference on, Vol. 1, pp. I–577. IEEE. ShaoM,DingZ,FuY(2015) Sparselow-rankfusionbaseddeepfeaturesformissing modality face recognition In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 1, pp. 1–6. IEEE. Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, Dean J (2017) Out- rageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 . Sikka K, Dykstra K, Sathyanarayana S, Littlewort G, Bartlett M (2013) Multiple kernel learning for emotion recognition in the wild In Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 517–524.ACM. Sondhi M, Gopinath B (1971) Determination of vocal-tract shape from impulse response at the lips. The Journal of the Acoustical Society of Amer- ica 49:1867–1873. Song Y, McDuff D, Vasisht D, Kapoor A (2015) Exploiting sparsity and co- occurrence structure for action unit recognition In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 1, pp. 1–8. IEEE. Song Y, Morency LP, Davis R (2012) Multimodal human behavior analysis: learn- ing correlation and interaction across modalities In Proceedings of the 14th ACM international conference on Multimodal interaction, pp. 27–30. ACM. Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep boltzmann machines In Advances in neural information processing systems, pp. 2222–2230. Stolcke A (2002) Srilm-an extensible language modeling toolkit. In Interspeech, Vol. 2002, p. 2002. 161 Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Mar- tin R, Van Ess-Dykema C, Meteer M (2000) Dialogue act modeling for auto- matic tagging and recognition of conversational speech. Computational linguis- tics 26:339–373. Sundermeyer M, Schlüter R, Ney H (2012) Lstm neural networks for language modeling. In Interspeech, pp. 194–197. Suzuki M, Nakayama K, Matsuo Y (2016) Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 . Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Tahon M, Degottex G, Devillers L (2012) Usual voice quality features and glottal features for emotional valence detection In Speech Prosody 2012. Tian YI, Kanade T, Cohn JF (2001) Recognizing action units for facial expres- sion analysis. IEEE Transactions on pattern analysis and machine intelli- gence 23:97–115. Tian YL, Kanade T, Cohn JF (2005) Facial expression analysis In Handbook of face recognition, pp. 247–275. Springer. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network In Acoustics, Speech and Signal Process- ing (ICASSP), 2016 IEEE International Conference on, pp. 5200–5204. IEEE. Valstar MF, Almaev T, Girard JM, McKeown G, Mehu M, Yin L, Pantic M, Cohn JF (2015) Fera 2015-second facial expression recognition and analysis challenge In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, Vol. 6, pp. 1–8. IEEE. Vargas MF (1986) Louder than words: An introduction to nonverbal communica- tion Iowa State Pr. Vedantam R, Fischer I, Huang J, Murphy K (2017) Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762 . Vincent P, Larochelle H, Bengio Y, Manzagol PA (2008) Extracting and com- posing robust features with denoising autoencoders In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. ACM. Vinyals O, Le QV (2015) A neural conversational model. CoRR abs/1506.05869. 162 Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Wan L, Zeiler M, Zhang S, Le Cun Y, Fergus R (2013) Regularization of neural networks using dropconnect In International Conference on Machine Learning, pp. 1058–1066. Wang C (2001) Prosodic modeling for improved speech recognition and under- standing Ph.D. diss., Massachusetts Institute of Technology. Wang J, Wang X, Li F, Xu Z, Wang Z, Wang B (2017) Group linguistic bias aware neural response generation In Proceedings of the 9th SIGHAN Workshop on Chinese Language Processing, pp. 1–10. Wang W, Lee H, Livescu K (2016) Deep variational canonical correlation analysis. arXiv preprint arXiv:1610.03454 . Wang ZQ, Tashev I (2017) Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 5150–5154. IEEE. Wang Z, Li Y, Wang S, Ji Q (2013) Capturing global semantic relationships for facial action unit recognition In Computer Vision (ICCV), 2013 IEEE Interna- tional Conference on, pp. 3304–3311. IEEE. Weninger F, Bergmann J, Schuller B (2015) Introducing currennt: The munich open-source cuda recurrent neural network toolkit. The Journal of Machine Learning Research 16:547–551. Wu D, Parsons TD, Narayanan SS (2010) Acoustic feature analysis in speech emo- tion primitives estimation In Eleventh Annual Conference of the International Speech Communication Association. Wu M, Goodman N (2018) Multimodal generative models for scalable weakly- supervised learning. arXiv preprint arXiv:1802.05335 . Wu T, Bartlett MS, Movellan JR (2010) Facial expression recognition using gabor motion energy filters In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pp. 42–47. IEEE. Xia R, Liu Y (2013) Using denoising autoencoder for emotion recognition. In Interspeech, pp. 2886–2889. 163 Yu L, Zhang W, Wang J, Yu Y (2017) Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp. 2852–2858. Yuan J, Liberman M (2008) Speaker identification on the scotus corpus. Journal of the Acoustical Society of America 123:3878. Zadeh A, Zellers R, Pincus E, Morency LP (2016) Multimodal sentiment inten- sity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31:82–88. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regulariza- tion. arXiv preprint arXiv:1409.2329 . Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence 31:39–58. Zhang X, Mahoor MH, Mavadati SM, Cohn JF (2014a) A l p-norm mtmkl frame- work for simultaneous detection of multiple facial action units In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pp. 1104–1111. IEEE. Zhang X, Yin L, Cohn JF, Canavan S, Reale M, Horowitz A, Liu P, Girard JM (2014b) Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32:692–706. Zhou H, Huang M, Zhang T, Zhu X, Liu B (2017) Emotional chatting machine: emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074 . 164 Appendix A Appendix 1: Probabilistic Framework for Importanced-based Multimodal Autoencoder Model (IMA) In this appendix, a variational probabilistic framework for the proposed model IMA is derived. It is shown that with a proper selection of the generative and inference networks, the loss functions in the ELBO (Evidence Lower Bound) also include alignment terms for each modality. These enable only M additional loss X 1 X M z u 1 u M X 1 X M z u 1 u M Figure A.1: Flow of information in generation (left) and inference (right) networks forthemodelwhichperformsbothalignmentandfusion. zisthelatentmultimodal representation, u 1 ···u M denotes the unimodal representations and x 1 ···x M are the inputs for M modalities. 165 functionstohandlemissingmodalities,insteadofsub-sampledtrainingwithO(2 M ) losses (Wu & Goodman, 2018) or 2 M separate networks (Suzuki et al., 2016). A loss function is also introduced in Section 5.2.3 for training importance networks to detect uncorrelated noise in the observed data. The framework described can be seamlessly extended to more than two modalities. Further, these approaches are entirely unsupervised, with no requirement of labeled attributes. Figure A.1 shows the flow of information between observed and latent variables for generation and inference respectively. A.0.1 Problem Statement Assume there are N data-points and M modalities. In the j-th modality, the observedvariable x j denotestheobserveddata; variable zdenotesthesharedlatent variable, and u j is a latent variable specific to the j-th modality. The dimension- ality of z and all u j ,∀j∈{1,2,...M} are equal and denoted byK. For convenience of representation, notation is utilized to denote these variables collectively across modalities. Thus, while z is shared across all modalities, x ={x 1 ,x 2 ...x M } and u ={u 1 ,u 2 ...u M }. These variables factorize into modality-specific terms as seen subsequently in Section A.0.2. A.0.2 Multimodal Variational Alignment and Fusion From Fig A.1, it is apparent that the flow of information during inference does not have exact correspondence with that during generation. Specifically, x j is generated by z and not u j . This configuration is necessary to maintain the autoencoder’s bottleneck structure and also to derive the alignment losses, 166 as shown below. The joint probability of all shared and latent variables over all modalities is given by: P (x,u,z) =P (z)P (u|z)P (x|z) (A.1) Imposing standard priors on the shared latent variable as a multivariate Gaussian distributedrandomvariablewithzeromeanandunityvariance,thefollowingholds: z∼N (0,I) (A.2) The conditional pdf (probability density function) of the private variable u for the j-th modality conditioned on the shared variable z is given by: P (u|z) =P (u 1 |z)P (u 2 |z)...P (u M |z) (A.3) u j |z∼N (u j ;z,Σ u j (z)),∀j∈{1,2,...M} (A.4) Thus u, while being a separate latent variable is obtained through a replication of the value contained in z (through the mean parameter), while being corrupted by Gaussian noise of covariance Σ u j (z). The generative pdf of the observed data conditioned on the latent factors for the j-th modality is given by : P (x|z) =P (x 1 |z)P (x 2 |z)...P (x M |z) (A.5) x j |z∼N (x j ;g j (z),Σ x j (z))∀j∈{1,2,...M} (A.6) Following a standard generation process for a multimodal variational autoencoder, g j (z) is a generative network which maps from the latent variable z to the observed data in the j-th modality. The observation is also corrupted by Gaussian noise 167 of covariance matrix Σ x j (z).The posterior of the latent variable z conditioned on u ={u 1 ,u 2 ,...u M } is modeled using a encoder network in VAEs (Kingma & Welling, 2013). By the mean-field assumption (Bishop, 2006), the posterior can be factorized into separate terms over latent variables (each corresponding to an unimodalsubnetwork). Eachposteriorterm z|u j ismodeledasaGaussiandistribu- tion, then the overall posterior is a product of Gaussians; which has a closed-form expressionfortheresultingmeanandcovariance. ThisproductofGaussianexperts in the posterior has been used in other papers, for example in Wu & Goodman (2018). q(z|u)∼q(z|u 1 )q(z|u 2 )...q(z|u M ) (A.7) q(z|u j ) =N (z;u j ,Σ z j (u j )) (A.8) q(z|u) =N (z;μ z ,Σ z ) (A.9) μ z = P j Σ −1 z j u j P j Σ −1 z j Σ z = X j Σ −1 z j −1 (A.10) The expressions in Equation A.10 are obtained from standard mean and covariance parameters for product of Gaussian distributions. The posterior distribution of u j conditioned on x j : q θ j (u j |x j ) =N (u j ;f θ j (x j ),Σ u j (x j )) (A.11) f θ j (x j ) is a function operating on the input x j , which is modeled by an unimodal subnetwork. Utilizing the factorizations for the generative and posterior distri- butions, Equation A.1 is divided by Equation A.3. This gives the ratio of the 168 likelihood of all data P (x,z,u) to the posterior of the latent variables q θ (z,u|x) conditioned on the observed data is given by: " P (z) q(z|u) # M Y j=1 P (x j |z) M Y j=1 P (u j |z) q θ (u j |x j ) The variational lower bound is commonly referred to as ELBO and exists in the variational inference literature Kingma & Welling (2013). For the framework described, it is given by the following expression: L := Z z,u q(z,u|x)log P (z,u,x) q(z,u|x) dzdu (A.12) Taking log on both sides and grouping each term the variational lower boundL auto is obtained as a sum of three main loss function terms - (1)L glob , which enforces a zero-mean unit-variance Gaussian to the overall multimodal representation; (2) L (j) rec which is the reconstruction loss for each modality and (3)L (j) align which is the additional modality-specific alignment term, trying to bring each unimodal representation u j closer to the overall multimodal representation z. This term also has the effect of bringing the modalities closer to each other through minimization. Mathematically,L auto :=L glob + P M j=1 L (j) rec (x) + P M j=1 L (j) align (x). The global latent KL regularization term for the multimodal embedding z is given by: L glob = Z z,u q(z,u|x)log P (z) q(z|u) dz du = Z u q θ (u|x) " Z z q(z|u)log P (z) q(z|u) dz # du = Z u q θ (u|x) " − Z z q(z|u)log q(z|u) P (z) dz # du =−E q(u|x) [D(q(z|u)||P (z))] D(q(z|u)||P (z))istheKL-divergencebetweentheposteriordistributionq(z|u)and P (z).E q(u|x) [D(q(z|u)||P (z))] is the expectation of this term computed according to the posterior q(u|x). Since both distributions are assumed as Gaussian with a 169 dimensionality ofK, from Equations A.2 and A.9 the KL-divergence term is given as: D(q(z|u)||P (z)) = 1 2 tr I −1 Σ z + (0−μ z ) T I −1 (0−μ z )−D + ln detI detΣ z !! = 1 2 tr(Σ z ) +μ T z μ z −D + ln detI detΣ z !! = 1 2 kμ z k 2 + 1 2 tr(Σ z )− D 2 − 1 2 ln(detΣ z ) (A.13) Forsimplicity, thecovariancematricesareassumedtobediagonal, andtheoutputs of the network modeling q(z|u) are restricted only to the diagonal terms. While this may not be ideal from a theoretical standpoint, it reduces computational time by having fewer minimization terms. The reconstruction termL (j) rec for the j-th modality is given by: L (j) rec (x) = Z z,u q(z,u|x)logP (x j |z)dz du = Z z,u q(z|u)q θ (u|x)logP (x j |z)dz du = Z z,u q(z|u) M Y j=1 q θ j (u j |x j ) logP (x j |z)dz du The reconstruction term in each modality can be approximated through Monte Carlo sampling by first sampling each u j from x j and then subsequently obtaining thesharedlatentvariable z. Backpropagationispossiblethroughapplicationofthe reparameterization trick (Kingma & Welling, 2013). The alignment loss between the representation u j and the multimodal joint representation z is given by: L (j) align (x) = Z z,u q(z,u|x)log P (u j |z) q θ (u j |x j ) dz du = Z z,u q(z|u)q(u|x)log P (u j |z) q θ (u j |x j ) dz du To tractably evaluate this integral, a Monte Carlo approximation can be performed for the integration over z. A sample of z is drawn from the encoder network, and 170 then utilized for the approximation. Assuming that this sample of z is z 0 , the alignment term can be written as: L (j) align (x)≈ Z u q(u|x)log P (u j |z 0 ) q θ (u j |x j ) du = Z u j q(u j |x j )log P (u j |z 0 ) q θ (u j |x j ) du j (A.14) =−D h q θ (u j |x j )||P (u j |z 0 ) i (A.15) Applying Equation A.4 and the same properties of KL-divergence between mul- tivariate Gaussians as in deriving the global regularization term, the term D h q θ (u j |x j )||P (u j |z 0 ) i can be expanded as follows: D h q θ (u j |x j )||P (u j |z 0 ) i = 1 2 tr h Σ −1 u j (z 0 )Σ u j (x j ) i + 1 2 z 0 −f θ j (x j ) T Σ −1 u j (z 0 )(z 0 −f θ j (x j ))− D 2 + 1 2 ln detΣ u j (z 0 ) detΣ u j (x j ) ! (A.16) This loss function appears for each modality, and its goal is to indirectly perform alignment between modalities, where each modality representation u j =f θ j (x j ) is forcedtobesimilarto z 0 . Inpracticefortheimplementationassumethat Σ u j (z 0 )≈ Σ u j (x j )≈ Σ z j = 2 I∀ j∈{1,2,3...M}. This assumption of quasi-deterministic encoding with very small covariances ( = 10 −5 ) provides adequate performance with a simpler implementation. 171
Abstract (if available)
Abstract
With the ever increasing abundance of multimedia data available on the Internet and crowd-sourced datasets/repositories, there has been a renewed interest in machine learning approaches for solving real-life perception problems. Current research has achieved state-of-the-art performances on tasks such as image understanding and speech recognition. However, such techniques have only recently made inroads into research problems relevant to the study of human emotion and behavior understanding. The primary research challenges addressed by this dissertation are three fold. Firstly, it is an open problem to build representation learning systems for emotion recognition which provide comparable or better performance to existing models trained with well-engineered domain knowledge based features. Secondly, there has been limited investigation in applying deep neural networks to the task of building good multimodal representations of data. Such multimodal representations should not only improve performance on downstream tasks such as classification and retrieval, but also model the importance of each modality for the desired task. Thirdly, the effective integration of cues from visual, acoustic and textual modalities still poses a challenge. Particularly, the fusion of verbal information with affective and non-verbal cues from other modalities for applications such as language modeling and emotional text generation has not been studied. ❧ In this dissertation the above research challenges are addressed for three areas—(1) Unimodal Representation Learning: In the visual modality a novel multilabel CNN (Convolutional Neural Network) is proposed for learning AU (Action Unit) occurrences in facial images. The multi-label CNN learns a joint representation for AU occurrences, obtaining competitive detection results
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Toward robust affective learning from speech signals based on deep learning techniques
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Multimodal reasoning of visual information and natural language
PDF
Modeling, learning, and leveraging similarity
PDF
Deep learning models for temporal data in health care
PDF
Learning shared subspaces across multiple views and modalities
PDF
Learning distributed representations from network data and human navigation
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Invariant representation learning for robust and fair predictions
PDF
Scalable machine learning algorithms for item recommendation
PDF
Machine learning paradigms for behavioral coding
PDF
Rethinking perception-action loops via interactive perception and learned representations
PDF
Towards generalizable expression and emotion recognition
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Transfer learning for intelligent systems in the wild
PDF
Visual representation learning with structural prior
PDF
Active state learning from surprises in stochastic and partially-observable environments
Asset Metadata
Creator
Ghosh, Sayan
(author)
Core Title
Multimodal representation learning of affective behavior
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/09/2019
Defense Date
08/09/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,Human behavior,machine learning,multimodal,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Scherer, Stefan (
committee chair
), Georgiou, Panayiotis (
committee member
), Knight, Kevin (
committee member
), Morency, Louis-Philippe (
committee member
)
Creator Email
say.ghosh.88@gmail.com,sayangho@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-67581
Unique identifier
UC11671946
Identifier
etd-GhoshSayan-6724.pdf (filename),usctheses-c89-67581 (legacy record id)
Legacy Identifier
etd-GhoshSayan-6724.pdf
Dmrecord
67581
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ghosh, Sayan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
machine learning
multimodal