Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
(USC Thesis Other)
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIMODALITY, CONTEXT AND CONTINUOUS DYNAMICS FOR RECOGNITION AND ANALYSIS OF EMOTIONAL STATES, AND APPLICATIONS IN HEALTHCARE by Angeliki Metallinou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2013 Copyright 2013 Angeliki Metallinou This thesis is dedicated to my family. ii Acknowledgements I would like to thank my advisor Prof. Shrikanth Narayanan for his continuous advice and support that has always guided me in my research. Also, I am grateful to Dr. Nassos Katsamanis and my former labmate Prof. Carlos Busso for their advice and discussions that motivated me to explore new directions and ideas, and were an invaluable help for progressing my work. Thanks also go to Prof. Sungbok Lee and Prof. Ruth Grossman for their advice and collaboration. Finally, I would like to thank all my labmates at SAIL lab, for being great labmates, always eager to oer help and advice, but also for being great friends. I am really happy to have been part of the supportive and collaborative research environment of SAIL lab. I will always be grateful to my good friends here in LA, back in Greece and in other places of the world, for being there for me during both the happy and the stressful times of my PhD years. Finally and most importantly, I would like to thank my parents, my sister Margarita and my partner Aditya, for their love and continuous support, without which this work would not have been possible. iii Table of Contents Acknowledgements iii List of Figures vii List of Tables xi Abstract xiv Chapter 1: Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . 1 1.2 Context-Sensitive Emotion Recognition . . . . . . . . . . . . . . . . . . 6 1.3 Continuous Emotional Dynamics . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Applications in Healthcare: Computational Approaches for Autism Re- search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 2: Emotional Representations 19 2.1 Categorical Representations . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Continuous Dimensional Representations . . . . . . . . . . . . . . . . . . 20 Chapter 3: The USC CreativeIT Database: Collection and Annotation 24 3.1 Multimodal Emotional Databases and CreativeIT . . . . . . . . . . . . . 24 3.2 Theatrical Techniques for Eliciting Naturalistic Emotions . . . . . . . . 27 3.2.1 Active Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Design of Data Collection . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Session Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Equipment and Technical Details . . . . . . . . . . . . . . . . . . 29 3.4 Emotional Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Why Continuous Annotations? . . . . . . . . . . . . . . . . . . . 30 3.4.2 Challenges and Design Decisions . . . . . . . . . . . . . . . . . . 31 iv 3.4.3 Discrete Data annotation . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Data annotation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 Annotator Agreement . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Comparing Continuous and Discrete Annotations . . . . . . . . . 39 3.6 Open Questions and Research Directions . . . . . . . . . . . . . . . . . . 40 3.6.1 Improving tools for continuous annotations . . . . . . . . . . . . 40 3.6.2 Absolute vs Relative Ratings . . . . . . . . . . . . . . . . . . . . 41 3.6.3 Which attributes can be continuously rated? . . . . . . . . . . . 41 3.6.4 Multiple Subjective Ratings and Crowdsourcing . . . . . . . . . 42 3.6.5 Annotator-specic delays . . . . . . . . . . . . . . . . . . . . . . 43 3.6.6 Continuous ratings and saliency detection . . . . . . . . . . . . . 43 Chapter 4: Multimodal Feature Extraction 44 4.1 Face, Speech and Head Movement Feature Extraction . . . . . . . . . . 45 4.1.1 The IEMOCAP Database . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Facial Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.3 Head Movement Features . . . . . . . . . . . . . . . . . . . . . . 50 4.1.4 Vocal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Body Language Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Body Language Feature Extraction . . . . . . . . . . . . . . . . . 54 4.2.2 Feature Selection Approaches . . . . . . . . . . . . . . . . . . . . 57 4.2.3 Analysis of Informative Body Language Features . . . . . . . . . 58 Chapter 5: Context-Sensitive, Hierarchical Approaches for Emotion Recognition 65 5.1 Context and Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 Context-Sensitive Frameworks . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.1 Hierarchical Context Sensitive Frameworks . . . . . . . . . . . . 70 5.2.2 Hierarchical HMM classiers . . . . . . . . . . . . . . . . . . . . 72 5.2.3 BLSTM and RNN architectures . . . . . . . . . . . . . . . . . . . 75 5.2.4 Combination of HMM and BLSTM classiers . . . . . . . . . . . 77 5.3 Emotions and Emotion Transitions . . . . . . . . . . . . . . . . . . . . . 78 5.3.1 Valence and Activation . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.2 Clusters in the Emotion Space . . . . . . . . . . . . . . . . . . . 81 5.3.3 Emotional Grammars . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Feature Extraction and Fusion . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.1 Audio-Visual Frame-level Feature Extraction . . . . . . . . . . . 86 5.4.2 Utterance-level Statistics of Audio-Visual Features . . . . . . . . 87 5.4.3 Audio-Visual Feature Fusion . . . . . . . . . . . . . . . . . . . . 89 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.2 Context-Free vs Context-Sensitive Classiers . . . . . . . . . . . 91 5.5.3 Context-Sensitive Neural Network Classiers . . . . . . . . . . . 95 v 5.6 BLSTM Context Learning Behavior . . . . . . . . . . . . . . . . . . . . 96 5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.8 Extensions: Multimodality and Dialog Modeling . . . . . . . . . . . . . 103 5.8.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.8.2 Experimental Setup and Results . . . . . . . . . . . . . . . . . . 107 5.8.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 6: Tracking Continuous Emotional Trends Through Time 113 6.1 Continuous Tracking of Emotions and Emotional Trends . . . . . . . . . 113 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3 Overview of our Emotion Tracking Experiments . . . . . . . . . . . . . . 116 6.3.1 Statistical Mapping Framework . . . . . . . . . . . . . . . . . . . 117 6.3.2 Database and Features . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4 Tracking Emotion Trends at Multiple Resolutions . . . . . . . . . . . . . 121 6.4.1 GMM-based tracking at frame and window level . . . . . . . . . 121 6.4.2 Using LSTM neural networks for regression . . . . . . . . . . . . 123 6.4.3 Baseline based on simple functions of informative features . . . . 123 6.5 Experiments, Results and Discussion . . . . . . . . . . . . . . . . . . . . 125 6.5.1 Frame-level tracking using audio-visual information . . . . . . . . 127 6.5.2 Window-level tracking using audio-visual information . . . . . . 132 6.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 134 Chapter 7: Healthcare Applications: Analysis of Emotional Facial Expressions of Children with Autism Spectrum Disorders 137 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.3 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.4 Data Preparation and Transformation into Functional Data . . . . . . . 142 7.5 Analysis of Global Characteristics of Aective Expressions . . . . . . . . 144 7.6 Quantifying Expression-Specic Atypicality through fPCA . . . . . . . . 148 7.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 8: Conclusions and Future Work 154 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.2 Current and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Bibliography 160 List of Publications 176 .1 Journal Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 .2 Conference Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 vi List of Figures 2.1 The Self Assessment Manikin used to rate the aective dimensions of valence (top row), activation (middle row) and dominance (bottom row), taken from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Example of a Feeltrace display during a tracking session, taken from [2]. The horizontal axis descibes the valence dimension and the vertical axis the activation dimension. The annotator can move the cursor (indicated by a circle) across this interface, to rate an emotional manifestation that is being diplayed in parallel on a separate window. . . . . . . . . . . . . 22 3.1 Acting continuum: From fully predetermined to fully undetermined [3] 26 3.2 Snapshots of an actor during data collection and the marker positions . 30 3.3 Screenshot of the modied Feeltrace interface. . . . . . . . . . . . . . . 34 3.4 Example of activation rating by three annotators. . . . . . . . . . . . . 36 4.1 Positions of the MoCap face markers and separation of the face into lower and upper facial regions. . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 The positions of the Motion Capture markers (a)-(b), and denitions of the body parts used in feature extraction (c) . . . . . . . . . . . . . . . 53 4.3 Examples of extracted features from MoCap markers. . . . . . . . . . . 54 vii 5.1 A summary of our classication systems under the proposed hierarchical, context-sensitive framework. At the lower (utterance) level, modeling of emotional utterances U t is performed through emotion-specic HMMs, as illustrated in the lower left part of the gure, or by computing statistical properties of each emotional class, as illustrated in the lower right part. At the higher level, which represents the conversation context, emotional ow between utteraces of a conversation C is modeled by an HMM or a Neural Network (Unidirectional or Bidirectional RNN or BLSTM). The dierent combinations of the approaches at lower and higher level, lead to the three systems that we describe in this work: 2-level HMM, neural networks (NNs) trained with feature functionals, and hybrid HMM/NN. 71 5.2 Sequential Viterbi Decoding passes. Viterbi Decoding is performed in se- quential subsequences (of length w+1) of the total utterance observation sequence. The labeling decision for utterance U t at time t is aected by the labeling decisions of w past and w future utterances. . . . . . . . . . 74 5.3 LSTM memory block consisting of one memory cell: the input, out- put, and forget gates collect activations from inside and outside the block which control the cell through multiplicative units (depicted as small cir- cles); input, output, and forget gates scale input, output, and internal states respectively; a i and a o denote activation functions; the recurrent connection of xed weight 1.0 maintains the internal state. . . . . . . . . 76 5.4 Structure of a bidirectional network with input i, output o, and two hidden layers (h f and h b ) for forward and backward processing. . . . . . 77 5.5 Analysis of activation and valence classes in terms of categorical labels for all utterances of the database: anger (ang), happiness (hap), excite- ment (exc), sadness (sad), frustration (fru), fear (fea), surprise (sur), disgust (dis), neutral (neu), other (oth), and no agreement (n.a.). We notice that the categorical tags are generally consistent with the activation and valence tags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.6 Analysis of classes in the 3 cluster and 4 cluster tasks in terms of cat- egorical labels. The bars and the error bars correspond to the mean and standard deviation computed across the 10 folds. We notice that the data- driven clusters tend to contain dierent categorical emotional manifesta- tions according to their position in the emotional space. For example, for the 3 cluster task, clusters 1,2 and 3 roughly contain categorical emo- tions of `anger or frustration', `happiness or excitement' and `neutrality or sadness or frustration' respectively. . . . . . . . . . . . . . . . . . . . 82 viii 5.7 Positions of the MoCap face markers and separation of the face into lower and upper facial regions. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.8 Derivatives of the network outputs at time t = 16 with respect to the dierent network inputs at dierent timesteps t 0 ; randomly selected ses- sion consisting of 30 utterances (BLSTM network for the discrimination of ve emotional clusters). . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.9 Average number of relevant past and future utterances dependent on the position in the sequence when using a BLSTM network for the discrimi- nation of ve emotional clusters (3 % sensitivity-threshold). . . . . . . . 100 5.10 Average number of relevant past utterances dependent on the sensitivity- threshold; straight lines: utterances in correct order; dashed lines: ran- domly shued data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.11 An overview of the proposed hierarchical framework. . . . . . . . . . . . 104 6.1 An overview of our work on emotion tracking. From left to right we depict the data collection setting (described in Sec. 3.3), the audio-visual feature extraction (described in Sec. 4.2) and data annotation process (described in Sec. 3.4), as well as the GMM-based statistical mapping approach that we follow for estimating the emotional curves. . . . . . . . . . . . . . . 117 6.2 Frame-level tracking using body language features: Performance of the various tracking approaches and feature selection algorithms (in terms of median correlation with ground truth) as a function of the number of body language features used. . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Examples of activation and dominance annotations and the corresponding ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Results of GMM-based mapping, LSTM regression and the simple base- line (mean), for the examples of Figs.6.3(a)-(c), for frame-level tracking 131 6.5 Examples of activation, valence and dominance annotations and the cor- responding ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.6 Results of GMM-based mapping, LSTM regression and the simple base- line (mean), for the examples of Figs. 6.5(a)-(c), for window-level tracking135 7.1 Placement of facial markers and denition of facial distances (left) and facial regions (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 ix 7.2 MDS visualization of similarities across subjects for left-right synchrony and facial region motion roughness metrics. . . . . . . . . . . . . . . . . 147 7.3 Examples of the expression of two consecutive smiles. (a) Plots of dis- tanceD1 during a typical (blue, TD) and atypical (red, HFA) expression evolution, before landmark registration. (b) Plots of distance D1 after landmark registration (subjects with HFA in red, TD in blue lines . . . 149 7.4 Analysis of the expression of two consecutive smiles, after performing fPCA. Plots of the rst 3 fPCA harmonics, covering 42%, 28% and 12% of total variability respectively, and scatterplots of corresponding fPCA scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 x List of Tables 3.1 Measures of agreement of the selected continuous ratings for activation, valence and dominance, at dierent levels of annotation detail . . . . . . 38 3.2 Cronbach's and ICC of global discrete ratings for activation, valence and dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Mean squared error between the discrete ratings and dierent functions of continuous ratings over all actor-recordings. . . . . . . . . . . . . . . . 40 4.1 Body language features extracted from actor A during his interaction with actor B. Features are denoted as individual when they describe only A's movement and posture information, and as interaction features when they describe the relative movement and posture of A with respect to his inter- locutor B. Norm indicates that the corresponding feature has been nor- malized per actor recording. . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Statistical analysis of the top 25 activation features, according to the F value criterion (each feature's rank according to the mRMR C criterion is included in the second column). The feature descriptions under the statistical tests column are describing high activation behavior compared to low activation behavior of a subject A. The statistical test performed is dierence of means of the feature values between high and low activation classes (t-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Statistical analysis of the top 25 dominance features, according to the F value criterion (each feature's rank according to the mRMR C criterion is included in the second column). The feature descriptions under the statistical tests column are describing high dominance behavior compared to low dominance behavior of a subject A. The statistical test performed is dierence of means of the feature values between high and low dominance classes (t-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 xi 4.4 Statistical analysis of the top 25 valence features, according to the F value criterion (each feature's rank according to the mRMR C criterion is in- cluded in the second column). The feature descriptions under the sta- tistical tests column are describing positive valence behavior compared to negative valence behavior of a subject A. The statistical test performed is dierence of means of the feature values between positive and negative valence classes (t-test) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.1 Emotional transition bigrams for the valence and activation classication tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Emotional transition bigrams for the 3 and 4 cluster classication tasks. For the 3 cluster case the most frequent categorical emotions per cluster are : c1 = 'ang/fru', c2 = 'hap/exc', c3 = 'neu/sad'. For the 4 cluster case the most frequent emotions per cluster are: c1 = 'hap/exc', c2 = 'sad/fru', c3 = 'ang/fru' and c4 = 'neu'. . . . . . . . . . . . . . . . . . 83 5.3 Statistical functionals used for turnwise processing. . . . . . . . . . . . . 88 5.4 Distribution of the features selected via CFS for the classication of va- lence (VAL) and activation (ACT) as well as for the discrimination of 3 and 4 clusters in emotional space (see section 5.3.2). . . . . . . . . . . . 89 5.5 Comparing context-free and context-sensitive classiers for discriminat- ing three levels of valence and activation, and three and four clusters in the valence-activation space, using face (f) and voice (v) features: mean and standard deviation of F1-measure and unweighted Accuracy across the 10 folds (10 speakers). . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6 Confusion matrices of the HMM, the hierarchical HMM, the hybrid HMM/BLSTM and the BLSTM classiers for the activation and 3 cluster classication tasks. For the 3 cluster case the correspondence between emotions and clusters is: c1 = 'ang/fru', c2 = 'hap/exc', c3 = 'neu/sad'. . . . . . . . 93 5.7 Comparing context-sensitive Neural Network classiers for discriminat- ing three levels of valence and activation, and three and four clusters in valence-activation space using face (f) and voice (v) features: mean and standard deviation of F1-measure and unweighted Accuracy across the 10 folds (10 speakers). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.8 Recognition performances (%) of BLSTM networks when training on the original sequence of utterances compared to when the utterances are ran- domly shued: mean and standard deviation of F1-measure and un- weighted Accuracy across the 10 folds (10 speakers). . . . . . . . . . . . 97 xii 5.9 Classication performances (F1 measures) for three levels of valence and activation using face (f), voice (v), head (h) and hand (ha) features: mean and standard deviation of F1-measure across the 5 folds (5 dyadic sessions).108 5.10 Classication performances (F1 measures) for three levels of valence and activation by fusing using face (f), voice (v) and head (h) modalities at various levels: mean and standard deviation of F1-measure across the 5 folds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.11 Classication performances (F1 measures) for three levels of valence for speaker and dialog modeling. We use the best multimodal fusion approach from the previous section. . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1 Continuous tracking at the frame-level of activation, valence and dom- inance using body language and speech cues. We present the median correlation value between the computed emotional curve and the ground truth. Parentheses indicate the number of selected body features (K), or body and speech features (K+L). . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Continuous tracking at the frame-level of activation, valence and dom- inance using speech cues only. We present the median correlation value between the computed emotional curve and the ground truth, computed only on speech regions. . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.3 Continuous tracking at thewindow-level of activation, valence and dom- inance using body language and speech cues. We present the median cor- relation value between the computed emotional curve and the ground truth 133 7.1 Results of Statistical Tests of Global Facial Characteristics (t-test, dier- ence of means) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xiii Abstract Human expressive communication is characterized by the continuous ow of multimodal information, such as facial, vocal and bodily gestures, which may convey the partici- pant's aect. Additionally, the emotional state of a participant is typically expressed in context, and generally evolves with variable intensity and clarity over the course of an interaction. This thesis explores methodologies for recognition and analysis of emo- tional human states, that address such complex aspects of emotion: multimodality, the use of contextual information, and the continuous dynamics of emotions during aec- tive experiences. Furthermore, the computational approaches proposed in this thesis are applied in the healthcare domain for the analysis of aective expressions of children with autism spectrum disorders Firstly, we investigate the use of contextual information for improving emotion recog- nition performance. We focus on the notion of temporal context, for example the evolu- tion of internal states of other participants or the subject's own previous internal states. Inspired from speech recognition ideas, such as language modeling, we explore typical emotion evolution patterns and propose a hierarchical, audio-visual emotion recognition framework that considers temporal context. We experimentally demonstrate the utility of our proposed approaches for improving emotion recognition performance. Secondly, extending this notion of emotional evolution, we represent emotions as continuous ran- dom variables, such as the degree of intensity or positivity of a person's emotion. Using xiv this detailed and exible emotion state representation, we describe methods for continu- ously estimating emotional states of participants during dyadic interactions, at various time resolutions, based on speech and body language information. Such continuous estimates could highlight emotionally salient regions in long interactions. The systems described are multimodal and combine a variety of information such as speech, facial expressions and body language in the context of dyadic settings. We particularly focus on emotional body language, which is less researched compu- tationally, but is a highly informative modality. We investigate how body language is modulated to express particular emotional states. This allows us to revisit qualita- tive psychological observations regarding emotion and body language, such as gestures, approach/avoidance behaviors, body posture and orientation, from a quantitative per- spective. Finally, this thesis explores applications of our computational approaches in the healthcare domain. Specically, we focus on the analysis of facial expressions of chil- dren with High Functioning Autism (HFA), that are typically reported in the autism literature to be perceived as awkward. This work aims to computationally quantify this impression of awkwardness. Our ndings indicate that aspects of asynchrony, rough- ness of facial motion, and atypicality in the dynamic evolution of facial expressions dierentiate expressions produced by children with HFA from expressions of typically developing children. This work sheds light into the nature of facial expression awkward- ness in autism, and demonstrates the potential of computational modeling approaches to further our understanding of human behavior xv Chapter 1: Introduction 1.1 Motivation and Problem Statement Emotions are omnipresent in everyday life, usually happening in the context of a situ- ation or interaction, and shaping our interpretation of the events that we experience. They are expressed overtly or covertly, with an intensity and clarity that may evolve over time. Our underlying emotional states tend to modulate our multimodal commu- nicative channels, such as facial, voical or bodily manifestations, which in turn can be used to infer the expressed emotion. Accurate computational models of emotion need to account for such complex aspects of emotional expression; multimodality, context in which an emotional manifestation occurs, and continuous emotional dynamics during an aective expression or aective interaction. Developing computational methods to understand `typical' emotional human behavior, could also provide us with the tools to explore `atypical' emotional behavior, for example behavior of individuals with mental or psychological disorders. The link between computational approaches and potential healthcare applications is an exciting path for exploration. 1 Regarding `typical', real-life experiences, consider the situation of an airplane pas- senger being informed by an airport employee that her luggage was lost and that she will not be reimbursed for it. This situation already sets a context for the emotional response that one might expect, possibly biasing towards negative emotions. A further notion of emotional history can come into play; given that the passenger and the airport employee have expressed negative emotions up to the current point in the conversation, it might be more likely that negative emotions will follow. If typical emotion evolu- tion tends to follow specic patterns, then modeling such emotional dynamics could enhance our ability to interpret emotional manifestations. Following the Automatic Speech Recognition (ASR) paradigm, where algorithms exploit temporal context in the phonetic or word sequence, one could model structure in an emotional sequence, e.g where transitions from anger to anger might be more likely transitions from anger to happiness. In the same lost luggage scenario, the passenger might express her anger through various communicative channels. Raising the speech intensity and pitch, the facial expression of a frown, rapid hand and body movements conveying distress, as well as corresponding language content could be used in isolation or in combination to convey anger. Our subject might also resort to sarcasm by combining con icting multimodal cues, for example a smiling face with an angry voice. This multimodal nature of emotion makes it important to correctly encode and model the information conveyed by various modalities in order to obtain a more complete picture of the expressed emotion. One will not manage to recognize happiness in speech if that emotion is solely conveyed through a bright smile, or could not hope to interpret a smile as an expression of sarcastic anger unless also taking the corresponding angry speech into account. Returning to the emotional state of our airport passenger, the clarity, intensity and degree of unpleasantness of her anger might evolve through the course of the interaction. 2 For example, covert anger at the beginning of the conversation might escalate to fury when she is confronted with the indierence of the airport employee. This suggests that emotions are not always well represented by simple categorical descriptions, but could be thought of as continuous variables evolving through time. Hence an emotional state could be represented in terms of continuous attributes, for example its degree of intensity or unpleasantness. Such continuous representations could also be seen as a more generic way to describe emotions, especially for emotional manifestations that are complex, subtle or vague, and may not have a clear categorical description. The use of continuous representations paves the way for building systems that pose the emotion recognition problem in an alternative way; as emotional curve estimation. Such curves provide emotional descriptions through time and could highlight regions of abrupt emotional change, hinting to the problem of emotional event and emotional saliency detection. This thesis proposes computational frameworks that aim to model more accurately emotional expression by taking into account the context in which it takes place and the evolving continuous attributes of emotion during an aective manifestation or an aective interaction. Our goal is to describe and experimentally validate methodologies that lead to better performing emotion recognition systems, with more diverse capa- bilities. At the same time, we hope to shed light into the way multimodal expressions are modulated by the underlying emotional state, and revisit qualitative psychological observations from a quantitative perspective. The last part of this thesis explores ap- plications of such computational methods in the healthcare domain, particularly for the analysis of aective expressions of individuals with Autism Spectrum Disorders (ASD). Our goal is to use our computational methods to investigate and quantify dierences between `typical' and `atypical' emotional expressions, in order to further the under- standing of this complex psychological condition, approaching it from a computational point of view. 3 There are three basic building blocks in this work; rstly our research focuses on de- veloping and analyzing context-sensitive, hierarchical, multimodal emotion recognition systems. We dene context as temporal emotional context, that is observations which are around the current observation and could help disambiguate the emotional content of the current observation. This work is introduced in Section 1.2 and described in de- tail in Chapter 5. In the second part of our work we are moving away from traditional emotion recognition paradigms, and pose the problem of emotion recognition as emotion tracking of continuous emotional attributes through time (Section 1.3 and Chapter 6). Fundamental issues for both these directions of work are the problems of extraction of emotionally informative features, and of modeling and fusion of the various multimodal cues through which an emotional state is expressed. The subject of multimodality is introduced in Section 1.4 and further elaborated in Chapter 4. We are particularly interested in the links between body language and emotion, which is a less researched subject in engineering, despite the fact that psychological research has shown that body language carries rich emotional information, e.g., [4, 5]. Finally, the third part of this thesis explores applications of computational methodologies in the healthcare domain, specically for the analysis of aective expressions of individuals with ASD. Here, our aim is to understand and quantify expressive dierences between typically developing populations and populations aected by ASD. This direction is introduced in Section 1.5 and described in Chapter 7. Putting this work in perspective, one could view it as part of the recent eorts in the Aective Computing community to better understand the challenges involved in recognizing complex and naturalistic emotional expressions and to propose computa- tional solutions [6, 7]. Our approach aims to to enrich emotion recognition systems by including multimodal information, by looking not only at the current observation but also at the information around it, and by modeling dynamics of emotional expression. 4 Furthermore, through the ideas of context and saliency we move towards the use of higher level information for emotion recognition systems, in the sense of highlighting emotionally salient events of an interaction or considering the context and setting of an emotional conversation. Despite the many challenges, the implications of the Aective Computing domain are many-fold for both science and technology [7, 8]. Indeed, from a technological point of view, this domain has the potential to change our concept of what a machine is and the way we interact with them. One could think of truly personalized computers that could recognize your mood, e.g stressed vs joyful, from audio-visual signals and context, such as time and day of the week or number of appointments in your online calendar, and suggest music, screensaver, rank the news of your incoming newsfeed, or make dinner outing suggestions. Even if current emotion recognition performance is far from excellent, a rough understanding of the emotional state of the user, e.g posi- tive vs negative and degree of intensity, could inform an appropriate response from the computer that would bring Human Computer Interaction (HCI) closer to producing a human-like experience. From a synthesis perspective, an analysis of how facial, bodily and vocal expressions are modulated by the underlying emotion could enable creat- ing aect-sensitive virtual agents that express emotions, which could have numerous applications in entertainment, education and research (analysis by synthesis). As an example, part of our ongoing and future directions focus on emotional body language generation for animating aect-sensitive virtual agents [9]. From a scientic point of view, computational studies that apply engineering methodologies for understanding emotional expression through quantitative observa- tions, could bring new insights about the way emotion is produced and perceived. Per- haps more importantly, such an understanding could inform social, psychological and health domains in a way that could make a positive dierence in everyday life. The use 5 of computational methodologies to further our understanding of `atypical' or distressed human behavior has been discussed and applied for the analysis of marital couples ther- apy interactions [10], and of individuals with autism spectrum disorders ([11, 12] and our own work [13]), among others. Such directions can be seen as part of the emerging Behavioral Signal Processing domain that explores the role of engineering in developing health-oriented methods and tools [14]. 1.2 Context-Sensitive Emotion Recognition Interpreting an emotional manifestation might require information other than the one contained in the currently displayed modalities, an observation which highlights the importance of contextual information. Indeed, psychology research suggests that human emotional understanding is in uenced by context, which can broadly refer to linguistic structural information, discourse information, cultural background and gender of the participants, knowledge of the general setting in which an emotional interaction is taking place etc. For instance, psychology literature indicates that facial information viewed in isolation might not be sucient to disambiguate the expressed emotion and humans tend to use context, such as past visual information [15], general situational understanding [16], past verbal information [17] and cultural background [18] to make an emotional decision. Furthermore, emotions are usually slowly varying states, typically lasting from under a minute to a few minutes [19]. Therefore an emotion may span several consecutive utterances of a conversation and emotional transitions tend to be smooth. In the emotion recognition literature, relatively few works make use of contextual information and generally use diverse context denitions. In [20] the author describes a framework for building a tutoring virtual agent, which considers a variety of contextual variables such as the student's personality and the tutor's goal. In [21] the authors 6 propose a unimodal framework for short-term temporal context modeling in dyadic in- teractions, where speech cues from the past utterance of the speaker and his interlocutor are taken into account during emotion recognition. In [22] lexical and dialog act features are used in addition to acoustic (segmental/prosodic) features. In [23] authors make use of prosodic, lexical and dialog act features from the past two turns of a speaker for recognizing the speaker's current emotional state. In our work, we have explored the use of context in the sense of temporal evolution of emotion and have proposed systems that utilize emotional information around the current observation to inform emotion classication of the current observation. A vari- ety of technical challenges are addressed in this study. Firstly, we look at an emotional expression not in isolation, but as part of a sequence, which means that our system needs to represent and process all input emotional manifestations, rather than charac- teristic examples of certain emotions. A further question is what is the type emotional context that we want to exploit; in other words, is there some temporal structure in how emotion typically evolves? Furthermore, the design of a context-sensitive system needs to address issues concerning the amount of past and possibly future context that should be included. A further issue of interest is the possibly multimodal nature of emotional observations, which brings the problem of multimodal modeling and fusion into play. Finally, to simulate real-life scenarios, we aim for a speaker independent system that does not assume prior knowledge of a test speaker. We propose hierarchical frameworks that model emotional evolution within and be- tween emotional utterances of a dialog; i.e. at the utterance and at the dialog level respectively. Specically, if we assume that an utterance is homogenous in terms of emotional content, then dynamic modeling at the utterance level models uctuations within an emotional state, while dynamic modeling between utterances captures emo- tional evolution through the dialog. The latter corresponds to considering temporal 7 emotional context: we consider the recent past, and possibly future, utterances of the speaker and his interlocutor, in order to inform our decision about the emotional con- tent of the current observation (utterance). This work is inspired by state-of-the-art approaches in Automatic Speech Recognition (ASR), where algorithms exploit context at multiple levels: from phonetic details including coarticulation in speech production to word transitions re ecting language based statistics [24, 25]. To address the multimodal nature of emotional observation, our proposed system is exible in terms of the fusion strategies that can be applied to combine the various multimodal cues, such as facial expressions, speech, head and hand movement [26, 27]. Moreover, the proposed hierarchical framework is exible in terms of the classiers that can be applied for modeling. We have experimented with approaches assuming a Markov Model structure in emotional evolution, including discriminatively trained HMMs and coupled-HMMs for combining multimodal information, as well as Neural Network (NN) based approaches. Regarding the NN classication approaches, we have specically focused on the behavior of a recently proposed variation of Recurrent Neural Networks, the Long Short-Term Memory (LSTM) neural network, which is able to model an arbitrarily long amount of history in the observation sequence [28]. This character- istic makes LSTM neural networks well suited for recognition tasks, where modeling a long range of observation history is often found to be benecial, such as handwriting recognition [29], continuous speech recognition [30], and emotion recognition. In terms of emotion classication performance, our ndings suggest that incorporat- ing temporal context in emotion recognition systems generally leads to a performance increase. Both HMM and NN based approaches perform comparably, with the latter reaching a higher recognition performance while the former producing more consistent (less variable) recognition results across speakers. Moreover, analysis of the emotional 8 ow of improvised aective dialogs suggests that there are indeed some common pat- terns underlying typical emotional evolution, motivating the use of emotional grammars to capture such patterns. For example emotional states generally tend to remain of pos- itive or negative nature over long periods of time while the intensity of the expressed emotions tends to change more rapidly, with states of very low or high intensity being transient [26]. Finally, we have particularly studied the context learning behavior of LSTM net- works, proposing approaches to eectively measure the amount of bidirectional context that such networks consider when classifying an emotional observation. We also study the eect that incorporating such context has on the nal emotion recognition result. We nd that the good emotion classication performance of LSTM networks heavily re- lies on their ability to capture temporal context information in the observation sequence [31]. 1.3 Continuous Emotional Dynamics The problem of representing an emotional state and modeling its evolution through time could be posed in a more generic way; rstly by looking at emotions as contin- uous random variables, and secondly by avoiding the segmentation of the time axis into units, such as utterances. Therefore an emotional state could be described by di- mensional attributes which take continuous values and which are unfolding through the course of time. Perhaps the most commonly used such attributes are activation, valence and dominance; activation describes how active or intense is the emotional experience, valence describes the level of pleasure related to an emotion, and takes positive and negative values for pleasant and unpleasant emotions respectively, and dominance de- scribes the level of dominance (control) of a person during an emotional experience. This approach for describing emotions was introduced in psychology research based on 9 evidence that humans may perceptually use such a representation in order to evaluate emotional situations [32, 33]. These representations enable building emotion recognition systems that estimate continuous emotional attributes rather than perform classication into discrete emo- tions. However, few works in the literature presently follow this direction. In [34, 35] the authors use regression approaches, such as Support Vector Regression (SVR), to es- timate continuous dimensional attributes from speech cues of presegmented utterances. Some works have avoided segmenting the temporal dimension, e.g. into utterances, and have addressed the problem of continuously tracking emotions through time. In [36] the authors continuously recognize the emotional content of movies from audio and video cues, using a Hidden Markov Model (HMM) which classies dimensional attributes into discrete levels. Few works have used continuous representations for both emotion and time dimen- sions. In [37, 38], the authors build unsupervised models that map the emotional content of movies in the valence-activation space using low-level audio and video cues. In [39, 40] the authors use SVR and Long-Short Term memory (LSTM) neural networks for re- gression to continuously estimate valence and activation values from emotional speech. In the [41] authors describe a multimodal system to continuously track valence and activation of a speaker using SVR and LSTM regression. In this work, the problem of emotion recognition is reformulated as an emotion tracking problem; the goal is to track a set of underlying continuous emotional curves given a set of observed continuous audio-visual features, here body language and speech cues. Our setting is very generic: the emotional states of participants in dyadic aec- tive interactions are tracked throughout their interaction, while they may be speaking, listening or neither. The technical challenges associated with this work include emotion representation and annotation issues; obtaining a reliable emotional curve annotation 10 as ground truth for further experiments is a challenging problem. Extraction of emotion informative body language features and dynamic modeling of the evolving audio-visual cues and emotional states are further issues of interest. Moreover, our system should address the fact that although body language cues are available throughout the inter- action, audio cues are only available when the subject is speaking. Finally, the system design should decide at what level of detail (resolution) across time emotion tracking is to be performed. In this work, we have applied a generative Gaussian mixture model-based approach, which computes a statistical mapping, from the set of observed feature trajectories to the set of underlying emotional curves that we want to track. This scheme has been successfully applied for voice conversion [42] and acoustic to articulatory speech inversion [43], among others. Speech inversion refers to the problem of recovering the underlying articulation during speech production from just the observed speech acoustics. In a similar way, we are trying to recover the underlying emotional state as it is represented by activation, valence and dominance from the observed body language and speech observations. In our experiments, we perform tracking at various levels of detail and we use detailed body language descriptions obtained from full body motion capture, as well as speech features. Our feature sets are tailored to the target emotional attribute by selecting informative feature subsets through feature selection. Our ndings indicate that we are more successful at tracking emotional trends, e.g. increase, decrease or stability of an emotional state, rather than the actual emotional values. According to our experimental results, we achieve higher tracking performance for activation and dominance trends, compared to valence. This suggests that body language conveys information about activation and dominance, while other modalities such as facial expressions may better re ect valence. The inclusion of speech information generally leads to a performance increase. The GMM mapping approach that we follow 11 is shown to outperform other regression-based approaches that have been applied for this problem. Overall, we obtain promising tracking performance for activation and dominance trends, which for certain cases is close to human evaluator agreement [44]. This research direction points to alternative ways to represent and study emotional information in social settings. For example, our methodologies could enable naturalistic HCI that can continuously process a variety of multimodal information from the user(s) as they unfold, monitor the users' internal emotional state and respond appropriately when needed. Additionally, this work could be extended towards examining the pro- duced emotional curves to detect regions of emotional saliency, and study the events that occur in such regions. Such vocal, bodily or interaction-based events could give us insights of what constitutes the emotional content of an interaction. 1.4 Multimodality Emotions are complex dynamic processes that are expressed by multiple modalities which may be carrying complementary, supplementary or even con icting information [45]. Speech may express rich emotional information: emotion is shown to in uence pitch, intensity and spectral characteristics of the speech signal as well as articulation, voice quality and speech rate [46, 47, 48]. The study of the relations between face and emotion has been an active research eld in psychology [49, 50, 51], where certain pro- totypical facial expressions have been argued to re ect universally recognized emotional states, e.g., anger, sadness [52]. Furthermore, research on non-prototypical expressions has described phenomena of mixed expressions to convey mixed emotions, as well as modulation or falsication of facial expressions, e.g by controlling the number of facial regions involved, adjusting the strength of muscle movement and masking of certain expressions [53]. Finally, psychological research argues that body language behaviors, such as body gestures and posture as well as proxemics, gaze and touching, convey 12 rich emotional information and facilitate aective social interactions [4, 54]. Recent work has argued that body postures are more informative than facial expressions for discriminating extreme emotional states [5]. Following these directions, the aective computing community has increasingly un- derscored the importance of multimodal information, moving towards multimodal emo- tion recognition approaches [7]. Many works can be found in the literature that combine facial and speech data for emotion recognition, using a variety of approaches including Hidden Markov Model (HMM) based systems [55], multi-stream HMMs [56], Support Vector Machines (SVM) [57], Ada-Boost algorithms [58], adaptive Neural Network (NN) classiers [59], NN and Support Vector Regression (SVR) approaches [60]. A small but increasing amount of engineering works focus on the capture and analysis of body lan- guage information for emotion recognition. In [61, 62] the authors use upper body language information along with facial expressions to recognize emotions, while in [41] shoulder movement cues were used along with facial and vocal cues for continuous emotion tracking. Works that examine aective full body language include [63] where authors advantageously use full body motion cues, alongside facial and vocal informa- tion, and [64] where authors use features describing movement quality to classify basic emotional states. Lexical content information can also be combined with audio-visual cues, as in the work of [65] where a NN architecture is applied using facial, vocal and lexical features. Finally, researchers have also studied the use of physiological signals for emotion recognition. In [66] authors use measurements of subject's body temperature, galvanic skin response and heart rate, while in [67] the authors combine psysiological signals (electrocardiogram, skin conductivity, temperature, etc) with speech informa- tion. Both [66, 67] consider simple k-Nearest Neighbour (kNN) or Linear Discriminant (LDA) approaches for classication. A detailed survey of multimodal emotion recogni- tion systems can be found in [68]. 13 In our work, we have considered the problems of capturing, extracting and model- ing multimodal information, such as facial expressions, speech and full body language gestures, as well as the problem of multimodal fusion. The visual cues (facial and body language information) are obtained through motion capture and are further processed to extract descriptions of emotion-related gestures. Out feature extraction work shows a preference for features that are knowledge-based and interpretable, rather than purely data-driven and optimized for specic classication tasks. We have examined a variety of approaches for modeling the evolution and interplay of speech and facial cues dur- ing an emotional utterance and as a front-end for emotion recognition systems, such as dynamic modeling through HMMs and coupled-HMMs, and statistical descriptions through the use of statistical functionals over features [69, 26]. This work also focuses on extracting information about human expressive body language in the context of aective dyadic interactions. We have examined a variety of psychology-inspired features describing both body posture and movement as well as relative body behavior of a person with respect to his interlocutor, such as proxemics, looking, touching etc. Such features are applied for emotion recognition tasks but are also analyzed to shed light into the body gestures that seem most re ective of certain emotional states. This enables us to revisit qualitative psychological observations concerning body language and emotion, from a quantitative perspective. Indeed, some of our ndings ([44]), agree with intuition and with past psychological observations; for example highly activated subjects generally display higher velocities, and more leaning and body orientation towards the interlocutor, while dominant subjects tend to touch the interlocutor, which brings to mind psychological observations relating touching with dominant behavior [70]. Finally, multimodal data processing starts from multimodal emotional data collec- tion and data annotation, which are challenging problems in themselves. This thesis 14 discusses methodological approaches to elicit naturalistic emotional data in a controlled lab environment. The large, multimodal CreativeIT database [71], collected at SAIL lab, can be used as an example of how theatrical techniques, such as Active Analysis [72], can be applied for emotional data collection. Moreover, summarizing our experience from the CreativeIT data annotation design and data processing, we discuss challenges, design decisions, lessons learned and open problems in the annotation and processing of continuous emotional attributes [73]. 1.5 Applications in Healthcare: Computational Ap- proaches for Autism Research Autism and Autism Spectrum Disorders (ASD) are a range of neural development dis- orders that are characterized by diculties in social interaction and communication, reduced social reciprocity, as well as repetitive and stereotyped behaviors [74]. ASD aects a large and increasing amount of children in the US; according to the Center of Disease Control (CDC), 1 in 88 children in the US was diagnosed with ASD in 2012, a number that was 1 in 110 until recently (CDC 2010). This has motivated technology research to work towards providing computational methods and tools to improve the lives of autism practitioners and patients, and poten- tially further the understanding of this complex disorder. For example, technological possibilities include the development of Human-computer Interfaces to elicit, encourage and capture behavior of individuals with ASD, as well as the development of methods to analyze their communication and social patterns. Work in [75] describes eye tracking glasses to be worn by the practitioner and track gaze patterns of children with ASD dur- ing therapy sessions, while [76] introduces an expressive virtual agent that is designed to interact with children with ASD. Computational analyses have mostly focused on 15 atypical prosody, where certain prosodic properties of subjects with ASD are shown to correlate with the severity of autism diagnosis [77, 78, 79]. This work ([13]) focuses on the analysis of atypical facial expressions, which is a rel- atively unexplored topic. Specically, we examine aective facial expressions of children with High Functioning Autism (HFA). Individuals with HFA have average intelligence and language skills, but often struggle in social settings because of diculty in interpret- ing [80] and producing [81, 82] facial expressions. Their expressions are often perceived as awkward or atypical by typically developing observers (TD) [83]. Although this perception of awkwardness is used as a clinically relevant measure, it does not shed light into the specic facial gestures that may have elicited that perception. This work aims to computationally quantify this impression of awkwardness. For this purpose, we use Motion Capture (MoCap) technology and statistical methods like Functional Data Analysis (FDA, [84]), that allow us to capture, mathematically quantify and visualize atypical characteristics of facial gestures. Starting from these qualitative notions of atypicality, our goal is to derive quantita- tive descriptions of the characteristics of facial expressions using appropriate statistical analyses. Through these, we can discover dierences between TD and HFA popula- tions that may contribute to a perception of atypicality. The use of detailed MoCap information enables quantifying overall aspects of facial gestures such as synchrony and smoothness of motion, that may aect the nal expression quality. Dynamic aspects of facial expressions are of equal interest, and the use of FDA techniques such us func- tional PCA (fPCA) provides a mathematical framework to estimate important patterns of temporal variability and explore how such variability is employed by the two popu- lations. Finally, given that children with HFA may display a wide variety of behaviors [85], it is important to understand child-specic expressive characteristics. The use of 16 multidimensional scaling (MDS) addresses this point by providing a principled way to visualize dierences of facial expression behavior across children. Our methods are applied to a database of expressions from 37 children aged 9- 14, which includes typically developing children and children with ASD. The children perform a variety of facial expressions, including smile, frown, expressions of surprise etc. MoCap markers were attached to the subjects' faces, to provide detailed facial expression information. The data was collected by our psychology collaborator, Prof. R.B. Grossman. According to our results, subjects with HFA are characterized on average by lower synchrony of movement between facial regions, more rough head and facial motion, and a larger range of facial region motion. Expression-specic analysis of smiles indicates that children with HFA diplay a larger variability in their smile evolution, and may display idiosyncratic facial gestures unrelated to the expression. Overall, children with HFA consistenly display a wider variability of facial behaviors compared to their TD counterparts, which corroborates with existing psychological research [85]. Those results shed light into the nature of expression atypicality and certain ndings, e.g., asynchrony, may suggest an underlying impairment in the facial expression production mechanism that is worth further investigation. Our work proposes the use of a variety of statistical approaches to uncover and interpret characteristics of behavioral data, and demonstrates their potential to bring new insights into the autism domain. Our ongoing eorts focus on the analysis of a larger dataset, that is currently being collected and includes more subjects and a larger variety of facial expressions. 17 1.6 Overview The rest of this document is organized as follows; Chapter 2 presents an overview of emotional representations, and Chapter 3 discusses the collection and annotation of multimodal databases, focusing on continuous emotional descriptions and using the CreativeIT database as an example. Chapter 4 describes multimodal feature extraction and analysis, focusing on emotional body language, while Chapter 5 presents research on context-sensitive emotion recognition systems. Chapter 6 describes our research on continuous emotion tracking and Chapter 7 presents our computational analysis of healthcare data, from individuals with ASD. Finally, Chapter 8 concludes this thesis and discusses ongoing and future research directions. 18 Chapter 2: Emotional Representations 2.1 Categorical Representations After decades of research, there is still lively debate in the psychology community re- garding the denition and properties of emotion. Certain researchers argue that there is a small set of basic or prototypical emotions that have been developed through hu- man evolution and are universally recognized. Plutchik [86] describes a list of 8 basic emotions: fear, anger, happiness, sadness, acceptance, disgust and surprise, which could be combined to create more complex emotions. Each of those emotions stems from a basic evolutionary need, e.g fear to escape a threat, happiness when obtaining a desired object, anger when confronting an obstacle etc. Ekman and Friesen [87] dened a set of 6 basic emotions: anger, disgust, fear, joy, sadness and surprise, which were argued to be universally recognized from facial expressions across cultures. Later Ekman expanded this set to include more subtle emotional states, such as amusement, embarrassment, guilt, etc [52]. Other researchers have proposed alternative sets of basic emotions, rang- ing from as little as two emotions (e.g pain and pleasure proposed by Mowrer [88]) to many more (e.g., see Table p. 2 in [89]). Usually sets include fear, anger, sadness and 19 joy. However, the theory of basic emotions is strongly disputed by other researchers, e.g., Ortony and Turner [89], who argue that such emotions are simply more common. Following these directions emotion recognition systems in the literature are often design to classify emotional manifestations into such emotional categories, e.g. [90, 71, 56]. However, from a computational perspective, the emotions that we choose to consider when designing an emotion recognition system may be application dependent. For example, for the call-center application described in [22], it is sucient to recognize positive vs negative emotion of a caller, while the aect-sensitive educational system proposed in [8] would classify an observation into one of three emotional categories: interest, pleasure or frustration. Interpretability is a strong point of such categorical representations. 2.2 Continuous Dimensional Representations An alternative approach for describing emotions is to represent emotions in terms of continuous attributes, which are called dimensions, rather than with discrete emotional categories. The most commonly used such dimensions are activation, valence, and dom- inance. Activation describes how active or intense is the emotional experience, valence describes the level of pleasure related to an emotion, and takes positive and negative values for pleasant and unpleasant emotions respectively, and dominance describes the level of dominance (control) of a person during an emotional experience. These emo- tional dimensions were introduced in psychology based on experimental evidence that humans may perceptually use such a representation for evaluating emotions [32, 33]. This representation does not suppose pre-selecting a number of disrete emotional cat- egories for classication and could be seen as a more generic way to classify emotions, especially for emotional manifestations that are complex, subtle or may not have a clear categorical description. Indeed various recognition systems in the aective computing 20 Figure 2.1: The Self Assessment Manikin used to rate the aective dimensions of valence (top row), activation (middle row) and dominance (bottom row), taken from [1]. literature use dimensional representations, and often consider only the activation and valence dimensions e.g., [91, 92]. Systems that use dimensional representations may not require pre-dening a set of emotional classes, however they usually require a choice on how many levels of activation and valence will be used for classications. For example systems may use three levels, indicating low, medium and high values, or require greater detail, such as 5 point or 9 point scales. To facilitate annotation of emotional manifestations into scales of activa- tion, valence and dominance, psychology reasearchers introduced the Self Assessment Manikins (SAM) [1]. SAM consists of intuitive pictorial representations describing the three dimensional attributes. An example of SAM for a 5-point scale for rating activa- tion, valence and dominance is depicted in Fig. 2.1 (taken from [1]). In the aective computing community SAM has been used for rating emotional databases, e.g in [93]. 21 Figure 2.2: Example of a Feeltrace display during a tracking session, taken from [2]. The horizontal axis descibes the valence dimension and the vertical axis the activation dimension. The annotator can move the cursor (indicated by a circle) across this inter- face, to rate an emotional manifestation that is being diplayed in parallel on a separate window. Alternatively, researchers have tried to take full advantage of the continuous nature of such representations by collecting continuous ratings of emotional dimensions and developing systems that estimate continuous emotional attributes, e,g [34, 35]. A further step is to collect continuous ratings of dimensional attributes through the course of time, and build systems that represent emotional attributes not as points but as curves that take continuous values and evolve through time. In order to collect such ratings, in [2] the authors have introduced the Feeltrace annotation software, which is depicted in Fig. 2.2. Feeltrace enables the user to provide real-time emotional annotation by moving the cursor across an interface that represents the valence-activation space, while the emotional manifestations are being displayed on a separate video player. The end product is an emotional curve for each emotional attribute. 22 Irrespective of the representation that is chosen, a challenge concerning emotional ratings is the inherent subjectivity of the task. Individuals often perceive emotions in a person-specic and subjective manner, hence there may be disagreement as to the actual emotional label of an emotional manifestation, especially for subtle or ambigu- ous emotional manifestations. Naturally, the greater the level of detail of emotional annotations, e.g., more categorical labels or more levels of emotional attributes, the less multiple annotators are expected to agree. This underlines a fundamental challenge of the emotion recogntion problem, namely the fact that there is uncertainty regard- ing the label associated with an example, in contrast to traditional supervised pattern recognition problems where there is a clear association between an example and the corresponding label. Chapter 3 discusses the challenges associated with obtaining continuous emotional annotations, and describes guidelines and design decisions that we followed during the data annotation of the CreativeIT database, in order to alleviate the problem of annota- tor disagreement. We also highlight open questions for research regarding the annotation and processing of continuous emotional attributes. 23 Chapter 3: The USC CreativeIT Database: Collection and Annotation 3.1 Multimodal Emotional Databases and CreativeIT This chapter describes the design, collection and annotation process and annotation results of a novel, multimodal and multidisciplinary interactive database, the USC Cre- ativeIT database [71]. The author was actively involved in the collection and annotation of CreativeIT and the data collected are used for a substantial portion of this thesis. The database is a result of the collaborative work between the USC Viterbi School of Engineering and the USC School of Theater. The database is collected using cam- eras, microphones and motion capture and contains detailed audiovisual information of the actors' body language and speech cues. It serves two purposes. First, it provides insights into the creative and cognitive processes of actors during theatrical improvi- sation. Second, the database oers a well-designed and well-controlled opportunity to study expressive behaviors and natural human interaction. 24 The signicance of studying creativity in theater performance is that improvisa- tion is a form of real-time dynamic problem solving [94]. Improvisation is a creative group performance where actors collaborate and coordinate in real time to create a co- herent viewing experience [95]. Improvisation may include diverse methodologies with variable levels of rules, constraints and prior knowledge, concerning the script and the actor's activities. Active Analysis, introduced by Stanislavsky, proposes a goal-driven performance to elicit natural aective behaviors and interaction [72], and is the primary acting technique utilized in the database. It provides a systematic way to investigate the creative processes that underlie improvisation in theater. The role of acting has been considered as a viable research methodology for studying human emotions and communication. Theater has been suggested as a model for believ- able agents; agents that may display emotions, intents and human behavioral qualities [96]. Researchers have advocated the use of improvisation as a tool for eliciting natural- istic aective behavior for studying emotions and argue that improvised performances resemble real-life decision making (Fig. 3.1, [3]). Furthermore, it has been suggested that experienced actors, engaged in roles during dramatic interaction may provide a more natural representation of emotions, avoiding exaggeration or caricatures [97]. A variety of acted or elicited emotional/behavioral databases exist in the literature. As argued in [98] valuable emotional databases can be recorded from actors using the- atrical techniques. Examples of databases which explore acting techniques include the audiovisual IEMOCAP database [93] (also descripbed in Section 4.1.1), and the speech Genova Multimodal Emotion Portrayal (GEMEP) database [99]. In [100], authors de- scribe the collection of a multimodal database where contextualized acting is used. Other naturalistic, audiovisual databases include SEMAINE [101], that uses emotion elicitation techniques and does not contain actors, and the Belfast Naturalistic Database 25 Figure 3.1: Acting continuum: From fully predetermined to fully undetermined [3] [102], that contains acted emotional portrayals as well as extracts of emotional TV clips of a more spontaneous nature. The USC CreativeIT database is a novel, multimodal database that is distinct and complements most of the existing ones. Its theoretical design is based on the well- established theatrical improvisation technique of Active Analysis and results from a close collaboration of theater experts, actors and engineers. We utilize Motion Capture technology to obtain detailed body language information of the actors, in addition to microphones, video and carefully designed post-performance interviews of the partic- ipants. The database aims to facilitate the study of creative theatrical improvisation qualitatively and provides a valuable resource to study human-human communicative interaction. 26 3.2 Theatrical Techniques for Eliciting Naturalistic Emo- tions 3.2.1 Active Analysis In Active Analysis, the actors play con icting forces that jointly interact. The balance of the forces determines the direction of the play. The scripts used in the case play the role of guiding the events (skeleton). The course of the play can be close or dierent from the script. This degree of freedom provides a exibility to work at dierent levels in the improvisation spectrum. A key element in Active Analysis is that actors are asked to keep a verb in their mind, while they are acting, which drives their actions. As a result, the interaction and behavior of the actors may be more expressive and closer to natural, which is crucial in the context of emotion recognition. For instance, if the play suggests a confrontation between two actors, one of them may choose the verb inquire while the other may choose evade. If the verbs are changed (e.g. persuade, confront) the play will have a dierent development. By changing the verbs, the intensity of the play can be modied as well (i.e., ask versus interrogate). As a result, dierent manifestations of communication goals, emotions and non-verbal behaviors can be elicited through the course of the interaction. This exibility allows us to explore the improvisation spectrum at dierent levels and makes Active Analysis a suitable technique to elicit emotional manifestations. 3.2.2 Design of Data Collection The USC CreativeIT database utilizes two dierent theatrical techniques, the two- sentence exercise and the paraphrase, both of which originate from the Active Analysis methodology. 27 In the two sentence exercise, each actor is restricted to saying one given sentence with a given verb. For example, one actor may say \Marry Me" with verb confront, and another one may say \I'll think about it" with verb de ect. Given the lexical constraint, the expressive behaviors and the ow of the play will be primarily based on the prosodic and non-verbal behaviors of the actors. This type of controlled interaction can bring insights into how human/actors use their expressive behaviors, such as body language and prosody, to reach a communication goal. Also, this approach is suitable to study emotion modulation at a semantic level, since the same sentences are repeated dierent times with dierent emotional connotation. In the paraphrase, the actors are asked to act out a given script with their own words and interpretation. Examples of plays that are used are "The Proposal" by Chekhov or "Taming of the Shrew" by Shakespeare. In this set of recordings, actors are not lexically constrained. As a result, the performance is characterized by a more natural and free- ow interaction between the actors which bares more resemblance to real-life scenarios, compared to the two-sentence exercise. Therefore, behavioral analysis and ndings on such sessions could possibly be extrapolated to natural human interaction and communication. 3.3 Data Collection 3.3.1 Session Protocol An expert on Active Analysis (Prof. Carnicke, professor at the USC Theater School) directed the actors during the rehearsal and the recording of the sessions. Prior to the scheduled data collection date, the actors had to go through a rehearsal with the director to become familiar with active analysis and the scene. Just before the recording of the paraphrase, there was another 5-minute session to refresh actors' memory and 28 give the director a chance to remind actors the essence of the script. A snapshot of an actor during the data collection is shown in Figure 3.2(a) Each data collection consists of six performances from a pair of actors: four two- sentence exercises and two paraphrases. Verbs are chosen either by the actors or the director prior to each performance. Some of the commonly chosen verbs are to shut him out, to seduce, to de ect, to confront, to force the issue etc, which introduce a large variety of communication goals. 3.3.2 Equipment and Technical Details The following is the list of equipment that is utilized in the data collection: Vicon Motion Capture System: 12 motion capture cameras to record 45 marker's (x;y;z) position for each actor. The markers are placed according to Figure 3.2(b). HD Sony Video Camcorder: 2 Full HD cameras are places at each corner of the room to capture the performance of the actors. Microphones: Each actor has a close-up microphone to record actors' speech at 48KHz with 24 bits. 3.4 Emotional Data Annotation Human emotional states evolve with variable intensity and clarity through the course of social interactions and experiences, and they are continuously in uenced by a variety of input multimodal information from the environment and the interaction participants. This has motivated the development of a new area within aective computing that treats emotions as continuous variables and examines their representation, annotation and 29 (a) Actor wearing microphone and markers (b) Positions of markers Figure 3.2: Snapshots of an actor during data collection and the marker positions modeling. The emotional annotation of the CreativeIT database focuses on continuous annotations, and was discussed in [73]. 3.4.1 Why Continuous Annotations? The CreativeIT database contains a variety of multimodal expressions and interaction dynamics that continuously unfold during the improvisation. Therefore, it is dicult to dene precise starting and ending times of expressions since those are produced multimodally, or to segment interactions into units of homogenous emotional content. In unimodal databases, or databases that are spoken/dialog-centric such as IEMOCAP [93] and VAM [103], it seems natural to segment a conversation into utterances as basic units for examining emotional content. In contrast, the CreativeIT database contains many nonverbal emotional expressions that happen asynchronously to speech or when the participant is silent. Such observations motivate the use of continuous attributes as a natural way to describe the emotional ow of an interaction. The perceived emotional state for each participant was annotated in terms of the widely used dimensional attributes of activation, valence and dominance. This repre- sentation is well-suited to describe the complex and ambiguous manifestations of the 30 CreativeIT database, which do not always have clear categorical descriptions. For our annotations, we used the Feeltrace software [2] and collected annotations of perceived activation, valence and dominance for each actor in each performance, taking continuous values in [-1,1]. 3.4.2 Challenges and Design Decisions Annotation of emotional content is an inherently subjective task that depends, among others, on the individual's perception, experiences and cultural background. The use of continuous descriptors seems to increase the level of complexity of the emotional annotation task, as it requires a higher amount of attention and cognitive processing compared to non real-time, discrete annotation tasks. Apart from being a strenuous and time-consuming process, continuous annotation poses challenges in terms of obtaining inter-annotator agreement. We performed two annotation eorts on the CreativeIT data; the rst one was a pilot annotation of a subset of the data which helped us identify problems and pro- pose improvements for the second annotation eort. In our pilot annotation, used in [104], we identied problems of low annotator agreement leading to noisy ground truth. Specically, we obtained median inter-annotator correlations of around 0.5 for the three dimensional attributes. Similar problems have been reported by other researchers that make use of continuous attributes. In [105] authors report that in 70% of continu- ous valence annotations of TV clips, the inter-annotator correlations are above the 0.5 threshold, a percentage that reduces to 55% and 34% for activation and dominance (power), respectively. In [36] authors report mean annotator correlations of 0.3 and 0.4 for valence and activation respectively for continuous self-annotations of felt emotion while watching movies. 31 The continuous rating of the SEMAINE database shows a more optimistic picture, where Cronbach's values of continuous valence ratings are reported to be higher than 0:75 for 86% of the ratings (indicating acceptable consistency) [101]. Authors also report high values (0.8-0.9) for consistency of functions over continuous dimensional attributes, e.g., mean. However such functions should be interpreted with caution as they do not necessarily correspond to a well-dened user perception. For example, the mean of a continuous rating over a clip does not generally correspond to the rater's perceived global rating of the clip, as discussed in Sec. 3.5.2. For the annotation of the CreativeIT data, we identied a number of potential factors that could be sources of annotation noise, and could increase the level of inter- annotator variability over what is naturally expected because of the challenging and subjective nature of the task. Below, we describe such noise factors and the resulting practical design decisions that we adopted in order to address them. Annotator Motivation and Experience We recruited psychology students, most of whom had previous experience in emotional annotation, and were committed to weekly working requirements. Denition of Emotional Attributes Although people typically learn to assess emotional content through social experiences, the denition of dimensional emotional attributes may be less intuitive for some annotators. The denitions of activation, valence and dominance attributes were explained through examples. We claried that ratings are subjective, however annotators should be able to rationally explain their decisions based on verbal or nonverbal characteristics of the interaction. The annotation instrument We observed a learning curve until annotators be- came comfortable with the use of Feeltrace (see also Sec.3.6.1). Annotators were trained 32 on how to use Feeltrace, they performed their rst annotations multiple times to famil- iarize themselves with the software, and were later encouraged to perform each annota- tion as many times as needed until they were satised with the result. Understanding the type and range of emotional content in the dataset In order to facilitate the annotation process, we wanted annotators to be familiar with the type and range of emotional manifestations that appear in the database. Therefore, as part of their training, they had to watch in advance about a fth of the recorded performances, randomly selected across dierent performances. Person-specic annotation delays Since continuous annotations are performed in real time, we expect person-specic delays due to perceptual processing, between the time that an event happens and when its emotional content is annotated. In order to reduce such delays, we modied the Feeltrace interface so that annotators can focus their attention on one attribute each time, rather than two attributes that was the original design of Feeltrace. The modied Feeltrace interface for activation annotation is presented in Fig.3.3. The annotation is performed by moving the mouse, shown as a full circle, along the horizontal line, while watching the performance video in a separate window. It is interesting to note that a one-dimensional version of the Feeltrace interface later became available (software Gtrace [106]), indicating the need for such a one-dimensional annotation tool. Finally, to further reduce delays due to perceptual processing, we also instructed annotators to watch each video multiple times and have a clear idea of the emotional content before starting the real-time annotation. 3.4.3 Discrete Data annotation We also collected discrete annotations of global emotional content of each performance. Emotional content was rated in terms of perceived activation, valence and dominance for each actor on a 9-point scale. Rating 1 denotes the lowest possible activation level, the 33 Figure 3.3: Screenshot of the modied Feeltrace interface. most negative valence level, and the most submissive dominance level. Annotators were asked to give an overall rating that summarizes the particular attribute over the total recording. They were instructed to perform the overall rating right after completing the corresponding continuous annotation, such that they would have a recent impression of the annotated performance and their continuous assessment of its emotional content. The reason for collecting global annotations is two-fold; rstly we wanted to enrich our annotation with more standard discrete labels for potential future use. Secondly, we want to study relations between global discrete and detailed continuous ratings provided by the same person, in order to shed light into the way humans summarize an emotional experience. 3.5 Data annotation Results The database contains 50 recordings, each rated for both actors in the dyad, therefore we have 100 actor-recordings. Seven annotators participated in total, rating overlapping portions of the database, so that each actor-recording would be rated by three or four 34 people (88 out of the 100 actor-recordings were rated by 3 people). The resulting contin- uous annotations were pre-processed by lowpass ltering to remove high frequency noise artifacts. This section describes analysis of the annotation results that we obtained. 3.5.1 Annotator Agreement Evaluator agreement is a straightforward concept when dealing with discrete labels; for example we can say that two annotators agree if the choose the same label. For con- tinuous annotations this concept becomes less straightforward. Researchers processing continuous ratings generally assume agreement when two continuous ratings are corre- lated, e.g., using Pearson correlation as a metric, or when they are consistent in terms of the quantitative Cronbach's coecient (which is widely used in the cognitive sci- ences), or when they have small mean square dierence with each other, in terms of their absolute values [41, 101]. To choose an agreement metric, it is important to understand how raters behave when rating continuous attributes. Fig. 3.4 shows an example of activation annotations by three annotators for the same actor-recording, and their average. Although annota- tors agree on the trends of the activation curve (mean correlation of 0.67), and recognize pronounced activation events, they do not agree on the actual activation values. Similar observations hold true for several of our obtained annotations. This suggests that peo- ple agree more when describing emotions in relative terms, e.g., whether there has been an increase or decrease, rather than in absolute terms. Rating emotions in absolute terms seems more challenging because of each person's internal scale when assessing an emotional experience (similar arguments are made in [107]). This motivates us to focus on the annotation trends, and to use correlation metrics, such as Pearson's correlation, and Cronbach's to measure evaluator agreement. 35 Figure 3.4: Example of activation rating by three annotators. To compute the emotional `ground truth' for each recording (especially for facilitat- ing subsequent computational modeling), we need to aggregate the multiple annotators' decisions. However, some ratings might appear inconsistent with the ratings of the majority of annotators. This issue is common in emotional labeling with categorical labels, where the emotional ground truth is often computed based on majority voting and minority labels are ignored, e.g. [93]. Here, we extend this notion on continu- ous ratings, using correlations as a basis for agreement. Specically, we set a cut-o threshold for dening acceptable annotator agreement and for each actor-recording, we take the union of all annotator pairs with linear correlations greater than the threshold. Only this annotator subset is used to compute the ground truth for the corresponding actor-recording. If no annotators are selected then we assume that there is no agree- ment for that recording. Our threshold is empirically set to 0.45, which is similar to the correlation threshold used in [105] for dening agreement. This results in ground truth agreement in 80, 84 and 73 actor-recordings for the activation, valence and dominance class respectively, out of 100 in total. Interestingly, a comparable percentage of ground truth agreement (about 75%) was reported for annotation into categorical labels using this majority voting scheme, for the IEMOCAP database [93], an emotional database of improvised acting. 36 This process selects consistent annotations and allows for aggregation methods such as averaging. Developing methodologies for combining multiple annotators' subjective judgements in a more informed way than averaging, potentially considering disagree- ing annotators, is an important research problem, e.g., see [108, 109]. However, the continuous nature of our ratings makes such existing methodologies not directly appli- cable. An approach for weighted averaging of continuous annotations is proposed in [41], where annotator weights are computed based on their correlation with the rest of the annotators. This is a correlation-based, soft-selection scheme, which instead of ignoring uncorrelated annotators, it assigns them low weights. A more principled ap- proach to uncover a continuous ground truth from multiple noisy continuous emotional annotations, in a way that also handles potential delays between the annotations, is proposed in [110], by combining Probabilistic Canonical Correlation Analysis (PCCA) and Dynamic Time Warping (DTW) approaches. To get an impression of annotator agreement over the database, we rst compute the mean of the correlations between the selected annotators per actor-recording, and then compute the mean over all actor-recordings. Similarly, we also compute the Cron- bach's coecient of the selected annotators per actor-recording, and then compute the overall mean. These measures are presented in Table 3.1 (third line). For all tasks we have > 0:7 which indicates acceptable levels of annotator consistency, and good levels of correlation, given the challenging nature of the task. These measures were computed from detailed annotations with 100 values per sec. However, emotional states are slowly varying, therefore this degree of accuracy may not be necessary and it might be capturing annotation noise, unrelated to emotion. We also examine the eect of lower information detail by rst averaging the selected annotations over windows of 3 sec., with 1 sec. overlap (1 value per sec.). These measures are presented in Table 3.1 (fth line). We notice only a slight increase in annotator consistency, which indicates 37 Table 3.1: Measures of agreement of the selected continuous ratings for activation, valence and dominance, at dierent levels of annotation detail mean Pearson's correlation mean Cronbach's activation valence dominance activation valence dominance 100 values per sec. 0.60 0.63 0.59 0.75 0.77 0.74 1 value per sec. 0.63 0.64 0.61 0.76 0.77 0.75 Table 3.2: Cronbach's and ICC of global discrete ratings for activation, valence and dominance Cronbach's Intra-class correlation (Case 1 [111]) activation valence dominance activation valence dominance 0.72 0.78 0.67 0.69 0.78 0.65 that our annotation pre-processing through lowpass ltering has removed most of the high frequency noise. The consistency of the discrete global annotations was also examined by computing the coecient of global activation, valence and dominance ratings from all dierent annotators (no annotator selection is performed here). Those coecients are presented in Table 3.2. Overall, we notice that annotator consistency is at acceptable levels (around and over 0:7), except from dominance which is slightly lower. We also report consistency in terms of intra-class correlation (ICC), a related metric that considers the fact that dierent recordings are rated by dierent annotator subsets. Specically, we use ICC Case 1 from [111], which assumes that each target is rated by a dierent set of annotators, randomly selected from a larger annotator pool. 38 3.5.2 Comparing Continuous and Discrete Annotations The availability of both global discrete and detailed continuous ratings from the same annotator and for the same actor-recording, allows us to examine how annotators sum- marize continuous information to produce an overall judgment. We applied several functions to summarize each continuous rating into a number and examined how close the resulting functional is to the global rating given by the annotator. The functions include mean, median, maximum, minimum, rst and third quantile (q1 and q3) of the recording. The discrete ratings were rst shifted and rescaled to match the range of the Feeltrace annotations. Table 3.3 shows the mean squared error (MSE) between the discrete ratings and dierent functionals over all actor-recordings. The last line is the MSE when we choose the closest to the discrete rating between q1 and q3. We notice that the discrete rating is generally closer to either q1 or q3 compared to the other metrics, although it varies per rating to which of the two functionals it will be closer. Hence, the global rating is more in uenced by either the highest or the lowest values of a rating during a recording. Specically, for 66% of the activation ratings the discrete rating is closer to q3, for 59% of the valence ratings the discrete rating is closer to q1, while for dominance there is an almost equal split. This suggests that global judgments of activation tend to be more in uenced by the higher activated events of a recording, while global judgments of valence tend to be more in uenced by the more negatively valenced events. It also seems that dierent raters weight dierently the same recording when making an overall decision; for example looking at the clips that were rated by 3 people (which is the large majority) in only about 40% all annotators were consistent as to the quantile that they weighted more, either that was q1 or q3 (this percentage is similar for the 3 attributes). These ndings illustrate the complexity of the human cognitive processing when summarizing emotional content; this processing is in uenced by the emotional 39 Table 3.3: Mean squared error between the discrete ratings and dierent functions of continuous ratings over all actor-recordings. function activation valence dominance mean 0.13 0.10 0.06 median 0.14 0.10 0.07 max 0.31 0.37 0.24 min 0.71 0.26 0.40 q1 0.22 0.12 0.12 q3 0.14 0.13 0.08 either q1 or q3 0.07 0.06 0.03 aspect to be evaluated, the events that are being observed, as well as person-specic characteristics. 3.6 Open Questions and Research Directions 3.6.1 Improving tools for continuous annotations Performing continuous annotations is a challenging task, therefore the availability of suitable annotation tools is important to facilitate this process. Annotation software should ideally be customizable in terms of the number and type of attributes to be annotated, portable and easy to install for users with little technical experience, and easy to learn and use. Feeltrace, being one of the rst freely available software for continuous annotation, has been a useful resource to the community. However, this early version was lacking in terms of portability and customization functionality, e.g., it only supported 2-dimensional annotation. Those issues were addressed by the development of its successor Gtrace. Also, we noticed that users were sometimes distracted by the two separate windows in Feeltrace (one for annotation and one for video viewing); in Gtrace where those windows are integrated into a common interface. Further improvements could focus on usability; for example Feeltrace and Gtrace are mouse operated and 40 require continuously pressing the mouse to perform annotation. This can be tiring especially for annotation of long videos. Joystick-based options would be more natural, and even more enjoyable to the user. Regarding other continuous annotation tools, one could also look at the joystick-operated and freely available CMS [112]. 3.6.2 Absolute vs Relative Ratings Our annotation results suggest that humans are better at rating emotions in relative rather than absolute terms. Indeed, humans seem to have individual internal rating scales that are culture and personality dependent, among others. Therefore it is easier for multiple people to agree that, for example there has been an activation increase, rather than on the absolute values of activation. This issue was also discussed in [107], where authors propose a rating-by-comparison method for annotation of emotional con- tent in terms of discrete dimensional labels. It would be interesting to study how such rating-by-comparison ideas could be reformulated in the context of continuous annota- tion, in an attempt to increase inter-annotator consistency. 3.6.3 Which attributes can be continuously rated? Here, we focus on emotional content, however a variety of other attributes could be annotated in a continuous way depending on the application and the analysis focus, i.e., engagement, enjoyment, frustration, hostility, etc. Given that some attributes seem easier to rate than others, e.g., valence generally achieves higher consistency than ac- tivation and dominance, it would be interesting to examine to what extent dierent attributes can be rated consistently in a continuous way. This is also discussed in [105], where authors compare rating consistencies for dierent emotions and cognitive states. It seems however that this eect could be data dependent. For example, data collected for emotional studies are usually rich in emotional manifestations which may facilitate 41 consistency in emotional annotation, while data collected to examine user experience, i.e., using an interface could be more appropriate for engagement or frustration anno- tation. 3.6.4 Multiple Subjective Ratings and Crowdsourcing Studies that compute statistical properties over user populations may require a large number of annotators per recording. This brings forward the relevant discussion in the literature regarding selecting few expert annotators or many naive annotators. In this work, we have followed the former approach, as a small number of expert annotators was relatively easy to recruit in a university environment, and to coordinate. A similar approach was followed for the SEMAINE database annotation [101], while for the Belfast Naturalistic database ([102]) the large number approach was adopted instead, with some clips being rated over 160 times. Obtaining a large number of annotations per recording can be greatly facilitated by the use of modern crowdsourcing tools, like the Amazon Mechanical Turk (MTurk) which has proven useful for various user studies [113, 114], or translation tasks [115]. One caveat when using MTurk for subjective ratings is that the researcher should devise a method for assessing the attention of the rater and the quality of the annotations, in order to prevent potential careless annotators from contaminating the results, as discussed in [115, 113]. For the case of continuous ratings, there is yet no available online tool for continuous ratings, that would enable performing such annotations online and linking them to the MTurk service. Such a tool would allow researchers to harness the capabilities of crowdsourcing for large scale data annotation projects. 42 3.6.5 Annotator-specic delays As discussed in Sec. 3.4.2, because the continuous annotations are done in real-time and due to cognitive processing, we expect a person-specic delay between the time that an event happens and when its emotional content is annotated. Here, we have tried to reduce such delays by asking the annotators to perform an emotional assessment of each recording before starting the real-time annotation. Other researchers have addressed this issue by performing post-processing of the obtained annotations, e.g., by slightly shifting them in time such that the correlation between multiple annotators' continuous ratings is empirically maximized [41]. A more sophisticated method for warping and combining multiple annotations with variable delays is proposed in [110], by combin- ing Probabilistic Canonical Correlation Analysis (PCCA) and Dynamic Time Warping (DTW). More detailed studies of annotator delays would shed light into the timing of human cognitive processing with respect to events in the environment, and how it is aected by factors such as fatigue or interest. Such studies would require a careful experimental design, where for example each rater's response could be measured with respect to certain predened emotional events in the videos. 3.6.6 Continuous ratings and saliency detection Continuous ratings could reveal regions of an interaction that are characterized by abrupt changes or extreme values of emotional content. Those could be regions where interesting events happen, in the sense that they catch the viewer's attention or they are prototypical or salient examples of a certain emotional or cognitive state. Such regions may also be weighted more heavily when a person summarizes an emotional experience. Availability of continuous annotations would help us understand what constitutes the salient content of an interaction, and would pave the way towards emotional event detection and summarization in social interactions. 43 Chapter 4: Multimodal Feature Extraction Emotional expression is a multimodal process. The aective state may be transmitted by one or more of various channels such as face, voice, speech content, body movement and posture. Moreover, emotion is perceived by combining those channels which may carry complementary, supplementary or even con icting information (as in the case of sarcasm) [45]. Therefore, an emotion recognition system that takes into account multiple modalities may be able to achieve robust emotion recognition performance, even under noisy conditions or when subtle emotions are expressed. Following these directions, the aective computing community has increasingly un- derscored the importance of multimodal information, moving towards multimodal emo- tion recognition approaches [7]. The state of the art and challenges faced in developing an aect-sensitive multimodal human computer interface have been discussed in [116]. Multimodal research on emotion recognition has focused mostly in combining the face and voice modalities [56] [117], [60], [56]. Researchers have also used face images with markers to minimize the noise introduced by automatic facial feature detection [118]. 44 Recent works make use of multimodal information such as body language gestures com- bined with speech [63],[41], as well as physiological signals (such as subject's body temperature, galvanic skin response and heart rate) [66], for emotion recognition. In this chapter we describe feature extraction methods from facial, vocal and head movement cues, as well as full body language cues, in order to extract the multimodal emotion-related features which are used in this work. Facial expression information is extracted from Motion Capture (MoCap) markers across the subject's face. Speech in- formation is extracted through microphone signal, and head movement features are also extracted via MoCap markers. Those features are extracted from improvised aective dyadic interactions from the IEMOCAP database, a naturalistic emotional database described in detail in Section 4.1.1. Moreover, this chapter describes extensive work on body language feature extraction from full body MoCap markers, extracted from the CreativeIT database, described in Chapter 3. In general, we have a preference for fea- tures that are intuitive and interpretable, rather than purely data-driven features that are optimized for specic classication tasks. Apart from classication performance, our goal is to shed light on how multimodal expressions, such as facial and body language cues, are modulated by the underlying emotional states. 4.1 Face, Speech and Head Movement Feature Extraction The features described in this section are used in the experiments of Section 5 and the corresponding publications [119, 26, 27, 31], as well as other related publications [120, 69] 4.1.1 The IEMOCAP Database For a large portion of this work, we use the Interactive Emotional Dyadic Motion Cap- ture (IEMOCAP) database which is a large multimodal and multisubject emotional 45 database collected at SAIL lab at USC [93]. It contains approximately 12 hours of audio-visual data from ve mixed gender pairs of actors. Each recorded session lasts approximately 5 minutes and consists of two actors interacting with each other in sce- narios that encourage emotional expression. During each recording, both actors wore microphones and one of them had face Motion Capture (MoCap) markers. The IEMOCAP database has been carefully designed to elicit complex emotional manifestations and to avoid caricatures by exploiting acting techniques, such as im- provisation. The importance of such acting techniques in collecting naturalistic, acted databases has been armed elsewhere in the emotion literature, as a useful tool for studying emotions in controlled environments [121][122]. Here, two acting styles were used: improvisation of scripts and improvisation of hypothetical scenarios. Each im- provisation was designed to convey a general emotional theme; for example, a subject is sharing the news of her recent marriage (happiness/excitement), a subject is talking about the death of a close friend (sadness), a subject who just lost her valuable luggage at the airport learns that she will receive only a small refund (anger/frustration). The scripts are taken from theatrical plays and they are generally characterized by a more complex emotional ow. The recordings, even those with a general theme, contain mul- tiple emotional displays which vary in their intensity and clarity. The emotional content of each utterance is not pre-dened and it generally depends on the interpretation of the script/improvisation by the actors and the course of their interaction. The goal was to elicit emotional displays that resemble natural emotional expression and are generated through a suitable context. The dyadic sessions were later manually segmented into utterances, where consecu- tive utterances of a speaker may or may not belong to the same turn. The emotional content of each utterance was annotated by human annotators in categorical labels (annotators had to choose between the following emotions: angry, happy, excited, sad, 46 frustrated, fearful, surprised, disgusted, neutral and \other-please specify"). Each utter- ance was annotated by three people and the nal categorical label was decided through majority voting (at least two annotators should agree on the categorical tag). Note that a signicant portion of the utterances, around 17%, did not have a majority consensus categorical label, which suggests that the database contains a great variety of subtle, ambiguous or vague emotional manifestations. Each utterance was also annotated in terms of dimensional descriptions of valence and activation. Value 1 denotes very low activation and very negative valence and 5 denotes very high activation and very positive valence. Those properties are rated on scales 1-5 and are averaged across 2 annotators (or for some few utterances across 3 annotators). In general, the emotional annotations are a result of the overall impression of an utterance, since annotators considered audio, video, speech content and the interaction context. Therefore, the multimodal expression of emotion was taken into account both during the collection and the annotation of the database. 4.1.2 Facial Features The IEMOCAP data contain detailed facial marker coordinates from the actors during their emotional interaction. The positions of the facial markers can be seen in gure 4.1. The markers were normalized for head rotation and the nose marker is dened as the local coordinate center of each frame. The ve nose markers were excluded because of their limited movement. In total, information from 46 facial markers is used, the (x,y,z) coordinates. This results in a 138-dimensional facial representation, which tends to be redundant because it does not exploit the correlations of neighboring marker movements and the structure of the human face. For example, if we perform Principal Component Analysis (PCA) on this representation, the rst 30 PCA dimensions explain around 95% of the total variability. Therefore, there is much room for dimensionality reduction 47 Figure 4.1: Positions of the MoCap face markers and separation of the face into lower and upper facial regions. in order to compute a lower dimensional facial feature vector, which would be better suited for emotion recognition applications. 4.1.2.1 Speaker Face Normalization When we examine various speakers it is important to smooth out individual speaker face characteristics that are not related to emotion. Our speaker normalization approach consists of nding a mapping from the individual average face to the general average face. This is achieved by shifting the mean value of each marker coordinate of each speaker to the mean value of that marker coordinate across all speakers. Specically, for each speaker we compute the mean of each face feature (marker coordinate) across all emotions,m ij , where i is the speaker index and j is feature index. Also, we compute the mean of each feature across all speakers and all emotions,M j , where j is the marker coordinate index. Each feature is then normalized by multiplying it with the coecient c i;j = M j m ij . In practice, to get an impression of the mean facial characteristics of a test speaker, we use the data of the current test utterance, or of the total test conversation, 48 assuming that we see the total test conversation in advance (as in the experiments of Section 5). 4.1.2.2 Principal Feature Analysis PCA is a widely used dimensionality reduction method that nds the projection of data into a lower dimensional linear space such that the variance of the projected data is maximized [123]. However, the PCA transformation space has no inherent intuitive interpretation. In order to nd more meaningful facial representations we use Principal Feature Analysis (PFA) [124]. This method computes the PCA transformation matrix as a rst step and uses this matrix to cluster together facial marker coordinates that are highly correlated. Then it selects a representative feature from each cluster, thus performing feature selection while using similar criteria as PCA. We nd experimentally that it is benecial to perform some ad-hoc averaging of neighboring facial markers prior to applying PFA. That way, highly correlated markers are averaged and the face markers are reduced from 46 to 28. Then we perform PFA, which selects the least correlated averaged marker coordinates, and nally we normalize the selected coordinates. The position of the marker coordinates is aected by the facial conguration of each speaker. Normalization is important in order to smooth out individual face characteristics which are unrelated to emotion and focus on emotional modulations. We select 30 features (the initial PCA transformation explains around 95% of total data variability) and we append the rst derivatives to the feature vector resulting in a 60-dimensional facial representation. An analysis of the PFA process using markers from the total face, shows that the facial features are clustered together in a meaningful way. For example same 49 coordinates of neighboring or mirroring markers (e.g., left and right cheek) are clus- tered together. When repeating PFA 100 times, we nd that on average 28% x- coordinates, 39% y-coordinates and 33% z-coordinates are selected, showing that all 3 coordinates have important variability in emotional speech. The compar- atively high percentage of y-coordinates selected is expected because of the jaw movements in the vertical direction. Indeed, on average 22% of the selected y- coordinates come from mouth markers while only 14% of the initial markers are placed around the mouth. Selection of z-coordinates can be attributed to lip protru- sion during articulation. The distribution of initial markers across the face regions is (chin,mouth,cheeks,eyebrows,forehead) = (11%,14%,28%,36%,11%) while the distribu- tion of the selected markers is (13%,23%,25%,31%,8%). This indicates a bias towards selecting lower face marker coordinates (especially mouth), which is expected since the movement of the jaw conveys a great amount of the variability. This is an encourag- ing result since the mouth can be automatically tracked more reliably than other face regions like cheeks and forehead. 4.1.3 Head Movement Features The head features consist of the head translation in the (x,y,z) directions as well as the head angles (yaw, pitch and roll). Translations are derived from the nose marker and head angles are computed from all the markers using a technique based on Singular Value Decomposition(SVD), as described in [93]. We also compute the rst derivatives of these features, resulting in an 12-dim representation. The head features are normalized through z-standardization. In practice, the mean and standard deviation normalization constants are computed on the training set. In order to gain intuition about the inportance of these features for discriminating between emotional classes of anger, happiness, sadness and neutrality, we compute the 50 Fisher Criterion values for each of the head features. Fisher criterion maximizes the between class variability and minimizes the within class variability [123]. According to this analysis the most discriminating features are the head angles, especially pitch and yaw. This agrees with our intuition that tilting or lowering of the head often conveys aect. 4.1.4 Vocal Features In this work we use a variety of speech features, such as the Mel Frequency Cepstral Coecients (MFCC) which are commonly used in the Speech Recognition and Aec- tive Computing Community, i.e., [25][60], as well as the Mel Frequency Band Features (MFB). The MFB features were introduced in [125] for emotion recognition and are a variation of MFCC features, which are computed like MFCCs but without taking the nal Discrete Cosine Transform (DCT) step. We also extract pitch (F0) and en- ergy(intensity) values which are shown to convey emotion-related information [68]. For our experiments, we extract 12 MFCC coecients, 27 Mel Frequency Band coecients (MFB), pitch and energy values, along with their rst derivatives. All the audio features are computed using the Praat software [126] and are normalized using z-standardization. In practice, the mean and standard deviation normalization constants are computed on the training set. For certain classication experiments, as described in Section 5.4.2, we also compute statistical functionals of these features over the total examined utterance, which is done using the OpenSmile Toolbox [127]. 4.2 Body Language Features Body language cues such as body posture, orientation, proxemics and ap- proach/avoidance behaviors have been long studied in behavioral research and have been shown to convey emotional information [4], and to be highly discriminative of 51 extreme emotional states [5]. In engineering, a small but increasing amount of recent literature investigates body language. In [61, 62] the authors use upper body language information along with facial expressions to recognize emotions, while in [41] shoulder movement cues were used along with facial and vocal cues for continuous emotion track- ing. In [128] authors investigate a variety of upper body descriptions of movement and symmetry in order to extract a minimal representation of aective gestures. Works that examine aective full body language include [63] where authors advantageously use full body motion cues, alongside facial and vocal information, and [64] where authors use features describing movement quality to classify basic emotional states. In [129], au- thors use the setup of a body-movement-based video game and recognize emotions such as defeat, triumph etc., using MoCap derived features. Few works have addressed body language behavior in the context of social interaction, for example the work in [130], that examines dominance and synchronization phenomena during collaborative social tasks, and [131] where measures of posture are used to examine approach-avoidance behaviors during the interaction of two seated participants. Dierent body language feature sets have been proposed in the literature, ranging from lower-level features such as joint angles [129, 132], to more interpretable features such as distances and angles between body parts [133, 134] and this work, to higher-level posture and movement properties (contraction index, smoothness/ uidity of motion) [128, 64]. An overview of various body language features in the literature can be found in [135]. Although there seems to be no standard feature set for body language, several body language features in the literature measure similar qualities. For example, in [133] authors measure horizontal and vertical distances between a subject's hands and shoulder, while here we compute the relative positions of a person's hands with respect to his torso. 52 In this work, we extract a large set of interpretable and psychology-inspired body language features, which measure properties of a person's posture, motion, and body behavior with respect to the interlocutor. Our goal is to examine how body language is modulated by underlying emotional states. Body language is examined in the context of aective dyadic interactions, and our setup is generic: unlike many of the engineering works on body language, the examined subjects are not restricted to produce specic emotions or body gestures, but are encouraged to freely interact. We use the CreativeIT database [71], described in detail in Chapter 3, which contains full body MoCap data from multiple actors during dyadic improvisations. The marker placement is illustrated in Figures 4.2(a) and (b). The availability of direct motion information allows us to investigate in detail the potential role of body gestures and behaviors in emotional expression and interaction. It is interesting to note that many of the body language features described in this section could be approximated by using recently developed, less intrusive technologies, such as the Kinect sensor. (a) Marker positions. (b) Actor wearing markers. (c) Denition of body parts. Figure 4.2: The positions of the Motion Capture markers (a)-(b), and denitions of the body parts used in feature extraction (c) 53 (a) Hand positions. (b) Relative movement. (c) Looking towards or away. (d) Leaning angles. (e) Arm angles and description of arms crossed . Figure 4.3: Examples of extracted features from MoCap markers. 4.2.1 Body Language Feature Extraction The features are extracted for each person, and they are either absolute descriptions of a person's posture and movement, or relative descriptions of his body behavior with respect to his interlocutor (in the latter case data from both people are used for the feature extraction). In total, we examine 53 body language features, extracted at the MoCap framerate (60 fps) and smoothed using a median lter. This comprehensive feature set is sum- marized in Table 4.1, and may contain correlated or redundant features; decorrelated feature subsets will be later chosen through feature selection. Features are extracted in a geometrical manner from the positions of the MoCap markers, by dening global and local coordinate systems and measuring 3D distances, velocities and angles. The origin of the global coordinate system is roughly the center of the recording space, while local coordinate systems for each actor are dened using the four waist markers, as shown 54 in Fig.4.2(c). The positions of the various body parts are illustrated in Fig.4.2(c). For example, one's center is dened as the average of the four waist markers. Certain features are particularly in uenced by person-specic bodily characteristics. For example the z-coordinates of a person's upper and lower back, which may re ect crouching and sitting are in uenced by the person's height. Therefore, features that are z-coordinate positions are normalized by dividing by the actor's median height in each recording. Additionally, features that are (x,y,z) positions of hands in one's coordinate system are normalized by dividing by the person's median arm length in each recording, measured by the median distance between shoulder and hand markers. All normalized features are denoted as `norm' in Table 4.1. Figure 4.3 illustrates some example features. For instance, as shown in Fig.4.3(a), the position of one's center is measured in the global system to decsribe his location, while positions of one's hands are measured in his local coordinate system to describe his hand gestures. An individual's absolute velocity is computed from the movement of his center, while relative velocity is computed by projecting the velocity vector in the direction between the two participants (Fig.4.3(b)). A description of one's looking behavior relative to his interlocutor is computed from the angle between the orientation of one's head coordinate system and the direction between the heads of the participants (Fig.4.3(c)). A description of one's relative body orientation can be obtained similarly by looking at the people's waist coordinate systems instead. In Fig. 4.3(d), the angle between a person's spine and his local z-axis describes his leaning front/back behavior, while the angle between one's spine and the direction between the centers of the par- ticipants describes relative leaning behavior (towards/away). The angles of one's arms with his local x-axis, describe hand position and indicate arms crossing behavior (Fig. 4.3(e)). 55 Table 4.1: Body language features extracted from actor A during his interaction with actor B. Features are denoted as individual when they describe only A's movement and posture information, and as interaction features when they describe the relative move- ment and posture of A with respect to his interlocutor B. Norm indicates that the cor- responding feature has been normalized per actor recording. A's velocity (individual) A's velocity (see Fig. 4.3(b)) velocity of A's right/left arm velocity of A's right/left foot relative velocity of A's right/left arm w. respect to A relative velocity of A's right/left foot w. respect to A A's body posture (individual) A's body leaning angle front/back (see Fig. 4.3(d)) A's body leaning angle right/left A's body position in global coord. sys- tem: x,y, norm z coordinates (see Fig. 4.3(a)) A'sright/lefthandpositioninA'slocal coord. system: norm x,y,z coordinates (see Fig. 4.3(a)) distance between A's right and left hand angleofA'sright/lefthandwithx-axis in A's system (indicating arms crossed, see Fig. 4.3(e)) head angle, looking up/down distance between A's right/left hand and A's chest distance between A's right/left hand and A's right/left hip angle between A's right and left hands normzcoordinateofA'sright/leftknee (indicating kneeling) z coordinate of A's right/left foot (in- dicating jumping) norm z coordinate of A's upper back (indicating upward vs crouched pos- ture) norm z coordinate of A's lower back (indicating sitting down) A's distance from B (interaction) A's distance from B Min. distance between A's right/left hand and B's hands Min. distance between A's right/left hand and B's torso Min. distance between A's right/left hand and B's head Min. distance between A's right/left hand and B's back A's velocity with respect to B (interaction) A's relative velocity w. respect to B (see Fig. 4.3(b)) Relative velocity of A's right/left hand w. respect to B Relative velocity of A's right/left foot w. respect to B A's orientation with respect to B (interaction) Angle of A's face w. respect to B (see Fig. 4.3(c)) AngleofA'sbodyw. respecttoB(sim- ilar to Fig. 4.3(c), but for waist coord. systems) A'sleaningangletowardsorawayfrom B (see Fig. 4.3(d)) Position of A in B's coordinate system 56 4.2.2 Feature Selection Approaches We examine a variety of feature selection approaches to select a subset of decorrelated, informative body language features, tailored to each emotional attribute. We study the three dimensional emotional attributes: activation, valence and dominance. Continuous annotations of these attributes are obtained about the displayed emotion of each actor for each recording, according to the CreativeIT data annotation process, described in Section 3.4. We examine the following feature selection criteria: Mutual Information-based and Correlation-based criteria: The minimal redundancy maximal relevance criterion (mRMR), introduced in [136], selects features that maximize the mutual information (MI) between features and the ground truth, and minimize the MI between the selected features. Let S M =fx i g M i=1 be a set of M continuous body language features, y the continuous emotional attribute, and I(;) represent MI. Then the mRMR measure is dened: mRMR I (i) =I(x i ;y) 1 M 1 X x j 2S M ;j6=i I(x i ;x j ) (4.1) where I(x i ;y) = X x i 2X i X y2Y p(x i ;y)log( p(x i ;y) p(x i )p(y) ) (4.2) Estimation of the probability distributions p(x i );p(y);p(x i ;x j ) and p(x i ;y), which is required for computing the MI values, is performed through uniform quantization. We also examine the selection of maximal relevance and minimal redundancy fea- tures based on correlations rather that MI values. Specically, if we denote as C(x i ;y) andC(x i ;x j ) the linear (Pearson) correlations between a feature and the ground truth, 57 and between the two features, respectively, we can dene the correlation-based metric as follows: mRMR C (i) =C(x i ;y) 1 M 1 X x j 2S M ;j6=i C(x i ;x j ) (4.3) Both approaches perform a ranking of features, where high values are preferred and they denote that the feature shares much information, or has high correlation, with the ground truth and shares little information, or has low correlation, with other selected features. Fisher Criterion: Alternatively, we select features that discriminate between re- gions of high, low and medium values of the emotional ground truth. Intuitively these features re ect dierent body language behaviors across regions of dierent emotional content. Each attribute is quantized into three levels through k-means clustering, and the features that correspond to each level are collected. Fisher criterion, denoted as F value , is the ratio between the within-class variance and the between-class variance for a feature, and scores highly those features that achieve small within-class variability and large between-class variability [123]. While the previously described correlation and MI based methods favor the selection of feature sets with low redundancy, the Fisher criterion may lead to redundant feature sets. Therefore, we further reduce our feature set, by excluding features, such that no feature pair has a correlation higher than a threshold (here we empirically selected a threshold of 0.8). When choosing between two competing, highly-correlated features, we pick the one with the largest F value . 4.2.3 Analysis of Informative Body Language Features In this section, we present and we perform statistical analysis of the selected body lan- guage features, to provide insights about the body language gestures, movements and postures that are informative of the underlying emotional attributes. Our discussion 58 focuses mostly on features selected according to F value criterion, however similar ob- servations can be made for the features selected according to mRMR C , especially for activation and dominance. In Tables 4.2, 4.3 and 4.4 we present the top ranked 25 body language features for activation, valence and dominance, according to theF value criterion. Detailed results of themRMR C criterion are omitted for lack of space, however we include the mRMR C - based rank next to each feature (notice the overlap between the features of the two criteria for activation and dominance, although not as much for valence). Each feature value represents a meaningful body posture. For performing statistical tests, we quantize each attribute value into 3 classes using the k-means algorithm, and collect the feature instances that correspond to the high and low classes, over the total database. For each feature, we perform a t-test to compare the mean feature value between low and high emotional attribute classes. We also include a description of the corresponding dierence in body language that each feature value represents, always comparing high (or positive) versus low (or negative) attribute values. For example, the rst line (feature of rank 1) of Table 4.2 can be interpreted as `more leaning towards the interlocutor when subject is characterized by high activation vs no leaning when subject is characterized by low activation'. All feature mean dierences are statistically signicant, although in some cases mean dierences are so small that do not correspond to a recognizable dierence in body language (e.g., see feature 8 at 4.2, or feature 3 at 4.4). As seen in Table 4.2 for activation, many of the the selected features describe abso- lute velocities, relative body orientation and leaning, posture and hand gestures. Highly activated subjects generally display higher arm and foot velocities (features 4,20,21), more leaning and body orientation towards the interlocutor (features 1,5), and more front leaning (feature 9) among others. Also many selected features describe hand ges- tures, for example hands tend to be further from the body (features 3,6,7,10,19), further 59 from each other (features 12,22), and raised higher (features 24,25) for highly activated subjects. Also, body location in (x,y) coordinates re ects a tendency of activated par- ticipants to be at the center of the recording space (features 11,13). For the dominance task, according to Table 4.3, many of the selected features are common with the activation features, however we notice a preference for features de- scribing relative behaviors like velocity, leaning and orientation. For example dominant individuals tend to lean and have body orientation more towards interlocutor (fea- tures 1,4), and move their body, arms and feet more towards interlocutor (features 8,17,20,22,24). This seems intuitive since dominance essentially captures relative (in- teraction) behavior. Also, dominant subjects tend to touch the interlocutor (feature 10), which brings to mind psychological observations relating touching with dominant behavior [70]. Finally, for the valence task, some features from Table 4.4 stand out. For instance, positively valenced subjects tend to place hands on chest (features 22,23), or touch the interlocutor's hand (feature 15), which seem to be intuitive bodily expressions of valence. Also positively valenced subjects tend to look more towards and move towards the interlocutor (features 9,21), and move their arms and feet more (features 2,13,14). Also the combination of more leaning towards others (feature 20), but less front leaning (feature 19) for positive valence, indicates that positively valenced subjects tend to lean more towards the interlocutor, while negatively valenced subjects generally have a more slouched posture. Some of the above aective body language behaviors agree with the literature, for example arms being far from the body for high activation, or increased body motion for activated emotions such as anger ([135], Table 2). However, direct comparisons are hard to make since most past works on body language examine pre-dened categorical 60 emotional states rather than continuous emotional attributes. Other aspects that dif- ferentiate this work from the literature include examining dominant behaviors, which are generally less discussed, as well as the focus on interaction aspects of body language through the introduction of `relative' body features. Overall, the analysis of our body language features oers quantitative insights on the relations between an underlying emotional state and the displayed bodily behavior in the context of dyadic interaction, and highlights emotionally informative body behaviors, including approach/avoidance behaviors, hand gestures and body orientation. This enables us to draw connections with psychological observations regarding body language and emotion. 61 Table 4.2: Statistical analysis of the top 25 activation features, according to the F value criterion (each feature's rank according to the mRMR C criterion is included in the sec- ond column). The feature descriptions under the statistical tests column are describing high activation behavior compared to low activation behavior of a subject A. The sta- tistical test performed is dierence of means of the feature values between high and low activation classes (t-test) Activation: Comparison of high vs low activation classes F value (and mRMR C ) rank rank feature description of statistical tests results p 0 F value mRMR C 1 1 A's body lean towards/away from B more lean towards vs no leaning 2 9 norm x coord of A's right hand in A's system x coord higher (further from body to- wards right, see also Fig. 4.3(a)) 3 7 distance of A's left hand from A's hip greater distance 4 6 abs velocity of A's right arm higher velocity 5 22 relative angle of A's body towards B body orientation more towards B vs side- ways 6 27 norm y coord of A's left hand in A's sys- tem y coord higher (further from body to- wards front, see also Fig. 4.3(a)) 7 12 distance of A's right hand from A's hip greater distance 8 2 A's body leaning angle, left/right slightly lean right vs straight (though an- gle in both cases is close to zero) 9 4 A's body leaning angle, front/back more lean front vs no leaning 10 28 norm y coord of A's right hand in A's system y coord higher (further from body to- wards front, see also Fig. 4.3(a)) 11 3 x coord of A's center x abs value lower (x more towards center (0,0,0) of the recording space) 12 34 distance between A's right and left hand hands wider apart 13 11 y coord of A's center y abs value lower (y more towards center (0,0,0) of the recording space) 14 14 norm z coord of A's upper back higher, more upwards posture, also indi- cates less sitting 15 48 distance between A's right hand and B's back smaller, more touching, could indicate hugging depending on the interlocutors orientation 16 23 norm z coord of A's left knee lower, may indicate kneeling 17 8 A's head angle, up/down more straight vs more downwards 18 38 angle of A's right hand with x coord in A's system hand more in front vs slightly towards left (see also Fig. 4.3(e)) 19 37 norm x coord of A's left hand in A's sys- tem x coord lower (further from body towards left, see also Fig. 4.3(a)) 20 18 abs velocity of A's right foot velocity higher 21 17 abs velocity of A's left foot velocity higher 22 20 angle between A's hands hands wider apart 23 40 distance between A's left hand and A's chest bigger distance, hand further from chest 24 41 norm z coord of A's right hand in A's system hand is higher, indicates raised hand (see also Fig. 4.3(a)) 25 42 norm z coord of A's left hand in A's sys- tem hand is higher, indicates raised hand (see also Fig. 4.3(a)) 62 Table 4.3: Statistical analysis of the top 25 dominance features, according to the F value criterion (each feature's rank according to the mRMR C criterion is included in the sec- ond column). The feature descriptions under the statistical tests column are describing high dominance behavior compared to low dominance behavior of a subject A. The sta- tistical test performed is dierence of means of the feature values between high and low dominance classes (t-test) Dominance: Comparison of high vs low dominance classes F value (and mRMR C ) rank rank feature description of statistical tests results p 0 F value mRMR C 1 7 relative angle of A's body towards/away from B body orientation more towards other vs sideways 2 1 A's head angle, up/down more straight vs more downwards 3 6 norm z coord of A's center higher, indicates less sitting 4 3 A's body leaning angle towards/away from B more lean towards vs no leaning 5 13 distance of A's left hand from A's hip greater distance, hand further away from hip 6 2 z coord of A's right foot lower 7 17 norm x coord of A's right hand in A's system x coord higher (further from body to- wards right, see also Fig. 4.3(a)) 8 12 relative velocity of A towards/away from B move more towards vs away 9 24 distance of A's right hand from A's hip greater distance, further from hip 10 48 min dist between A's left hand and B's torso smaller, indicates more touching 11 38 norm z coord of A's right hand in A's system hand is lower 12 29 distance between A's hands hands wider apart 13 10 A's body leaning angle, left/right more lean right (though angle in both cases is close to zero) 14 28 norm x coord of A's left hand in A's sys- tem x coord lower (further from body towards left, see also Fig. 4.3(a)) 15 5 y coord of A's center y abs value lower (y more towards center (0,0,0) of the recording space) 16 36 angle of A's right hand with x coord in A's system hand more in front vs slightly towards left (see also Fig. 4.3(e)) 17 14 relative velocity of A's right hand to- wards/away from B move more towards vs away 18 45 norm z coord of A's left hand in A's sys- tem hand is lower 19 5 A's body leaning angle, front/back more lean front vs slighly less lean front 20 21 relative velocity of A's left hand to- wards/away from B move more towards vs away 21 37 distance between A's right hand and A's chest greater, hand further from chest 22 15 relative velocity of A's right foot to- wards/away from B move more towards vs away 23 11 norm z coord of A's upper back higher, more upwards position, also indi- cates less sitting 24 16 relative velocity of A's left foot to- wards/away from B move more towards vs away 25 9 x coord of A's center x abs value higher (x further from center (0,0,0) of the recording space) 63 Table 4.4: Statistical analysis of the top 25 valence features, according to the F value criterion (each feature's rank according to the mRMR C criterion is included in the sec- ond column). The feature descriptions under the statistical tests column are describing positive valence behavior compared to negative valence behavior of a subject A. The sta- tistical test performed is dierence of means of the feature values between positive and negative valence classes (t-test) Valence: Comparison of positive vs negative valence classes F value (and mRMR C ) rank rank feature description of stat. tests results p 0 F value mRMR C 1 33 norm z coord of A's lower back higher, indicates less sitting 2 42 abs velocity of A's right arm higher velocity 3 15 A's head angle, up/down slightly more downwards vs straight (though the two angles are almost the same) 4 34 distance between A's hands hands closer together 5 41 distance of A's left hand from A's hip greater distance, further from hip 6 36 distance of A's right hand from A's hip greater distance, further from hip 7 28 norm x coord of A's left hand in A's sys- tem x coord higher (closer to body towards right, see also Fig. 4.3(a)) 8 20 norm z coord of A's upper back lower, less upward position 9 11 relative angle of A's face towards B face orientation more towards other 10 31 norm x coord of A's right hand in A's system x coord lower (closer to body towards left, see also Fig. 4.3(a)) 11 30 angle of A's left hand with x coord in A's system left hand more towards front rather than left (see also Fig. 4.3(e)) 12 27 norm y coord of A's right hand in A's system y coord higher (further from body to- wards front) 13 21 abs velocity of A's right foot higher velocity 14 23 abs velocity of A's left foot higher velocity 15 43 distance between A's right hand and B's hand lower, indicates more touching of B's hand 16 3 norm z coord of A's left knee higher, indicates less kneeling 17 4 A's direction relative to B slightly more towards right-front of B vs more in front 18 29 norm y coord of A's left hand in A's sys- tem y coord higher (further from body to- wards front) 19 19 A's body leaning angle, front/back less leaning front vs more leaning front, indicates less slouched posture 20 24 A's body leaning angle, towards/away from B more leaning towards vs less leaning to- wards 21 12 relative velocity of A towards/away from B more moving towards vs moving away 22 37 distance between A's right hand and A's chest lower, indicates hand touching chest 23 38 distance between A's left hand and A's chest lower, indicates hand touching chest 24 30 angle of A's left hand with x coord in A's system hand more towards front vs towards right 25 1 y coord of A's center y abs value bigger (y further from center (0,0,0) of the recording space) 64 Chapter 5: Context-Sensitive, Hierarchical Approaches for Emotion Recognition 5.1 Context and Emotions Human emotional expression is a complex process where a variety of multimodal cues interact to create an emotional display. Furthermore, emotions are usually slowly vary- ing during a conversation and the perception of an emotional display is aected, among others, by recently perceived past emotional displays, which place the expressed emo- tion into context. Taking into account such contextual information may prove to be advantageous for real-life automatic emotion recognition systems that can process a great variety of complex, vague or ambiguous emotional displays. The focus of this chapter is to investigate learning frameworks for automatic, multimodal emotion recog- nition that allow the use of information of the structure of past and future evolution of 65 an emotional interaction. The study also considers the ow of emotional expression by examining emotional transitions in a variety of improvised aective interactions. The work presented in this chapter has been published in [26, 27, 31]. Psychology research suggests that human perception of emotion is relative and emo- tional understanding is in uenced by context. Context in human communication can broadly refer to linguistic structural information, discourse information, cultural back- ground and gender of the participants, knowledge of the general setting in which an emotional interaction is taking place etc. For instance, psychology literature indicates that facial information viewed in isolation might not be sucient to disambiguate the expressed emotion and humans tend to use context, such as past visual information [15], general situational understanding [16], past verbal information [17] and cultural background [18] to make an emotional decision. Also, emotions are expressed through the interplay of multiple complementary, supplementary and even con icting modalities (facial gestures, prosodic information, lexical content), and therefore such multimodal cues provide context for each other [45]. For example, discordance in the emotions expressed by the facial and vocal modalities degrades a subject's ability to correctly identify the emotion expressed by face or voice separately [137]. Furthermore, emotions are usually slowly varying states, typically lasting from under a minute to a few minutes [19]. Therefore an emotion may span several consecutive utterances of a conversation and emotional transitions are usually smooth. For example, it seems reasonable to as- sume that an angry utterance is more likely to be succeeded by one displaying anger rather than happiness. Context awareness is recognized as an important element in human-computer inter- faces and can be broadly dened as an understanding of the location and identity of the user as well as the type and timing of the human-computer interaction [138]. In the emotion recognition literature, relatively few works make use of contextual information 66 and generally use diverse context denitions. In [21] the authors propose a unimodal framework for short-term context modeling in dyadic interactions, where speech cues from the past utterance of the speaker and his interlocutor are taken into account dur- ing emotion recognition. In [22] lexical and dialog act features are used in addition to acoustic (segmental/prosodic) features. In [23] authors make use of prosodic, lexical and dialog act features from the past two turns of a speaker for recognizing the speaker's current emotional state. In [20] the author describes a framework for building a tutoring virtual agent, where the tutor's behavior takes into account the student's recent emo- tional state as well as a variety of contextual variables such as the student's personality and the tutor's goal. In [139] and [140] authors use the formalization of domain ontolo- gies to describe generic frameworks for dening relations between emotion and context concepts such as environment, physiological cues, cultural information etc. In our pre- vious work, we have used neural network architetures such as Bidirectional Long Short Term Memory (BLSTM) neural networks that take into account an arbitrary amount of past and future audio-visual emotional expressions to recognize the current emotion of a speaker [141]. In this work, we dene context to be information about the emotional content of audio-visual displays that happen before or after the observation that we examine. We focus on emotion recognition at the utterance level. Here an utterance is loosely dened as a chunk of speech where the speaker utters a thought or idea. The phrases that we examine have been manually segmented from longer dyadic conversations and usually last a few seconds. In addition to the current utterance's audio-visual cues, we exploit information from an arbitrary number of neighboring utterances that could range from one past or future utterance to all the utterances of the conversation. Apart from this denition of context which is our primary focus, we could also interpret the use of audio-visual cues as another form of context where the interplay within the multimodal 67 streams provides context for one another and oers a fuller picture of the expressed emotion. We investigate three alternative multimodal and hierarchical schemes for incorpo- rating contextual information in emotion recognition systems, by modeling emotional evolution at two levels: within an emotional utterance and between emotional utter- ances of a conversation. Specically, we examine the use of hierarchical Hidden Markov Model (HMM) classiers [142], of Recurrent Neural Networks (RNNs), and specically BLSTM neural networks [29, 143] as well as the use of a hybrid BLSTM/HMM approach as well. The HMM-based classication is inspired by the Automatic Speech Recognition (ASR) literature, where algorithms exploit context at multiple levels within a Markov model structure: from phonetic details including coarticulation in speech production to word transitions re ecting language based statistics [24, 25]. We hypothesize that similar within and across model transitions can be advantageously used to capture the dynamics in the evolution of emotional states, including within and across emotional categories. Alternatively, RNN architectures are a powerful, discriminative framework that enables modeling the emotional ow of a conversation without making Markov as- sumptions about emotional transitions. Here, we apply BLSTM neural networks which overcome the vanishing gradient problem of conventional RNNs and are able to learn from an arbitrarily large amount of past and future contextual information [141]. For our experiments we use a large multimodal and multisubject database of dyadic interactions between actors, namely the IEMOCAP database [93] (also described in Section 4.1.1), which contains detailed facial information, obtained from facial Motion Capture (MoCap) as well as speech information. The IEMOCAP database consists of dyadic conversations that are elicited so as to contain emotional manifestations that are non-prototypical and resemble real-life emotional expression. Our goal is to obtain a realistic assessment of emotion recognition performance, when our system is required to 68 make a decision about the emotional content of all possible input utterances, including those containing subtle or ambiguous emotions. We focus on the recognition of dimensional emotional descriptions, i.e., valence and activation levels, instead of categorical emotional tags, such as `anger' or `happiness'. We derive a dimensional label for all available utterances by averaging the decisions of multiple annotators. In addition to classifying the degree of valence and activation separately, we also investigate their joint modeling by classifying among clusters in the two-dimensional valence-activation space [91]. Our analysis of the relation between dimensional attributes/clusters and categorical labels indicates that the classication tasks are interpretable in terms of categorical emotional content and allow us to have a meaningful description of the emotion of an utterance. Modeling of emotional transitions between utterances of an interaction could be formulated equivalently using the concept of probabilistic emotional grammars, which could inform us about the structure of emotional evolution during aective conversations. Our experimental results show that incorporating temporal context in emotion clas- sication systems generally leads to improvement in average performance for our clas- sication tasks, except for the case of activation. For most of our context-sensitive classiers we consistently observe an increase in performance compared to classiers that do not take context into account. Such improvements are statistically signicant for the valence and the three cluster classication task, for classiers such as hierar- chical HMM and hybrid HMM/BLSTM. These results suggest that context-sensitive approaches could pave the way for better performing emotion recognition systems. 69 5.2 Context-Sensitive Frameworks 5.2.1 Hierarchical Context Sensitive Frameworks Our problem can be posed as a two-level modeling problem of an emotional conver- sation. At the higher level, an emotional conversation is modeled as a sequence of emotional utterances, while at the lower level, each such utterance is modeled as a sequence of audiovisual observations. We assume that an emotional utterance can be described by a single emotional label, e.g., a single level of activation, valence or a sin- gle cluster in the valence-activation space. However, an emotional conversation may contain arbitrary emotional transitions between utterances and may consist of a vari- ety of emotional manifestations. Therefore utterance modeling captures the dynamics within emotional categories while conversation modeling captures the dynamics across emotional categories. Let us denote a conversation C as a sequence of utterances U t ;t = 1;:::;T : C = U 1 ;U 2 ;:::;U T and an utterance U i as a sequence of low-level observations (frames) O tj ;j = 1;:::;: U t = O t1 ;O t2 ;:::;O t . In Figure 5.1 we present a summary of our approaches for modeling utterance and conversation dynamics. At the utterance level, we examine dynamic modeling by using fully-connected HMMs, which may capture feature statistics and underlying emotional characteristics in the audio-visual feature streams. The intuition for using fully-connected HMMs is that there is no apparent left-to-right property in the dynamic evolution of the facial or vocal characteristics during emotional expression (as opposed to the evolution of phonemes during speech that is exploited in phoneme-specic left-to-right HMMs in ASR). The use of coupled- instead of simple, multistream HMMs enables us to model asynchrony between the audio-visual streams. Alternatively, we model the emotional utterance by 70 Figure 5.1: A summary of our classication systems under the proposed hierarchical, context-sensitive framework. At the lower (utterance) level, modeling of emotional ut- terances U t is performed through emotion-specic HMMs, as illustrated in the lower left part of the gure, or by computing statistical properties of each emotional class, as il- lustrated in the lower right part. At the higher level, which represents the conversation context, emotional ow between utteraces of a conversation C is modeled by an HMM or a Neural Network (Unidirectional or Bidirectional RNN or BLSTM). The dierent combinations of the approaches at lower and higher level, lead to the three systems that we describe in this work: 2-level HMM, neural networks (NNs) trained with feature functionals, and hybrid HMM/NN. estimating static, utterance-level, statistical features, through the use of statistical func- tionals over the low-level frame sequence. Such an approach implicitly captures some of the observation dynamics while it makes fewer modeling assumptions compared to the HMM (no Markovian property, conditional independence or synchronicity assump- tions of the underlying audio-visual sequences). At the dialog level, we examine the use of HMM and discriminatively trained neural network classiers (RNN, BLSTM). The 71 latter make fewer assumptions on the underlying sequence of emotional utterances and may potentially capture more complex patterns of emotional ow. Our approach of combining rst and second layer HMMs for dialog and utterance modeling leads to a two-level structure, along the lines of multi-level HMMs [144] and Hierarchical HMMs [142]. Alternatively, we examine the performance of discriminatively trained neural network classiers for conversational modeling when statistical function- als are extracted at the utterance level. In the hybrid HMM/BLSTM approach, we keep the probabilistic dynamic modeling of HMMs at the utterance level and use the emotion-specic HMM log-likelihoods to form an utterance-level feature vector, which is the input of the BLSTM at the conversation-level. In the next sections we elaborate on these three approaches. 5.2.2 Hierarchical HMM classiers The state of the art in sequence modeling, such as in ASR, utilizes the Markov chain framework to model temporal sequence context be it acoustic feature dependencies, phonetic symbol structure, local word structure or even dialog states [24]. Here, we adopt a two-level HMM structure for modeling the sequence of audio-visual observations at both the utterance and the conversation level (context). At the utterance level, the HMM classiers are fully-connected 3-state models, trained separately for each of N emotional categories. Thus we have N models i ;i = 1;:::;N, denoted in Fig. 5.1 as emo i . At the conversation level we have one fully connected HMM with N states modeling the N emotional categories. For each testing sequence of utterances, we estimate the most likely emotional category i for the current utterance U t at time t, by nding the HMM i , with the maximum likelihood (score) P (U t j i ). These scores are utilized by the higher level HMM to represent the 72 observation probabilities of the hidden emotional states, while the transition probabil- ities between emotional states P (jji) can be computed from the train set. The most probable sequence of emotional categories Q = q 1 ;q 2 ;:::;q T for an observed conver- sation C = U 1 ;U 2 ;:::U T can be found using the Viterbi decoding algorithm over the conversation utterances. Therefore, when we make a decision about the emotional con- tent of the current utterance, we take into account all past and future utterances of the utterance sequence. We have also investigated the eect of a variable length of bidirectional context, that could range from one past or future utterance to all the utterances in the conversation. For this purpose, we have applied a modied Viterbi decoding over shorter sequential windows which could also be useful in real-time scenarios. Real-time systems may not aord to wait for the whole conversation to end before making a decision about the emotional content of the utterances of the conversation. More specically, we perform Viterbi decoding in overlapping windows of lengthw +1 utterances that scan the whole sequence. The score s i t of each emotional class i;i = 1:::N in the Viterbi trellis is ini- tialized by the likelihoodP (U t j i ). Then the sequence of scores is sequentially updated: we take into account the decisions of the previous Viterbi passes by incorporating them as weights c i t in the current Viterbi pass. The most probable emotion according to the previous pass gets the highest weight. An utterance U t will be updated from w + 1 consecutive Viterbi passes, starting when the moving window begins at time tw 1 and ends at t, and until the moving window has reached time t as its starting point, as illustrated in Figure 5.2. The score s i t of each emotional class i is nalized after the window has moved after time t, and through this update process w utterances left and right ofU t have been taken into account when making a decision about the label of U t . 73 Figure 5.2: Sequential Viterbi Decoding passes. Viterbi Decoding is performed in se- quential subsequences (of length w+1) of the total utterance observation sequence. The labeling decision for utterance U t at time t is aected by the labeling decisions of w past and w future utterances. The algorithm is desribed in Box 1 and the update function works as follows: s i t logP (U t j i ) s i t s i t +c i t c i t = 8 < : log(a) if q t =i from the previous pass log(b) otherwise where b<a and N X i=1 c i t = 1;t 2window Box 1 Sequential Viterbi Decoding (seqVD) 1: place window of length w+1 at the beggining of the sequence of utterances 2: repeat 3: current utterances utterances[window] 4: output sequence[window] VD(current utterances) 5: update(current utterances) using output sequence[window] 6: utterances[window] current utterances 7: shift window forward one utterance 8: until the end of all the utterances The sequential Viterbi algorithm of Box 1 contains parameter a which is the the weight from the previous pass and parameter w, which is the window size. In our 74 experiments these parameters are optimized across all folds using the Nelder Mead algorithm [145]. This sequential Viterbi decoding algorithm shares some similarities with the Viterbi decoding method proposed in [146]. However in contrast to [146] where the sequential Viterbi passes would x the decision of the initial observation within the current window, in our approach a Viterbi pass just places a weight on an observation utterance, to be used for the next pass. The approach described in this section constitutes a second layer of computation over the utterance-level, emotion-specic HMM classiers, and models the transitions between them. This can be viewed as a higher-level HMM, where hidden states corre- spond to emotions and the transitions describe the emotional evolution between utter- ances in a conversation (temporal context). 5.2.3 BLSTM and RNN architectures Classiers such as neural networks are able to capture a certain amount of context by using cyclic connections. These so-called recurrent neural networks can in principle map from the entire history of previous inputs to each output. Yet, the analysis of the error ow in conventional recurrent neural nets led to the nding that long-range context is inaccessible to standard RNNs since the backpropagated error either blows up or decays over time (vanishing gradient problem [147]). One eective technique to address the problem of vanishing gradients for RNN, is the Long Short-Term Memory architecture [28], which is able to store information in linear memory cells over a longer period of time. LSTM networks are able to overcome the vanishing gradient problem and can learn the optimal amount of contextual information relevant for the classication task. Thus, LSTM architectures seem to be well-suited for modeling context between successive utterances for enhanced emotion recognition. 75 Figure 5.3: LSTM memory block consisting of one memory cell: the input, output, and forget gates collect activations from inside and outside the block which control the cell through multiplicative units (depicted as small circles); input, output, and forget gates scale input, output, and internal states respectively; a i and a o denote activation functions; the recurrent connection of xed weight 1.0 maintains the internal state. An LSTM layer is composed of recurrently connected memory blocks, each of which contains one or more memory cells, along with three multiplicative `gate' units: the input, output, and forget gates. The gates perform functions analogous to read, write, and reset operations. More specically, the cell input is multiplied by the activation of the input gate, the cell output by that of the output gate, and the previous cell values by the forget gate (see Figure 5.3). The overall eect is to allow the network to store and retrieve information over long periods of time. For example, as long as the input gate remains closed, the activation of the cell will not be overwritten by new inputs and can therefore be made available to the net much later in the sequence by opening the output gate. Another problem with standard RNNs is that they have access to past but not to future context. This can be overcome by using bidirectional RNNs [148], where two separate recurrent hidden layers scan the input sequences in opposite directions. The two hidden layers are connected to the same output layer, which therefore has access to context information in both directions. The amount of context information that the 76 network actually uses is learned during training, and does not have to be specied be- forehand. Figure 5.4 shows the structure of a simple bidirectional network. Combining bidirectional networks with LSTM gives bidirectional LSTM (BLSTM) networks [141] which have been successfully used in various pattern recognition applications such as phoneme recognition [149] and emotion recognition from speech [150]. Figure 5.4: Structure of a bidirectional network with input i, output o, and two hidden layers (h f and h b ) for forward and backward processing. In this work we use unidirectional and bidirectional LSTM networks trained with utterance-level features, specically statistical functionals of audio-visual features. The LSTM networks consist of 128 memory blocks with one memory cell per block while the BLSTM networks consist of two LSTM layers with 128 memory blocks per input direction. The input layer has the same dimensionality as the feature vector and the output layer has the same dimensionality as the number of emotional classes we want to classify. For training we used the RNNLib toolbox which is available for download at [151]. 5.2.4 Combination of HMM and BLSTM classiers We examine a combination of the HMM and BLSTM classiers that takes advantage of both the explicit dynamic utterance modeling of the HMM framework and the ability of the BLSTM to learn an arbitrarily long amount of bidirectional context. We utilize the BLSTM network as a second layer of computation over the HMM classiers as an 77 alternative to Viterbi decoding. This combination has the advantages of a two-layer clas- sication structure; therefore there is transparency as to the performance improvement we can gain from context modeling. Furthermore, the HMM+BLSTM combination may potentially capture more complex structure in the underlying emotional ow than the one captured by an HMM. In our implementation, we collect the log-likelihoods logP (U t j i ) for each ut- terance sequence U t ;t = 1;:::T , generated by the emotion-specic HMM models i ;i = 1;:::;N and we create an N-dimensional, utterance-level feature vector of log- likelihoods. This is used as the input to the higher-level BLSTM, as illustrated in method (3) of Figure 5.1. Therefore at each time t the BLSTM will have as input a feature vector containing the log-likelihoods of each emotional category at that time. The BLSTMs are trained using the output log-likelihoods produced on the trainset utterances. 5.3 Emotions and Emotion Transitions The database used in this work is the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, which is described in detail in Section 4.1.1. In our experiments we focus on the dimensional emotional representations, in terms of valence and activa- tion. The dimensional tags seem to provide more general descriptions of an emotional expression. For example, a large part of the utterances of the database (approx. 17%) do not have a categorical label with a majority-vote agreement of the evaluators. These utterances may be displaying subtle or ambiguous emotions, that are hard to classify with a single categorical label. However, annotators seem to have an acceptable level of agreement in the dimensional labels of such utterances. An analysis of the IEMO- CAP dimensional labels shows that for the data labeled by two evaluators, in 94% of the utterances evaluators agreed or were one point apart in their rating of valence, and 78 in 85% of the utterances evaluators agreed or were one point apart in their rating of activation. Moreover, the use of dimensional labeling seems particularly suitable for analyzing temporal emotional context, which is the main focus of this work, since it enables us to have a label for every utterance in the sequence of consecutive utterances that make up a conversation. Here, this label is derived by averaging the decisions of the annotators. Considering categorical labels would introduce `gaps' in our observation label sequence for utterances with no evaluator agreement or utterances that are labeled with rarely occurring emotions, like fear and disgust. Training emotional models for rare emotions would not be possible because of lack of data, while the treatment of a `no agreement' class does not seem straightforward. Although, both categorical and dimensional representations have their merits, in the context of this work we focus our analysis on the dimensional representation, and we use the categorical labels to validate and interpret our results, as explained in the following sections. 5.3.1 Valence and Activation The rst emotion classication task that we consider in this work consists of the clas- sication of three levels of valence and activation: level 1 contains ratings in the range [1,2], level 2 contains ratings in the range (2,4) and level 3 contains ratings in the range [4,5]. These levels intuitively correspond to low, medium and high activation, and to negative, neutral and positive valence respectively. The class sizes are not balanced since medium values of labels are more common than extreme values. The choice of three levels instead of ve which was the initial resolution of the valence and activation ratings was made to ensure that a sucient amount of data is available in each class for emotional model training. For example, for activation 2% of the averaged annotator labels (87 utterances) has a rating of less or equal to 1.5 , which corresponds to the cases 79 (a) Activation (b) Valence Figure 5.5: Analysis of activation and valence classes in terms of categorical labels for all utterances of the database: anger (ang), happiness (hap), excitement (exc), sadness (sad), frustration (fru), fear (fea), surprise (sur), disgust (dis), neutral (neu), other (oth), and no agreement (n.a.). We notice that the categorical tags are generally con- sistent with the activation and valence tags. of very low activation. When we used the 3-level scale the low activation instances are 11% of the data or a total of 557 utterances. The emotional information provided by the dimensional tags is still related to cat- egorical emotions in a meaningful way. To examine this, we analyzed the available categorical tags of the utterances that belong to each of the dimensional classes. In Figure 5.5 we show how the utterances of each class break down into categorical tags. Specically the categorical tags that we are considering in the IEMOCAP corpus are: Angry, Happy, Excited, Sad, Frustrated, Fearful, Surprised, Disgusted, Neutral, Other and No Agreement (n.a.). The annotators of the categorical and dimensional tags of an utterance are usually not the same. Specically in Figure 5.5(a), we show how many utterances from each activation class fall into each categorical emotional tag. Similarly, we construct the bar graph for valence, which is presented in Figure 5.5(b). For both valence and activation assessment, the categorical labels generally agree with the dimensional tags, according to what is known in the emotion literature about the position of categorical emotions in the valence-activation space [152]. Overall, the 80 resulting bar graphs are intuitive. For example, in the valence plot in Figure 5.5, we notice that utterances that are annotated as having negative valence are also generally perceived to express `negative' emotions such as anger, sadness, and frustration while utterances with positive valence are generally perceived to express `positive' emotions, such as happiness and excitement. An interesting observation can be made regarding frustration (an emotion hard to classify, since it could ressemble anger, sadness or neutrality) where we can see that the valence assignment for frustration observations is almost equally divided between negative and neutral. 5.3.2 Clusters in the Emotion Space We also examine the joint classication of the emotional dimensions by building three and four clusters in the valence-activation emotional space. The motivation for clus- tering the valence-activation space is to build classiers that provide richer and more complete emotional information by combining valence and activation information. We apply data-driven clustering through K-means to automatically select clusters that t the distribution of the emotional manifestations of our database in the emotional space (similar approaches are also followed in [91],[21]). The ground truth of every utterance is assigned to one of the clusters using the minimum Euclidean distance between its annotation and the cluster midpoints. When abstracting our emotion classes into clusters of the valence-activation space, we also study their relation to the corresponding categorical annotations and investigate which categorical emotional manifestations tend to fall into each cluster. Since our automatically derived clusters, presented in Fig. 5.6(a) and 5.6(b), cover dierent areas of the emotional space, they are expected to contain dierent emotional manifestations. In order to provide a rough and intuitive categorical description of the clusters, we examine how these clusters break down in terms of categorical labels. Note that in this 81 valence activation (a) 3 clusters in the valence-activation space (fold1) valence activation (b) 4 clusters in the valence-activation space (fold1) (c) 3 clusters (d) 4 clusters Figure 5.6: Analysis of classes in the 3 cluster and 4 cluster tasks in terms of categor- ical labels. The bars and the error bars correspond to the mean and standard deviation computed across the 10 folds. We notice that the data-driven clusters tend to contain dierent categorical emotional manifestations according to their position in the emo- tional space. For example, for the 3 cluster task, clusters 1,2 and 3 roughly contain categorical emotions of `anger or frustration', `happiness or excitement' and `neutrality or sadness or frustration' respectively. case the utterances that belong to each cluster depend on the training set of each fold so the counts of the categorical tags may change from fold to fold. Thus, the bar graphs in Fig. 5.6(c) and 5.6(d) represent the mean over the 10 folds of our experiment (see also the experimental setup in section 5.5.1) and the error bars represent the standard deviation over the 10 folds. 82 Table 5.1: Emotional transition bigrams for the valence and activation classication tasks. valence activation Neg Neu Pos Low Med High Neg 0.72 0.27 0.01 Low 0.34 0.63 0.03 Neu 0.23 0.65 0.12 Med 0.09 0.76 0.15 Pos 0.02 0.26 0.72 High 0.04 0.48 0.48 Table 5.2: Emotional transition bigrams for the 3 and 4 cluster classication tasks. For the 3 cluster case the most frequent categorical emotions per cluster are : c1 = 'ang/fru', c2 = 'hap/exc', c3 = 'neu/sad'. For the 4 cluster case the most frequent emotions per cluster are: c1 = 'hap/exc', c2 = 'sad/fru', c3 = 'ang/fru' and c4 = 'neu'. 3clusters 4clusters c1 c2 c3 c1 c2 c3 c4 c1 0.66 0.06 0.28 c1 0.71 0.06 0.03 0.20 c2 0.06 0.78 0.16 c2 0.04 0.56 0.21 0.18 c3 0.24 0.12 0.64 c3 0.03 0.29 0.58 0.09 c4 0.19 0.25 0.08 0.49 We can see from Fig. 5.6(a) that the 3 clusters correspond to areas in the valence- activation space that are generally expected to contain emotional manifestations of `anger', `happiness' and `sadness or neutrality'. The plot corresponds to the rst fold of our experiment, but the dierences across folds are relatively small (the average standard deviation of the cluster centroid coordinates across the ten folds is as low as 0.05). These observations agree with the bar graph of Fig. 5.6(c) where cluster c1 contains large portions of utterances tagged as angry or frustrated, cluster c2 con- tains utterances tagged as happy or excited and cluster c3 utterances tagged as sad or neutral or frustrated. Therefore, we could think of the three clusters as roughly containing emotional manifestations of `anger/frustration', `happiness/excitement' and `sadness/neutrality/frustration'. Similarly, for the 4 cluster classication task, accord- ing to both Fig. 5.6(b) and Fig. 5.6(d), we could think of the four clusters as roughly containing emotional manifestations of `happiness/excitement', `sadness/frustration', 'anger/frustration', and `neutrality'. 83 5.3.3 Emotional Grammars Analysis of the emotional dialogs of our database reveals that certain emotional tran- sitions are more probable than others, indicating that the underlying emotional ow follows certain typical patterns. In the two-level HMM approach, we have assumed that the underlying emotional states form a Markov chain and applied an HMM at the con- versation level to model emotion dynamics. Equivalently, we could view this modeling as a Probabilistic Regular Grammar (PRG) describing emotional transitions, using the equivalence between PRGs and HMMs [153]. We can dene a PRG with initial state S 1 as: S i !w j S k or S i !w j whereS i represents an emotional state,w j represents an emotional observation and the right arrow represents an emotional transition. This modeling assumes that given an internal emotional stateS i , a person emits an audio-visual emotional expressionw j , e.g., tone of voice and facial expression, and may transition to another internal emotional state S k . In this work, we only compute bigram emotional utterance probabilities which give us a rough description of the emotional ow between utterances. We utilize these emotional grammars to gain insights about typical evolution patterns in emotional expression and apply this knowledge to inform emotion recognition tasks. This approach could be extended to higher order transition models, which would capture more detailed emotional patterns. Alternatively, we also consider models that make fewer assumptions on the underlying sequence and potentially have richer representation power, such as neural networks (here, BLSTM). In Tables 5.1 and 5.2 we present the transition probabilities for the valence, activa- tion, 3 and 4 cluster HMMs. The transition probabilities have been approximated by counting all transitions between consecutive utterances of the database. For the cluster 84 tasks where the cluster tags of each utterance may change according to the fold we present the average transition probabilities across all folds. To test the statistical sig- nicance of the transitions of Tables 5.1 and 5.2, we performed lag sequential analysis, as described in [154] (Chapter 7). For all classication tasks and all emotional transi- tions we nd that the observed transitions are statistically signicantly dierent than the ones that are expected occur if emotion at time t was independent of the previous emotion (p-value < 0:001). Similar conclusions are made when we compute table-wise statistics as described in [154], specically Pearson and likelihood-ratio chi-square val- ues, which support the hypothesis that the rows and colums of each transition matrix are dependent (p< 0:001). These statistical tests were performed using the GSEQ5 Toolbox [155]. The valence HMM states contain large diagonal self-transition probabilities, sug- gesting that interlocutors tend to preserve their valence states locally over time. In contrast, low and high activation states tend to be mostly isolated phenomena since interlocutors tend to transition to the medium activation state and preserve that state. This indicates that emotional states of negative, neutral or positive valence tend to last longer compared to high or low activated emotional states which seem to be transient. For the 3 cluster bigrams, frequent transitions happen between the `anger/frustration' and the `neutrality/sadness' clusters, as well as the `happi- ness/excitement' and the `neutrality/sadness' clusters, while transitions between `anger/frustration' and `happiness/excitement' clusters are very rare. This indicates that interlocutors generally transition between a neutral state and an emotional state (of positive or negative valence) but not directly between the extreme valence states. For the 4 cluster HMM, we notice that the `happy/excited' cluster is the one with the highest self-transition probability. Frequent transitions happen between `anger/frustration' and `sadness/frustration' and between 'neutrality' and most other 85 states. These suggest that interlocutors preserve their positive or negative valence while possibly changing their activation levels, i.e. transitioning between sadness, frustration and anger. Neutrality appears to be an intermediate state when transitioning between emotions of opposite valence. We also compared the emotional transitions computed separately on improvised and scripted sessions. We notice that improvised sessions tend to to contain emotions that last longer, which is re ected by larger diagonal transitions on improvised sessions compared to scripted sessions. This is expected since improvisations are characterized by a general emotional theme which favors longer-lasting emotional states while scripted sessions tend to contain a greater variety of emotional manifestations. The above observations concerning emotional transitions depend on the structure of our database. Even though the database design may not cover the full range of human emotional interactions, one could argue that the conclusions presented here could prove useful for processing human-machine interactions where the variety and complexity of emotions and emotional transitions are often limited compared to the general possibilities in interpersonal human interactions. 5.4 Feature Extraction and Fusion 5.4.1 Audio-Visual Frame-level Feature Extraction The IEMOCAP data contain detailed MoCap facial marker coordinates. The posi- tions of the facial markers can be seen in gure 5.7. In total, information from 46 facial markers is used; namely their (x,y,z) coordinates. In order to obtain a lower-dimensional representation of the facial marker information, we use Principal Feature Analysis (PFA) [124], as described in Section 4.1.2.2. We select 30 features, since the PCA transfor- mation explains more than 95% of the total variability. To these we append the rst 86 Figure 5.7: Positions of the MoCap face markers and separation of the face into lower and upper facial regions. derivatives, resulting in a 60-dimensional representation. The facial features are nor- malized per speaker to smooth out individual facial characteristics that are unrelated to emotional expressions, as described in Section 4.1.2.1. In addition, we extract a variety of features from the speech waveform : 12 MFCC coecients, 27 Mel Frequency Band coecients (MFB), pitch and energy values. We also compute their rst derivatives. All the audio features are computed using the Praat software [126] and are normalized using z-standardization. The audio and visual features are extracted at the same framerate of 25 ms, with a 50 ms window. The utterance-level audio HMMs were trained using the 27 MFBs, pitch and energy along with their rst derivatives, while the visual HMMs were trained using the 30 PFA features with their rst derivatives. For the audio-visual HMMs and coupled-HMMs we used both these voice and face features, fused at feature or at model-level. 5.4.2 Utterance-level Statistics of Audio-Visual Features We use a set of 23 utterance-level statistical functionals that are computed from the low-level acoustic and visual features (see table 5.3). Thus, we obtain 142 23 = 3266 87 utterance-level features. All functionals were calculated using the openSMILE toolkit [127]. Table 5.3: Statistical functionals used for turnwise processing. group functionals extremes position of maximum, position of minimum regression linear regression coecients 1 and 2, quadratic mean of linear regression error, quadratic regression coecients 1, 2, and 3, quadratic mean of quadratic regression error means arithmetic mean percentiles quartiles 1, 2, and 3, interquartile ranges 1-2, 2-3, and 1-3, 1 %-percentile, 99 %-percentile, percentile range others number of non-zero values, standard deviation, skewness, kurtosis In order to reduce the size of the resulting feature space, we conduct a cyclic Cor- relation based Feature Subset Selection (CFS) using the training set of each fold. The main idea of CFS is that useful feature subsets should contain features that are highly correlated with the target class while being uncorrelated with each other [156, 157]. Note that we deliberately decided for a lter-based feature selection method, since a wrapper-based technique would have biased the resulting feature set with respect to compatibility to a specic classier. Applying CFS to the 3266-dimensional feature space results in an automatic selection of between 66 and 224 features, depending on the classication task and the fold. For the valence classication task, on average 84 1.1 % of the selected features are facial features, whereas for classication of the degree of activation, only 44 1.8 % of the features selected via CFS are facial features. This underscores the fact that visual features tend to be well-suited for determining valence while acoustic features reveal the degree of activation and agrees with the unimodal classication results that are presented in the results section. For a detailed analysis of the selected features see Table 5.4. 88 Table 5.4: Distribution of the features selected via CFS for the classication of valence (VAL) and activation (ACT) as well as for the discrimination of 3 and 4 clusters in emotional space (see section 5.3.2). feature group VAL ACT 3 clusters 4 clusters pitch 5 % 4 % 3 % 4 % energy 0 % 1 % 1 % 1 % MFCC 4 % 21 % 11 % 11 % MFB 7 % 30 % 18 % 19 % lower face (Fig.4.1) 63 % 32 % 50 % 49 % upper face (Fig.4.1) 21 % 12 % 17 % 16 % 5.4.3 Audio-Visual Feature Fusion For the utterance-level HMM approaches where frame-level features are used, we apply multi-stream HMM classiers (here denoted simply as HMMs). These assign dier- ent importance weights to the audio and visual modalities and assume synchronicity between them. When modeling the dynamics of high level attributes such as emo- tional descriptors of a whole utterance, allowing asynchrony in the dynamic evolution of the underlying audio-visual cues could be benecial. We also apply model-level fu- sion through the use of coupled Hidden Markov Models (c-HMMs), which allow this type of asynchrony, and have been widely used in the literature [158, 159]. All models are trained using the HTK Toolkit [25]. HTK oers the functionality for dening and training a multi-stream HMM but it does not explicitly allow for coupling of multiple single stream HMMs. However, following the analysis presented in [159], [160], we can implement c-HMMs in HTK using a product HMM structure. 89 5.5 Experimental Results 5.5.1 Experimental Setup Our experiments are organized in a cyclic leave-one-speaker-out cross validation. The feature extraction PCA transformations (for the face features) and the feature z- normalization constants are computed based on the respective training set of each fold. The mean and standard deviation of the number of test and training utterances across the folds is 498 60 and 4475 61, respectively. For each fold, we compute the F1- measure, which is the harmonic mean of unweighted precision and recall, as our pri- mary performance measure. As a secondary measure we also report unweighted recall (unweighted accuracy). The presented recognition results are the subject-independent averages over the ten folds and the corresponding standard deviation. In the next sections we present the results of the context-sensitive neural network and HMM frameworks for the various classication tasks. We trained 3-state ergodic HMMs and c-HMMs with observation probability distributions modeled by Gaussian mixture models. The stream weights and the number of mixtures per state (varying from 4 to 32 mixtures) have been experimentally optimized on a validation set. For the context-sensitive HMM approaches, the bigrams and the BLSTMs of each fold have been computed or trained on the corresponding training set. The LSTM networks consist of 128 memory blocks with one memory cell per block while the BLSTM networks consist of two LSTM layers with 128 memory blocks per input direction. To improve generalization, we add zero mean Gaussian noise with standard deviation 0.6 to the input statistical features during training. The BLSTM networks that are trained to process the HMM outputs (HMM+BLSTM) consist of 32 memory blocks per input direction. 90 5.5.2 Context-Free vs Context-Sensitive Classiers In this section, we compare the classication performance between emotion-specic HMMs or c-HMMs, that do not make use of context information, and our pro- posed context-sensitive frameworks: hierarchical HMMs (or c-HMMs) using sequen- tial Viterbi Decoding with bidirectional window of w + 1 utterances (HMM+HMM, c- HMM+HMM), BLSTM trained with utterance-level feature functionals (BLSTM) and hybrid HMM/BLSTM classiers (HMM+BLSTM, c-HMM+BLSTM). Table 5.5 shows the performances for discriminating three levels of valence and activation, as well as classication into three and four clusters in the valence-activation space. To test the statistical signicance of the dierences in average F1 performance between (c-)HMM, (c-)HMM+HMM, (c-)HMM+BLSTM and BLSTM classiers, we conducted repeated measures ANOVA at the subject level with Bonferonni adjustment for post-hoc tests, using SPSS [161]. Concerning the valence results, we observe that facial features are much more eec- tive in classiying valence than voice features, which agrees with our previous ndings [119]. Regarding the audio-visual classiers, BLSTM achieves the highest average F1 measure, while the emotion-specic HMMs benet from the use of long-range context, either through higher-level Viterbi Decoding (HMM+HMM) or through the use of a higher-level BLSTM (HMM+BLSTM). Statistical signicance tests reveal that the av- erage HMM+BLSTM F1 measure is signicantly higher than that of the HMM at the 0.05 level. HMM+HMM performance was not found signicantly higher than HMM performance, but it has a p value very close to the threshold (p=0.055). Similarly the comparison of BLSTM and HMM gives a p value of 0.06. For the activation task, incorporating visual cues does not improve performance sig- nicantly, indicating that audio cues are more informative that visual cues. This agrees 91 Table 5.5: Comparing context-free and context-sensitive classiers for discriminat- ing three levels of valence and activation, and three and four clusters in the valence- activation space, using face (f) and voice (v) features: mean and standard deviation of F1-measure and unweighted Accuracy across the 10 folds (10 speakers). classier features F1 Acc. valence HMM v 49.85 3.18 49.99 3.63 HMM f 58.85 3.86 60.98 4.96 HMM v+f 60.79 2.53 62.50 3.39 cHMM v+f 60.42 3.59 61.75 4.66 HMM+HMM(w=2) v+f 62.02 2.25 63.16 3.18 HMM+BLSTM v+f 63.97 3.03 62.78 6.43 BLSTM v+f 65.12 5.13 64.67 6.48 activation HMM v 57.54 3.33 61.92 4.88 HMM f 49.04 4.40 51.36 4.14 HMM v+f 57.56 4.27 60.00 4.45 cHMM v+f 57.39 3.25 61.29 5.16 HMM+HMM(w=4) v+f 57.71 4.23 60.02 4.54 HMM+BLSTM v+f 53.41 5.99 46.93 5.69 BLSTM v+f 54.90 5.02 52.28 5.37 3 clusters HMM v+f 67.33 5.15 66.18 6.69 cHMM v+f 68.45 3.38 67.95 3.18 cHMM+HMM(w=4) v+f 70.36 3.48 69.76 3.09 cHMM+BLSTM v+f 68.09 4.16 68.02 4.72 BLSTM v+f 72.35 5.10 71.83 5.46 4 clusters HMM v+f 56.54 4.29 56.64 5.90 cHMM v+f 57.28 3.65 57.87 4.33 cHMM+HMM(w=4) v+f 58.65 3.80 58.89 4.59 cHMM+BLSTM v+f 58.21 5.24 57.94 5.89 BLSTM v+f 62.80 6.69 61.96 7.02 with previous results in the emotion literature [162], [69]. Overall, we notice that tak- ing temporal context into account does not benet activation classication performance for HMMs (HMM+HMM), and the BLSTM and HMM/BLSTM classiers on average perform worse than the context-free HMMs. This could be attributed to the isolated nature of the extreme activation instances, as we have observed in section 5.3.3. High and low activation events are isolated between repeated instances of medium activation, therefore the use of context-sensitive methods like Viterbi Decoding (VD) or BLSTM tends to underestimate their probability of occurrence and ends in misclassifying them as medium activation events (oversmoothing). Especially, the long-range context mod- eling of the BLSTM does not seem to capture well the activation evolution and degrades 92 Table 5.6: Confusion matrices of the HMM, the hierarchical HMM, the hybrid HMM/BLSTM and the BLSTM classiers for the activation and 3 cluster classica- tion tasks. For the 3 cluster case the correspondence between emotions and clusters is: c1 = 'ang/fru', c2 = 'hap/exc', c3 = 'neu/sad'. activation HMM HMM+HMM Low Med High Low Med High Low 53.32 41.47 5.21 Low 53.32 41.47 05.21 Med 14.66 62.61 22.73 Med 14.40 62.96 22.64 High 0.93 30.93 68.14 High 0.93 31.13 67.94 HMM+BLSTM BLSTM Low Med High Low Med High Low 7.18 92.46 0.36 Low 28.42 69.27 2.31 Med 1.28 92.31 6.41 Med 6.75 78.65 14.60 High 0 58.66 41.34 High 0.51 49.13 50.36 3clusters cHMM cHMM+HMM c1 c2 c3 c1 c2 c3 c1 70.78 13.96 15.26 c1 71.82 13.24 14.94 c2 16.39 67.56 16.05 c2 15.98 69.15 14.88 c3 22.83 11.86 65.31 c3 22.83 10.55 66.61 cHMM+BLSTM BLSTM c1 c2 c3 c1 c2 c3 c1 70.12 10.70 19.18 c1 72.20 8.73 19.07 c2 18.60 65.22 16.18 c2 16.77 67.75 15.48 c3 21.11 9.82 69.07 c3 14.72 9.69 75.59 activation classication. For example when looking at the confusion matrices of Table 5.6, we notice that when using the hybrid HMM/BLSTM or BLSTM the performance of the high and low activation classes greatly decreases compared to the simple HMM. For the three cluster task, we notice that context-sensitive classiers, such as c- HMM+HMM and BLSTM, on average perform higher than the simple HMMs and c-HMMs, and that the BLSTM classier achieves the highest average F1 measure. The average F1 measure of cHMM+HMM was found signicantly higher that of c-HMM at the 0.05 level. Similarly, for the four cluster task context-sensitive classiers tend to outperform simple HMMs and cHMMs in terms of average F1, although these dierences are not statistically signicant at the 0.05 level. In Table 5.6 we present the confusion matrices of the (c-)HMM, the hierarchical (c-)HMM, the hybrid (c-)HMM/BLSTM and the BLSTM classiers for the 3 cluster and activation classication tasks, in order to give a description of the confusion between 93 classes for a task where context is benecial (3 cluster) versus a task where context is not benecial (activation). Overall, our results suggest that incorporating context is benecial for the valence and the three and four cluster classication. The BLSTM classier generally achieves the highest classication performance, although performance across folds has a rela- tively high variance. The hierarchical HMM and hybrid HMM/BLSTM classiers per- form similarly in general, and lower than the BLSTM in terms of average F1 measure, although they tend to have more consistent performance across subjects (smaller vari- ance). Regarding the hierarchical HMM approach we notice that a small amount of bidirectional context (e.g., w=4) can give a performance increase. We have omitted the results of the HMM+HMM architecture where Viterbi Decoding is used over the total observation sequence. For all our classication tasks, the results are very similar to the ones obtained through sequential VD with small window sizes, which suggests that it is possible to increase recognition performance even when a small amount of bidirectional context is used. These observations are encouraging and suggest that this algorithm could be applied in practical scenarios where an emotion recognition system might not aord to wait for the conversation to end in order to perform recognition, while it might be acceptable to wait a few utterances before making a decision. Note that there are signicant variations in the performance and rankings across dierent folds for all classication approaches, as indicated by the variances of Table 5.5 and the results of the statistical signicance tests. These suggest that no approach is clearly superior for all speakers. Our insight is that these variations could result from speaker dependent characteristics of emotional expression, i.e., some speakers may be more overtly expressive than others or may make dierent expressive use of the audio and visual modalities. 94 To the best of our knowledge, there are no published works that report classication results of dimensional labels using the IEMOCAP database in a way that would allow direct comparison with our results. The most relevant past works are [21] and [163], however for both cases our experimental setup is more generic. In [21] authors perform speech-based classication of dimensional labels, however, the train and test sets are randomly split into 15 cross validations. Our subject independent setting is considerably more challenging for classication. In [163] authors perform speech-based classication of valence and activation, only for utterances with categorical labels of angry, happy, neutral and sad. Furthermore, the authors remove utterances that seem to have con- icting categorical and dimensional labels which generally simplies the problem as it removes potentially ambiguous emotional manifestations. 5.5.3 Context-Sensitive Neural Network Classiers In this section, we compare the recognition performances of various Neural Network classiers which take into account dierent amount of unidirectional and bidirectional context. The results are presented in Table 5.7. (B)LSTM architectures achieve a higher average F1 measure compared to (B)RNN architectures which indicates the merit of learning a longer range of temporal context for emotion recognition tasks. Also, bidirectional neural networks, such as BLSTMs and BRNNs, outperform their respective unidirectional counterparts, such as LSTM and RNN, which suggests the importance of bidirectional context for these architectures. The performance dierences between these context-sensitive NNs, although not statistically signicant, indicate a consistent trend in performance across all classication tasks, with BLSTM being the highest performing classier. 95 Table 5.7: Comparing context-sensitive Neural Network classiers for discriminating three levels of valence and activation, and three and four clusters in valence-activation space using face (f) and voice (v) features: mean and standard deviation of F1-measure and unweighted Accuracy across the 10 folds (10 speakers). classier features F1 Acc. valence RNN v+f 63.34 4.58 62.92 6.00 BRNN v+f 64.10 5.05 63.68 6.64 LSTM v+f 63.71 4.86 63.76 5.95 BLSTM v+f 65.12 5.13 64.67 6.48 activation RNN v+f 52.78 5.21 48.54 5.59 BRNN v+f 53.93 4.12 49.98 4.62 LSTM v+f 53.65 4.97 50.35 5.83 BLSTM v+f 54.90 5.02 52.28 5.37 3 clusters RNN v+f 69.59 5.75 69.34 5.95 BRNN v+f 69.94 5.65 69.76 6.00 LSTM v+f 70.34 5.85 69.53 6.61 BLSTM v+f 72.35 5.10 71.83 5.46 4 clusters RNN v+f 58.30 6.63 57.29 7.28 BRNN v+f 60.10 5.96 59.14 6.72 LSTM v+f 61.93 5.96 61.02 6.15 BLSTM v+f 62.80 6.69 61.96 7.02 5.6 BLSTM Context Learning Behavior To investigate the importance of presenting training and test utterances in the right order during BLSTM network training and decoding, we repeated all BLSTM classi- cation experiments using randomly shued data, i.e., we processed the utterances of a given conversation in arbitrary order so that the network is not able to make use of meaningful context information. As can be seen in Table 5.8, this downgrades recogni- tion performance. To test the statistical signicance of this result, we performed paired t-tests and found that the dierences in average F1 measures are statistically signicant at the 0.05 level for all classication tasks, except for the case of activation. These re- sults suggest that the high performance of the BLSTM classiers is to a large extent due to their ability to eectively learn an adequate amount of relevant emotional context from past and future observations. They can also be interpreted as further evidence that 96 learning to incorporate temporal context information is important for human emotion modeling. Table 5.8: Recognition performances (%) of BLSTM networks when training on the original sequence of utterances compared to when the utterances are randomly shued: mean and standard deviation of F1-measure and unweighted Accuracy across the 10 folds (10 speakers). classier features F1 Acc. valence BLSTM v+f 65.12 5.13 64.67 6.48 BLSTM(shued) v+f 59.71 4.51 58.98 5.14 activation BLSTM v+f 54.90 5.02 52.28 5.37 BLSTM(shued) v+f 52.10 6.86 46.35 6.78 3 clusters BLSTM v+f 72.35 5.10 71.83 5.46 BLSTM(shued) v+f 67.86 5.08 66.61 4.95 4 clusters BLSTM v+f 62.80 6.69 61.96 7.02 BLSTM(shued) v+f 59.27 6.40 57.93 6.88 An impression of the amount of contextual information that is used by the BLSTM network can be gained by measuring the sensitivity of the network outputs to the net- work inputs. When using feedforward neural networks, this can be done by calculating the Jacobian matrix J whose elementsJ ki correspond to the derivatives of the network outputs y k with respect to the network inputs x i . To extend the Jacobian to recurrent neural networks, we have to specify the timesteps at which the input and output vari- ables are measured. Thus, we calculate a four-dimensional matrix called the sequential Jacobian [164] to determine the sensitivity of the network outputs at timet to the inputs at time t 0 : J tt 0 ki = @y t k @x t 0 i Figure 5.8(a) shows the derivatives of the network outputs at time t = 16 with respect to the dierent network inputs (i. e., features) at dierent timesteps t 0 for a ran- domly selected session consisting of 30 utterances when using a BLSTM network for the 97 discrimination of ve emotional clusters. Since we use BLSTM networks for utterance- level prediction, each timestep corresponds to one utterance. Note that the absolute magnitude of the derivatives is not important. We are rather interested in the relative magnitudes of the derivatives to each other, since this determines the sensitivity of out- puts with respect to inputs at dierent timesteps. Of course the highest sensitivity can be detected at timestep t 0 = 16, which means that the current input has the most sig- nicant in uence on the current output. However, also for timesteps smaller or greater than 16, derivatives dierent from zero can be found. This indicates that also past and future utterances aect the current prediction. As positive and negative derivatives are of equal importance, Figure 5.8(b) shows the absolute values of the derivatives in Figure 5.8(a). Finally, Figure 5.8(c) displays the corresponding derivatives summed up over all inputs and normalized to the magnitude of the derivative at t 0 = 16. In order to systematically evaluate how many past and future inputs are relevant for the current prediction, we determined how many utterances before and after the cur- rent utterance (e. g., utterance 16 in the example given in Figure 5.8) have a sensitivity greater or equal to 3 % of the maximum sensitivity. To this end, we calculated projec- tions of the sequential Jacobian as in Figure 5.8(c) for each timestep t in each session and each fold. Figure 5.9(a) shows the number of relevant past and future utterances dependent on the position in the sequence (i. e., dependent on the utterance number within a session) when using a BLSTM network for the discrimination of ve clusters in the emotional space (the corresponding gures for the other classication tasks are very similar and are omitted). The number of past utterances for which the sensitivity lies above the 3 % threshold increases approximately until the eighth utterance in a session. As more and more past utterances become available, the graph converges to a value of between seven and eight, meaning that roughly seven to eight utterances of past context are used for a prediction. For the rst few emotion predictions the network uses about 98 (a) Derivatives at time t = 16. (b) Absolute values of the derivatives in Figure 5.8(a). (c) Derivatives summed up over all inputs and normal- ized. Figure 5.8: Derivatives of the network outputs at time t = 16 with respect to the dierent network inputs at dierent timesteps t 0 ; randomly selected session consisting of 30 utterances (BLSTM network for the discrimination of ve emotional clusters). eight utterances of future context. The slight decrease of the number of used future ut- terances for higher utterance numbers (i. e., for utterances occurring later in a session) is simply due to the fact that some sessions consist of less than 30 utterances, which means that towards the end of a session, less future utterances are available on average. 99 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 1 2 3 4 5 6 7 8 9 10 past turns future turns Average number of turns above threshold Turn no. (a) BLSTM network trained on utterances in the correct order. 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 1 2 3 4 5 6 7 8 9 10 past turns future turns Average number of turns above threshold Turn no. (b) BLSTM network trained on randomly shued data. Figure 5.9: Average number of relevant past and future utterances dependent on the position in the sequence when using a BLSTM network for the discrimination of ve emotional clusters (3 % sensitivity-threshold). Figure 5.9(b) shows the number of relevant preceding and successive utterances for the BLSTM network trained on randomly shued data. As can be seen, the amount of used context is less than for the BLSTM trained on correctly aligned utterances. Even though no reasonable emotional context can be learned when training on arbitrarily shued data, the network still uses context. One reason for this could be that BLSTM attempts to learn other session-specic characteristics, such as speaker characteristics. Figure 5.10 shows the number of relevant past utterances when considering dierent classication tasks and sensitivity-thresholds from 1 to 10 %. Again, we can see that networks trained on randomly shued data use less context (see dashed lines in Fig- ure 5.10) while the amount of context exploited for the dierent classication tasks is relatively similar. 100 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11 12 valence activation 3 clusters 4 clusters 5 clusters Average number of turns above threshold Sensitivity-threshold [%] Figure 5.10: Average number of relevant past utterances dependent on the sensitivity- threshold; straight lines: utterances in correct order; dashed lines: randomly shued data. 5.7 Conclusions In this work we have described and analyzed context-sensitive frameworks for emotion recognition, i.e., frameworks that take into account temporal emotional context when making a decision about the emotion of an utterance. These methods, which utilize powerful and popular classiers, such as HMMs and BLSTMs, could be viewed under the common framework of a hierarchical, multimodal approach, which models the obser- vation ow both at the utterance level (within an emotion) and at the conversation level (between emotions). The dierent classiers that can be chosen for each level re ect dierent modeling assumptions on the underlying sequences and account for dierent system requirements. Our emotion classication experiments indicate that taking into account temporal context tends to improve emotion classication performance. Over- all, context-sensitive approaches outperform methods that do not consider context for the recognition of valence states and emotional clusters in the valence-activation space, in terms of average F1 measure. However, the relatively large performance variability between subjects suggests that no method is clearly superior for all subjects. Addi- tionally, the use of context from both past and future seems benecial, as suggested by 101 the slightly higher performance of bidirectional neural networks (BLSTM, BRNN) com- pared to their unidirectional counterparts. Even the use of a small amount of context around the current observation, e.g., from the use of the sequential VD algorithm with a small window of w + 1 utterances, leads to performance improvement, which is an encouraging result for designing context sensitive frameworks with performance close to real-time. The only emotion classication task that does not benet signicantly from context is activation, possibly because of the isolated nature of the extreme activation events, which makes this structure dicult to model. According to our results, neural network architectures, and specically (B)LSTM networks trained with utterance level feature functionals achieve a higher average per- formance than HMM classication schemes. This could be attributed to their discrim- inative training, fewer modeling assumptions and their ability to capture long-range, bidirectional temporal patterns of the input feature streams and output activations. BLSTM networks can learn an adequate amount of relevant emotional context around the current observation, during the training stage. When such context is not present, for example when we randomly shue the utterances of a conversation, the performance of the BLSTM classiers signicantly decreases. However (B)LSTM and (B)RNN classi- ers seem to have diculties handling emotional expression variability between subjects, therefore their performance may vary signicantly across people. HMM classication frameworks and hybrid HMM/BLSTM frameworks, on average perform lower than neu- ral networks, but generally achieve more consistent classication results across subjects. They provide a structured approach for modeling and classifying sequences at multiple levels, they have more transparency as to what amount of context is used and they are generally exible. For example, HMM+HMM classiers can be modied to use limited context so as to suit possible requirements of real-time emotion recognition systems. 102 Analysis of the emotional ow in the conversations of our database indicates an underlying structure in typical emotional evolution. For example valence states seem to last longer than activation states of high or low activation, which are more tran- sient. Also, some emotional transitions are more frequent than others, i.e., a transition between neutrality and an emotional state is much more likely than a transition be- tween two emotional states of opposing valence (happy from/to angry). Simple rst order transition probabilities provide a rough description of the emotional ow and lead to modeling emotional utterances in a conversation using an HMM. This can be seen equivalently as a Probabilistic Regular Grammar (PRG) for emotional utterances. One could perform more complex modeling of a conversation and look for equivalent gram- mars with more representation power than a PRG. Although such conclusions depend on the design of our database and may not cover the full range of human emotional in- teractions, one could argue that they contain useful information about typical structure of emotional transitions. In this work we have focused on context in the sense of past and future observations that are informative of the current observation. However, a broader denition of context could include general understanding of the situation, the conversation topic and the interlocutor roles, and this is an interesting direction for future work. 5.8 Extensions: Multimodality and Dialog Modeling Building upon the work presented in the previous sections, we propose a more generic description of our hierarchical context-sensitive framework that specically addresses the issues of dialog modeling, multimodal fusion, and handling cases where dierent amount of multimodal information is available per speaker [27]. Specically, our approach enables modeling emotional dynamics at both the utter- ance level, i.e., within an emotion, and at the dialog level, i.e., between the emotions 103 Figure 5.11: An overview of the proposed hierarchical framework. of a speaker or of both speakers that are expressed during a dyadic conversation. The framework is exible as to the classication approaches that can be applied to model the aective content of multiple modalities, such as face, voice, head and hand movement, and can be extended to include more modalities if they become available. Here, we uti- lize Hidden Markov Model (HMM) classiers to model utterance and dialog emotional evolution, therefore our approach could be seen as a two-level context-sensitive HMM. As a testbed for evaluating the proposed approach, we use the Interactive Emo- tional Dyadic Motion Capture (IEMOCAP) database, described in section 4.1.1. Our experiments are organized in a dyad-independent manner to simulate real-life scenarios where no prior knowledge of the specic dyads is available. Our results indicate that our approach can successfully accommodate a variety of classiers and fusion strategies, and can handle cases where a dierent amount of multimodal information is available from each speaker. 5.8.1 Framework Overview As described in section 5.2.1, our problem can be posed as a two-level one: modeling the sequence of audio-visual observationsfO t1 ;O t2 ;:::;O tn g belonging to an emotional utterance U t and on top of that modeling the sequence of utterancesfU 1 ;U 2 ;:::;U T g 104 belonging to an emotional conversationC. We assume that an emotional utterance can be described by a single emotional label while an emotional conversation may contain arbitrary emotional transitions between utterances. Our proposed approach, illustrated in Fig. 5.11, can be seen as a two-level HMM, and shares similarities with the (uni- modal) multi-level HMM proposed in [144]. At the lower level, the system processes multimodal cues during each of the speaker's utterances which are modeled using emo- tion specic HMMs to capture the dynamics within emotional categories. At the higher level, which represents the temporal emotional context, the emotional ow between ut- terances is modeled by a conversation-level HMM which transitions between emotions. More specically, the overall system comprises: Emotional Utterance Modeling: The system takes as input audio-visual cues from each speaker during his utterance. Here, we consider vocal cues for both interlocu- tors, and for one of the speakers we also consider visual cues: facial expressions, head movement and hand movement. Emotion-specic, 3-state, fully-connected HMMs are trained for each modality, using the HTK Toolbox [25]. Modeling of utterance-level score vectors: We estimate the log-likelihood s it = logP (U t j i );i = 1;::;N of every emotion-specic HMM i given each utter- ance. N is the number of emotional categories. We collect these likelihood scores to create an N-dimensional utterance-level score vector [s it ] N i=1 . The joint distribution P ([s it ] N i=1 jemo i ) of these scores is then modeled for each emotional category separately. Two potential models are examined, namely emotion-specic Gaussian Mixture Models (GMMs) and multinomial nominal or ordinal regression. The latter can be used when our classication task is of ordinal nature (e.g., levels of activation). The purpose of this joint score modeling step is to also exploit useful relations that may potentially exist between scores obtained at the rst level. For example, we would like an utterance that scores high for emo 1 and low for the other two classes to more strongly qualify for 105 the classemo 1 rather than one that has scored higher for emo 1 but also relatively high for other categories as well. Alternatively, this joint score modeling could be seen as a score normalization process that takes into account the scores from all utterance-level emotional models. Speaker and Dialog emotional modeling: Speaker context can be included by modeling the evolution of a speaker's emotional state, while dialog context is included by modeling the evolution of emotional states in a dialog in a speaker independent manner. Speaker modeling incorporates temporal context of a speaker based on the assumption that his emotional state is slowly varying, while dialog modeling typically incorporates temporal context from both speakers based on the assumption that their emotional states in uence each other. In both cases, context is modeled by a higher-level HMM, where each state corresponds to a dierent emotion. This higher-level HMM can be either rst or second order, where each state is dependent upon the previous one or two states respectively. The transition probabilities between states are then approximated by bigrams or trigrams, estimated in the training set. So, in the case of a rst-order higher-level HMM for example, P (emo 2 jemo 1 ) would be estimated as the ratio of the training emo 2 utterances that follow emo 1 utterances, divided by the total number of emo 1 utterances. The emission probability distribution per state is the joint score conditional distribution for the corresponding emotion, P ([s it ] N i=1 jemo i ). In the case of dialog context modeling the second-order HMM allows us to relate the current state with the previous emotions of not only the current speaker but also the interlocutor. This happens because in most cases an emotional utterance of a speaker is followed by an emotional utterance from his interlocutor. Multimodal Fusion: There is exibility regarding the choice of a fusion strategy that can be used for the multimodal cues. Modalities could be fused at the feature 106 level, e.g., by training audio-visual lower level HMMs or at the model-level by collect- ing the log-likelihood scores for each modality, e.g., [P (U t j i ) f ;P (U t j i ) v ] N i=1 to model multimodal score vector distribution (f denotes facial and v vocal modality). Alterna- tively, they could be fused at the score-level by adding the score log-likelihoods with appropriate weights, e.g., w f logP ([s f it ] N i=1 jemo i ) +w v logP ([s v it ] N i=1 jemo i ). 5.8.2 Experimental Setup and Results As described in section 4.1.1, IEMOCAP database contains detailed face and head movement information via MoCap for one of the two speakers in each conversation, while the speech signal is available for both speakers via shotgun microphones. Therefore, here we extract facial, speech and head movement features for the speaker wearing the markers, and speech for the speaker not wearing markers. Our facial features are 30 normalized PFA features plus their rst derivatives, as described in sections 4.1.2.1 and 4.1.2.2. The head features consist of the head translation in the (x,y,z) directions as well as the head angles (yaw, pitch and roll), plus their rst derivatives, as described in section 4.1.3. Speech features are z-normalized 12 MFCC coecients, pitch, and energy, together with their rst derivatives, extracted using Praat [126]. Audio and visual features are extracted at 25 Hz, with a 50 ms window. Our experiments are organized in a ve-fold leave-one-pair-out cross validation. The presented recognition results (unweighted F1 measures) are the pair-independent aver- ages over the ve folds. The number of test utterances per fold is 1954 194 on average when we have audio cues and 988 97 when we have audio-visual cues. 5.8.2.1 Unimodal Classication and Joint Score Modeling Here, we present the HMM unimodal recognition results for discriminating three levels of valence and activation, for facial, vocal, head and hand movement modalities. For the 107 valence task, we have discriminatively re-trained the emotion-specic HMMs, using the Minimum Phone Error criterion [25], which led to an improvement in the average F1 measure around 1-2% absolute. Due to class imbalance for activation, where most of the instances are of medium activation, discriminative training did not improve performance and has not been used. We also present the classication results when we perform joint score distribution modeling on top of the lower-level HMM classication. We examine two such modeling approaches: through a GMM and through logistic regression, either nominal for the valence task, or ordinal for the activation task. The GMMs or regression models have been trained using the likelihood scores in the trainset, to which we added low random Gaussian noise to improve generalization. The results are presented in Table 5.9. Table 5.9: Classication performances (F1 measures) for three levels of valence and activation using face (f), voice (v), head (h) and hand (ha) features: mean and standard deviation of F1-measure across the 5 folds (5 dyadic sessions). Lower-level HMM Classication: F1 (mean & std.dev) classier valence activation HMM (v) 54.03 3.09 55.39 1.38 with markers 54.91 2.82 57.53 2.10 without markers 52.45 4.05 52.75 2.11 HMM (f) 60.26 3.71 47.71 4.49 HMM (h) 43.53 2.59 51.51 2.05 HMM (ha) 42.34 3.14 44.69 1.01 HMM and Score Vector modeling using GMM classier valence activation HMM+GMM (v) 54.95 1.86 56.61 2.88 with markers 56.46 2.38 58.88 4.26 without markers 52.99 2.41 54.09 1.91 HMM+GMM (f) 59.62 3.71 48.57 4.44 HMM+GMM (h) 45.48 1.75 52.74 1.83 HMM+GMM (ha) 43.09 2.20 48.79 3.31 HMM and Score Vector modeling using Multinomial Logistic Regression (MLR) classier valence(nominal) activation(ordinal) HMM+MLR (v) 54.52 2.24 57.74 1.11 with markers 55.35 2.95 59.45 2.57 without markers 53.17 2.48 55.79 1.13 HMM+MLR (f) 58.91 2.57 49.47 5.38 HMM+MLR (h) 44.41 1.89 52.52 1.32 HMM+MLR (ha) 43.57 2.35 48.18 2.93 We notice that facial cues seem to be more informative for valence, while vocal cues more informative for activation. The head and hand movement cues generally 108 carry less emotional information, although they seem informative for activation level discrimination. Recognition performance based on voice tends to be higher when the speaker is wearing the markers. Analysis of the microphone signals suggests that this is possibly an artifact of the database due to the placement of the microphones. Joint score distribution modeling generally gives an improvement over low-level HMMs; the GMM and logistic regression models perform comparably, with ordinal regression being the best performing for the ordinal activation task, but nominal regression performing lower than the GMM for the valence task. For the rest of our experiments, due to lack of space, we only present the results of the GMM method which gives a consistent improvement for both classication tasks. 5.8.2.2 Multimodal Fusion at Multiple Levels In Table 5.10, we present the multimodal fusion results using face and voice (fv), or face,voice and head (fvh) modalities (corresponding only to the cases where the speakers are wearing markers). We examine fusion at multiple levels; at the feature-level we train multistream HMMs where the steam weights for audio and facial modalities are optimized on the train set. At the model level, we train GMMs to model multimodal score vectors, and at the score level we perform a weighted average of the score-vector GMM log-likelihoods, where the weights for each modality have been optimized on the train set. The results using the hand modality are omitted since they did not show a signicant performance increase. Based on the intuition that face and voice cues are more correlated than head cues, we fuse the head movement modality at the same or later stage than the other two. For the valence task, where both facial and vocal cues have adequate discriminative power, fusing them at the feature-level leads to good performance, and including head cues at the score level gives a further small performance increase. For the activation task, where 109 the vocal cues alone perform considerably better that the facial cues, it is preferable to combine the three modalities at the score level, after adjusting the modality weights on the train set. Finally, as expected, multimodal classiers perform considerably better than unimodal ones. Table 5.10: Classication performances (F1 measures) for three levels of valence and activation by fusing using face (f), voice (v) and head (h) modalities at various levels: mean and standard deviation of F1-measure across the 5 folds. Fusion of face and voice: F1 (mean & std.dev) classier & fusion approach valence activation HMM(fv)+GMM feature 62.75 4.43 57.43 3.76 HMM+fuse(fv)+GMM(f) model 61.08 4.40 57.94 3.71 HMM+GMM+fuse(fv) score 62.22 2.73 59.27 3.92 Fusion of face, voice and head: F1 (mean & std.dev) classier & fusion approach valence activation HMM(fv)+fuse(h)+GMM fv:feature, h:model 61.92 4.11 57.16 3.79 HMM(fv)+GMM+fuse(h) fv:feature, h:score 63.26 4.05 58.83 3.66 HMM+fuse(fvh)+GMM fvh:model 60.47 4.37 58.64 2.74 HMM+fuse(fv)+GMM+fuse(h) fv:model, h:score 61.01 3.80 59.30 2.92 HMM+GMM+fuse(fvh) fvh:score 61.84 3.26 61.15 2.95 5.8.2.3 Speaker and Dialog Modeling Here, we present the results after context modeling. We examine two issues; rstly whether emotional context information from a speaker could benet classication of his current emotion (speaker modeling), and secondly whether including context from his interlocutor further increases performance (dialog modeling). For speaker modeling we use a rst-order HMM denoted asHMM 1st sp while for dialog modeling we use a second- order HMM denoted as HMM 2nd d . Higher order modeling for speaker context did not provide any additional benets. In all cases, transition probabilities were estimated on the training set. The results for the valence classication task are presented in Table 5.11. We do not present results for activation, where using temporal emotional context does not signicantly increase performance (see also section 5.5.2). Here, we make a distinction between emotion classication of the speaker wearing the markers, where audio-visual cues are available, and the speaker without markers 110 where only audio cues are available. The temporal information that is benecial for each case may vary as can be seen in Table 5.11. The dierence in available modalities for each speaker makes emotion classication more reliable for the speaker wearing the markers, compared to the speaker without markers. Therefore, although temporal emotional context from the same speaker generally seems benecial, context from the other speaker through dialog modeling is benecial only if the other speaker has more available modalities, and therefore more reliable emotional estimates. This motivated us to try mixed modeling, that is speaker modeling for the speaker wearing markers and dialog modeling for the speaker without markers, which resulted in the highest classication performance, as can be seen in Table 5.11. Table 5.11: Classication performances (F1 measures) for three levels of valence for speaker and dialog modeling. We use the best multimodal fusion approach from the previous section. valence No higher level modeling speaker classier modeling F1 (mean & std.dev) with markers HMM(fv)+GMM+fuse(h) { 63.26 4.05 no markers HMM+GMM (v) { 52.75 2.11 in total 58.92 2.65 Speaker modeling, 1st order HMM for each speaker separately speaker classier modeling F1 (mean & std.dev) with markers HMM(fv)+ HMM,fuse(h) HMM 1st sp 66.09 3.39 no markers HMM+HMM (v) HMM 1st sp 54.88 3.52 in total 61.30 2.93 Dialog modeling, 2nd order HMM for total dialog speaker classier modeling F1 (mean & std.dev) with markers HMM(fv)+ HMM,fuse(h) HMM 2nd d 62.01 3.14 no markers HMM+HMM (v) HMM 2nd d 57.14 2.80 in total 59.98 2.61 Mixed modeling, according to whether the speaker wears markers or not speaker features modeling F1 (mean & std.dev) with markers HMM(fv)+HMM,fuse(h) HMM 1st sp 66.09 3.39 no markers HMM+HMM (v) HMM 2nd d 57.14 2.80 in total 62.31 2.18 5.8.3 Conclusions In this section, we have proposed and tested a hierarchical, multimodal framework that captures emotional dynamics within and between emotions in aective dialogs and 111 incorporates multiple modalities at various levels, depending upon the classication task and the modality. According to our results multimodal classiers outperform unimodal ones, especially for the case of activation where facial, vocal and head movement cues carry relevant emotional information. Valence classication benets signicantly from incorporating temporal context from the same speaker. Considering context from the other speaker it is only helpful when we have a reliable (multimodal) emotional estimate of that speaker. Our framework is exible enough to handle varying characteristics of each emotional task or dialogs where a dierent amount of multimodal information is available per speaker. We have examined HMM, GMM and MLR classiers for utterance and score-vector modeling, however other generative or discriminative approaches could be applied according to the problem in hand. 112 Chapter 6: Tracking Continuous Emotional Trends Through Time 6.1 Continuous Tracking of Emotions and Emotional Trends Human expressive communication is characterized by the continuous interplay of mul- timodal information, such as facial, vocal and bodily gestures, which may convey the participant's aect. The aective state of each participant can be seen as a continuous variable that evolves with variable intensity and clarity over the course of an interac- tion. It can be described by continuous dimensional attributes: activation, valence and dominance. This can be seen as a more generic way to classify emotions, especially for emotional manifestations that may not have a clear categorical description. In this chapter, we address the problem of continuous tracking of activation, valence and dominance, when they are considered to be continuously valued, using a statisti- cal generative framework that jointly models the underlying emotional states and the 113 observed audio-visual cues. Our goal is to obtain a continuous description of each participant's underlying emotional state through the course of an improvised dyadic interaction. Our experimental setup is generic; participants express a wide variety of emotions that are not pre-dened but are elicited through their interaction, and have varying roles throughout the performance (speaker, listener, neither). This approach has the potential to shed light into the temporal dynamics of emotions through an inter- action and highlight regions where abrupt emotional change happens. These could be viewed as regions of emotional saliency. Moreover, the development of methodologies to continuously track emotional states could enable technologies such as naturalistic Human-Computer Interfaces (HCI) that can continuously process a variety of multi- modal information from the user(s) as they unfold, monitor the users' internal state and respond appropriately when needed. We apply a Gaussian Mixture Model (GMM) based methodology, originally intro- duced in [43], to compute an optimal statistical mapping between an underlying emo- tional state and an observed set of audio-visual features, both evolving through time. We formulate the emotion tracking problem at various time resolutions, to investigate the eect of the tracking detail on the nal performance. For our experiments, we use the USC Creative IT database, described in detail in Chapter 3, which contains detailed full body Motion Capture (MoCap) information in the context of expressive theatrical improvisations. We extract a variety of psychology-inspired body language features describing each participant's body language and relative interaction behaviors. Those features and their relation with the underlying emotional states have been de- scribed in Section 4.2. Here, we focus on continuous emotional state estimation using the extracted body language features and speech information of the participants. Our experimental results indicate that we are better at tracking changes in emo- tional attributes rather than the absolute values themselves, following a similar trend 114 as the human annotations, as discussed in Section 3.5 that describes the data anno- tation results. Furthermore, the proposed GMM based tracking method outperforms other examined methods, in terms of correlation-based performance metrics (estimating trends of attributes). For activation trends, the tracking performance is close to human agreement, while for dominance we achieve encouraging results. The work presented in this chapter has been published in [44, 104]. 6.2 Related Work The use of dimensional representations of emotions has been adopted by many re- searchers but typically the dimensional values are quantized into discrete levels. How- ever, a continuous representation may allow a more generic and exible treatment of emotions. Examples of work that avoid discretizing the emotional dimensions include [34, 35] where regression approaches, such as Support Vector Regression (SVR), were used to estimate continuous dimensional attributes from speech cues of presegmented utterances. Most of the existing literature, including works that focus on recognition of emotions as part of an emotion sequence [165, 166], presegment the time dimension into units for recognition, e.g., consecutive words or utterances. Few works have avoided segmenting the temporal dimension and have addressed the problem of continuously tracking emo- tions across time. For example, in [36] the authors present continuous recognition of the emotional content of movies using a Hidden Markov Model (HMM) which classies dimensional attributes into discrete levels. A relatively small amount of literature treats both time and emotion variables as continuous. In [41] the authors describe a multimodal system to continuously track valence and activation of a speaker, using SVR and Long-Short Term memory (LSTM) regression, with LSTM being the best performing approach. Similarly, single-modality 115 systems were proposed in [39, 40] using SVR and LSTM neural networks for regression to continuously estimate valence and activation values from emotional speech. An unsu- pervised method for mapping the emotional content of movies in the valence-activation space was proposed in [37, 38] using low-level audio and video cues. In our work, we propose a supervised, GMM-based methodology to continuously track an underlying emotional state using body language and speech information. 6.3 Overview of our Emotion Tracking Experiments Figure 6.1 presents a summary of our work. As illustrated in the left of Fig. 6.1, our study relies on video, audio and MoCap data collected from two actors engaged in emotional dyadic improvisations. The center part of Fig. 6.1 describes the data processing, specically the extraction of detailed body language and speech information from both participants, as well as the data annotation. Data annotation was performed by multiple human evaluators who were asked to continuously rate the perceived valence, activation and dominance levels of each participant during each interaction (as described in Section 3.4). The result is multiple emotional curves which are averaged to provide the ground truth for further experiments. After these steps, we have available for each participant various body language features x body extracted throughout the interaction (as described in Section 4.2), speech features x speech extracted from regions where that person is speaking, and the corresponding emotional curves y. The joint distribution P (x;y) is modeled using a Gaussian Mixture Model (GMM), wherex can be a visual or audiovisual feature vector andy is one of the three emotional attributes. The conditional distribution P (yjx) is also a GMM. The GMM-based tracking approach consists of computing a mapping from the observed features to the underlying emotional curve by maximizing the conditional probability of the emotion given the features, e.g., ^ y = 116 arg maxP (yjx). In the right part of Fig. 6.1 we present an example of the resulting emotional curve estimate. Figure 6.1: An overview of our work on emotion tracking. From left to right we depict the data collection setting (described in Sec. 3.3), the audio-visual feature extraction (described in Sec. 4.2) and data annotation process (described in Sec. 3.4), as well as the GMM-based statistical mapping approach that we follow for estimating the emotional curves. 6.3.1 Statistical Mapping Framework Let x t denote the vector of body language and speech observations at time t of an interaction recording and y t be the underlying emotional attribute, namely activation, 117 valence or dominance. One way to predict y t given x t would be by maximizing the corresponding conditional probability: ^ y t = arg max yt P (y t jx t ; (y;x) ) (6.1) assuming a specic model (y;x) for two concurrent instantiations of x andy. However, given the continuous nature of the involved variables, it would be benecial to incor- porate dynamic information in this estimation. This can be achieved by also jointly modeling the rst and second temporal derivatives of y t and x t , denoted here as y t , 2 y t and x t , 2 x t respectively. By replacingy t with Y t = [y t ; y t ; 2 y t ] T andx t with X t = [x T t ; x T t ; 2 x T t ] T , the optimal estimate ^ y = [y 1 ;:::;y t ;:::y T ] of the emotional ow for the course of the interaction can be found as: ^ y = arg max y P (YjX; (Y;X) ); (6.2) where X = [X T 1 ; X T 2 ;:::; X T t ;:::; X T T ] T is the sequence of the dynamic information- augmented features and Y = [Y T 1 ; Y T 2 ;:::; Y T t ;:::; Y T T ] T the corresponding emotional attribute and its derivatives for the entire interaction. Following the paradigm that was originally introduced for voice conversion [42], we consider the model (Y;X) of the joint probability of (Y t ; X t ) to be a Gaussian Mixture Model (GMM): P (Y t ; X t j (Y;X) ) = M X m=1 a m N ([Y t T ; X t T ] T ; (Y;X) m ; (Y;X) m ) (6.3) with a m , (Y;X) m and (Y;X) m being each component's weight, mean and covariance re- spectively: (Y;X) m = 2 4 (Y) m (X) m 3 5 ; (Y;X) m = 2 4 (YY) m (YX) m (XY) m (XX) m 3 5 : (6.4) 118 The conditional probability in (6.2) can be written as [42]: P (YjX; (Y;X) ) = X overallm P (mjX; (Y;X) )P (YjX; m; (Y;X) ) T Y t=1 M X m=1 P (mjX t ; (Y;X) )P (Y t jX t ;m; (Y;X) ) (6.5) where m = [m 1 ;:::;m t ;:::;m T ] is a sequence of mixture components and: P (mjX t ; (Y;X) ) = a m N (X t ; (X) m ; (XX) m ) P M i=1 a i N (X t ; (X) i ; (XX) i ) (6.6) P (Y t jX t ;m; (Y;X) ) =N (Y t ;E (Y) m;t ;D (Y) m ): (6.7) and: E (Y) m;t = (Y) m + (YX) m (XX)1 (X t (X) m ); (6.8) D (Y) m = (YY) m (YX) m (XX)1 m (XY) m : (6.9) Estimation of the underlying emotional ow ^ y for the entire utterance can nally be achieved based on (6.2) via Expectation Maximization as described in detail in [42, 43]. The initial estimate is just the Minimum Mean Squared Error (MMSE) estimate based on the conditional probability distribution (6.5) without using dynamic information. Due to the use of dynamic information in the estimations, the nal estimate at each time instant ends up being aected by the entire sequence of observations. It has been shown that in the case of a single Gaussian Model the incorporation of derivatives in an analogous scenario corresponds to xed-lag Kalman smoothing [167]. The lag depends 119 on the window length 2L 1 over which the derivatives are approximated (second derivatives are computed by applying (6.10) to the rst derivatives): y t = P =L =L (y t+ y t ) 2 P =L =L 2 : (6.10) This scheme has been successfully applied for voice conversion [42], lip movement - speech synchronization [168] and acoustic to articulatory speech inversion [43]. Speech inversion refers to the problem of recovering the underlying articulation during speech production from just the observed speech acoustics. In a similar way, herein, we are try- ing to recover the underlying emotional state as it is represented by activation, valence and dominance from the observed body language and speech observations. 6.3.2 Database and Features For our experiments we use the Creative IT database. The data collection and the continuous emotional annotation process, are described in detail in Chapter 3. The body language feature set used for our experiments, the feature selection process, and statistical analyses of the features are described in detail in Section 4.2. Regarding the speech features, we extract 12 Mel Frequency Cepstral Coecients (MFCCs) along with pitch and energy, using overlapping windows of length 30msec and framerate of 16.67msec (same as MoCap framerate). Such features are standard for speech emotion recognition [169]. In contrast to body language features which are extracted throughout the recording session, the acoustic features are extracted only when the actors are speaking. For this purpose, the microphone signal obtained from each actor is rst manually transcribed into regions where that actor is speaking and being silent. 120 6.4 Tracking Emotion Trends at Multiple Resolutions 6.4.1 GMM-based tracking at frame and window level Our GMM-based tracking approach follows the mathematical framework described in Section 6.3.1. Additionally, it takes into account that body language features are avail- able throughout the interaction, while speech features are available only when the actor is speaking. Therefore, when audio-visual features are considered we compute two mappings: a visual mapping trained only with body language features and an audio- visual mapping trained with both body language and speech features. The audio-visual features are fused at the feature-level for training the audio-visual GMM. During test- ing, we apply the GMM mapping on overlapping windows. When only visual features are used we compute the visual mapping on each window irrespective if whether the window contains speech or not. When audio-visual features are used, we compute an audio-visual mapping for the windows where speech is present, otherwise we compute a visual mapping. Therefore, we again scan the total recording using visual information and, if available, speech information. As a result, the results of the visual and audio- visual experiments are comparable as they are computed on the same recordings, and the audio-visual results provide information about whether speech improves emotion tracking on top of the visual information. Empirically, we conrmed that including dynamic features produces a smoother emotional trajectory estimate, since it considers a window of the emotional state and the feature vector centered at the frame of interest. In our implementation, the underlying emotional trajectory y t ;t = 1;:::;T is estimated over consecutive overlapping windows of length 300 frames, with 150 frames overlap. Then curves obtained from neighboring windows are merged using the add-overlap algorithm, and are smoothed using a low-pass lter. 121 This approach computes detailed frame-by-frame emotional trajectory estimates. However, emotional states are slowly varying, therefore this degree of accuracy may not be necessary. Modeling body and speech features at such detail may lead to modeling of noise or gestures unrelated to emotion rather that emotionally informative audio-visual manifestations. This motivates the use of window-level tracking, where features and feature functionals are extracted over larger windows in an attempt to capture more meaningful emotional and gestural dynamics. In this case, the mapping function takes as input the functionals computed over a window and outputs the average emotional attribute value of that window. Specically, we average the ground truth curves over overlapping windows of 3sec length and 2sec overlap. We also apply such windows on the audio-visual features, over which we extract a variety of statistical functionals, specically: mean, standard deviation, median, minimum, maximum, range, skewness, kurtosis, the lower and upper quantiles (corresponding to the 25th and 75th percentiles) and the interquantile range. Therefore, we extract a potentially richer feature descrip- tion by including statistical functionals over features. The feature vector dimensionality is reduced by PCA. We train full covariance GMMs using 4 and 2 mixtures for frame and window-level tracking respectively (the method is not sensitive to the number of gaussian mixtures). The use of a full covariance matrix is important in order to capture relations between the various body language and speech features, and empirically leads to better performance. The joint feature-emotion GMM models were trained using the HTK Toolbox [25], while the subsequent EM equations for computing the statistical GMM-based mapping were implemented in matlab, based on [43]. Our implementation for performing the GMM- based continuous mapping for emotion tracking has been freely released online at [170]. 122 6.4.2 Using LSTM neural networks for regression Long Short Term Memory (LSTM) neural networks were introduced in [28], as a variant of Recurrent Neural Networks (RNN). While RNNs are able to model a certain amount of history through their cyclic connections, it has been shown that longer range history is inaccessible to RNNs since the backpropagated error either blows up or decays over time (vanishing gradient problem). LSTM networks overcome the vanishing gradient problem by storing in their hidden layers information from an arbitrarily long amount of time [28]. LSTM networks have been applied in a variety of pattern recognition applications, including phoneme classication [149], audio-visual emotion classication [119], and regression for tracking continuous emotions [41]. Modeling history seems to be benecial for the problem of emotion tracking, since emotions tend to be slowly varying over time, and LSTM regression was shown to outperform Support Vector Regression (SVR) for continuously tracking valence and activation over time [41]. Here, we apply LSTM networks for both the frame and the window level regression problems. LSTM networks for regression are trained using the RNNlib Toolbox [151], without using derivative features. Including derivatives was deemed redundant since temporal information is already captured through the network. The LSTM networks consist of one hidden layer with 128 memory blocks (we also experimented with 64 and 256 mem- ory block congurations, which performed similarly). To improve generalization low Gaussian noise was added to the training features. The produced curves are smoothed using a lowpass lter. 6.4.3 Baseline based on simple functions of informative features A relevant question is what would be the tracking performance if we estimated an at- tribute, e.g., activation, as a simple function of informative features, e.g., velocity of body, of hands, intensity of voice, leaning angle towards interlocutor etc. Indeed such 123 approaches are common in the behavioral sciences, where for instance speech intensity and pitch are sometimes used as indicators of vocal activation [171]. Along these lines, assuming that an interlocutor's emotional attributes and his audiovisual features are normalized to be roughly in the same range, we could compute an estimate of his acti- vation by taking a functional (e.g., mean) of the most activation-informative features. If a feature is negatively correlated with activation then we multiply it with -1 before- hand. This method does not require training a model, however it assumes that we have available a set of informative features for each attribute, which can be chosen through feature selection e.g, by using the approaches of Section 4.2.2, or through prior knowl- edge. This simple baseline could be useful for cases were we have few or no annotated data. In our implementation, we select the K most informative body features for each attribute, based on the F value , and the L most informative speech features based on correlation with the attribute (results based on the mRMR C criterion are similar and are omitted for lack of space). All the features and the emotional attributes are rst normalized to have zero mean and unit standard deviation across the database, and features that are negatively correlated with the emotional attribute are multiplied with -1. Then we compute the mean, median and maximum of these features as dierent attribute estimates (please refer to the Appendix for a list of the most informative body language features per attribute). For the window-level tracking, we follow the same approach using normalized statistical functionals of body and speech features extracted over windows, so as to directly compare with the methods of Sections 6.4.1 and 6.4.2. Again, we select the K most correlated functionals of body features and the L most correlated functionals of speech features. 124 6.5 Experiments, Results and Discussion Our experiments are organized in an eight-fold leave-one-dyad-out cross validation. Ac- tors belong only to one dyad, therefore this cross validation ensures that test set actors are not seen during training. Each dyad was recorded in each of eight recording days, and since the selected number of recordings per day vary, this results in 5-12 actor recordings selected for testing at each fold, while the rest are used for training. We focus primarily on tracking the underlying emotional trends, and therefore we compute the correlations between the ground truth and the estimated emotional trajectories as our primary performance metric. The body language feature sets are selected through the correlation-based criterion mRMR C or the Fisher criterion F value (the MI-based criterion mRMR I gave slightly lower performance and is omitted). Those criteria are described in detail in Section4.2.2. We systematically examine the eect of the number of body language features on the performance of each tracking approach, by selecting the top 5; 10; 15; 30 features for themRMR C criterion, and the top 10; 15; 40 features for theF value criterion (which are later further reduced after removing highly correlated features). The performance of the GMM-based and LSTM frame-level tracking as a function of the number of selected body features is shown in Fig. 6.2. This approach represents a principled way to select the nal number of features based on visual frame-level tracking performance, although could cause some amount of overtting. Note however that the selected number of features is not necessarily optimal for the audio-visual and window-level tracking, since some of the body language features that are left out may be important when used in combination with speech, or may have informative statistical functionals. To underscore this point, we present an example of valence tracking at Tables 6.1 and 6.3, where selecting 11 features is optimal for the frame-level visual experiments but not for the 125 audio-visual and window-level experiments, where larger feature sets, e.g of 24 features, perform better. 0 5 10 15 20 25 30 35 −0.1 0 0.1 0.2 0.3 0.4 Number of features Performance (Median Correlation) Activation CORR (GMM) Fvalue (GMM) CORR (LSTM) Fvalue (LSTM) (a) Activation. 0 5 10 15 20 25 30 35 −0.1 0 0.1 0.2 0.3 0.4 Number of features Performance (Median Correlation) Valence CORR (GMM) Fvalue (GMM) CORR (LSTM) Fvalue (LSTM) (b) Valence. 0 5 10 15 20 25 30 35 −0.1 0 0.1 0.2 0.3 0.4 Number of features Performance (Median Correlation) Dominance CORR (GMM) Fvalue (GMM) CORR (LSTM) Fvalue (LSTM) (c) Dominance. Figure 6.2: Frame-level tracking using body language features: Performance of the various tracking approaches and feature selection algorithms (in terms of median corre- lation with ground truth) as a function of the number of body language features used. For GMM-based and LSTM frame-level tracking, we select the number of body language features that leads to the best performance. We include speech information, by adding the 14 speech features in our body language feature set (feature-level fusion). We also add the rst and second feature derivatives. For window-level tracking, we perform statistical functional computation on the respective optimal frame-level feature set and then Principal Component Analysis (PCA), keeping the rst 50 components, which explain about 88-95% of the total variability. To prevent oversmoothing, we only add rst derivatives, resulting in 100 dimensional feature vectors for both the visual and the audio-visual case. Both the features and the emotional curves are z-normalized using the global means and standard deviations of the dataset. Regarding the simple baseline described in Section 6.4.3 for frame or window-level tracking, we select the number of body language and speech features that empirically give the best performance, and we combine them using their mean (which tends to perform better than median and maximum). Our observation it that the performance 126 of this simple approach saturates sooner than the other algorithms, typically around 10 or 15 features. 6.5.1 Frame-level tracking using audio-visual information In Table 6.1, we present the tracking performance of visual and audio-visual features methods for the GMM-based mapping and the LSTM regression approaches. The num- ber of selected body language features is presented in parentheses. For the simple baseline method, the selected number of K body and L speech features is presented in parentheses as (K+L). For each case, we present the median of the correlations between each estimated curve and the ground truth, as a metric of the overall performance. In the last row of Table 6.1, we also report the median inter-annotator correlations computed at the frame-level, as a metric of inter-annotator agreement for this task. For all methods, activation tracking is the best performing task, followed by dom- inance, while neither of the approaches seems to adequately capture valence trends. Considering speech features increases activation (speech features generally convey acti- vation information [172]) and slightly boosts dominance tracking performance but oers no signicant increase for valence. Both feature selection criteria perform comparably. For valence, theF value resulted in selecting a relatively small body language feature set of 11 features, therefore we also tried a larger feature set of 24 features to see if extra features would increase performance at later stages. Indeed the extra features and their statistical functionals seem to slightly boost performance at window-level tracking (see results of Section 6.5.2 Table 6.3), however valence tracking generally remains problematic. This suggests that valence is not adequately re ected on our features, or that body language generally conveys less information about valence, compared to activation and dominance. Valence may be better re ected through other modalities; for instance facial expressions are found to discriminate valence states well [173, 166]. 127 Note that when annotators rated each actor's valence they had access to a variety of cues besides body language and speech, including facial expressions and lexical content, a fact that could explain their good agreement scores. Table 6.1: Continuous tracking at the frame-level of activation, valence and domi- nance using body language and speech cues. We present the median correlation value between the computed emotional curve and the ground truth. Parentheses indicate the number of selected body features (K), or body and speech features (K+L). body language features: median correlations with ground truth feature selection activation valence dominance GMM-based mapping F value 0.3910 (26) ?y 0.1127 (11)/ 0.1005 (24) 0.2102 (15) ? mRMR C 0.4121 (25) ?y 0.0699 (30) 0.2212 (30) LSTM regression F value 0.2905 (14) 0.0934 (28) 0.1712 (13) mRMR C 0.2994 (20) 0.0524 (25) 0.1859 (30) simple baseline mean 0.2634 (10) 0.0650 (10) 0.1629(15) body language+speech features: median correlations with ground truth feature selection activation valence dominance GMM-based mapping F value 0.4629 ?y 0.1178 / 0.1220 ?y 0.2495 ?y mRMR C 0.4692 ?y 0.0756 0.2582 ?y LSTM regression F value 0.2908 0.0619 0.1610 mRMR C 0.3874 0.0842 0.1596 simple baseline mean 0.3000 (10+5) 0.0815(10+5) 0.1829 (5+5) Median inter-annotator correlation (agreement) activation valence dominance 0.5945 0.6171 0.6028 Between the tracking approaches, the GMM-based mapping achieves consistently higher correlation for activation and dominance. We performed the non-parametric Wilcoxon signed-rank test to examine whether the median of the paired dierences between algorithms is signicantly dierent from zero. Specically, we compared the GMM and LSTM approaches, given same feature selection method, the GMM approach with the simple baseline, and the LSTM approach with the simple baseline (p=0.05). Statistically signicant dierences are denoted in Table 6.1 with symbol ? for the GMM vs LSTM comparison, y for the GMM vs simple baseline comparison and for the LSTM vs simple baseline comparison (symbols are placed next to the method that performs 128 better in the comparison). For example, for the frame-level tracking of activation using bodylanguage features, symbols ? and y next to GMM tracking(F value ) indicate that the algorithm performs signicantly better than both LSTM (F value ) and the simple baseline. Overall, the GMM-based mapping signicantly outperforms both the LSTM method and the simple baseline for most activation and dominance tasks. However, LSTM tracking hardly outperforms the simple baseline, which works reasonably well for the activation and dominance tasks. In order to examine how these methods approximate the actual values of the under- lying emotional curves, we also compute the Root Mean Square Error (RMSE) between the estimated curve and the ground truth, which is dened as: RMSE(^ y est ;y true ) = v u u t 1 T T X i=1 (^ y est (i)y true (i)) 2 All methods lead to median RMSE methods between 0.8 and 1.2, with the GMM- based mapping usually having a slightly lower RMSE. Those values are considerably higher that the median RMSE values computed between the annotation curves of mul- tiple evaluators, which are 0.37, 0.24 and 0.31 for activation, valence and dominance, respectively. In Table 6.2, we also present results based on speech features only. Audio-only GMM-based tracking works reasonably for activation and partially for dominance, which conrms our previous observations regarding the importance of speech for activation trend tracking. Note however that these results are computed only on speech regions, therefore they are not directly comparable with the results of Table 6.1. The behavior of the two methods is illustrated in Figures 6.3 and 6.4. In Figures 6.3(a)-(c), we present the multiple annotations (dashed blue lines) along with their mean (red line) which is our ground truth, for two activation and one dominance example. 129 Figures 6.4(a)-(c) show the estimated curves for GMM-based tracking, LSTM and the simple baseline respectively, for the curve of Fig. 6.3(a). For this example, GMM-based mapping produces a curve that is smoother and has higher correlation with the ground truth than the other two methods. Figures 6.4(d)-(f) show the estimated curves for Fig. 6.3(b), where the GMM-based performance is moderate but the method seems to track the most prominent activation trends. Finally, Figs. 6.4(g)-(i) show examples of dominance tracking for curve of Fig. 6.3(c), where all methods perform reasonably well, although the three output curves look quite dierent. In general, we notice that the GMM method produces smooth and at curves, while the other two methods produce noisier curves of larger amplitudes. Table 6.2: Continuous tracking at the frame-level of activation, valence and dom- inance using speech cues only. We present the median correlation value between the computed emotional curve and the ground truth, computed only on speech regions. speech features only: median correlations with ground truth activation valence dominance GMM-based mapping 0.3866 0.0501 0.1102 LSTM regression 0.2237 0.0609 0.0066 simple baseline (mean) 0.1823 (5) 0.0529 (5) 0.0093 (5) 0 1000 2000 3000 4000 5000 6000 7000 8000 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating ground truth annotations (a)Activation Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.44 0 1000 2000 3000 4000 5000 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating (b)Activation Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.68 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 −3 −2 −1 0 1 2 3 Time (frames) Dominance Rating (c)Dominance Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.57 Figure 6.3: Examples of activation and dominance annotations and the corresponding ground truth 130 0 1000 2000 3000 4000 5000 6000 7000 8000 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating ground truth estimate(bl) estimate(bl+sp) (a)Tracking Activation of Fig.6.3(a) using GMM-based mapping, with body (green) and speech+body (black) features. Correlations with ground truth are 0.74 and 0.77 respectively 0 1000 2000 3000 4000 5000 6000 7000 8000 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating (b)Tracking Activation of Fig.6.3(a) using LSTM regression, with body (green) and speech+body (black) features. Correlations with ground truth are 0.60 and 0.52 respectively 0 1000 2000 3000 4000 5000 6000 7000 8000 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating (c)Tracking Activation of Fig.6.3(a) using the simple baseline (mean), with body (green) and speech+body (black) features. Correlations with ground truth are 0.26 and 0.30. 0 1000 2000 3000 4000 5000 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating ground truth estimate(bl) estimate(bl+sp) (d)Tracking Activation of Fig.6.3(b) using GMM mapping, with body and sp.+body features. Correlations with ground truth are 0.25 and 0.43 0 1000 2000 3000 4000 5000 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating (e)Tracking of Activation of Fig.6.3(b) using LSTM regression, with body and sp.+body features. Correlations with ground truth are 0.25 and 0.29 0 1000 2000 3000 4000 5000 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (frames) Activation Rating (f)Tracking of Activation of Fig.6.3(b) using the simple baseline, with body and sp.+ body features. Correlations with ground truth are 0.32 and 0.32 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 −3 −2 −1 0 1 2 3 Time (frames) Dominance Rating ground truth estimate(bl) estimate(bl+sp) (g)Tracking Dominance of Fig.6.3(c) using GMM mapping, with body and sp.+body features. Correlations with ground truth are 0.59 and 0.63 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 −3 −2 −1 0 1 2 3 Time (frames) Dominance Rating (h)Tracking Dominance of Fig.6.3(c) using LSTM regression, with body and sp.+body features. Correlations with ground truth are 0.51 and 0.50 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 −3 −2 −1 0 1 2 3 Time (frames) Dominance Rating (i)Tracking Dominance of Fig.6.3(c) using the simple baseline, with body and sp.+ body features. Correlations with ground truth are 0.68 and 0.48 Figure 6.4: Results of GMM-based mapping, LSTM regression and the simple baseline (mean), for the examples of Figs.6.3(a)-(c), for frame-level tracking 131 6.5.2 Window-level tracking using audio-visual information In Table 6.3 we present the performance of the low resolution tracking at the window level. The median annotation correlations are re-computed at the window level and are reported at the last row of Table 6.3. For GMM-based and LSTM tracking we utilize the empirically selected feature sets of Section 6.5.1, after statistical feature extraction and PCA. For the simple baseline, we present the better performing statistical functionals of K body and L speech features. In general, we notice a signicant increase from the previous results which can be attributed to the fact that we model less noise and track pronounced trends in the underlying emotional curves. Also we use a richer feature set, consisting of statistical functionals of the frame-level features. The GMM-based mapping results follow similar trends as before; activation is the best performing attribute, followed by dominance. Valence performance is still low, although when we use the Fisher criterion F value with the larger feature set our performance increases. Adding speech features considerably increases activation and dominance performance. Activation tracking reaches a median correlation of around 0.6, which is similar to the median correlations between human annotators for this task. The LSTM regression and simple baseline results follow similar trends, although median correlations are generally lower. The statistical signicance of these results is examined using the Wilcoxon signed- rank test for paired dierences, following the same notation as in Section 6.5.1. In general, GMM-based tracking signicantly outperforms the other two approaches for activation and dominance trend tracking, while LSTM and simple baseline have com- parable performance, with LSTM being slightly better. Again, when looking at the resulting curves we observe smooth and at curves for the GMM-based method and noisier curves with bigger amplitude for the LSTM and simple baseline methods. Figures 6.5(a)-(c) illustrate examples of rated activation, 132 valence and dominance respectively. In Figures 6.6(a)-(c) we present the window-level tracking of activation curve 6.5(a), where all methods perform well, while the GMM- based curve achieves the highest correlation with the ground truth. In Figures 6.6(d)-(f) we present less successful tracking results of the valence curve in Fig. 6.5(b). The GMM- based mapping captures few of the valence peaks, while the other two methods seem to mostly capture noise. Finally, in Figs 6.6(g)-(i) we present tracking of the dominance curve 6.5(c), where GMM-based tracking performs better than LSTM, which in turn ourperforms the simple baseline. Table 6.3: Continuous tracking at the window-level of activation, valence and dom- inance using body language and speech cues. We present the median correlation value between the computed emotional curve and the ground truth body language features: median correlations with ground truth feature selection activation valence dominance GMM-based mapping F value 0.4943 ?y 0.1296 / 0.2061 0.3268 y mRMR C 0.5169 ?y 0.0866 0.3219 ?y LSTM regression F value 0.4455 0.1348 0.2268 mRMR C 0.4529 0.1480 0.2835 simple baseline mean 0.3682(10) 0.0626(15) 0.0953(15) body language+speech features: median correlations with ground truth feature selection activation valence dominance GMM-based mapping F value 0.5979 ?y 0.1831 / 0.2247 0.3696 ?y mRMR C 0.5837 ?y 0.0563 0.3368 ?y LSTM regression F value 0.4882 0.0976 0.2122 mRMR C 0.4934 0.0878 0.2549 simple baseline mean 0.4447(10+5) 0.1261 (15+5) 0.1837(5+5) Median inter-annotator correlation (agreement) activation valence dominance 0.6199 0.6317 0.6200 133 0 20 40 60 80 100 120 140 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Activation Rating ground truth annotations (a)Activation Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.54 0 10 20 30 40 50 60 70 80 90 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time (windows) Valence Rating (b)Valence Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.46 0 20 40 60 80 100 120 140 160 180 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Dominance Rating (c)Dominance Example Annotations (blue) and their mean (red). Mean evaluator correlation: 0.51 Figure 6.5: Examples of activation, valence and dominance annotations and the corre- sponding ground truth 6.6 Conclusion and Future Work We address the problem of tracking continuous emotional attributes of participants throughout aective dyadic improvisations, where participants may be listening, speak- ing or doing neither. To this end, we have extracted a variety of features describing of a person's body language, and speech information. We propose a statistical mapping approach to automatically track emotional trends based on body language and speech. Our approach outperforms other examined methods, such as LSTM regression [41], and produces smooth emotional curve estimates. Also, the simple baseline represents an interesting, unsupervised alternative, that is worth further investigation. Our results show promising performance for tracking trends of activation and dominance, and also suggest that body language conveys rich activation and dominance related information. For activation trend tracking our correlation-based performance is comparable to human performance. However, valence trend tracking remains problematic, which might indicate that our features are not adequately re ective of valence. Existing literature indicates that body posture is a better indicator of activation, although the importance of the valence 134 0 20 40 60 80 100 120 140 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Activation Rating ground truth estimate(bl) estimate(bl+sp) (a)Tracking Activation of Fig.6.5(a) using GMM-based mapping, with body (green) and speech+body (black) features. Correlations with ground truth are 0.57 and 0.74 respectively 0 20 40 60 80 100 120 140 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Activation Rating (b)Tracking Activation of Fig.6.5(a) using LSTM regression, with body (green) and speech+body (black) features. Correlations with ground truth are 0.49 and 0.58 respectively 0 20 40 60 80 100 120 140 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Activation Rating (c)Tracking Activation of Fig.6.5(a) using the simple baseline (mean), with body (green) and speech+body (black) features. Correlations with ground truth are 0.11 and 0.41 0 10 20 30 40 50 60 70 80 90 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time (windows) Valence Rating ground truth estimate(bl) estimate(bl+sp) (d)Tracking Valence of Fig.6.5(b) using GMM mapping, with body and sp.+body features. Correlations with ground truth are 0.24 and 0.21 0 10 20 30 40 50 60 70 80 90 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time (windows) Valence Rating (e)Tracking Valence of Fig.6.5(b) using LSTM regression, with body and sp.+body features. Correlations with ground truth are 0.31 and 0.17 0 10 20 30 40 50 60 70 80 90 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time (windows) Valence Rating (f)Tracking Valence of Fig.6.5(b) using the simple baseline, with body and sp.+ body features. Correlations with ground truth are 0.02 and 0.03 0 20 40 60 80 100 120 140 160 180 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Dominance Rating ground truth estimate(bl) estimate(bl+sp) (g)Tracking Dominance of Fig.6.5(c) using GMM mapping, with body and sp.+body features. Correlations with ground truth are 0.66 and 0.71 0 20 40 60 80 100 120 140 160 180 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Dominance Rating (h)Tracking Dominance of Fig.6.5(c) using LSTM regression, with body and sp.+ body features. Correlations with ground truth are 0.29 and 0.46 0 20 40 60 80 100 120 140 160 180 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Time (windows) Dominance Rating (i)Tracking Dominance of Fig.6.5(c) using the simple baseline, with body and sp.+ body features. Correlations with ground truth are 0.11 and 0.12 Figure 6.6: Results of GMM-based mapping, LSTM regression and the simple baseline (mean), for the examples of Figs. 6.5(a)-(c), for window-level tracking 135 dimension should not be dismissed [129]. Possibly higher-level body features are required to discern valence; we have not incorporated audio-visual cues at the session level, such as the amount and length of pauses, percentage of time that an actor performs an action, turn-taking patterns etc. Such higher-level cues may be informative of valence and dominance, and their investigation is a promising future research direction. Also note that we do not consider facial expressions, which are known to be re ective of valence [173, 166]. Other open questions pertain to our performance metrics; while correlation metrics and RMS errors describe dierent aspects of tracking performance and are currently used for evaluating systems that produce continuous estimates [41, 36], we may need to nd more accurate measures to describe the performance of such systems. Additionally, normalizing for subject-dependent emotional variability in expressive body language is an interesting research direction that could potentially bring signicant improvement. A further goal is to extend this work towards examining the produced emotional curves to detect regions of emotional saliency, and study the actual events that occur in such regions. Such vocal, bodily or interaction-based events could give us insights of what constitutes the emotional content of an interaction. 136 Chapter 7: Healthcare Applications: Analysis of Emotional Facial Expressions of Children with Autism Spectrum Disorders 7.1 Introduction Autism and Autism Spectrum Disorders (ASD) are a range of neural development dis- orders that are characterized by diculties in social interaction and communication, reduced social reciprocity, as well as repetitive and stereotyped behaviors [74]. ASD aects a large and increasing amount of children in the US; according to the Center of Disease Control (CDC), 1 in 88 children in the US was diagnosed with ASD in 2012, a number that was 1 in 110 until recently (CDC 2010). This has motivated technology research to work towards providing computational methods and tools to improve the 137 lives of autism practitioners and patients, and potentially further the understanding of this complex disorder. Technological research directions include the development of naturalistic Human-Computer Interfaces (HCI) and conversational agents to facili- tate education, communication but also to encourage, elicit and capture the behavior of individuals with ASD. Additionally, computational techniques could be used to au- tomatically track a patient's progress during interventions, and to better understand communication and social patterns of patients. The work described in this Chapter focuses on the latter direction; namely quanti- cation of social and emotional expressive patterns. We focus on the analysis of aective facial expressions of children with High Functioning Autism (HFA). Facial expressions provide a window to internal emotional state and are key for successful communication and social integration. Individuals with HFA have average intelligence and language skills, however they often struggle in social settings because of diculty in interpreting [80] and producing [81, 82] facial expressions. Their expressions are often perceived as awkward or atypical by typically developing observers (TD) [83], an impression that is hard to describe quantitatively. Although this perception of awkwardness is used as a clinically relevant measure, it does not shed light into the specic facial gestures that may have elicited that perception. This work aims to computationally quantify this impression of awkwardness or `atypicality', in order to quantitatively understand the dierences between typical and `atypical' facial expressions. We analyze a facial expression data from children, both typically developing and with HFA, and use Motion Capture (MoCap) technology and the application of statistical methods like Functional Data Analysis (FDA, [84]), in order to capture, mathematically quantify and visualize atypical characteristics of facial gestures. This work is part of the emerging Behavioral Signal Processing (BSP) domain that explores the role of engineering in furthering the understanding of human behavior [14]. 138 Starting from these qualitative notions of atypicality, our goal is to derive quantita- tive descriptions of the characteristics of facial expressions using appropriate statistical analyses. Through these, we can discover dierences between TD and HFA populations that may contribute to a perception of atypicality. The availability of detailed MoCap information enables quantifying overall aspects of facial gestures such as synchrony and smoothness of motion, that may aect the nal expression quality. Dynamic aspects of facial expressions are of equal interest, and the use of FDA techniques such us func- tional PCA (fPCA) provides a mathematical framework to estimate important patterns of temporal variability and explore how such variability is employed by the two popu- lations. Finally, given that children with HFA may display a wide variety of behaviors [85], it is important to understand child-specic expressive characteristics. The use of multidimensional scaling (MDS) addresses this point by providing a principled way to visualize dierences of facial expression behavior across children. Our work proposes the use of a variety of statistical approaches to uncover and interpret characteristics of behavioral data, and demonstrates their potential to bring new insights into the autism domain. According to our results, subjects with HFA are characterized on average by lower synchrony of movement between facial regions and more roughness of head and facial region motion. They also display a larger range of fa- cial region motion that could be indicative of exaggerated expressions. We also perform expression-specic analysis of smile expressions, where we nd that children with HFA display a larger variability in their smile evolution, and may display idiosyncratic facial gestures unrelated to the expression. Overall, children with HFA consistently display a wider variability of facial behaviors compared to their TD counterparts, which corrobo- rates with existing psychological research [85]. Those results shed light into the nature 139 of expression atypicality and certain ndings, e.g., asynchrony, may suggest an under- lying impairment in the facial expression production mechanism that is worth further investigation. Our results have been published in [13]. 7.2 Related Work Since early psychological works [174, 175], autism spectrum disorders (ASD) have been linked to the production of atypical facial expressions and prosody. Autism researchers have reported that the facial expressions of subjects with ASD are often perceived as dierent and awkward [81, 82, 83]. Researchers have also reported atypicality with the synchronization of expressive cues, e.g., verbal language and body gestures [176]. Inspired by these observations of asynchrony, we examine synchronization properties of minute facial gestures, which are often hard to describe by visual inspection. Recent computational work aims to bring new understanding of this psychological condition and develop technological tools to help ASD individuals and psychology prac- titioners. Work in [75] describes eye tracking glasses to be worn by the practitioner and track gaze patterns of children with ASD during therapy sessions, while [76] introduces an expressive virtual agent that is designed to interact with children with ASD. Com- putational analyses have mostly focused on atypical prosody, where certain prosodic properties of subjects with ASD are shown to correlate with the severity of autism di- agnosis [77, 78, 79]. In contrast, computational analysis of atypical facial expressions is a relatively unexplored topic. Typical facial expression is a long studied subject in psychology, where the introduc- tion of the Facial Action Coding system (FACS) has been in uential in the categorization of human facial gestures [177]. Researchers have used facial parametrization to quantify minute expessive dierences e.g. between posed and spontaneous smiles [178], while many engineering works have focused on automatic recognition of facial Action Units 140 (AU) that constitute an expression [179, 180, 181]. Facial expression has also been a widely used modality for emotion recognition (for example [57, 182], as well as our own work described in Chapter 5). Our analysis relies heavily on FDA methods, which were introduced in [84] as a collection of statistical methods for exploring patterns in time series data. A fundamen- tal dierence between FDA and other statistical methods is the representation of time series data as functions rather than multivariate vectors, which exploits their dynamic nature. FDA methods include powerful extensions of mathematical tools, such as func- tional PCA. Such FDA methods have been successfully applied for quantifying prosodic variability in speech accents [183], and the analysis of tendon injuries using human gait MoCap data [184]. This work lies in the intersection of the above areas; aiming to approach the subject of facial expression atypicality from an engineering perspective, we adopt ne grained facial parametrization and description approaches, while while utilizing rich statistical analysis tools such as FDA. 7.3 Database We analyze facial expression data from 37 children (21 with HFA, 16 TD) aged 9-14, collected by our psychology collaborator Prof. R.B Grossman. The children perform facial mimicry tasks; they are instructed to watch short videos of facial expressions and then mimic the expressions that they see. The videos are contained in the Mind Reading CD, a common psychology resource [185]. The expression videos last a few seconds, contain no sound and cover a variety of emotions including happiness (smile), anger (frown), surprise expressions, as well as emotional transitions, e.g. surprise followed by happiness (mouth opening and smile). In order to increase the diversity of expressions in the data, we decided on two predened sets of expressions with 18 expressions each, 141 Figure 7.1: Placement of facial markers and denition of facial distances (left) and facial regions (right). covering similar expressions. Each child was instructed to mimic the 18 expressions of one set. During the data collection, children wore 32 facial MoCap markers, as illustrated in Figure 7.1. They were recorded by 6 MoCap cameras, at a framerate of 100 fps. Our analysis focuses on the detailed facial data from the MoCap markers. 7.4 Data Preparation and Transformation into Functional Data In order to factor our head motion and focus on facial expression motion, we used four stability markers, that are depicted in Fig 7.1 as larger markers in the forehead and ears. The positions of the remaining 28 markers are computed with respect to the stability markers and used for further facial expression processing. The the stability markers are used to provide head motion information. 142 Facial data were further rotated to align with the (x,y,z) axes as depicted in Figure 7.1, and were centered to the origin of the coordinate system. We developed data visu- alization tools and visually inspected the MoCap sequence and corrected any artifacts that may have been caused by MoCap marker occlusions, or errors during motion cap- ture. Cleaning of the MoCap sequence is a time-consuming but important process to ensure that the facial motion patterns that we analyze are not caused by data collection artifacts. Dierent subjects have dierent facial structure. We perform face normalization to smooth out subject-specic face structure variability, and focus on expression related variability. We apply the normalization approach described in Chapter 4.1.2.1, where the main idea is to transform each speaker-specic facial structure to the mean facial structure across all subjects in our database. Specically, each subject's mean marker positions are shifted to match to the global mean positions computed across all subjects. Finally, marker trajectories were interpolated to ll in gaps shorter than 1 sec, that result from temporarily missing or occluded markers. We use cubic Hermite spline interpolation, which we empirically found to produce visually smooth results. Marker data are then transformed into functional data. This process consists of approximating each marker coordinate time series, e.g., x 1 ;:::x T , by a function say ^ x(t) = P K k=1 c k k (t), where k ;k = 1;:::;K are the basis functions and c 1 ;c 2 ;:::;c k are the coecients of the expansion. This transformation of our data into functional data performs smoothing of the original time series, and enables computing smooth approximations of high order derivatives of marker trajectories. Such higher order derivatives are useful for computing expression properties, such as roughness of motion, as described in Section 7.5. Additionally, the use of functional data and FDA techniques enables us to apply statistical methods such as fPCA that are useful for analyzing 143 temporal variability of particular expressions, as described in Section 7.6, for smile expressions. We use B-splines as basis functions, which are commonly used in FDA because of their exibility to model non-periodic series [84, 183]. Fitting is done by minimizing: F = X i [x i ^ x(t i )] 2 + Z [D 2 ^ x(t)] 2 dt whereD 2 denotes second derivative and parameter controls the amount of smooth- ing (second term) relative to the goodness of t of function ^ x (rst term). We choose = 1 empirically according to the Generalized Cross Validation (GCV) criterion [186]. The FDA analysis throughout the paper is performed using the FDA toolbox [186]. 7.5 Analysis of Global Characteristics of Aective Expres- sions We group the expressions into two groups containing expressions produced by subjects with TD and HFA, and perform statistical analysis of expressive dierences. We ex- amine properties inspired from psychology, such as synchrony of facial motion [176], or properties that intuitively seem to aect the quality of the nal expression, e.g., facial motion smoothness and range of marker motion TD and HFA groups contain roughly 16 18 and 21 18 expressions respectively, although certain samples are removed because of missing or noisy markers. When grouping together various facial expressions, we want to smooth out expression-related variability and focus on subject-related variability. Therefore, all metrics described below are normalized by mean shifting such that the mean of each metric per expression (and across multiple subjects) is the same across expressions. 144 Table 7.1: Results of Statistical Tests of Global Facial Characteristics (t-test, dierence of means) comparison result Left-Right Face Synchrony left-right mouth corner correlations Lower correlations for HFA, p=0.02 left-right cheek correlations Lower correlations for HFA, p=0.07 left-right eyebrow correlations Lower correlations for HFA, p=0.01 Upper-Lower Face Synchrony right eyebrow & mouth opening cor- relations. Lower correlations for HFA, p=0.05 left eyebrow & mouth opening cor- relations Lower correlations for HFA, p=0.03 Facial Motion Roughness (i = 2) mouth roughness Higher roughness for HFA, p=0.02 right cheek roughness Higher roughness for HFA, p=0.01 left cheek roughness no dierence right eyebrow roughness Higher roughness for HFA, p=0.07 left eyebrow roughness no dierence Head Motion Roughness (i = 2) head roughness Higher roughness for HFA, p 0 Facial Motion Range upper mouth motion range Higher range for HFA, p 0 lower mouth motion range Higher range for HFA, p 0 right cheek motion range Higher range for HFA, p 0 left cheek motion range no dierence right eyebrow motion range no dierence left eyebrow motion range no dierence 145 We examined synchrony of movement across left-right and upper-lower face regions. For left-right comparisons, we measured facial distances associated with muscle move- ments, specically mouth corner, cheek raising, and eyebrow raising. These distances are depicted as D1;D2 and D3 respectively for the right face and D1 0 ;D2 0 ;D3 0 for the left face, in Fig 7.1. To measure their motion synchrony, we computed Pearson's correlation between D1- D1 0 , D2-D2 0 and D3-D3 0 . For upper-lower comparisons, we measured mouth opening diatance D4, as shown in Figure 7.1, and eyebrow raising distances D3 and D3 0 , and we computed Pearson's correlation between D4-D3, and D4-D3 0 . We examine the statistical signicance of group dierences in correlation us- ing a dierence of means t-test. The results are presented in Table 7.1, and indicate lower facial movement synchrony for subjects with HFA. Although statistical tests reveal global dierences, subject-specic characteristics are also of interest, given the large variability of expressions and behaviors that is typically reported for subjects with ASD [85, 187]. Subject characteristics are visualized using multidimensional scaling (MDS), which is a collection of methods for visualizing the proximity of multidimensional data points [188]. MDS takes as input a distance (dis- similarity) matrix, where dissimilar data points are more distant, and provides methods for transforming data points into lower dimensions, typically two dimensions for easy visualization, while optimally preserving their dissimilarity. Here, a data point is a subject associated with a multidimensional feature of average correlations. A subject performs 18 expressions, and for each we compute the 3 left- right facial correlations mentioned above. By averaging over expressions we compute a 3D average correlation feature per subject. Dissimilarity between subjects is computed by taking euclidean distances of their respective features. MDS uses this dissimilarity matrix to compute the distances between subjects in 2D space. Figure 7.2(a) shows the MDS visualization for TD subjects represented in blue `T', and subjects with HFA 146 in red `A', when the average left-right symmetry correlations are used as features (we applied non-classical multidimensional scaling using the metricstress criterion, and con- rmed that original dissimilarities are adequately preserved in 2D [188]). Intuitively, Figure 7.2(a) illustrates similarities of subjects with respect to left-right facial syn- chrony behavior. We notice that subjects with HFA generally display larger behavioral variability, although there is one TD subjects that appears to be an outlier. This vi- sualization could help clinicians understand subject-specic characteristics with respect to particular facial behaviors. (a) Left-right face synchrony. (b) Face region motion roughness. Figure 7.2: MDS visualization of similarities across subjects for left-right synchrony and facial region motion roughness metrics. Smoothness of motion was investigated by computing higher order derivatives of facial and head motion. We examine 5 facial regions, i.e., left and right eyebrows, left and right cheeks, and mouth (Figure 7.1). Each region centroid is computed by averaging the markers in that region, and for each centroid motion we estimate the absolute derivatives of order i;i = 1; 2; 3 averaged during the expression. We call this a roughness measure of order i. Similar computations are performed for head motion, where the head centroid is the average of the 4 stability markers. The results in Table 7.1, indicate more roughness of head and lower/right facial region motion for subjects with HFA. For lack of space we present only acceleration results (i = 2), but other 147 order derivatives follow similar patterns. We perform MDS analysis using as features the average acceleration roughness measures per subject from mouth, right cheek and right eyebrow regions. We select those regions since, according to Table 7.1, they have signicantly higher roughness measures. According to the resulting MDS visualization of Figure 7.2(b), subjects with HFA are more likely to be outliers and display larger variability. Finally, we examine the range that facial regions traverse during an expression, i.e. range of motion for eyebrows, cheeks, upper and lower mouth region centroids. This range is signicantly higher for the HFA group for lower and right face regions, as seen in Table 7.1. This could indicate more exaggerated facial (especially mouth) gestures from HFA subjects. 7.6 Quantifying Expression-Specic Atypicality through fPCA While the analysis of the previous section examines global expression properties, in this section we perform expression-specic analysis focusing on dynamic expression evolu- tion. As an example, we choose an expression of happiness, consisting of two consecutive smiles. This expression belongs to one of the two expression sets mentioned in Section 7.3, so it is only performed by 19 children (7 TD, 12 HFA), out of the 37 children in our database. We choose to study smiles for a variety of reasons. Firstly, they are very common in a variety of social interactions, and are important for indicating friendli- ness and social participation. Moreover, the smile expression that we choose contains a transition between two smiles, therefore it has increased complexity and is a good candidate for revealing typical and atypical variability patterns. Finally, mouth region expressions seem to dier between TD and HFA groups, as shown in Section 7.5, and 148 are good candidates for analysis, while the onset and apex of those expressions can be easily detected in our marker data by looking at mouth distances, for example distance D1 from Figure 7.1. (a) Smile evolution examples (b) After landmark registration Figure 7.3: Examples of the expression of two consecutive smiles. (a) Plots of distance D1 during a typical (blue, TD) and atypical (red, HFA) expression evolution, before landmark registration. (b) Plots of distance D1 after landmark registration (subjects with HFA in red, TD in blue lines In Fig 7.3(a) we depict the mouth corner distance D1 during the two smiles expres- sion from a TD subject (blue solid line), and a subject with HFA (red dashed line). The expression of the TD subject depicts a typical evolution where the 2 distance min- ima (black circles) correspond to the apex of the two smiles and are surrounded by 3 maxima (black squares), representing the beginning, middle (between smiles) and end of the expression respectively. The expression of the subject with HFA follows a more atypical evolution and contains seemingly unrelated motion, for example the oscillatory motion in the middle. For better comparison, expressions are aligned such that the smile apices of dif- ferent subjects coincide. We apply a method called landmark registration [84, 183], which uses a set of predened landmarks, e.g., comparable events during an expression, and computes warping functions such that the landmark points of dierent expression realizations coincide. Here we dene 5 landmarks; the 2 minima (smile apices) and 3 149 (a) Eigenfunction 1(42% var.) (b) Eigenfunction 2(28% var.) (c) Eigenfunction 3(12% var.) (d) fPCA 1 vs fPCA 2 scores (e) fPCA 1 vs fPCA 3 scores (f) fPCA 2 vs fPCA 3 scores Figure 7.4: Analysis of the expression of two consecutive smiles, after performing fPCA. Plots of the rst 3 fPCA harmonics, covering 42%, 28% and 12% of total variability respectively, and scatterplots of corresponding fPCA scores. maxima described above. Landmarks are automatically detected by searching for local maxima/minima and are manually corrected if needed. In Figure 7.3(b) we show the subjects' expressions after landmark registration, where the two smile events are aligned and clearly visible. After alignment, we compute the principal components of expression variability us- ing fPCA. fPCA is an extension of ordinary PCA that operates on a set of functional input curves, i.e.,x i (t);i = 1;::;N. Here N=19, e.g., we have 19 realizations of the same expression, one from each child. fPCA iteratively computes eigenfunctions j (t) such that the data variance along the eigenfunction is maximized at each step j. Specically the quantity Var R j (t)x i (t)dt is maximized, subject to normalization and orthogo- nality constraints, i.e., R 2 j (t)dt = 1 and R j (t) k (t)dt = 0;8k6= j. The resulting set 150 of eigenfunctions represents an orthonormal basis system where input curves (expres- sions) are decomposed into principal components of variability. The PCA score of input curve x i (t) along j (t) is dened as c ij = R j (t)x i (t)dt, where we have assumed mean subtracted curves for simplicity. Figures 7.4(a)-(c) present the rst 3 eigenfunctions (harmonics), cumulatively cov- ering 82% of total variability. Following the visualization proposed in [84], the black line is the mean curve, and the solid red line and dashed blue lines illustrate the eect of j (t) by adding or subtracting respectively std(c j ) j (t) to the mean curve. The quantity std(c j ) is the standard deviation of c j and is computed over all PCA scores c ij . Although, these three rst principal components of variability are estimated in a data-driven way, they seem visually interpretable. They seem to respectively account for variability of: overall expression amplitude, as seen in Fig. 7.4(b), the smile width as dened by the curve dip happening between intial and ending points, as seen in Fig. 7.4(c), as well as mouth closing and second smile apex happening in the second half of the expression, as seen in Fig. 7.4(d). Figures 7.4(e)-(g) present scatterplots of the rst 3 PCA scores of dierent subjects' expressions. Subjects with HFA display a wider variability of PCA score distribution, which translates to wider variance in the smile expression evolution, and potentially contributes to an impression of atypicality. Note that fPCA is unaware of the diagnosis label when it decomposes each expression into eigenfunctions. This decomposition naturally reveals dierences between the TD and HFA groups in the way the smile expression evolves. We performed fPCA analysis of various expressions, and made similar observations of greater variability in the evolution of expressions by subjects with HFA, mostly for complex expressions containing transitions between facial gestures e.g., mouth opening and then smile, consecutive smiles of increasing width. Mimicry of such expressions might be challenging for children with HFA, and may thus reveal dierences between 151 expressions of the two groups. Note that since we analyze posed expressions from children, we would expect some degree of unnaturalness from the subjects. However, Figures 7.3(a) and 7.4(d)-(f) suggest a dierent nature and wider variance of expressive choices produced by subjects with HFA, which are sometimes unrelated to a particular expression. For example, the oscillatory motion displayed by the subject with HFA at Figure 7.3(a) appears in other facial regions and other expressions of the same subject, and seems to be an idiosyncratic facial gesture. 7.7 Conclusions and Future Work We have focused on quantifying atypicality in aective facial expressions, through the statistical analysis of MoCap data from facial gestures, which are hard to describe quantitatively by visual inspection. For this purpose, we have demonstrated the use of various data representation, analysis and visualization methods for behavioral data. We have found statistically signicant dierences in the aective facial expression char- acteristics between TD children and children with HFA. Specically, children with HFA display more asynchrony of motion between facial regions, more head motion roughness, and more facial motion roughness and range for the lower face regions, compared to TD children. We also found that children with HFA consistently display a wider variability in the expressive choices and behaviors that they employ, while they may display cer- tain idiosyncratic facial gestures. Our results shed light on the characteristics of facial expressions of children with HFA and support qualitative psychological observations regarding atypicality of those expressions [81, 82], as well as diversity of behaviors of individuals with ASD [85, 187]. Our analyses indicate the potential of computational approaches to bring new insights into the autism domain. In general, our described analyses could be applied in various time series data of human behavior, where the main goal is the discovery and interpretation of data patterns. 152 Our future work includes analysis of a wider range of expressions including both positive and negative emotions, from a larger pool of subjects. We are currently col- lecting and pre-processing data from subjects belonging to older age groups, in order to expand and strengthen our conclusions. We also plan to perform annotation of the collected facial expressions, obtain perceptual measurements of awkwardness of the col- lected expressions by TD observers(annotators) in order to further interpret our ndings. Finally, the availability of facial marker data enables us to pursue analysis-by-synthesis approaches, for example performing facial expression animation of the MoCap data us- ing virtual characters. Among our future directions is the development of expressive virtual characters that would allow animation, as well as manipulation of typical and atypical facial expressions. Such facial expression visualizations could provide further insights to autism practitioners, or could become a useful educational tool for individuals with HFA. 153 Chapter 8: Conclusions and Future Work 8.1 Conclusions This thesis focuses on the recognition and analysis of emotional human states from a multimodal perspective, using a variety of informative signals including speech, facial expressions, head motion and full body language. Additionally, this thesis explores novel approaches and computational methods for emotion processing, such as the concept of emotional grammars and the use of temporal context to improve emotion recognition performance. We have also focused on continuous emotional representations, addressed challenges in continuous emotion annotation and processing, and proposed frameworks for continuous emotion tracking. Finally, we have applied our computational approaches in the healthcare domain for the analysis of aective facial expressions of children with Autism Spectrum Disorders (ASD), with the goal of better understanding the atypical nature of such expressions. Our work on emotion and temporal context essentially addresses two questions: whether there is structure in typical emotional evolution, and whether such structure 154 could be exploited for improving emotion recognition performance. Our analysis re- vealed patterns in the way that emotion typically evolves; for example valence states tend to be longer lasting, while extreme activation states tend to be more transient, and transition to medium activation. Furthermore, we have proposed multimodal, hierarchi- cal frameworks for emotion recognition that eectively consider such temporal context, i.e., past and, possibly future, emotional evolution from the speaker and his/her inter- locutor. Such frameworks are inspired by speech recognition ideas, such as grammars and language models. We have experimentally demonstrated the utility of our proposed approaches for improving emotion recognition performance, for valence and for clusters in the valence and activation space. Such approaches could pave the way for better performing emotion recognition systems, that consider emotional observations in the context of the setting and the interaction that they occur. Extending our study of emotional evolution, we have also focused on continuous emotional dynamics. We have explored the idea of representing emotions as continuous random variables and reformulated the problem of emotion classication as continuous emotion tracking. We have addressed the challenges of annotation and processing of continuous dimensional attributes of activation, valence and dominance. Our work has proposed methods for continuously estimating emotional states of participants during dyadic interactions, at various time resolutions, based on speech and body language information. According to our experiments, our proposed framework that is based on a generative statistical mapping, outperforms other regression-based approaches in the literature and achieves promising results for activation and dominance trend tracking. Our obtained continuous estimates of emotional attributes could highlight salient regions in long interactions. Problems in aective computing are inherently multimodal, and this thesis has addressed challenges regarding capturing, analyzing, modeling and combining various 155 modalities including speech, facial expressions, head motion and full body language. We have a preference for features that are intuitive and interpretable. Indeed, apart from classication performance, our goal is to shed light on how multimodal expressions, such as facial and body language cues, are modulated by the underlying emotional states. We have particularly focused on body language and emotion, which is a relatively less researched topic in the engineering literature. We have studied a variety of body lan- guage features describing postures and general body behavior of a person with respect to his/her interlocutor, and found that behaviors such as approach/avoidance, gestures, body posture and orientation are very emotionally informative. This has allowed us to revisit psychological observations regarding body language and emotion from a quanti- tative perspective. Finally, our experience in analysis of typical emotional expressions has provided us with the computational tools to study `atypical' emotional expressions. The last part of this thesis explores applications of computational methods in the healthcare domain. Specically, we focus on the analysis of facial expressions of children with High Functioning Autism (HFA), that are typically reported in the autism literature to be perceived as awkward. This work aims to computationally quantify this impres- sion of awkwardness. Our ndings indicate that aspects of asynchrony, roughness of facial motion, range of facial motion, and atypicality in the dynamic evolution of facial expressions dierentiate expressions produced by children with HFA from expressions of typically developing children. This work sheds light into the nature of facial expres- sion awkwardness in autism, and demonstrates the potential of computational modeling approaches to further our understanding of human behavior Overall, this thesis develops algorithms that eectively capture context and temporal structure in multimodal, aective time series for the purpose of emotion recognition and analysis. Our goal is to propose methods for higher performing emotion recognition 156 systems, that would facilitate a more natural human-computer interaction. At the same time, we aim to uncover hidden structure in human behavior and oer meaningful analytics that could help us understand computationally typical and `atypical' human emotional expression. This thesis drawn ideas from various diverse disciplines, including speech and signal processing, machine learning, but also from the behavioral sciences, theater and psychology. My eort has been to bring together such diverse ideas, propose novel computational approaches for the recognition and analysis of emotional human behavior, and contribute to exciting technologies that have the potential to change the way we interact with computers, but also further the understanding of our own behaviors. 8.2 Current and Future Work My current and future research interests include aective human behavior, group be- havior and interaction, spoken and multimodal interfaces and dialog systems, as well as healthcare and education applications. Below are brief descriptions of ongoing and future research directions. Extending the work on body language in dyadic interaction settings, we are currently focusing on analysis of how people's body behaviors aect each other during dyadic interactions, and how that mutual in uence depends on the emotional relation between the two participants, e.g., when they are friendly towards each other, or hostile towards each other. Additionally, we would like to exploit this body language in uence and work towards body language animation in dyadic settings using the interlocutor's multimodal information to guide the animation. Our longer term goal is to create aect-sensitive full body virtual agents that can interact with each other, or with the user, in a friendly or unfriendly manner. As part of this direction, we are also currently exploring data driven clustering of body language gestures, e.g., hand gestures, head and body motion, 157 in order to derive a vocabulary of gesture units, that could later be used to facilitate animation. This is joint work with Zhaojun Yang, and our preliminary work is published in [9]. As can be inferred from the classication results, e.g., of Chapters 5 and 6, subject- independent emotion classication can be very challenging. To a large extent, this is due to subject-specic characteristics in the manner that emotions are being expressed. Certain subjects tend to be more extrovert or more introvert when they communicate emotions, and what is considered a neutral state may dier for each speaker. For example, for a soft spoken person, a relatively higher intensity of his/hers voice may be an indication of anger, while for a person that generally speaks loudly, we may need to observe much higher speech intensity to consider it an indication of anger. What is more, many vocal characteristics that are typically used for emotion recognition, such as pitch and intensity, and also very aected by factors such as gender and age. Therefore, it is important to develop a feature normalization method that would allow to smooth out person-specic variability while preserving emotion-specic variability, in order to improve emotion recognition in speaker-independent settings. As part of our joint work with Prof. C. Busso and his collaborators at UT Dallas, we have proposed an Iterative Feature Normalization (IFN) scheme for feature normalization in emotion recognition. The main idea is to normalize each speaker's features based on statistics computed only on neutral examples of that speaker's data. This has the eect of making neutral features comparable across speakers, while preserving deviations from neutral because of emotion. Since labeled neutral examples for a test speaker will generally not be available, we use an iterative neutrality detection scheme to nd, with high precision, a number of neutral examples in the testing dataset that can be used for computing the normalization constants. Our work has mostly focused on speech features, and has been successfully applied for emotion recognition [189], and intoxicated speech detection[190]. 158 The latter work was the winning entry in the 2011 speaker intoxication challenge, at Interspeech. We have also submitted an article describing this work [191], which is currently under revision. Finally, I have performed research in the dialog systems area, an interesting area where speech applications and human-computer interaction naturally meet. This work was part of my summer 2012 internship at Microsoft Research, where, along with re- searchers J. Williams and D. Bohus, we worked on discriminative state tracking for spoken dialog systems. Specically, we proposed an extension of the Maximum Entropy (MaxEnt) model that reformulates the state tracking problem, e.g., hypotheses scoring of a dialog system hypothesis list, as a multiclass classication problem where the num- ber of classes, i.e., the number of list hypotheses, can be arbitrary. Our experiments suggest that discriminative approaches for state tracking lead to better performance, as opposed to generative ones, and that our proposed MaxEnt formulation shows good promise. This work is published in [192]. 159 Bibliography [1] M. M. Bradley and P. J. Lang, \Measuring emotion: The self-assessment manikin and the semantic dierential," Journal of Behavioral Therapy and Experimental Psychiatry, vol. 25, pp. 49{59, 1994. [2] R. Cowie, E. Douglas-Cowie, S. Savvidou, E. McMahon, M. Sawey, and M. Schr oder, \FEELTRACE: An instrument for recording perceived emotion in real time," in ISCA Workshop on Speech and Emotion, 2000, 2000, pp. 19{24. [3] C. Busso and S. Narayanan, \Recording audio-visual emotional databases from actors: a closer look," in In Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 2008, pp. 17{22. [4] J.A. Harrigan, R. Rosenthal, and K.R. Scherer, The new handbook of Methods in Nonverbal Behavior Research, Oxford Univ. Press, 2005. [5] H. Aviezer, Y. Trope, and A. Todorov, \Body cues, not facial expressions, dis- criminate between intense positive and negative emotions," Science, vol. 338, pp. 1225{1229, 2012. [6] R. Cowie, \Perceiving emotion: towards a realistic understanding of the task.," Philosophical Transactions of the Royal Society, vol. 364, pp. 3515{3525, 2009. [7] N. Sebe, I. Cohen, and T.S. Huang, Multimodal Emotion Recognition, Handbook of Pattern Recognition and Computer Vision, World Scientic, 2005. [8] Rosalind W. Picard, Aective Computing, The MIT Press, Sept. 1997. [9] Z. Yang, A. Metallinou, and S. Narayanan, \Toward body language generation in dyadic interaction settings from interlocutor multimodal cues," in Proc. of Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2013. [10] M. P. Black, A. Katsamanis, B. R. Baucomb, C.-C. Lee, A. C. Lammert, A. Chris- tensen, P. G. Georgiou, and S. S. Narayanan, \Toward automating a human behavioral coding system for married couples' interactions using speech acoustic features," Speech Communication, vol. 55, pp. 1{21, 2013. 160 [11] S. Baron-Cohen, O. Golan, and E. Ashwin, \Can emotion recognition be taught to children with autism spectrum conditions?," Philosophical Transactions of the Royal Society, vol. 364, pp. 3567{3574, 2009. [12] R. W. Picard, \Future aective technology for autism and emotion communica- tion," Philosophical Transactions of the Royal Society, vol. 364, pp. 3575{3584, 2009. [13] A. Metallinou, R. Grossman, and S. Narayanan, \Quantifying atypicality in af- fective facial expressions of children with autism spectrum disorders,," in Proc. of the IEEE Intl. Conf. on Multimedia & Expo (ICME), 2013. [14] S. Narayanan and P. Georgiou, \Behavioral signal processing: Deriving human behavioral informatics from speech and language," Proceedings of IEEE, pp. 1{31, 2013. [15] R. El Kaliouby, P. Robinson, and S. Keates, \Temporal context and the recogni- tion of emotion from facial expression," in HCI International, Crete, June 2003. [16] J. M. Carroll and J. A. Russell, \Do facial expressions signal specic emotions? judging emotion from the face in context," Journal of Personality and Social Psychology, vol. 70, pp. 205{218, 1996. [17] H. R. Knudsen and L. H. Muzekari, \The eects of verbal statements of context on facial expressions of emotion," Journal of Nonverbal Behavior, vol. 7, pp. 202{212, 1983. [18] T. Masuda, P.C. Ellsworth, B. Mesquita, J. Leu, S. Tanida, and E. Van de Veer- donk, \Placing the face in context: Cultural dierences in the perception of facial emotion," Journal of Personality and Social Psychology, vol. 94, pp. 365{381, 2008. [19] K. Oatley and J. M. Jenkins, Understanding Emotions, Blackwell Publishers Ltd, 1996. [20] C. Conati, \Probabilistic assessment of users emotions in educational games," Applied Articial Intelligence, vol. 16, pp. 555{575, 2002. [21] C.-C. Lee, C. Busso, S. Lee, and S. Narayanan, \Modeling mutual in uence of interlocutor emotion states in dyadic spoken interactions," in Proc. of Interspeech, UK, 2009. [22] C. M. Lee and S. S. Narayanan, \Toward detecting emotions in spoken dialogs," IEEE Transactions on Speech and Audio Processing, vol. 13, pp. 293{303, 2005. 161 [23] J. Liscombe, G. Riccardi, and D. Hakkani-Tur, \Using context to improve emotion detection in spoken dialog systems," in Proc. of Interspeech, 2005. [24] M.J.F. Gales and S.J. Young, \The application of Hidden Markov Models in speech recognition," Foundations and Trends in Signal Processing, vol. 1, pp. 195{304, 2008. [25] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, Entropic Cambridge Research Laboratory, Cambridge, England, 2006. [26] A. Metallinou, A. Katsamanis M. W ollmer, F. Eyben, B. Schuller, and S. Narayanan, \Context-sensitive learning for enhanced audiovisual emotion clas- sication," Transactions of Aective Computing (TAC), vol. 3, pp. 184{198, 2012. [27] A. Metallinou, A. Katsamanis, and S. Narayanan, \A hierarchical framework for modeling multimodality and emotional evolution in aective dialogs," in Proc. of ICASSP, 2012. [28] S. Hochreiter and J. Schmidhuber, \Long short-term memory," Neural Compu- tation, vol. 9(8), pp. 1735{1780, 1997. [29] A. Graves, S. Fernandez, M. Liwicki, H. Bunke, and J. Schmidhuber, \Uncon- strained online handwriting recognition with recurrent neural networks," Advances in Neural Information Processing Systems, vol. 20, pp. 1{8, 2008. [30] M. W ollmer, F. Eyben, B. Schuller, and G. Rigoll, \Recognition of spontaneous conversational speech using long short-term memory phoneme predictions," in Proc. of Interspeech, Makuhari, Japan, 2010. [31] M. W ollmer, A. Metallinou, N. Katsamanis, B. Schuller, and S. Narayanan, \An- alyzing the memory of blstm neural networks for enhanced emotion classication in dyadic spoken interactions," in Proc. of ICASSP, 2012. [32] J. Russell and A. Mehrabian, \Evidence for a three-factor theory of emotions," Journal of Research in Personality, vol. 11, pp. 273{294, 1977. [33] M.K. Greenwald, E.W. Cook, and P.J. Lang, \Aective judgment and psychophys- iological response: Dimensional covariation in the evaluation of pictorial stimuli," Journal of Psychophysiology, vol. 3, pp. 51{64, 1989. [34] M. Grimm, E. Mower, K. Kroschel, and S. Narayanan, \Primitives based estima- tion and evaluation of emotions in speech," Speech Communication, vol. 49, pp. 787{800, 2007. 162 [35] D. Wu, T. Parsons, E. Mower, and S. S. Narayanan, \Speech emotion estimation in 3d space," in Proc. of IEEE Intl. Conf. on Multimedia & Expo (ICME) 2010, 2010. [36] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi, \A super- vised approach to movie emotion tracking.," in Proc. of ICASSP 2011, 2011. [37] A. Hanjalic and L.-Q. Xu, \Aective video content representation and modeling," IEEE Trans. On Multimedia, vol. 7, pp. 143{154, 2005. [38] A. Hanjalic, \Extracting moods from pictures and sounds: Towards truly person- alized TV," IEEE Signal Processing Magazine, pp. 90{100, 2006. [39] M. Woellmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie, \Abandoning emotion classes - towards continuous emotion recognition with modelling of long-range dependencies," in Proc. of Interspeech 2008, 2008. [40] M. Wollmer, B. Schuller, F. Eyben, and G. Rigoll, \Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive arti- cial listening," IEEE Journal of Selected Topics in Signal Processing, vol. 4, pp. 867{881, 2010. [41] M.A. Nicolaou, H. Gunes, and M. Pantic, \Continuous prediction of spontaneous aect from multiple cues and modalities in valence-arousal space," IEEE Trans. on Aective Computing, 2011. [42] Tomoki Toda, AlanW. Black, and Keichi Tokuda, \Voice conversion based on maximum likelihood estimation of spectral parameter trajectory," IEEE Trans. on Audio, Speech and Language Processing, vol. 15, pp. 2222{2235, 2007. [43] T. Toda, A. W. Black, and K. Tokuda, \Statistical mapping between articula- tory movements and acoustic spectrum using a gaussian mixture model," Speech Communication, vol. 50, pp. 215{227, 2008. [44] A. Metallinou, A. Katsamanis, and S. Narayanan, \Tracking continuous emotional trends of participants during aective dyadic interactions using body language and speech information," Image and Vision Computing (IMAVIS), Special Issue on Aect Analysis in Continuous Input, vol. 31, pp. 137{152, 2013. [45] A. Mehrabian, \Communication without words," Psychology today, vol. 2, pp. 53{56, 1968. [46] R. Frick, \Communicating emotion: The role of prosodic features," Psychological Bulletin, vol. 97, pp. 412{429, 1985. 163 [47] I.R. Murray and J.L. Arnott, \Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion," Journal of Acoustic Society of America, vol. 93, pp. 1097{1108, 1993. [48] K.R. Scherer, \Vocal aect expression: A review and a model for future research," Psychological Bulletin, vol. 99, pp. 143{165, 1986. [49] P. Ekman, Darwin and Facial Expressions, New York: Academic, 1973. [50] M. Davis and H. College, Recognition of Facial Expressions, New York: Arno Press, 1975. [51] K. Scherer and P. Ekman, Approaches to Emotion., Mahwah, NJ: Lawrence Erlbaum Associates, 1984. [52] P. Ekman, Basic Emotions, Handbook of Cognition and Emotion, Sussex, UK, John Wiley & Sons, 1999. [53] P. Ekman and W. Friesen, Unmasking the Face., Englewood Clis, NJ: Prentice- Hall, 1975. [54] M. Knapp and J. Hall., Nonverbal Communication in Human Interaction, Har- court Brace College Publishers, 1972. [55] L. C. De Silva and P.C. Ng, \Bimodal emotion recognition," in Proc. of the Int. Conference on Face and Gesture Recognition, 2000. [56] Z. Zeng, J. Tu, M. Liu, T.S. Huang, B. Pianfetti, D. Roth, and S. Levinson, \Audio-visual aect recognition," IEEE Transactions on Multimedia, vol. 9, pp. 424{428, 2007. [57] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Lee, A. Kazemzadeh, S. Lee, U. Neu- mann, and S. Narayanan, \Analysis of emotion recognition using facial expres- sions, speech and multimodal information," in Proc. of Sixth International Con- ference on Multimodal Interfaces (ICMI), 2004. [58] Z. Zeng, Y. Hu, G.I. Roisman, Z. Wen, Y. Fu, and T.S. Huang, \Audio-visual spontaneous emotion recognition," in Proceedings of the ICMI 2006 and IJCAI 2007 international conference on Artical intelligence for human computing, 2007, pp. 72{90. [59] M. Pantic, G. Caridakis, E. Andre, J. Kim, K. Karpouzis, and S. Kollias, Emotion- Oriented Systems, chapter Multimodal emotion recognition from low-level cues, pp. 115{132, Cognitive Technologies, Springer, 2011. [60] I. Kanluan, M. Grimm, and K. Kroschel, \Audio-visual emotion recognition using an emotion space concept," in Proc. of EUSIPCO, 2008. 164 [61] H. Gunes and M. Piccardi, \Bi-modal emotion recognition from expressive face and body gestures," Journal of Network and Computer Applications, vol. 30, pp. 1334{1345, 2007. [62] H. Gunes and M. Piccardi, \Automatic temporal segment detection and aect recognition from face and body display," IEEE Trans. on Systems, Man, and Cybernetics - Part B, Special Issue on Human Computing, vol. 39, pp. 64{84, 2009. [63] G. Castellano, L. Kessous, and G. Caridakis, \Multimodal emotion recognition from expressive faces, body gestures and speech.," in Proc. of ACII, 2007. [64] G. Castellano, S.D Villalba, and A. Camurri, \Recognising human emotions from body movement and gesture dynamics," in Proc. of ACII, 2007, 2007. [65] N. Fragopanagos and J.G. Taylor, \Emotion recognition in human-computer in- teraction," Neural Networks, Special Issue on Emotion and Brain, vol. 18, pp. 389{405, 2005. [66] F. Nasoz, C. Lisetti, K. Alvarez, and N. Finkelstein, \Emotion recognition from physiological signals for user modeling of aect," in 9th Int. Conf. on User Model, 2003. [67] J. Kim and E. Andre, Perception and Interactive Technologies, chapter Emotion recognition using physiological and speech signal in short-term observation, pp. pp. 53{64, Springer-Verlag Berlin Heidelberg, 2006. [68] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor, \Emotion recognition in human-computer interaction," IEEE Signal Processing Magazine, vol. 18, pp. 32 { 80, 2001. [69] A. Metallinou, S. Lee, and S. Narayanan, \Decision level combination of multi- ple modalities for recognition and analysis of emotional expression," in Proc. of ICASSP, Dallas, Texas, 2010, pp. 2462{2465. [70] N. Henley, Body politics revisited: What do we know today? In Gender, Power, and Communication in Human Relationships., Hillsdale, NJ: Lawrence Erlbaum, 1995. [71] A. Metallinou, C.-C. Lee, C. Busso, S. Carnicke, and S. Narayanan, \The USC CreativeIT database: A multimodal database of theatrical improvisation," in Workshop on Multimodal Corpora, LREC 2010, 2010. [72] S. M. Carnicke, Stanislavsky in Focus: An Acting Master for the Twenty-First Century, Routledge, UK, 2008. 165 [73] A. Metallinou and S. Narayanan, \Annotation and processing of continuous emo- tional attributes: Challenges and opportunities," in Proc. of EmoSPACE, in conjunction with the IEEE International Conf. on Automatic Face and Gesture Recognition (FG), 2013. [74] DC: American Psychiatric Association 4 ed. Washington, Ed., Diagnostic and sta- tistical manual of mental disorders: DSM-IV, American Psychiatric Association., 2000. [75] Z. Ye, Y. Li, A. Fathi, Y. Han, A. Rozga, G. D. Abowd, and J. M. Rehg., \De- tecting eye contact using wearable eye-tracking glasses," in Proc. of 2nd Intl. Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PET- MEI) in conjunction with UbiComp, 2012. [76] E. Mower, M. J. Mataric, and S. S. Narayanan., \A framework for automatic human emotion classication using emotional proles," IEEE Transactions on Audio, Speech and Language Processing, Accepted for publication. [77] J. J. Diehl and R. Paul, \Acoustic dierences in the imitation of prosodic pat- terns in children with autism spectrum disorders," Research in Autism Spectrum Disorders, vol. 1, pp. 123{134, 2012. [78] J.P.H. Van Santen, E.T. Prud'hommeaux, L.M. Black, and M. Mitchell, \Com- putational prosodic markers for autism," Autism, vol. 14, pp. 215{236, 2010. [79] D. Bone, M. P. Black, C.-C. Lee, M. E. Williams, P. Levitt, S. Lee, and S. Narayanan, \Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist," in Proc. of Interspeech, 2012. [80] G. Celani, M. W. Battacchi, and L. Arcidiacono, \The understanding of the emotional meaning of facial expressions in people with autism," Journal of Autism and Developmental Disorders, vol. 29, pp. 57{66, 1999. [81] R. B. Grossman, L. Edelson, and H. Tager-Flusberg(in press)., \Production of emotional facial and vocal expressions during story retelling by children and ado- lescents with high-functioning autism.," Journal of Speech Language & Hearing Research, in press. [82] N. Yirmiya, C. Kasari, M.Sigman, and P. Mundy, \Facial expressions of aect in autistic, mentally retarded and normal children," Journal of Child Psychology and Psychiatry, vol. 30, pp. 725{735, 1989. [83] R.B. Grossman, N. Pitre, A. Schmid, and K. Hasty, \First impressions: Facial expressions and prosody signal asd status to nave observers," in Intl. Meeting for Autism Research, 2012. 166 [84] J.O. Ramsay and B.W. Silverman, Functional data analysis, 2nd ed., New York, Springer, 2005. [85] L. D. Shriberg, R. Paul, J. L. McSweeny, A. M. Klin, D. J.Cohen, and F. R. Volkmar, \Speech and prosody characteristics of adolescents and adults with high-functioning autism and asperger syndrome.," Journal of Speech Language & Hearing Research, vol. 44, pp. 1097{1115, 2001. [86] R. Plutchik, \The Nature of Emotions," American Scientist, vol. 89, no. 4, pp. 344{350, 2001. [87] P. Ekman and W. Friesen, \The repertoire of nonverbal behavior: Categories, origins, usage, and coding," Semiotica, vol. 1, pp. 49{98, 1969. [88] O.H. Mowrer, Learning theory and behavior, New York: Wiley, 1960. [89] A. Ortony and T. J. Turner, \What's basic about basic emotions?," Psychological Review, vol. 97, pp. 315{331, 1990. [90] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. S. Narayanan, \Emotion recognition using a hierarchical binary decision tree approach," Journal of Speech Communi- cation (Accepted for Publication). [91] M. W ollmer, F. Eyben, B. Schuller, E. Douglas-Cowie, and R. Cowie, \Data- driven clustering in emotional space for aect recognition using discriminatively trained LSTM networks," in Proc. of Interspeech, UK, 2009, pp. 1595{1598. [92] Z. Zeng, Z. Zhang, B. Pianfetti, J. Tu, and T. S. Huang, \Audio-visual aect recognition in activation-evaluation space," in Proc. of ICME, 2005. [93] C. Busso, M. Bulut, C-C Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan, \IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, pp. 335{359, 2008. [94] D. Mendonca and W. Wallace, \A cognitive model of improvisation in emergency management.," IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans, vol. 37, no. 4, pp. 547{561, 2007. [95] K. Johnstone, Impro: Improvisation and the Theatre., Routledge / Theatre Arts, New York., 1981. [96] K. Perlin and A.Goldberg, \Improv: A system for scripting interactive actors in virtual worlds.," in Proceedings of the 23rd Annual Conference on Computer Graphics., 1996. 167 [97] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach, \Emotional speech: Towards a new generation of databases.," Speech Communication, vol. 40, pp. 33{6, April 2003. [98] F. Enos and J. Hirschberg, \A framework for eliciting emotional speech: Capi- talizing on the actor's process," in LREC Workshop on Corpora for Research on Emotion and Aect, Genova, Italy., 2006. [99] T. Banziger and K. R. Scherer, \Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus.," in Int'l Conference on Aective Computing and Intelligent Interaction (ACII), 2007. [100] L. Anolli, F. Mantovani, M. Mortillaro, A. Vescovo, A. Agliati, L. Confalonieri, O.Realdon, V. Zurloni, and A. Sacchi, \A multimodal database as a background for emotional synthesis, recognition and training in e-learning systems.," in ACII 2005, Beijing, 2005. [101] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder, \The SEMAINE database: annotated multimodal records of emotionally coloured conversations between a person and a limited agent.," IEEE Trans. of Aective Computing, Special issue of Resources for Aective Computing., vol. 6, 2011. [102] E. Douglas-Cowie, N. Campbell, and R.P. Cowie, \Emotional speech: Towards a new generation of databases," Speech Communication, vol. 40, pp. 3360, 2003. [103] M. Grimm, K. Kroschel, and S. Narayanan, \The Vera am Mittag german audio- visual emotional speech database," in In Proc. of the IEEE Intl. Conf. on Multi- media and Expo (ICME), 2008. [104] A. Metallinou, A. Katsamanis, Y. Wang, and S. Narayanan, \Tracking changes in continuous emotion states using body language and prosodic cues," in Proc. of ICASSP 2010, 2011. [105] L. Devillers, R. Cowie, J. C. Martin, E. Douglas-Cowie, S. Abrilian, and M. McRorie, \Real life emotions in french and english tv video clips: an inte- grated annotation protocol combining continuous and discrete approaches," in Proc. of LREC, 2006. [106] R. Cowie and M. Sawey, \GTrace - General trace program from Queen's, Belfast," http://www.dfki.de/ schroed/feeltrace/, 2011. [107] Yi-Hsuan Yang and Homer H. Chen, \Ranking-based emotion recognition for music organization and retrieval," IEEE Trans. on Audio, Speech and Language Processing, vol. 19, pp. 762 { 774, 2011. 168 [108] K. Audhkhasi and S. Narayanan, \A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels.," IEEE Trans. on Pattern Analysis and Machine Intelligence., 2012. [109] H. Meng, A. Kleinsmith, and N. Bianchi-Berthouze, \Multi-score learning for aect recognition: the case of body postures," in Proc. of ACII 2011, 2011. [110] M. Nikolaou, V. Pavlovic, and M. Pantic, \Dynamic probabilistic cca for analysis of aective behaviour," in Proc. of the 12th European conference on Computer Vision (ECCV), 2012. [111] P.E. Shrout and J.L. Fleiss, \Intraclass correlations: uses in assessing rater relia- bility," Psychological Bulletin, vol. 86, pp. 420{428, 1979. [112] D.S. Messinger, T. Cassel, S. Acosta, Z. Ambadar, and J.F. Cohn, \Infant smiling dynamics and perceived positive emotion," Journal of Nonverbal Behavior, vol. 32, pp. 133{155, 2008. [113] A. Kittur, E. H. Chi, and B. Suh, \Crowdsourcing user studies with mechanical turk," in Proc. of the SIGCHI Conference on Human Factors in Computing Systems, 2008. [114] M. Buhrmester, T. Kwang, and S. D. Gosling, \Amazon's mechanical turk : A new source of inexpensive, yet high-quality, data?," Perspectives on Psychological Science, vol. 6, 2011. [115] C. Callison-Burch, \Fast, cheap, and creative: evaluating translation quality using amazon's mechanical turk," in Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009. [116] M. Pantic and L.J.M. Rothkrantz, \Toward an aect-sensitive multimodal human- computer interaction," Proceedings of the IEEE, vol. 91, pp. 1370{1390, 2003. [117] L.S. Chen and T.S. Huang, \Emotional expressions in audiovisual human com- puter interaction," in IEEE ICME, 2000. [118] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, \Analysis of emotion recognition using facial expressions, speech and multimodal information," in ICMI. 2004, pp. 205{211, ACM Press. [119] M. W ollmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan, \Context- sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling," in Proc. of Interspeech, Japan, 2010. 169 [120] A. Metallinou, C.Busso, S. Lee, and S. Narayanan, \Visual emotion recogni- tion using compact facial representations and viseme information," in Proc. of ICASSP, Dallas, Texas, 2010, pp. 2474{2477. [121] T. B anziger and K. R. Scherer, \Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus," in Proc. of the 2nd Intl. Conf. on Aective Computing and Intelligent Interaction, 2007. [122] F. Enos and J. Hirschberg, \A framework for eliciting emotional speech: Capi- talizing on the actors process," in 1st Intl. Workshop on Emotion: Corpora for Research on Emotion and Aect (International conference on Language Resources and Evaluation), 2006. [123] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classication, Springer-Verlag New York, Inc., 2007. [124] I. Cohen, Q. T. Xiang, S. Zhou, X. Sean, Z. Thomas, and T. S. Huang, \Feature selection using principal feature analysis," 2002. [125] C. Busso, S. Lee, and S.S. Narayanan, \Using neutral speech models for emotional speech analysis," in Interspeech 2007, 2007. [126] P. Boersma, \Praat, a system for doing phonetics by computer.," Glot Interna- tional, vol. 5, no. 9/10, pp. 341{345, 2001. [127] F. Eyben, M. W ollmer, and B. Schuller, \openSMILE - the Munich versatile and fast open-source audio feature extractor," in Proc. of ACM Multimedia, Firenze, Italy, 2010. [128] D. Glowinski, N. Dael, A. Camurri, G. Volpe, M. Mortillaro, and K. Scherer, \To- wards a minimal representation of aective gestures," IEEE Trans. on Aective Computing, vol. 2, pp. 106{118, 2011. [129] A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, \Automatic recognition of non-acted aective postures," IEEE Trans. on Systems, Man and Cybernetics, Part B, vol. 41, pp. 1027{1038, 2011. [130] G.Varni, G.Volpe, and A.Camurri, \A system for real-time multimodal analysis of nonverbal aective social interaction in user-centric media," IEEE Trans. on Multimedia, vol. 12, pp. 576{590, 2011. [131] V. Rozgic, B. Xiao, A. Katsamanis, B. Baucom, P. Georgiou, and S. Narayanan, \Estimation of ordinal approach-avoidance labels in dyadic interactions: ordinal logistic regression approach.," in In Proc. of ICASSP, 2011. 170 [132] D. Bernhardt and P. Robinson, \Detecting aect from non-stylised body motions," in Proc. of ACII 2007, 2007. [133] A. Kleinsmith and N. Bianchi-Berthouze, \Recognizing aective dimensions from body posture," in Proceedings of the 2nd Intl Conf on Aective Computing and Intelligent Interaction (ACII), 2007. [134] J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P.W. McOwan, and A. Paiva, \Automatic analysis of aective postures and body motion to detect engagement with a game companion," in Proc. of ACM/IEEE Intl Conf. on Human-Robot Interaction, 2011, 2011. [135] A. Kleinsmith and N. Bianchi-Berthouze, \Aective body expression perception and recognition: A survey.," IEEE Transactions on Aective Computing, pp. 1{20, 2012. [136] H. Peng, F. Long, and C. Ding, \Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy," IEEE Trans. of Pattern Analysis and Machine Intelligence, vol. 27, pp. 1226{1238, 2005. [137] B. de Gelder and J. Vroomen, \The perception of emotions by ear and by eye," Cognition and Emotion, pp. 289{311, May 2000. [138] A. K. Dey and G. D. Abowd, \Towards a better understanding of context and context-awareness," in Proc. of the 1st Intl. Symposium on Handheld and Ubiq- uitous Computing (HUC), 1999. [139] I. Cearreta, J.M. Lopez, and N. Garay-Vitoria, \Modelling multimodal context- aware aective interaction," in Proc. of the Doctoral Consortium of the 2nd Intl. Conf. on ACII, 2007. [140] G. McIntyre, \Towards aective sensing," in Proc. HCII, 2007. [141] A. Graves, S. Fernandez, and J. Schmidhuber, \Bidirectional LSTM networks for improved phoneme classication and recognition," in Proc. of ICANN, Poland, 2005, vol. 18, pp. 602{610. [142] S. Fine, Y. Singer, and N. Tishby, \The Hierarchical Hidden Markov Model: Analysis and applications," Machine Learning, vol. 32, pp. 41{62, 1998. [143] A. McCabe and J. Trevathan, \Handwritten signature verifcation using comple- mentary statistical models," Journal Of Computers, vol. 4, pp. 670{680, 2009. [144] I. Cohen, A. Garg, and T. S. Huang, \Emotion recognition from facial expressions using multilevel HMM," Neural Information Processing Systems, 2000. 171 [145] J. A. Nelder and R. Mead, \A simplex method for function minimization," Com- puter Journal, vol. 7, pp. 308{313, 1965. [146] M. Pilu, \Video stabilization as a variational problem and numerical solution with the Viterbi method," in Proc. of CVPR, 2004. [147] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, \Gradient ow in recurrent nets: the diculty of learning long-term dependencies," in A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen, Eds. IEEE Press, 2001. [148] M. Schuster and K. K. Paliwal, \Bidirectional recurrent neural networks," IEEE Transactions on Signal Processing, vol. 45, pp. 2673{2681, November 1997. [149] A. Graves and J. Schmidhuber, \Framewise phoneme classication with bidirec- tional LSTM and other neural network architectures," Neural Networks, vol. 18, no. 5-6, pp. 602{610, June 2005. [150] M. W ollmer, B. Schuller, F. Eyben, and G. Rigoll, \Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive arti- cial listening," IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, 2010. [151] A. Graves, \RNNLib toolbox," http://sourceforge.net/projects/rnnl/. [152] R. Cowie, E. Douglas-cowie, B. Apolloni, J. Taylor, A. Romano, and W Fellenz, \What a neural net needs to know about emotion words," in N. Mastorakis (ed.): Computational Intelligence and Applications. Word Scientic Engineering Society. 1999, pp. 109{114, Society Press. [153] C.D. Manning and H. Schutze, Foundations of Statistical Natural Language Pro- cessing, Chapter 11, The MIT Press, 1999. [154] R. Bakeman and J. M. Gottman, Observing Interaction: An Introduction to Sequential Analysis, 2nd Edition, Cambridge University Press, 1997. [155] R. Bakeman and V. Quera, Analyzing Interaction: Sequential Analysis with SDIS and GSEQ., Cambridge University Press, 1995. [156] M. A. Hall, Correlation-based Feature Selection for Machine Learning, Ph.D. thesis, University of Waikato, 1999. [157] Ian H. Witten and Eibe Frank, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, San Francisco, 2nd edition edition, 2005. 172 [158] A. V. Nean, L. Liang, X. Pi, L. Xiaoxiang, C. Mao, and K. Murphy, \A coupled HMM for audio-visual speech recognition," in Proc. of ICASSP, 2002, pp. 2013{ 2016. [159] M. Brand, N. Oliver, and A. Pentland, \Coupled Hidden Markov Models for complex action recognition," Proc. of CVPR, 1997. [160] G. Gravier, G. Potamianos, and C. Neti, \Asynchrony modeling for audio-visual speech recognition," in Proc. of the 2nd Intl. Conf. on Human Language Technol- ogy Research, San Francisco, CA, USA, 2002, HLT '02, pp. 1{6, Morgan Kaufmann Publishers Inc. [161] SPSS Inc., \SPSS base 10.0 for windows user's guide," SPSS Inc., Chicago IL., 1999. [162] J. A. Russell, J.-A. Bachorowski, and J.-M. Fernndez-Dols, \Facial and vocal expressions of emotion," Annual Review of Psychology, vol. 54, pp. 329{349, February 2003. [163] H. Perez Espinosa, C. A. Reyes Garca, and L. Villaseor Pineda, \Acoustic feature selection and classication of emotions in speech using a 3d continuous emotion model," in Ninth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2011), 2011. [164] A. Graves, Supervised sequence labelling with recurrent neural networks, Ph.D. thesis, Technische Universit at M unchen, 2008. [165] H. Meng and N. Bianchi-Berthouze, \Naturalistic aective expression classica- tion by a multi-stage approach based on hidden Markov models," in Proc. of ACII, 2011, 2011. [166] A. Metallinou, M. Woellmer, A. Katsamanis, F. Eyben, B. Schuller, and S. Narayanan, \Context-sensitive learning for enhanced audiovisual emotion clas- sication," IEEE Trans. of Aective Computing, vol. to appear, 2012. [167] Y. Minami, A. McDermott, E.and Nakamura, and S. Katagiri, \A theoretical analysis of speech recognition based on feature trajectory models," in Proc. of Interspeech, 2004, 2004. [168] Lijuan Zhuang, Xiaodanand Wang, Frank K. Soong, and Mark Hasegawa-Johnson, \A minimum converted trajectory error (MCTE) approach to high quality speech- to-lips conversion," in Proc. of Interspeech, 2010, 2010. [169] D. Ververidis and C. Kotropoulos, \Emotional speech recognition: Resources, features, and methods," Speech Communication, vol. 48, pp. 1162{1181, 2006. 173 [170] A. Metallinou, \Emotion Tracking Code (GMM mapping)," http://sail.usc. edu/sail_tools.php. [171] P.N. Juslin and K.R. Scherer, Chapter 3: Vocal Expression of Aect, The new handbook of Methods in Nonverbal Behavior Research, chapter Chapter 3: Vocal Expression of Aect, pp. 65{135, Oxford Univ. Press, 2005, 2005. [172] F. Eyben, M. Woellmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, \On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues," J. Multimodal User Interfaces, vol. 3, pp. 7{19, 2010. [173] I. Kanluan, M. Grimm, and K. Kroschel, \Audio-visual emotion recognition using an emotion space concept," in Proc. of EUSIPCO, 2008, 2008. [174] H. Asperger, Autistic psychopathy in childhood, Autism and Aspergers Syndrome. Cambridge: Cambridge University Press., 1944. [175] L. Kanner, \Autistic disturbances of aective contact," Acta Paedopsychiatr, vol. 35, pp. 100{136, 1968. [176] A. deMarchena and I.M. Eigsti, \Conversational gestures in autism spectrum disorders: Asynchrony but not decreased frequency," Autism Research, vol. 3, pp. 311{322, 2010. [177] P. Ekman, W.V. Friesen, and J.C. Hager, The facial action coding system (2nd ed.), Salt Lake City: Research Nexus eBook., 2002. [178] K. L. Schmidt, Z. Ambadar, J. F. Cohn, and L. I. Reed, \Movement dierences between deliberate and spontaneous facial expressions: Zygomaticus major action in smiling," Journal of Nonverbal Behavior, vol. 1, pp. 37{52, 2006. [179] Y. L. Tian, T. Kanade, and J. F. Cohn, \Recognizing action units for facial expres- sion analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 97{116, 2001. [180] M. S. Bartlett, G. C. Littlewort, M. G. Frank, C. Lainscsek, I. R. Fasel, and J. R. Movellan, \Automatic recognition of facial actions in spontaneous expressions," Journal of Multimedia, vol. 1, pp. 22{35, 2006. [181] M. Pantic and I. Patras, \Dynamics of facial expression: Recognition of facial actions and their temporal segments from face prole image sequences," IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 36, pp. 433{449, 2006. [182] Z. Zeng, M. Pantic, G.I. Roisman, and T.S Huang, \A survey of aect recognition methods: Audio, visual, and spontaneous expressions," Trans. of Pattern Analysis and Machine Intelligence, vol. 31, pp. 39{58, 2009. 174 [183] M. Gubian, L. Boves, and F. Cangemi, \Joint analysis of f0 and speech rate with functional data analysis," in Proc. of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2011. [184] O.A. Donoghue, A.J. Harrison, N. Coey, and K. Hayes, \Functional data analysis of running kinematics in chronic achilles tendon injury.," Medicine & Science in Sports and Exercise, vol. 40, pp. 1323{1335, 2008. [185] S. Baron-Cohen and J. Kingsley, Mind Reading: The Interactive Guide to Emo- tions, London, UK, 2003. [186] J.O. Ramsay, G. Hooker, and S. Graves, Functional Data Analysis with R and Matlab, vol. 66, Blackwell Publishing Inc, 2010. [187] W. Fay and A.L. Schuler, Emerging language in autistic children, Baltimore: University Park Press., 1980. [188] Trevor F. Cox and Michael A. Cox, Multidimensional Scaling. 2nd ed., Chapman & Hall/CRC, 2000. [189] C. Busso, A. Metallinou, and S. Narayanan, \Iterative feature normalization for emotional speech detection," in Proc. of ICASSP, 2011. [190] D. Bone, M. P. Black, M. Li, A. Metallinou, S. Lee, and S. Narayanan, \Intoxi- cated speech detection by fusion of speaker normalized hierarchical features and gmm supervectors," in Proc. of Interspeech, 2011. [191] C. Busso, S. Mariooryad, A. Metallinou, and S. Narayanan, \Iterative feature normalization scheme for automatic emotion detection from speech," submitted to Transactions of Aective Computing (TAC), under review. [192] A. Metallinou, D. Bohus, and J. Williams, \Discriminative state tracking for spoken dialog systems," in Proc. of the annual meeting of the Association for Computational Linguistics (ACL), long paper, 2013. 175 List of Publications .1 Journal Publications A. Metallinou, A. Katsamanis and S. Narayanan, \Tracking continuous emotional trends of participants during aective dyadic interactions using body language and speech information", Image and Vision Computing (IMAVIS), Special Issue on Aect Analysis in Continuous Input, Vol.31, Issue 2, pp. 137-152, Feb. 2013 A. Metallinou, M. Woellmer, A. Katsamanis, F. Eyben, B. Schuller and S. Narayanan, \Context-Sensitive Learning for Enhanced Audiovisual Emotion Classication", IEEE Transactions of Aective Computing (TAC), Vol. 3 , No. 2, pp. 184-198, April-June 2012 C. Busso, S. Mariooryad, A. Metallinou and S. Narayanan, \Iterative Feature Nor- malization Scheme for Automatic Emotion Detection from Speech", IEEE Trans- actions of Aective Computing (TAC), under review 176 .2 Conference Publications A. Metallinou, D. Bohus and J. Williams, \Discriminative State Tracking for Spoken Dialog Systems", the annual meeting of the Association for Computational Liguistics (ACL), long paper, Bulgaria, 2013 A. Metallinou, R. B. Grossman and S. Narayanan, \Quantifying Atypicality In Af- fective Facial Expressions Of Children With Autism Spectrum Disorders", ICME, San Jose, USA, 2013 Z. Yang, A. Metallinou and S. Narayanan, \Toward Body Language Generation in Dyadic Interaction Settings from Interlocutor Multimodal Cues", ICASSP, Van- couver, Canada, 2013 A. Metallinou and S. Narayanan, \Annotation and Processing of Continuous Emo- tional Attributes: Challenges and Opportunities", EmoSPACE, in conjunction with the IEEE International Conf. on Automatic Face and Gesture Recognition (FG), Shanghai, China 2013 K. Audhkhasi, A. Metallinou, M. Li and S. Narayanan, \Speaker Personality Clas- sication Using Systems Based on Acoustic-Lexical Cues and an Optimal Tree- Structured Bayesian Network", InterSpeech, Portland, US 2012 A. Metallinou, A. Katsamanis and S. Narayanan, \A Hierarchical Framework for Modeling Multimodality and Emotional Evolution in Aective Dialogs", ICASSP, Kyoto, Japan 2012 177 M. Woellmer, A. Metallinou, A. Katsamanis, B. Schuller and S. Narayanan, \An- alyzing the Memory of BLSTM Neural Networks for Enhanced Emotion Classi- cation in Dyadic Spoken Interactions", ICASSP, Kyoto, Japan, 2012 M. Li, A. Metallinou, D. Bone and S. Narayanan, \Speaker States Recognition us- ing Latent Factor Analysis Based Eigenchannel Factor Vector Modeling", ICASSP, Kyoto, Japan, 2012 D. Bone, M. P. Black, M. Li, A. Metallinou, S. Lee and S. Narayanan, \Intoxi- cated Speech Detection by Fusion of Speaker Normalized Hierarchical Features and GMM Supervectors", InterSpeech, Florence, Italy, 2011, 1st award at the 2011 Interspeech Speaker State (Intoxication) Challenge A. Metallinou, A. Katsamanis, Y. Wang and S. Narayanan, \Tracking Changes in Continuous Emotion States using Body Language and Prosodic Cues", ICASSP, Prague, Czech Republic, 2011 C. Busso, A. Metallinou and S. Narayanan, \Iterative Feature Normalization for Emotional Speech Detection", ICASSP, Prague, Czech Republic, 2011 M. Woellmer, A. Metallinou, F. Eyben, B. Schuller and S. Narayanan, \Context- Sensitive Multimodal Emotion Recognition from Speech and Facial Expression us- ing Bidirectional LSTM Modeling", InterSpeech, Makuhari, Japan, 2010 178 A. Metallinou, C.-C. Lee, C. Busso, S. Carnicke and S. Narayanan, \The USC CreativeIT Database: A Multimodal Database of Theatrical Improvisation", Mul- timodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC), Malta, 2010 A. Metallinou, S. Lee and S. Narayanan, \Decision Level Combination of Multi- ple Modalities for Recognition and Analysis of Emotional Expression", ICASSP, Dallas, Texas, US, 2010 A. Metallinou, C. Busso, S. Lee and S. Narayanan, \Visual Emotion Recognition using Compact Facial Representations and Viseme Information", ICASSP, Dallas, Texas, US, 2010 E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, S. Narayanan, \Interpreting Ambiguous Emotional Expressions", ACII Special Session: Recog- nition of Non-Prototypical Emotion from Speech, Amsterdam, The Netherlands, 2009 D. Dimitriadis, A. Metallinou , I. Konstantinou, G. Goumas, P. Maragos, N. Koziris, \GRIDNEWS: A Distributed Automatic Greek Broadcast Transcription System", ICASSP, Taipei, Taiwan, 2009. A. Metallinou, S. Lee and S. Narayanan, \Audio-Visual Emotion Recognition using Gaussian Mixture Models for Face and Voice", IEEE International Symposium on Multimedia (ISM2008), Berkeley, California US, 2008 179
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Emotional speech production: from data to computational models and applications
PDF
Behavioral signal processing: computational approaches for modeling and quantifying interaction dynamics in dyadic human interactions
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Multimodal analysis of expressive human communication: speech and gesture interplay
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
A computational framework for diversity in ensembles of humans and machine systems
PDF
Towards generalizable expression and emotion recognition
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Novel variations of sparse representation techniques with applications
PDF
Natural language description of emotion
PDF
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Emotional speech resynthesis
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Visual representation learning with structural prior
PDF
Learning shared subspaces across multiple views and modalities
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Building and validating computational models of emotional expressivity in a natural social task
PDF
Extracting and using speaker role information in speech processing applications
PDF
Multimodal representation learning of affective behavior
PDF
Human behavior understanding from language through unsupervised modeling
Asset Metadata
Creator
Metallinou, Angeliki
(author)
Core Title
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/06/2013
Defense Date
08/06/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,autism spectrum disorders,healthcare,machine learning,multimodal signal processing,OAI-PMH Harvest,speech processing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Lee, Sungbok (
committee member
), Sha, Fei (
committee member
)
Creator Email
ametallinou@gmail.com,metallin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-318864
Unique identifier
UC11294516
Identifier
etd-Metallinou-1981.pdf (filename),usctheses-c3-318864 (legacy record id)
Legacy Identifier
etd-Metallinou-1981.pdf
Dmrecord
318864
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Metallinou, Angeliki
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
affective computing
autism spectrum disorders
healthcare
machine learning
multimodal signal processing
speech processing