Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multimodal analysis of expressive human communication: speech and gesture interplay
(USC Thesis Other)
Multimodal analysis of expressive human communication: speech and gesture interplay
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTIMODAL ANALYSISOF EXPRESSIVE HUMAN COMMUNICATION: SPEECH ANDGESTURE INTERPLAY by Carlos Busso A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2008 Copyright 2008 Carlos Busso Dedication This dissertation is dedicated to my wife, my children and my parents. ii Acknowledgements I would like to thank my adviser Prof. Shrikanth Narayanan for providing an opportu- nity to join his research group to pursue my goal. He served not only as my mentor, but also as a role model of leadership and scholarly expertise. He encouraged and challenged me throughout my academic program. I have learned invaluable lessons from him. I truly admire his enthusiastic dedication to science, which has strongly in uenced my research. I will never forget his support to attend international conferences, in which I had the chance not only to present my work, but also to meet distinguished scholars. I owe him a debt of gratitude and appreciation. I would also like to thank Prof. C.-C. Jay Kuo and Prof. Ulrich Neumann for their participation in my guidance committee. Their comments and suggestions helped me to focus my research in the right directions. Likewise, I want to express my gratitude to Prof. Panayiotis Georgiou and Prof. Sungbok Lee for their valuable advices and friendly help. This dissertation could not have been written without my loved and devoted wife Andrea (in fact, her talented hands draw many of the gures presented in this disserta- tion). I am thankful for her unconditionally support during this journey. I am greatly indebted for her encouragement and understanding. She has played a crucial part in my achievements. To my daughter Natalia who has delighted my life with her smiles iii and love. To my soon-to-born son Carlos (and the ones to come). I give them my sin- cere thanks. Likewise, I want to thank my parents, my brother Marcelo and my sister Ang elica for their support. Thanks also go to my colleagues at the Speech Analysis and Interpretation Laboratory (SAIL) for their friendship and unselsh help. In particular, I want to thank Abe Kazemzadeh, Emily Mower, Matt Black, and Joseph Tepperman for reviewing and proofreading in detail several of my publications. I also want to thank my ocemates Murtaza Bulut, Serdar Yildirim, Angeliki Metallinou, Chi-Chun (Jeremy) Lee, and Andreas Tsiartas with whom I spent wonderful moments that I will never forget. I would like to thank some of my closer friends at USC. Jorge Silva and Andrea Pe~ na, Fabian Rojas and Carolina Arancibia, Cristian Plisco and Claudia Flores, Pietro Scaglioni and Daisy Mora, Javier Jo and Nuria Pe~ na, Felipe Arrate and Paulina Beeche, Nelson Rodriguez and Melissa Price. They have been very important in my life. Finally, to the Graduate School of USC for the support that I received during my academic program (Provost's Fellowship during 2003-2005, and Fellowship in Digital Scholarship in 2007-2008). Thanks also go to the various organizations that funded this research: National Science Foundation (NSF), Department of the Army, and MURI award from the Oce of Naval Research (ONR). iv Table of Contents Dedication ii Acknowledgements iii List of Figures viii List of Tables xv Abstract xix Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Open challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Dissertation contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Multimodal databases 12 2.1 USC Facial Motion Capture Database (FMCD) . . . . . . . . . . . . . . 13 2.2 Interactive emotional dyadic motion capture database (IEMOCAP) . . . 14 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Brief review of audio-visual databases . . . . . . . . . . . . . . . 16 2.2.3 The design of the database . . . . . . . . . . . . . . . . . . . . . 19 2.2.4 Recording of the corpus . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.5 Post processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 3: Analysis of speech and gestures 41 3.1 Interrelation between speech and facial gestures in emotional utterances 42 3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 3.1.4 Results of the Audio Visual Mapping . . . . . . . . . . . . . . . . 54 3.1.5 Results of Phoneme Level Analysis . . . . . . . . . . . . . . . . . 72 3.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.2 Interplay between linguistic and aective goals in facial expression . . . 80 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.2 Audio-visual database and facial features . . . . . . . . . . . . . 82 3.2.3 Emotional modulation . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3 Joint analysis of the inteplay between linguistic and aective goals . . . 88 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 4: Recognition on non-linguistic cues 104 4.1 Using Neutral Speech Models for Emotional Speech Recognition . . . . 105 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1.2 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.1.3 Neutral model for spectral speech features . . . . . . . . . . . . . 108 4.1.4 Neutral model for prosodic speech features . . . . . . . . . . . . 116 4.1.5 Remarks from this section . . . . . . . . . . . . . . . . . . . . . . 148 4.2 Analysis of emotion recognition using facial expressions, speech and mul- timodal information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.2.2 Emotion recognition systems . . . . . . . . . . . . . . . . . . . . 153 4.2.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.3 Real-time monitoring of participants' interaction in a meeting using audio- visual sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 4.3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 4.3.3 The Smart Room . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.3.4 Multimodal integration . . . . . . . . . . . . . . . . . . . . . . . 178 4.3.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . 181 4.3.6 Participant interaction . . . . . . . . . . . . . . . . . . . . . . . . 182 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 vi Chapter 5: Synthesis of Human-Like gestures 188 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.2.1 Emotion analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 5.2.2 Head motion synthesis . . . . . . . . . . . . . . . . . . . . . . . . 194 5.3 Audio-visual database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 5.4 Head motion characteristics in expressive speech . . . . . . . . . . . . . 196 5.5 Rigid head motion synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 199 5.5.1 Learning relations between prosodic features and head motion . 200 5.5.2 Generating realistic head motion sequences . . . . . . . . . . . . 202 5.5.3 Conguration of HMMs . . . . . . . . . . . . . . . . . . . . . . . 205 5.5.4 Objective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.6 Facial animation synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 208 5.7 Evaluation of emotional perception from animated sequences . . . . . . 210 5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Chapter 6: Conclusions and future work 221 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Bibliography 226 Appendix 244 vii List of Figures 1.1 Big picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Audio-visual database collection. The left gure shows the facial marker layout, and the right gure shows the motion capture system. . . . . . . 13 2.2 Marker layout. In the recording, fty-three markers were attached to the face of the subjects. They also wore wristbands (two markers) and headband (two markers). An extra marker was also attached on each hand. 16 2.3 Two of the actors who participated in the recording, showing the markers on the face and headband. . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 VICON motion capture system with 8 cameras. The subject with the markers sat in the middle of the room, with the cameras directed to him/her. The subject without the markers sat outside the eld of view of the VICON cameras, facing the subject with markers. . . . . . . . . . 24 2.5 Histogram with the number of words per turns (in percentage) in the scripted and spontaneous sessions. . . . . . . . . . . . . . . . . . . . . . 26 2.6 ANVIL annotation tool used for emotion evaluation. The elements were manually created for the turns. The emotional content of the turns can be evaluated based on categorical descriptors (e.g., happiness, sadness) or primitive attribute (e.g., activation, valence). . . . . . . . . . . . . . . 27 2.7 ANVIL emotion category menu presented to the evaluators to label each turn. The evaluators could select more than one emotions and add their own comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 viii 2.8 Distribution of the data for each emotional category. The gure only contains the sentences in which the category with the highest vote was unique (Neu = neutral state, Hap = happiness, Sad = sadness, Ang = anger, Sur = surprise, Fea = fear, Dis = disgust, Fru = Frustation, Exc = Excited and Oth = Other). . . . . . . . . . . . . . . . . . . . . . 30 2.9 (a) ANVIL attribute-based menu presented to the evaluators to label each turn. (b) Self-assessment manikins. The rows illustrate valence (top), activation (middle), and dominance (bottom). . . . . . . . . . . . 33 2.10 Distribution of the emotional content of the corpus in terms of (a) valence, (b) activation, and (c) dominance. The results are separately displayed for scripted (black) and spontaneous (gray) sessions. . . . . . . . . . . . 34 2.11 Percentage of the markers that were lost during the recording. The gure only shows the markers that have more than 1% of missing values. Dark colors indicate higher percentage. . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Face parameterization. (a) the gure shows the facial markers subdivision (upper, middle and lower face regions), (b) the gure shows head motion features, and (3) the gure shows the lip features. . . . . . . . . . . . . . 51 3.2 Linear estimation framework to quantify the level of coupling between facial gestures and acoustic features. AMMSE is used to map the acoustic features into the facial feature space. A ltered version of this estimated signal, b F Facial , is used to measure the Pearson's correlation. . . . . . . . 53 3.3 Facial activeness during speech. The gures show that during speech, the lower face region is the most active area in the face. It also shows inter-emotional dierences. During happiness and anger, the activeness of the face is higher than during neutral state. Conversely, during sadness the activeness of the face decreases. . . . . . . . . . . . . . . . . . . . . 58 3.4 Palette used in the plots. Darker shadings imply higher activeness (Figure 3.3) or higher correlation (Figure 3.5 and 3.6). . . . . . . . . . . . . . . 59 3.5 Correlation results for Sentence-Level mapping. (a-d) Prosodic features, (e-h) Vocal tract features. The gure shows high levels of correlation be- tween the original and estimated facial features, especially when MFCCs are used to estimate the mapping. The gures also suggest that the link between acoustic and facial features is in uenced by the emotional content in the utterance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 ix 3.6 Correlation results for Global-level mapping. (a-d) Prosodic features, (e-h) Vocal tract features. The gures show that the correlation lev- els signicantly decreases compared with the results at sentence-level. MFCC-based estimated facial features also present higher correlation lev- els than when prosodic features are used. The lower face region presents the highest correlation levels in the face. . . . . . . . . . . . . . . . . . . 66 3.7 Correlation levels as a function of the number of Eigenvectors (P ) used to approximate ~ t (Equation 3.8). The slopes of these curves indicate that the correlation levels slowly decreased asP decrease, supporting the hypothesis of an emotion-dependent structure in the audio-visual mappings. 71 3.8 Dynamic time warping to align neutral and emotional facial features. (a) optimum alignment path (b) original lip features (c) normalized and warped lip features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.9 Graphical representation of correlation levels between neutral and warped version of emotional facial features. Dark areas represent high correlation values, which imply low degree of freedom to convey non-verbal information 85 3.10 Graphical representation of Euclidean distance between the normalized neutral, ( b F (neu) t ) and emotional ( b F (emo) t ) facial features. Dark areas rep- resent large distances, which imply that the areas are not driven by ar- ticulatory processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.11 Mean and standard deviation of the likelihood scores in terms of broad phonetic classes, evaluated with the corpus presented in Section 2.1. The neutral models were trained with MFB features. . . . . . . . . . . . . . 92 3.12 (a) Data-driven approach to cluster the facial region [126]. (b) Facial subdivition: (F1) Forehead, (F2) Left eye, (F3) Right eye, (F4) Left cheek, (F5) Right cheek, (F6) Nasolabial, and (F7) Chin. . . . . . . . . 94 3.13 KLD between the distributions for the neutral and emotional likelihood scores values. Vowels present stronger emotional modulation in the spec- tral acoustic domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.14 KLD between the distributions of the neutral and emotional values of the pitch (a) and energy (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.15 KLD between the distributions of the neutral and emotional values of the displacement coecient for the facial areas. (F1) Forehead, (F2) Left eye, (F3) Righ eye, (F4) Left cheek, (F5) Righ cheek, (F6) Nasolabial, (F7) Chin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 x 4.1 Proposed two-step neutral model approach to discriminate neutral versus emotional speech. In the rst step, the input speech is contrasted with robust neutral references models. In the second step, the tness measures are used for binary emotional classication (details are given in Section 4.1.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2 Error bar of the likelihood scores in terms of broad phonetic classes, evaluated with the EMA corpus. The neutral models were trained with MFB features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Error bar of the likelihood scores in terms of broad phonetic classes, evaluated with the CCD corpus. The neutral models were trained with MFB features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.4 Likelihood score histograms for the broad phonetic classes B, F, N and T. EMA corpus is used (MFB-based neutral models). . . . . . . . . . . . 112 4.5 Error bar of the likelihood scores in terms of broad phonetic classes, evaluated with the EMA corpus. The neutral models were trained with MFCC features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6 Error bar of the likelihood scores in terms of broad phonetic classes, evaluated with the CCD corpus. The neutral models were trained with MFCC features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.7 Most emotional prominent features according to the average KLD ratio between features derived from emotional and neutral speech. The gures show the sentence-level (top) and voiced-level (bottom) features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . 131 4.8 Average KLD ratio between pitch features derived from emotional and neutral speech from the EMA corpus. The label Emo corresponds to the average results across all emotional categories. In order to keep the y axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . 132 xi 4.9 Average KLD ratio between pitch features derived from emotional and neutral speech from the EPSAT corpus. The label Emo corresponds to the average results across all emotional categories. Only the emotional categories hot anger, happiness and sadness are displayed. In order to keep the y-axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level fea- tures. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.10 Average KLD ratio between pitch features derived from emotional and neutral speech from the GES corpus. The label Emo corresponds to the average results across all emotional categories. Only the emotional cate- gories anger, happiness and sadness are displayed. In order to keep the y-axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . 134 4.11 Most emotionally discriminative pitch features according to the log-likelihood ratio scores in logistic regression analysis, when only one feature is entered at a time in the models. The gure displays the average results across emotional databases and categories. The gures show the sentence-level (top) and voiced-level (bottom) features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . . . . . . . . . . . . . . . 135 4.12 Most frequently selected features in logistic regression models using for- ward feature selection. The gures show the sentence-level (top) and voiced-level (bottom) features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.13 Performances of the proposed neutral model approach in term of the number of mixtures in the GMM. The results are not sensitive to this variable. For the rest of the experiments k = 2 was selected. . . . . . . . 143 4.14 Conventional classication scheme for automatic emotion recognition. Speech features are directly used as input of the classier, instead of the tness measures estimated from the neutral reference models (Fig. 4.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.15 ve areas of the face considered in this study . . . . . . . . . . . . . . . 159 4.16 First two components of low eye area vector . . . . . . . . . . . . . . . . 159 4.17 System based on facial expression . . . . . . . . . . . . . . . . . . . . . . 161 4.18 Features-level and decision-level fusion . . . . . . . . . . . . . . . . . . . 161 xii 4.19 Smart Room. The left gure shows the smart room. The right gure shows the microphone array and the omnidirectional camera. . . . . . . 173 4.20 Omnidirectional image from 360 camera and its panoramic transform . 177 4.21 Detection of participants' faces with the 360 camera . . . . . . . . . . . 177 4.22 The system is distributed running over TCP, with information exchange as depicted above. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 4.23 Speaker localization system. See Sections 4.3.3.3, 4.3.3.4 and 4.3.4 for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 4.24 High-level group interaction measures estimated from automatic (a-c) and manual (d-f) speaker segmentation. . . . . . . . . . . . . . . . . . . 183 4.25 High-level interaction ow inferences. . . . . . . . . . . . . . . . . . . . . 184 4.26 Dynamic behavior of speakers' activeness over time. . . . . . . . . . . . 185 5.1 Head poses using Euler angles . . . . . . . . . . . . . . . . . . . . . . . . 196 5.2 2D projection of Voronoi regions using 32-size vector quantization . . . 200 5.3 Head motion synthesis framework. . . . . . . . . . . . . . . . . . . . . . 202 5.4 Example of a synthesized head motion sequence. The gure shows the 3D noisy signal b Z (Equation 5.4), with the key-points marked as a circle, and the 3D interpolated signal b X, used as head motion sequence. . . . . 204 5.5 Kullback-Leibler distance rate (KLDR) of HMMs for eight head-motion clusters. Light colors mean that the HMMs are dierent, and dark col- ors mean that the HMMs are similar. The gure reveals the dierences between the emotion-dependent HMMs. . . . . . . . . . . . . . . . . . . 207 5.6 Overview of the data-driven expressive facial animation synthesis system. The system is composed of three parts: recording, modeling, and synthesis.208 5.7 Synthesized eye-gaze signals. Here, the solid line (blue) represents syn- thesized gaze signals, and dotted line (red) represents captured signal samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.8 Synthesized sequence for happy (top) and angry (bottom) sentences. . . 210 5.9 Dynamic Time Warping. Optimums path (left panel) and warped head motion signal (right panel). . . . . . . . . . . . . . . . . . . . . . . . . . 211 xiii 5.10 Self Assessment Manikins [73]. The rows illustrate: top, Valence [1- positive, 5-negative]; middle, Activation [1-excited, 5-calm]; and bottom, Dominance [1-weak, 5-strong]. . . . . . . . . . . . . . . . . . . . . . . . 213 5.11 Subjective evaluation of emotions conveyed in valence domain [1-positive, 5-negative]. Each quadrant has the error bars of facial animations with head motion synthesized with (from left to right): original head motion sequence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.12 Subjective evaluation of emotions conveyed in activation domain [1-excited, 5-calm]. Each quadrant has the error bars of facial animations with head motion synthesized with (from left to right): original head motion se- quence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 5.13 Subjective evaluation of emotions conveyed in dominance domain [1-weak , 5-strong]. Each quadrant has the error bars of facial animations with head motion synthesized with (from left to right): original head motion sequence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 xiv List of Tables 2.1 Scenarios used for eliciting unscripted/unrehearsed interactions in the database collection. The target emotions for each subject are given in parenthesis (Fru = Frustation, Sad = Sadness, Hap = Happiness, Ang = Anger, Neu = Neutral). . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Segmentation statistics of the IEMOCAP database speech. Comparative details for popular spontaneous spoken dialog corpora are also shown. . 25 2.3 Confusion matrix between emotion categories estimated from human evaluations (Neu = neutral state, Hap = happiness, Sad = sadness, Ang = anger,Sur = surprise,Fea = fear,Dis = disgust,Fru = frusta- tion, Exc = excited and Oth = other). . . . . . . . . . . . . . . . . . . . 31 2.4 Fleiss' Kappa statistic to measure inter-evaluator agreement. The results are presented for all the turns and for the turns in which the evaluators reached agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Inter-evaluator agreement of the attribute based evaluation measured with the Cronbach alpha coecient . . . . . . . . . . . . . . . . . . . . . 33 2.6 Example of the annotations for a portion of a spontaneous session (third scenario in Table 2.1). The example includes the turn segmentation (in seconds), the transcription, the categorical emotional assessments (three subjects) and the attribute emotional assessment (valence, activation, dominance, two subjects). . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.7 Comparison of the recognition rate in percentage between the evaluation by self and others for the spontaneous/unscripted scenarios (categorical evaluation). The results are presented for six of the actors (e.g., F 03 = female actress in session 3). . . . . . . . . . . . . . . . . . . . . . . . . . 35 xv 3.1 Average activeness of facial features during emotional speech (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) . . . . . . . . . . . . . . . 57 3.2 Statistical signicant of inter-emotion activation dierences (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) . . . . . . . . . . . . . . . 57 3.3 Summary Correlation for Sentence-Level mapping (N =neutral, S=sadness, H = happiness, A= anger) . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Statistical signicant of inter-emotion dierences in correlation of Sentence- level mapping (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) 61 3.5 Summary Correlation for Global-level Mapping (N =neutral, S=sadness, H = happiness, A= anger) . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Statistical signicant of inter-emotion dierences in correlation of Global- level mapping (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) 65 3.7 Fraction of eigenvectors used to span 90% or more of the variance of the parameterT (N =neutral, S=sadness, H = happiness, A= anger, G= global) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.8 Pearson's correlation for the audio-visual mapping at phoneme levels (N =neutral, S=sadness, H =happiness, A=anger) . . . . . . . . . . . . 74 3.9 Statistical signicant in dierences between broad phonetic classes (Al=All, Vw=Vowels,Co=Consonant, Vo=Voiced, Uv=Unvoiced, Na=Nasal, Pl=Plosive, Fr=Fricative) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.10 Frame-by-Frame analysis between emotional and neutral facial features 86 3.11 Broad phonetic classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.12 Likelihood scores for MFB (Sad = sadness, Ang = anger, Hap = happi- ness, Neu = neutral) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.13 Average values of pitch and energy during broad phonetic classes (Sad = sad- ness, Ang = anger, Hap = happiness, Neu = neutral) . . . . . . . . . . 97 3.14 Ratio between the average displacement coecient observed in emotional and neutral utterances. (S = sadness, A = anger, H = happiness, N = neutral) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.1 Broad phone classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 xvi 4.2 Discriminant analysis of likelihood scores for CCD database (Neg=negative, Neu=neutral) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3 Discriminant analysis of likelihood scores for EMA database (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger, Emo=emotional) . . . . . 116 4.4 Summary of the databases (neu=neutral, ang=anger, hap=happiness, sad=sadness, bor=boredom, dis=disgust, fea=fear, anx=anxiety, pan=panic, anh=hot anger, anc=cold anger, des=despair, ela=elation, int=interest, sha=shame, pri=pride, con=contempt, sur=surprise). . . . . . . . . . . 123 4.5 Sentence- and voiced-level features extracted from the F0. . . . . . . . . 128 4.6 Additional sentence-level F0 features derived from the statistics of the voiced region patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.7 Details of the logistic regression analysis using FFS (Sentence-level fea- tures). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.8 Details of the logistic regression analysis using FFS (voiced-level features).139 4.9 Correlation of the selected pitch features. . . . . . . . . . . . . . . . . . 141 4.10 Performances of the proposed neutral model approach. The performances of conventional LDC classier (without neutral models) for the same task are also presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.11 Performances of the proposed neutral model approach for each emotional database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.12 Precision rate of the proposed neutral model approach for each emotional category. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.13 Validating the robustness of the neutral model approach against mis- match between training and testing conditions. Sentence-level features (E=English, G=German, S=Spanish). . . . . . . . . . . . . . . . . . . . 146 4.14 Validating the robustness of the neutral model approach against mis- match between training and testing conditions. Voiced-level features (E=English, G=German, S=Spanish). . . . . . . . . . . . . . . . . . . . 147 4.15 Confusion matrix of the emotion recognition system based on audio . . 162 4.16 Performance of the facial expression classiers . . . . . . . . . . . . . . . 163 4.17 Decision-level integration of the 5 facial blocks emotion classiers . . . . 163 xvii 4.18 Confusion matrix of the combined facial expression classier . . . . . . . 164 4.19 Confusion matrix of the feature-level integration bimodal classier . . . 165 4.20 Decision-level integration bimodal classier with dierent fusing criteria 165 4.21 Confusion matrix of the decision-level bimodal classier with product- combining rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 4.22 All of the above results are obtained in real time, and include the whole length of the meeting, with no time given for initial convergence. A: Speaker ID as obtained purely from the speech signal using a GMM; B: Localiza- tion obtained by the two visual information channels and the microphone array; C: Speaker Identication & Localization based on all information channels. Assumes perfect knowledge of L, the seating arrangement of the participants; D: As C, but the mapping of speaker-location, L, is continuously estimated from the data; E: Speaker Location mapping, L. 181 4.23 Comparison between the hand-based addressee annotations and turn- taking transition matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.1 Statistics of rigid head motion . . . . . . . . . . . . . . . . . . . . . . . . 197 5.2 Results for dierent congurations using generic HMMs . . . . . . . . . 205 5.3 Canonical Correlation Analysis Between original and synthesized head motion sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.4 Subjective Agreement Evaluation, Variance about the mean . . . . . . . 213 5.5 Naturalness assessment of rigid head motion sequences [1-robot-like, 5- human-like] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 xviii Abstract The verbal and non-verbal channels of human communication are internally and intri- cately connected. As a result, gestures and speech present high levels of correlation and coordination. This relationship is greatly aected by the linguistic and emotional con- tent of the message being communicated. The interplay is observed across the dierent communication channels such as various aspects of speech, facial expressions, and move- ments of the hands, head and body. For example, facial expressions and prosodic speech tend to have a stronger emotional modulation when the vocal tract is physically con- strained by the articulation to convey other linguistic communicative goals. As a result of the analysis, applications in recognition and synthesis of expressive communication are presented. From an emotion recognition perspective, we propose to build acoustically neutral models, which are used to measure the degree of similarity between the input speech and neutral speech. A tness measure is then used as feature for classication, achieving better performance than conventional classication schemes in terms of accuracy and robustness. In addition to detecting users' emotions, we analyze how to use such ideas for meta-analysis of user behavior such as in automatically monitoring and tracking the behaviors, strategies and engagement of the participants in multiperson interactions. xix We describe a case of study of an intelligent meeting environment equipped with audio- visual sensors. We accurately estimate in real-time not only the ow of the interaction, but also how dominant and engaged each participant was during the discussion. Finally, we show examples of how to synthesize expressive behavior by exploiting interrelation between speech and gestures. We propose to synthesize natural head mo- tion sequences from acoustic prosodic features by sampling from trained Hidden Markov Models (HMMs). Our comparison experiments show that the synthesized head motions are perceived as natural as the captured head motion sequences. xx Chapter 1: Introduction 1.1 Motivation In normal human-human interaction, gestures and speech are intricately coordinated to express and emphasize ideas, and to provide suitable feedback. The tone and the inten- sity of speech, facial expressions, rigid head motion and hand movements are combined in a non-trivial manner, as they unfold in natural human communication. The fact that all these communicative channels interact and cooperate to convey a desired message suggests that gestures and speech are controlled by the same internal control system [29, 133, 173, 174]. These communicative channels are not only strongly connected, but also systematically synchronized in dierent scales (phonemes-words-phrases-sentences) [29]. Therefore, a joint analysis of these modalities is needed to fully understand ex- pressive human communication. Human beings communicate themselves in a multimodal manner. For example, the emotional state of the speaker, which is one of the most notable non-linguistic cues, is manifested through modulation of various communicative channels, including facial expressions [69], head motion (Section 5.4), eyebrow movement [64] and speech 1 [44, 156, 187]. Likewise, during speech production the vocal-tract is shaped, and the face is deformed to reach the articulatory target, aecting regions quite further from the oral aperture [174]. The fact that many of these channels are actively or passively involved during the production of speech (verbal) and facial expressions (non-verbal) indicates that the linguistic and aective goals co-occur during human interaction. Since con icts may appear between these communicative goals in their realization, some kind of central control system needs to buer, prioritize and execute them in a coherent manner. Human beings also perceive communicative messages by simultaneously processing information acquired by ears and eyes. For instance, it has been shown that the visual channels improve the intelligibility of speech [26]. In fact, when these two channels receive contradictory information, something dierent from the message conveyed by the individual modalities may be perceived, phenomenon known as the McGurk eect [132]. If we understand how to model the spatial-temporal modulation of these commu- nicative goals in gestures and speech, many exciting and challenging applications could be designed. For example, if computers could infer non-explicit messages from the users, such as their aective state, they could give specic and appropriate help to users in ways that are more in tune with the user's needs and preferences. Also applications such as realistic facial animation could be signicantly improved by learning and syn- thesizing human-like gestures that mimic the manner in which human beings interact to eectively engage the users. Following this direction, multimodal approaches have been proposed to improve the design of systems such as Automatic Speech Recognition (ASR) [151], human-like conversational agents [29], emotion recognition [90, 192], and realistic facial animation [12, 79]. 2 As a result of previous observations, it is clear that a joint study of gestures and speech will provide a more complete picture of how human communicate and inter- act. Toward understanding expressive human communication, this dissertation focuses on the analysis, recognition and synthesis of human communicative messages under a multimodal framework. 1.2 Open challenges Although some work has been done to study expressive human communication, many challenging areas remain open. It is well known that the gestures and the speech displayed under emotional state dier from the ones under neutral state. However, it is not clear how to model this spatial-temporal emotional modulation. If audio-visual models do not consider how the coupling between gestures and speech changes in presence of emotion, they will not accurately re ect the manner in which human communicate. Another interesting problem to study is the interdependencies between the various human communicative channels in conveying verbal and non-verbal messages. It is not clear how the communicative goals are buered, prioritized and executed, especially when there are con icts between them. The study of this question is important in understanding which gestures are more dominant in communicating the emotional state of the speakers. Another open question is how human beings communicate with other people. How are gestures used to respond to the feedback given by the interlocutor? Which gestures are used as active listeners? How the verbal and non-verbal messages conveyed by the interlocutor are perceived? Which strategies are used during small group discussions? 3 After learning how to model expressive human communication, the next question is how to use those models to design and enhance applications that will help and engage the users. These are some of the open challenges addressed in this dissertation. By studying these problems, under an engineering perspective, we expect to shed light into the underlying relationship between gestures and speech which will provide knowledge to model expressive human communication. 1.3 Proposed approach Figure 1.1 shows a high-level description of the proposed study on expressive human communication. As discussed in Section 1.1, both the aective state and the commu- nicative goals in uence the manner in which human beings communicate. Therefore, the proposed analysis includes how linguistic and emotional goals are communicated during human interaction. The proposed approach distinguishes interactions between individual, two-person (dyad) and small groups. The study at the individual level provides the basis to model expressive human communication. Here, focus is placed on studying how gestures and speech are used together to convey communicative goals, and how the relationship between gestures and speech change in presence of emotions. Since the gestures and speech of the speakers are in uenced by the feedbacks provided by other interlocutors, a dyad analysis of the various communicative channels is necessary to understand human interaction. Here, the analysis is centered on how individuals perceive and respond to the feedback provided by the other interlocutor. At small group level, the goal is to study the strategies used by the participants to interact with each other, and how these strategies dene the ow of the interaction. These three levels provide complementary information toward understanding expressive human communication. 4 Emotional aspect Linguistic aspect Idiosyncratic aspect Expressive Human Communication Individual Dyad Small Group Perception (McGurk effect) Speech Production Audio-Visual Models Listener Feedback Emotion Recognition Facial Animation Participants' interaction Emotion Modulation Virtual Agents Recognition Synthesis Analysis Figure 1.1: Big picture This dissertation is divided into three main components: analysis, recognition and synthesis. In the analysis section, the purpose is to understand how the communicative channels are combined to express communicative goals. The correlation between ges- tures and speech is studied to quantify the degree and the structure of the audio-visual coupling. The interplay between aective and linguistic goals is also analyzed to see the degree of freedom that various communicative channels have to express the verbal and non-verbal messages. Likewise, a joint analysis of facial expression and speech is pre- sented, which indicates that when one modality is constrained by articulatory processes, other communicative channels are used to convey the emotional goals of the speaker. These results in turn inform the development of model for synthesis and recognition. 5 In the recognition section, the knowledge learned from the analysis is used to in- fer non-linguistic human messages. At the individual level, we proposed a two-step approach to discriminate emotional from neutral speech. In the rst step, neutral refer- ence models are built to measure the degree of similarity between the input speech and the reference neutral speech. The output of this block is a tness measure of the input speech with respect to the neutral model. In the second step, these measures are used as features to infer whether the input speech is emotional or neutral. When compared to conventional approach, in which the speech features are directly used as the input in the classier, the proposed approach not only achieves better performance, but also presents better generalization. We also propose the study of emotion recognition sys- tems based on speech, facial gestures and both modalities together. The results reveal that acoustic and facial features contain complementary and redundant information. As a result, a multimodal approach increases not only the performance, but also the robustness, compared with the unimodal systems. We apply these ideas to study meet- ing environments. At small group level, the ow of the interaction is studied. Using an intelligent environment equipped with audio-visual sensors, the participants' behavior and strategies are monitored using high-level features automatically annotated, such as the number and the transition of turns between participants. In the synthesis section, models derived from the analysis are used to generate re- alistic human-like facial animation using a data-driven approach. The focus is placed on rigid head motion sequences learned from prosodic features. The framework exploits the coupling and synchronization between head motion and acoustic features. Further- more, the study includes the eect in emotional perception of the facial animation when intentional mismatch of the emotions between the head motion sequence and the rest of the animation is synthesized. This analysis reveals the important role of head motion in how the facial animation is perceived. 6 1.4 Dissertation contributions The proposed work is novel with important theoretical and practical values. A summary of the main results is listed below. Analysis Facial and acoustic features are strongly interrelated, showing levels of correlation higher thanr = 0.8 when the mapping is computed at sentence-level using spectral envelope speech features. The correlation levels present signicant inter-emotional dierences, which suggest that emotional content aect the relationship between facial gestures and speech. Principal component analysis (PCA) shows that the audiovisual mapping param- eters are grouped in a smaller subspace, which suggests that there is an emotion- dependent structure that is preserved from sentence to sentence. This internal structure seems to be easy to model when prosodic-features are used to estimate the audiovisual mapping. The correlation levels within the sentence vary according to broad phonetic prop- erties presented in the sentence. Consonants, especially unvoiced and fricative sounds present the lowest correlation levels. Facial gestures are linked at dierent resolutions. While the orofacial area is locally connected with the speech, other facial gestures such as eyebrow motion are linked only at sentence-level. Facial activeness is mainly driven by articulatory processes. However, clear spatial- temporal patterns are observed during emotional speech, which indicate that emo- tional goals enhance and modulate facial expressions. 7 Upper face region has more degrees of freedom to convey non-verbal information than the lower face region, which is highly constrained by the underlying articu- latory processes (aective/linguistic interplay). The results indicate that gross statistics from F0 contour (pitch) such as mean, maximum, minimum and range are more emotionally prominent than the features describing the pitch shape, which are hypothesized to be closely related with the linguistic content. Some broad phonetic classes present stronger emotional dierences in the spectral speech features. While front vowels present distinctive emotional modulations, the likelihood scores for nasal sounds are similar across emotional category, suggesting that during articulation there is not enough degrees of freedom to convey emotional modulation. When one modality is constrained by the articulatory speech process, other chan- nels with more degrees of freedom are used to convey the emotions. Facial expression and prosodic speech tend to have a stronger emotional modula- tion when the vocal tract is physically constrained by the articulation to convey other linguistic communicative goals. Recognition A novel robust binary emotion recognition system based on contrasting expres- sive speech with reference neutral models is proposed using spectral and prosodic speech features. Raw Mel Filter Bank (MFB) output was found to perform better than conventional MFCC, with both broad-band and telephone-band speech. 8 Analyzing the pitch statistics at the sentence-level seems to be more accurate and robust than analyzing the pitch statistics for each voiced segment. The recognition accuracy of the system is over 77% (baseline 50%). When com- pared to conventional classication schemes, the proposed approach performs bet- ter in terms of accuracy and robustness. Emotion recognition system based on facial expression gave better performance than the system based on just acoustic information for the emotions considered. When acoustic and facial features are fused, the performance and the robustness of the emotion recognition system improve measurably. In small group meetings, it is possible to accurately estimate in real-time not only the ow of the interaction, but also how dominant and engaged each participant was during the discussion. Multimodal approaches result in signicantly improved performance in spatial localization, identication and speech activity detection of the participants. Synthesis Rigid head motion conveys important non-verbal information in human commu- nication, and hence it needs to be appropriately modeled and included in realistic facial animations to eectively mimic human behaviors. Head motion patterns with neutral speech signicantly dier from head motion patterns with emotional speech in motion activation, range and velocity. Thus, head motion provides discriminating information about emotional categories. By building Hidden Markov Models for each emotional category, the method nat- urally models the specic temporal dynamics of emotional head motion sequences. 9 On average, the synthesized head motion sequences were perceived even more natural than the original head motion sequences. Head motion modies the emotional perception of the facial animation especially in the valence and activation domain. Appropriate head motion not only signicantly improves the naturalness of the animation but can also be used to enhance the emotional content of the animation to eectively engage the users. Another important contribution of this research is the multimodal databases that were collected (chapter 2). These corpora are in themselves valuable resources for the broader community to pursue a variety of interdisciplinary scholarly questions related to interpersonal human communication. The results presented in this dissertation have been sent for publication in dierent journals and international conferences. A full list of the publications is presented in the Appendix. 1.5 Dissertation outline As mentioned in Section 1.3, this dissertation is divided in three main sections: analysis, recognition and synthesis. Chapter 2 describes the databases used in this study. Chapter 3 presents the analysis of facial gestures and speech during expressive human communi- cation. The analysis includes the level of coupling between facial and acoustic features, and the interplay between these communicative channels to express aective and lin- guistic goals. In Chapter 4, non-linguistic cues such as aective states and speaker's engagement are inferred by processing multimodal streams of data. In Chapter 5, the close relationship between facial gestures and speech is exploited to synthesize realistic human-like facial animation. Specically, rigid head motion sequences are generated 10 from acoustic prosodic features using a time series framework. Finally, Chapter 6 gives the nal remarks and discusses the future directions of this research. 11 Chapter 2: Multimodal databases This chapter describes the multimodal emotional databases used in this study. Both corpora were recorded with motion capture system, providing detailed facial informa- tion. Section 2.1 describes the USC Facial Motion Capture Database (FMCD), which was recorded from one subject, who was asked to read a set of sentences portraying four dierent emotional states: anger, happiness, sadness and neutral state. Section 2.2 provides the details of the Interactive Emotional Dyadic Motion Capture database (IEMOCAP). This database was recorded from ten actors in dyadic sessions with markers on the face, head, and hands, which provide detailed information about their facial expression and hand movements during scripted and spontaneous spoken communication scenarios. The actors performed selected emotional scripts and also improvised hypothetical scenarios designed to elicit specic types of emotions (happi- ness, anger, sadness, frustration and neutral state). The corpus contains approximately twelve hours of data. The detailed motion capture information, the interactive setting to elicit authentic emotions, and the size of the database make this corpus a valuable 12 Figure 2.1: Audio-visual database collection. The left gure shows the facial marker layout, and the right gure shows the motion capture system. addition to the existing databases in the community for the study and modeling of multimodal and expressive human communication. 2.1 USC Facial Motion Capture Database (FMCD) The audiovisual database used in this work was recorded from an actress, who was asked to read a custom-made, phoneme-balanced corpus four times, expressing dierent emotions (happiness, sadness, anger and neutral state). A detailed description of her facial expression and rigid head motion was acquired by using 102 markers attached to her face (left of Figure 2.1). A VICON motion capture system with three cameras was used to capture the 3D position of each marker (right of Figure 2.1). The sampling frequency was set to 120 Hz. The recording was made in a quiet room using a close talking SHURE microphone at a sampling rate of 48 kHz. The markers' motions and the aligned audio were simultaneously captured by the system. In total, 612 sentences were used in this work. Note that the actress did not receive any special instruction about how to express the target emotions, and was asked to be natural. Even though acted facial expressions have some dierences with genuine facial ex- pression [65], databases based on actors have been widely used in the analysis of emo- tions. The main advantage of this setting is that a balanced corpus can be designed in 13 advance, to include a wide range of phonetic and emotional variability. In addition, the proposed recording setting allows us to use markers that provide detailed facial infor- mation which could be very dicult to obtain in a more realistic production scenario. Such data are useful for the types of analyses presented in this thesis (further discussion is given in Section 2.2.2). 2.2 Interactive emotional dyadic motion capture database (IEMOCAP) 2.2.1 Introduction One of the major limitations in the study of emotion expression is the lack of databases with genuine interaction that comprise integrated information from most of these chan- nels. Douglas-Cowie et al. analyzed some of the existing emotional databases [60], and concluded that in most of the corpora the subjects were asked to simulate (\act") specic emotions. While desirable from the viewpoint of providing controlled elicitation, these simplications in data collection, however, discarded important information observed in real life scenarios [61]. As a result, the performance of emotion recognition signi- cantly decreases when the automatic recognition models developed by such databases are used in real life applications [11], where a blend of emotions is observed [58, 61] (i.e., combinations of the \basic emotions" [67]). Another limitation of existing corpora is that the recorded materials often consist of isolated utterances or dialogs with few turns [60]. This setting neglects important eects of contextualization, which play a crucial role in how we perceive [30] and express emotions [61]. Likewise, most of the existing databases contain only the acoustic speech channel. Therefore, these corpora cannot be used to study the information that is conveyed through the other communica- tion channels. Other limitations of current emotional databases are the limited number 14 of subjects, and the small size of the databases [61]. Similar observations were also presented in the review presented by Ververidis and Kotropoulos [177]. Considering these limitations, a new audio-visual database was designed, which no- tably includes direct and detailed motion capture information that would facilitate access to detailed gesture information not aorded by the state of the art in video pro- cessing. In this database, which will be referred here on as the interactive emotional dyadic motion capture database (IEMOCAP), ten actors were recorded in dyadic ses- sions (5 sessions with 2 subjects each). They were asked to perform three selected scripts with clear emotional content. In addition to the scripts, the subjects were also asked to improvise dialogs in hypothetical scenarios, designed to elicit specic emotions (happiness, anger, sadness, frustration and neutral state). One participant of the pair was motion captured at a time during each interaction. Fifty-three facial markers were attached to the subject being motion captured, who also wore wristbands and a head- band with markers to capture hand and head motion, respectively (see Figure 2.2). Using this setting, the emotions were elicited within a proper context, improving the authenticity of the captured emotional data. Furthermore, gathering data from ten dif- ferent subjects increases the plausibility of eectively analyzing trends observed in this database on a more general level. In total, the database contains approximately twelve hours of data. This corpus, which took approximately 20 months to collect (from the design to the post processing stages), is hoped to add to the resources that can help advance research to understand how to model expressive human communication. The rest of this section is organized as follows. Section 2.2.2 presents a review of audio-visual databases that have been used to study emotions. Section 2.2.3 describes the design of the corpus presented in this section. Section 2.2.4 explains the recording procedures of the database. Section 2.2.5 presents the various post processing steps such 15 Figure 2.2: Marker layout. In the recording, fty-three markers were attached to the face of the subjects. They also wore wristbands (two markers) and headband (two markers). An extra marker was also attached on each hand. as reconstruction of the marker data, segmentation and emotional evaluation. Section 2.2.6 discusses how the IEMOCAP database overcomes some of the main limitations in the current state of the art emotional databases. 2.2.2 Brief review of audio-visual databases One of the crucial improvements that is needed to achieve major progress in the study of emotion expression is the collection of new databases that overcome the limitations existing in current emotional corpora. Douglas-Cowie et al. discussed the state of the art emotional databases [60], focusing on four main areas: scope (number of speakers, emotional classes, language, etc), naturalness (acted versus spontaneous), context (in- isolation versus in-context) and descriptors (linguistic and emotional description). They highlight the importance of having suitable databases with natural emotions recorded during an interaction rather than monologues. Other requirements for a good database are multiple speakers, multimodal information capture and adequate descriptors of the emotion contained in the corpus. 16 Given the multiple variables considered in the study of emotions, it is expected that a collection of databases rather than a single corpus will be needed to address many of the open questions in this multidisciplinary area. Unfortunately, there are currently few emotional databases that satisfy these core requirements. Some of the most successful eorts to collect new emotional databases to date have been based on broadcasted television programs. Some of these examples are the Belfast natural database [60, 61], the VAM database [82, 83] and the EmoTV1 database [1]. Likewise, movie excerpts with expressive content have also been proposed for emotional corpora, especially for extreme emotions (e.g., SAFE corpus [36]). Nevertheless, one important limitation of these approaches is the copyright and privacy problems that prevent the wide distribution of the corpora [46, 60]. Also, the position of the microphones and cameras, the lexical and emotional content, and the visual and acoustic backgrounds cannot be controlled, which challenge the processing of the data [46]. Other attempts to collect natural databases were based on recordings in situ (Genova Airport Lost Luggage database [157]), recording spoken dialogs from real call center (the CEMO [180], and CCD [115] corpora), asking the subjects to recall emotional experiences [3], inducing emotion with a Wizard of Oz approach in problem-solving settings using a human-machine interface (e.g., SmartKom database [160]), using games specially designed to emotionally engage the users (e.g., the EmoTaboo corpus [190]), and inducing emotion through carefully designed human-machine interaction (i.e., SAL [27, 46]). In the Humaine project portal, further descriptions of some of the existing emotional databases are presented [92]. Recording professional actors under controlled conditions can overcome many of the limitations of the aforementioned recording techniques. We have claimed in our previous work that good quality acted databases can be recorded, when suitable acting methodologies are used to elicit emotional realizations from experienced actors, engaged in dialogs rather than monologues [23]. The Geneva Multimodal Emotion Portrayal 17 (GEMEP) [8] is a good example. Enos and Hirschberg argued that emotion arises as a consequence of what it is expected and what it is nally achieved [70]. They suggested that acted databases could produce more realistic emotions if this goal-oriented approach is suitably incorporated in the recording. In order to make a unied analysis of verbal and nonverbal behavior of the subjects possible, the database should include the visual channel capturing gestures and facial expression in conjunction with the aural channel. Although there are automatic plat- forms to track salience features in the face using images (e.g., [39]), the level of the detailed facial information provided by motion capture data is not presently achievable using the state of art in video processing. This is especially notable in the cheek area, in which there are no salience feature points. To the best of our knowledge, few motion capture databases exist for the study of emotional expression. Kapur et al. presented an emotional motion capture database, but they targeted only body postures (no facial ex- pressions) [99]. The USC Facial Motion Capture Database (FMCD) is another example (Section 2.1). This database was recorded from a single actress with markers attached to her face, who was asked to read semantically-neutral sentences expressing specic emotions. The two main limitations of this corpus are that the emotions were elicited in isolated sentences, and that only one speaker was recorded. The IEMOCAP database described in this chapter was designed to overcome some of these basic limitations. The requirements considered in the design of the IEMOCAP database are listed below. The database must contain genuine realizations of emotions. Instead of monologues and isolated sentences, the database should contain natural dialogues, in which the emotions are suitably and naturally elicited. Many experienced actors should be recorded. 18 The recording of the database should be as controlled as possible in terms of emotional and linguistic content. In addition to the audio channel for capturing verbal behavior, the database should have detailed visual information to capture the nonverbal information, all in a synchronized manner. The emotional labels should be assigned based on human subjective evaluations. Notice that there are inherent tradeos between some of these requirements (e.g., naturalness versus control of expression content). The next sections describe how these requirements were addressed in the IEMOCAP database collection. 2.2.3 The design of the database The IEMOCAP database was designed as a tool for our research in expressive human communication. We are particularly interested in studying the emotional categories happiness, anger, sadness and the neutral state. These categories are among the most common emotional descriptors found in the literature [148]. For this database, we decided also to include frustration, since it is also an important emotion from an appli- cation point of view. Therefore, the content of the corpus was designed to cover those ve emotions. As will be discussed in Section 2.2.5, during the emotional evaluation, the emotional categories were expanded to include disgust, fear, excitement and surprise. The purpose of doing so was to have a better description of the emotions found in the corpus, notably in the spontaneous/unscripted elicitation scenarios. The most important consideration in the design of this database was to have a large emotional corpus with many subjects, who were able to express genuine emotions. To achieve these goals, the content of the corpus and the subjects were carefully selected. 19 2.2.3.1 Material selection Instead of providing reading material, in which the emotions are not guaranteed to be genuinely expressed during the recording [60], two dierent approaches were selected: the use of plays (scripted sessions), and improvisation based hypothetical scenarios (spontaneous sessions). The rst approach is based on a set of scripts that the subjects were asked to memorize and rehearse. The use of plays provides a way of constraining the semantic and emotional content of the corpus. Three scripts were selected after reading more than one hundred 10-minute plays. A theater professional supervised the selection given the requirement that the plays should convey the target emotions (happiness, anger, sadness, frustration and neutral state). In addition, these plays were selected so that they each consisted of a female and a male role. This requirement was imposed to balance the data in terms of gender. Since these emotions are expressed within a suitable context, they are more likely to be conveyed in a genuine manner, in comparison to the recordings of simple isolated sentences. In the second approach, the subjects were asked to improvise based on hypothetical scenarios that were designed to elicit specic emotions (see Table 2.1). The topics for the spontaneous scenarios were selected following the guidelines provided by Scherer et al. [159]. As reported in their book, the authors polled individuals who were asked to remember situations in the past that elicited certain emotions in them. The hypothetical scenarios were based on some common situations (e.g., loss of a friend, separation). In this setting, the subjects were free to use their own words to express themselves. By granting the actors a considerable amount of liberty in the expression of their emotions, we expected that the results would provide genuine realization of emotions. A comparison of the advantages and disadvantages of these two elicitation ap- proaches is given in our previous work [24]. 20 Table 2.1: Scenarios used for eliciting unscripted/unrehearsed interactions in the database collection. The target emotions for each subject are given in parenthesis (Fru = Frustation, Sad = Sadness, Hap = Happiness, Ang = Anger, Neu = Neutral). Subject 1 (with markers) Subject 2 (without markers) 1 (Fru) The subject is at theDepartmentofMotor Vehicles (DMV) and he/she is being sent back after standing in line for an hour for not having the right form of IDs. (Ang) The subject works at DMV. He/she re- jects the application. 2 (Sad) The subject, a new parent, was called to enroll the army in a foreign country. He/she has to separate from his/her spouse for more than 1 year. (Sad) The subject is his/her spouse and is ex- tremely sad for the separation. 3 (Hap) The subject is telling his/her friend that he/she is getting married. (Hap) The subject is very happy and wants to know all the details of the proposal. He/she also wants to know the date of the wedding. 4 (Fru) The subject is unemployed and he/she has spent last 3 years looking for work in his/her area. He/she is losing hope. (Neu) The subject is trying to encourage his/her friend. 5 (Ang) The subject is furious, because the airline lost his/her baggage and he/she will receive only $50 (for a new bag that cost over $150 and has lots of important things). (Neu) The subject works for the airline. He/she tries to calm the customer. 6 (Sad) The subject is sad because a close friend died. He had cancer that was detected a year before his death. (Neu) The subject is trying to support his friend in this dicult moment. 7 (Hap) The subject has been accepted at USC. He/she is telling this to his/her best friend. (Hap) The subject is very happy and wants to know the details (major, scholarship). He/she is also happy because he/she will stay in LA so they will be together. 8 (Neu) He/She is trying to change the mood of the customer and solve the problem. (Ang) After 30 minutes talking with a machine, he/she is transferred to an operator. He/she ex- presses his/her frustration, but, nally, he/she changes his/her attitude. 2.2.3.2 Actors selection As suggested in [60], skilled actors engaged in their role during interpersonal drama may provide a more natural representation of the emotions. Therefore, this database relied on seven professional actors and three senior students from the Drama Department at the University of Southern California. Five female and ve male actors were selected, after reviewing their audition sessions. They were asked to rehearse the scripts under the supervision of an experienced professional (functioning as a director) who made 21 Figure 2.3: Two of the actors who participated in the recording, showing the markers on the face and headband. sure the scripts were memorized and the intended emotions were genuinely expressed, avoiding exaggeration or caricature of the emotions. The subjects were recorded in ve dyadic recording sessions, each of which lasted approximately six hours, including suitable rest periods. Since the selected scripts have a female and a male role, an actor and an actress were recorded in each of the ve sessions (see Fig. 2.3). 2.2.4 Recording of the corpus For each of the sessions, fty-three markers (diameter 4mm) were attached to the face of one of the subjects in the dyad to capture detailed facial expression information, while keeping the markers far from each other to increase the accuracy in the trajectory reconstruction step. Most of the facial markers were placed according to the feature points dened in the MPEG-4 standard [143, 170]. Figures 2.2 and 2.3 show the layout of the markers. The subject wore a headband with two markers on it (diameter 2.5cm). These markers, which are static with respect to the facial movements, are used to compensate for head rotation. In addition, the subject wore wristbands with two markers each (diameter 1cm). An extra marker in each hand was also added. Since only three markers are used in each hand, it is not possible to have detailed hand gestures 22 (e.g., of ngers). Nevertheless, the information provided by these markers give a rough estimate of the hands' movements. In total, 61 markers were used in the recording (Fig. 2.2). Notice that the markers are very small and do not interfere with natural speech. In fact, the subjects reported that they felt comfortable wearing the markers, which did not prevent them from speaking naturally. After the scripts and the spontaneous scenarios were recorded, the markers were attached to the other subject, and the sessions were recorded again after a suitable rest. Notice that the original idea was to have markers on both speakers at the same time. However, the current approach was preferred to avoid interference between two separate setups. The VICOM cameras are sensitive to any re ected material in their eld of view. Therefore, it is technically dicult to locate the additional equipments in the room without aecting the motion capture recording (computer, microphones, cameras). Furthermore, with this setting, all the cameras were directed to one subject, increasing the resolution and quality of the recordings. The database was recorded using the facilities of the John C. Hench Division of Animation & Digital Arts (Robert Zemeckis Center) at USC. The trajectories of the markers data were recorded using a VICON motion capture system with eight cameras that were placed approximately one meter from the subject with markers, as can be seen in Figure 2.4. The sample rate of the motion capture system was 120 frames per second. To avoid having gestures outside the volume dened by the common eld of view of the VICOM cameras, the subjects were asked to be seated during the recording. However, they were instructed to gesture as naturally as possible, while avoiding occluding their face with the hands. The subject without the markers was sitting out of the eld of view of the VICON cameras to avoid possible interferences. As a result of this physical constraint, the actors were separated approximately three meters from each other. Since the participants were within the social distance as dened by Hall [85], we expect that 23 Figure 2.4: VICON motion capture system with 8 cameras. The subject with the markers sat in the middle of the room, with the cameras directed to him/her. The subject without the markers sat outside the eld of view of the VICON cameras, facing the subject with markers. the in uence of proxemics did not aect their natural interaction. At the beginning of each recording session, the actors were asked to display a neutral pose of the face for approximately two seconds. This information can be used to dene a neutral pose of the markers. The audio was simultaneously recorded using two high quality shotgun microphones (Schoeps CMIT 5U) directed at each of the participants. The sample rate was set to 48KHz. In addition, two high-resolution digital cameras (Sony DCR-TRV340) were used to record a semi frontal view of the participants (see Fig. 2.6). These videos were used for emotion evaluation, as will be discussed in Section 2.2.5. The recordings were synchronized by using a clapboard with re ective markers at- tached to its ends. Using the clapboard, the various modalities can be accurately syn- chronized with the sounds collected by the microphone, and the images recorded by the VICON and digital cameras. The cameras and microphones were placed in such a way that the actors could face each other, a necessary condition for natural interaction. Also, the faces were within the line of sight - not talking to the back of a camera. In 24 Table 2.2: Segmentation statistics of the IEMOCAP database speech. Comparative details for popular spontaneous spoken dialog corpora are also shown. IEMOCAP Other spontaneous corpora All turns Scripted Spontaneous Switchboard-I Fisher Turn duration [sec] 4.5 4.6 4.3 4.5 3.3 Words per turn 11.4 11.4 11.3 12.3 9.9 fact, the actors reported that the side conditions of the recording did not aect their natural interaction. 2.2.5 Post processing 2.2.5.1 Segmentation and transcription of the data After the sessions were recorded, the dialogs were manually segmented at the dialog turn level (speaker turn), dened as continuous segments in which one of the actors was actively speaking (see Figure 2.6 in which two turns are emotional evaluated). Short turns showing active listening such as \mmhh" were not segmented. Multi-sentence utterances were split as single turns. For the scripted portion of the data (see Section 2.2.3.1), the texts were segmented into sentences in advance and used as reference to split the dialogs. This segmentation was used only as guidance, since we did not require having the same segmentation in the scripts across sessions. In total the corpus con- tained ten thousand and thirty nine turns (scripted session: 5255 turns; spontaneous sessions: 4784 turns) with an average duration of 4.5 seconds. The average value of words per turn was 11.4. The histograms of words per turn for the scripted and spon- taneous sessions are given in Figure 2.5. These values are similar to the turn statistics observed in well-known spontaneous corpora such as Switchboard-1 Telephone Speech Corpus (Release 2) and Fisher English Training Speech Part 1 (see Table 2.2). The professional transcription of the audio dialogs (i.e., what the actors said) was obtained from Ubiqus [172] (see Table 2.6 for an example). Then, forced alignment 25 2 4 6 8 10 12 14 16 18 20 22 24 26 28 >30 0 0.1 0.2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 >30 0 0.1 0.2 (a) Scripted session (b) Spontaneous sessions Figure 2.5: Histogram with the number of words per turns (in percentage) in the scripted and spontaneous sessions. was used to estimate the word and phoneme boundaries. Conventional acoustic speech models were trained with over 360 hours of neutral speech, using the Sphinx-III speech recognition system (version 3.0.6) [91]. Although we have not rigorously evaluated the alignment results, our preliminary screening suggests that the boundaries are accurate, especially in segments with no speech overlaps. Knowing the lexical content of the utterances can facilitate further investigations into the interplay between gestures and speech in terms of linguistic units (Chapter 3). 2.2.5.2 Emotional annotation of the data In most of the previous emotional corpus collections, the subjects are asked to express a given emotion, which is later used as the emotional label. A drawback of this approach is that it is not guaranteed that the recorded utterances re ect the target emotions. Additionally, a given display can elicit dierent emotional percepts. To avoid these problems, the emotional labels in this corpus were assigned based on agreements derived from subjective emotional evaluations. For that purpose, human evaluators were used to assess the emotional content of the database. The evaluators were USC students who were uent English speakers. Dierent methodologies and annotation schemes have been proposed to capture the emotional content of databases (i.e., Feeltrace tool [47], Context Annotation Scheme 26 Figure 2.6: ANVIL annotation tool used for emotion evaluation. The elements were manually created for the turns. The emotional content of the turns can be evaluated based on categorical descriptors (e.g., happiness, sadness) or primitive attribute (e.g., activation, valence). (MECAS) [58]). For this database, two of the most popular assessment schemes were used: discrete categorical based annotations (i.e., labels such as happiness, anger, and sadness), and continuous attribute based annotations (i.e., activation, valence, and dom- inance). These two approaches provide complementary information of the emotional manifestations observed in corpus. The \annotation of video and spoken language" tool ANVIL [102] was used to fa- cilitate the evaluation of the emotional content of the corpus (see Fig. 2.6). Notice that some emotions are better perceived from audio (e.g., sadness) while others from video (e.g., anger) [49]. Also, the context in which the utterance is expressed plays an important role in recognizing the emotions [30]. Therefore, the evaluators were asked to sequentially assess the turns, after watching the videos. Thus, the acoustic and vi- sual channels, and the previous turns in the dialog were available for the emotional assessment. 27 One assumption made in this evaluation is that, within a turn, there is no transition in the emotional content (e.g., from frustration to anger). This simplication is reason- able, since the average duration of the turns is only 4.5 seconds. Therefore the emotional content is expected to be kept constant. Notice that the evaluators were allowed to tag more than one emotional category per turn, to account for mixtures of emotions (e.g., frustration and anger), which are commonly observed in human interaction [58]. Categorical emotional descriptors Six human evaluators were asked to assess the emotional content of the database in terms of emotional categories. The evaluation sessions were organized so that three dierent evaluators assessed each utterance. The underlying reason was to minimize evaluation time for the preliminary analysis of the database. The evaluation was divided into approximately 45-minute sessions. The evaluators were instructed to have a suitable rest between sessions. As mentioned in Section 2.2.3, the database was designed to target anger, sadness, happiness, frustration and neutral state. However, some of the sentences were not ade- quately described with only these emotion labels. Since the interactions were intended to be as natural as possible, the experimenters expected to observe utterances full of ex- citement, fear and other broad range of mixed emotions that are commonly seen during natural human interactions. As described by Devillers et al., emotional manifestations not only depends on the context, but also on the person [58]. They also indicated that ambiguous emotions (non-basic emotions) are frequently observed in real-life scenarios. Therefore, describing emotion is an inherent complex problem. As a possible way to simplify the fundamental problem in emotion categorization, an expanded set of cat- egories was used for emotion evaluation. On the one hand, if the number of emotion categories is too extensive, the agreement between evaluators will be low. On the other hand, if the list of emotions is limited, the emotional description of the utterances will 28 Figure 2.7: ANVIL emotion category menu presented to the evaluators to label each turn. The evaluators could select more than one emotions and add their own comments. be poor and likely less accurate. To balance the tradeo, the nal emotional categories selected for annotation were anger, sadness, happiness, disgust, fear and surprise (known as basic emotions [67]), plus frustration, excited and neutral states. Figure 2.7 shows the Anvil emotion category menu used to label each turn. Although it was preferred that the evaluators chose only a single selection, they were allowed to select more than one emotional label to account for blended emotions [58]. If none of the available emotion categories were adequate, they were instructed to select other and write their own comments. For the sake of simplicity, majority voting was used for emotion class assignment, if the emotion category with the highest votes was unique (notice that the evaluators were allowed to tag more than one emotion category). Under this criterion, the evaluators reached agreement in 74.6% of the turns (scripted session: 66.9%; spontaneous sessions: 83.1%). Notice that other approaches to reconciling the subjective assessment are possible (e.g., entropy based method [167], and multiple labels [58]). 29 Neu: 28% Dis: < 1% Hap: 7% Sur: 2% Sad: 15% Ang: 7% Fea: < 1% Fru: 24% Exc: 17% oth: < 1% Neu: 17% Dis: < 1% Hap: 9% Sur: 1% Sad: 14% Ang: 23% Fea: < 1% Fru: 25% Exc: 11% oth: < 1% (a) Scripted session (b) Spontaneous sessions Figure 2.8: Distribution of the data for each emotional category. The gure only con- tains the sentences in which the category with the highest vote was unique (Neu = neutral state, Hap = happiness, Sad = sadness, Ang = anger, Sur = surprise, Fea = fear, Dis = disgust, Fru = Frustation, Exc = Excited and Oth = Other). Figure 2.8 shows the distribution of the emotional content in the data for the turns that reached agreement. This gure reveals that the IEMOCAP database exhibits a balanced distribution of the target emotions (happiness, anger, sadness, frustration and neutral state). As expected, the corpus contains few examples of other emotional categories such as fear and disgust. Using the assigned emotional labels as ground truth, the confusion matrix between emotional categories in the human evaluation was estimated. The results are presented in Table 2.3. On average, the classication rate of the emotional categories was 72%. The table shows that some emotions such as neutral, anger and disgust are confused with frustation. Also, there is an overlap between happiness and excitement. To analyze the inter-evaluator agreement, Fleiss' Kappa statistic was computed [74] (see Table 2.4). The result for the entire database is = 0:27. The value of the Fleiss' Kappa statistic for the turns in which the evaluators reached agreements according to the criterion mentioned before is = 0:40. Since the emotional content of the database mainly span the target emotions (see Fig. 2.8), the Kappa statistic was recalculated 30 Table 2.3: Confusion matrix between emotion categories estimated from human evalua- tions (Neu = neutral state,Hap = happiness,Sad = sadness,Ang = anger,Sur = sur- prise, Fea = fear, Dis = disgust, Fru = frustation, Exc = excited and Oth = other). Emotional labels Neu Hap Sad Ang Sur Fea Dis Fru Exc Oth Neutral state 0.74 0.02 0.03 0.01 0.00 0.00 0.00 0.13 0.05 0.01 Happiness 0.09 0.70 0.01 0.00 0.00 0.00 0.00 0.01 0.18 0.01 Sadness 0.08 0.01 0.77 0.02 0.00 0.01 0.00 0.08 0.01 0.02 Anger 0.01 0.00 0.01 0.76 0.01 0.00 0.01 0.17 0.00 0.03 Surprise 0.01 0.04 0.01 0.03 0.65 0.03 0.01 0.12 0.09 0.01 Fear 0.03 0.00 0.05 0.02 0.02 0.67 0.02 0.05 0.15 0.00 Disgust 0.00 0.00 0.00 0.00 0.00 0.00 0.67 0.17 0.17 0.00 Frustation 0.07 0.00 0.04 0.11 0.01 0.01 0.01 0.74 0.01 0.02 Excited 0.04 0.16 0.00 0.00 0.02 0.00 0.00 0.02 0.75 0.00 after clustering the emotional categories as follows. First, happiness and excited were merged since they are close in the activation and valence domain. Then, the emotional categories fear, disgust and surprise were relabeled as other (only for this evaluation). Finally, the labels of the remaining categories were not modied. With this new label- ing, the Fleiss' Kappa statistic for the entire database and for the turns that reached agreement are = 0:35 and = 0:48, respectively. These levels of agreement, which are considered as fair/moderate agreement, are expected since people have dierent percep- tion and interpretation of the emotions. These values are consistent with the agreement levels reported in previous work for similar tasks [58, 82, 167]. Furthermore, everyday emotions are complex, which may cause poor inter-evaluator agreement [61]. Table 2.4 also provides the individual results of the Fleiss Kappa statistic for the scripted and spontaneous sessions. The results reveal that for the spontaneous ses- sions the levels of inter-evaluator agreement are higher than in the scripted sessions. While spontaneous sessions were designed to target ve specic emotions (happiness, anger, sadness, frustration and neutral state), the scripted sessions include progressive changes from one emotional state to another, as dictated by the narrative content of the play. Within a session, the scripted dialog approach typically elicited a wider range 31 Table 2.4: Fleiss' Kappa statistic to measure inter-evaluator agreement. The results are presented for all the turns and for the turns in which the evaluators reached agreement Original labels Recalculated labels Session All turns Reached agreement All turns Reached agreement Entire database 0.27 0.40 0.35 0.48 Scripted sessions 0.20 0.36 0.26 0.42 Spontaneous sessions 0.34 0.43 0.44 0.52 of ambiguous emotion manifestations. As a result, the variability of the subjective eval- uations increases, yielding to lower level of inter-evaluator agreement. Further analysis comparing scripted and spontaneous elicitation approaches are given in [24]. Continuous emotional descriptors An alternative approach to describe the emotional content of an utterance is to use primitive attributes such as valence, activation (or arousal), and dominance. This approach, which has recently increased popularity in the research community, provides a more general description of the aective states of the subjects in a continuous space. This approach is also useful to analyze emotion expression variability. The readers are referred to [44], for example, for further details about how to describe emotional content of an utterance using such an approach. The self-assessment manikins (SAMs) was used to evaluate the corpus in terms of the attributes valence [1-negative, 5-possitive], activation [1-calm, 5-excited], and dominance [1-weak, 5-strong] [73, 82] (Fig. 2.9). This scheme consists of 5 gures per dimension that describe progressive changes in the attribute axis. The evaluators are asked to select the manikin that better describes the stimulus, which is mapped into an integer between 1 to 5 (from left to right). The SAMs system has been previously used to assess emotional speech, showing low standard deviation and high inter-evaluator agreement [81]. Also, using a text-free assessment method bypasses the diculty that each evaluator has on his/her individual understanding of linguistic emotion labels. Furthermore, the evaluation is simple, fast, and intuitive. 32 (a) (b) Figure 2.9: (a) ANVIL attribute-based menu presented to the evaluators to label each turn. (b) Self-assessment manikins. The rows illustrate valence (top), activation (mid- dle), and dominance (bottom). Table 2.5: Inter-evaluator agreement of the attribute based evaluation measured with the Cronbach alpha coecient Cronbach's Alpha Session Valence Activation Dominance Entire database 0.809 0.607 0.608 Scripted sessions 0.783 0.602 0.663 Spontaneous sessions 0.820 0.612 0.526 Two dierent evaluators were asked to assess the emotional content of the corpus using the SAMs system. At this point, approximately 85.5% of the data have been eval- uated. After the scores were assigned by the raters, speaker dependent z-normalization was used to compensate for inter-evaluator variation. Figure 2.10 shows the distribution of the emotional content of the IEMOCAP database in terms of valence, activation and dominance. The histograms are similar to the results observed in other spontaneous emotional corpus [82]). The Cronbach alpha coecients were computed to test the reliabilities of the eval- uations between the two raters [48]. The results are presented in Table 2.5. This table shows that the agreement for valence was higher than for the other attributes. 33 (a) Valence (b) Activation (c) Dominance Figure 2.10: Distribution of the emotional content of the corpus in terms of (a) valence, (b) activation, and (c) dominance. The results are separately displayed for scripted (black) and spontaneous (gray) sessions. Categorical levels do not provide information about the intensity level of the emo- tions. In fact, emotional displays that are labeled with the same emotional category can present patterns that are signicantly dierent. Therefore, having both types of emo- tional descriptions provide complementary insights about how people display emotions and how these cues can be automatically recognized or synthesized for better human- machine interfaces. Table 2.6 gives an example of the annotations of this corpus. 2.2.5.3 Self- emotional evaluation of the corpus In addition to the emotional assessments with na ve evaluators, we asked six of the actors who participated in the data collection to self-evaluate the emotional content of their sessions using categorical (i.e., sadness, happiness) and attribute (i.e., activation, valence) approaches. This self emotional evaluation was performed only in the sponta- neous/unscripted scenarios (see Section 2.2.3.1). Table 2.7 compares the self-evaluation (\self") results with the assessment obtained from the rest of the evaluators (\others"). For this table, the emotional labels obtained from majority voting are assumed as ground truth. The turns in which the evaluators did not reach agreement were not considered. 34 Table 2.6: Example of the annotations for a portion of a spontaneous session (third scenario in Table 2.1). The example includes the turn segmentation (in seconds), the transcription, the categorical emotional assessments (three subjects) and the attribute emotional assessment (valence, activation, dominance, two subjects). Seg. [sec] Turn Transcription Labels [v,a,d] [05:0 07:8] F00: Oh my God. Guess what, guess what, guess what, guess what, guess what, guess what? [exc][exc][exc] [5,5,4][5,5,4] [07:8 08:7] M00: What? [hap][sur][exc] [4,4,3][4,2,1] [08:9 10:8] F01: Well, guess. Guess, guess, guess, guess. [exc][exc][exc] [5,5,4][5,5,4] [11:1 14:0] M01: Um, you{ [hap][neu][neu] [3,3,2][3,3,3] [14:2 16:0] F02: Don't look at my left hand. [exc][hap][hap;exc] [4,3,2][5,4,3] [17:0 19:5] M02: No. Let me see. [hap][sur][sur] [4,4,4][4,4,4] [20:7 22:9] M03: Oh, no way. [hap][sur][exc] [4,4,4][5,4,3] [23:0 28:0] F03: He proposed. He proposed. Well, and I said yes, of course. [LAUGHTER] [exc][hap][hap;exc] [5,4,3][5,5,3] [26:2 30:8] M04: That is great. You look radiant. I should've guess. [hap][hap][exc] [4,4,3][5,3,3] [30:9 32:0] F04: I'm so excited. [exc][exc][exc] [5,4,3][5,5,3] [32:0 34:5] M05: Well, Tell me about him. What happened [hap][exc][exc] [4,4,3][4,4,4] Table 2.7: Comparison of the recognition rate in percentage between the evaluation by self and others for the spontaneous/unscripted scenarios (categorical evaluation). The results are presented for six of the actors (e.g., F 03 = female actress in session 3). F01 F02 F03 M01 M03 M05 Average Self 0.79 0.58 0.44 0.74 0.57 0.54 0.60 Others 0.76 0.80 0.79 0.81 0.80 0.77 0.79 The results are presented in terms of classication percentage. Surprisingly, the results show signicant dierences between the emotional perceptions between na ve evaluators and the actors. Although the emotional labels were estimated only with the agreement between na ve evaluators -and therefore the recognition rates are expected to be higher- this table suggests that there are signicant dierences between both assessments. In our recent work, we studied in further detail the dierences between the evalua- tions from the na ve raters and the self-assessments in terms of inter-evaluator agreement [22]. We analyzed cross evaluation results across the actors and the na ve evaluators by estimating the dierences in reliability measures when each of the raters was excluded from the evaluation. The results also revealed a mismatch between the expression and 35 perception of the emotions. For example, the actors were found to be more selective in assigning the emotional labels to their turns. In fact, the kappa statistic decreased when the self-evaluations were included in the estimation. Notice that the actors are fa- miliar with how they commonly convey dierent emotions. Unlike the na ve evaluators, they were also aware of the underlying protocols to record the database. Further anal- ysis from both self and others evaluations is needed to shed light into the underlying dierences between how we express and perceive emotions. 2.2.5.4 Reconstruction of marker data The trajectories of the markers were reconstructed using the VICON iQ 2.5 software [179]. The reconstruction process is semi-automatic, since a template with the markers' positions has to be manually assigned to the markers. Also, the reconstruction needs to be supervised to correct the data when the software is not able to track the markers. Cubic interpolation was used to ll gaps when the number of consecutive missing frames for each marker was less than 30 frames (0.25 second). Unfortunately, some of the markers were lost during the recording, mainly because of sudden movements of the subjects, and the location of the cameras. Since natural interaction was encouraged, the recording was not stopped when the actors performed sudden movements. The cameras were located approximately one meter from the sub- ject to successfully capture hand gestures in the recording. If only facial expressions were recorded, the cameras had to be placed close to the subjects' faces to increase the resolution. Figure 2.11 shows the markers, in which the percentage of missing frames was higher than 1% of the corpus. The markers with higher percentages are associated with the eyelids and the hands. The reason is that when the subjects had their eyes open, the eyelids' markers were sometime occluded. Since the purpose of these markers was to infer eye blinks, missing markers are also useful information to infer when the 36 subjects' eyes blinked. The main problem of the hands' markers was the self-occlusion between hands. 12.0% 12.5% 1.9% 1.9% 3.5% 2.5% 2.6% 1.7% 0.0% 3.5% 6.8% 10.1% 13.5% Figure 2.11: Percentage of the markers that were lost during the recording. The gure only shows the markers that have more than 1% of missing values. Dark colors indicate higher percentage. After the motion data were captured, all the facial markers were translated to make a nose marker at the local coordinate center of each frame, removing any translation eect. After that, the frames were multiplied by a rotational matrix, which compensates for rotational eects. The technique is based on Singular value Decomposition (SVD) and was originally proposed by Arun et al. [5]. The main advantage of this approach is that the 3-D geometry of every marker is used to estimate the best alignment between the frames and a reference frame. It is robust against markers' noise and its performance overcomes methods that use few \static" markers to compensate head motion. In this technique, the rotational matrix was constructed for each frame as follows: A neutral facial pose for each subject was chosen as a reference frame, which was used to create a 53 3 matrix, M ref , in which the row of M ref has the 3D position of the markers. For the framet, a similar matrixM t was created by following the same marker 37 order as the reference. After that, the SVD,UDV T , of matrixM T ref M t was calculated. Finally, the product of VU T gave the rotational matrix, R t , for the frame t [5]. M T ref M t =UDV T (2.1) R t =VU T (2.2) The markers from the headband were used to ensure good accuracy in the head motion estimation. After compensating for the translation and rotation eects, the remaining motion between frames corresponds to local displacements of the markers, which largely dene the subject's facial expressions. 2.2.6 Discussion 2.2.6.1 IEMOCAP database: advantages and limitations As mentioned in Section 2.2.2, Douglas-Cowie et al. dened four main issues that need to be considered in the design of a database: scope, naturalness, context and descriptor [61]. In this section, the IEMOCAP database is discussed in terms of these issues. Scope: Although the number of subjects suggested by Douglas-Cowie et al. is greater than ten [60], the number used in this database may be a sucient initial step to draw useful conclusions about inter-personal dierences. To have this kind of comprehensive data from ten speakers marks a small rst, but hopefully an important, step in the study of expressive human communication (e.g., equipment, markers, number of modalities). The detailed motion capture information will be important to better understand the joint role in human communication of modalities such as facial expression, head motion and hand movements in conjunction with the verbal behavior. Also, twelve hours of 38 multimodal data will provide a suitable starting point for training robust classiers and emotional models. Naturalness: As mentioned by Douglas-Cowie et al., the price of naturalness is lack of control [60]. The use of the motion capture system imposed even greater constraints on the recording of natural human interaction. In this corpus, this tradeo was attempted to be balanced by selecting appropriate material to elicit the emotions during dyadic interactions. On the one hand, the linguistic and emotional content was controlled with the use of scripts (for the plays). On the other hand, it is expected that with this social setting, genuine realizations of emotions that are not observed either in monologues or in read speech corpus can be observed. According to [60], this database would be labeled as semi-natural since actors were used for the recording, who may exaggerate the expression of the emotions. However, based on the setting used to elicit the emotions and the achieved results, we consider that the emotional quality of this database is closer to natural than those from prior elicitation settings. As suggested by Cowie et al., we are planning to evaluate the naturalness of the corpus by conducting subjective assessments [46]. Context: One of the problems in many of the existing emotional databases is that they contain only isolated sentences or short dialogs [60]. These settings remove the dis- course context, which is known to be an important component [30]. In this corpus, the average duration of the dialogues is approximately 5 minutes in order to contextualize the signs and ow of emotions. Since the material was suitably designed from a dialog perspective, the emotions were elicited with adequate context. The emotional evalua- tions were also performed after watching the sentences in context so that the evaluators could judge the emotional content based on the sequential development of the dialogs. Descriptors: The emotional categories considered in the corpus provide a reasonable approximation of the emotions content observed in the database. These emotions are the 39 most common categories found in previous databases. Also, adding the primitive based annotation (valence, activation, and dominance) improves the emotional description of the collected corpus by capturing supplementary aspects of the emotional manifestation (i.e., intensity and variability). Lastly, with the detailed linguistic transcriptions of the audio part of the database, the emotional content can be analyzed in terms of various linguistic levels, in conjunction with the nonverbal cues. In sum, the IEMOCAP was carefully designed to satisfy the key requirements pre- sented in Section 2.2.2. As a result, this database addresses some of the core limitations of the existing emotional databases. This database can play an important role in un- derstanding and modeling the relation between dierent communicative channels used during expressive human communication, and contribute to the development of better human-machine interfaces. 40 Chapter 3: Analysis of speech and gestures In this chapter, the intrinsic relationship between gestures and speech during emotional utterances is analyzed. Section 3.1 analyzes the in uence of articulation and emotions on the interrelation between facial gestures and speech. A multilinear regression framework is used to estimate facial features from acoustics parameters. The levels of coupling between these communication channels are quantied by using Pearson's correlation between the recorded and speech-based estimated facial features. Section 3.2 analyzes and quanties the interplay between linguistic and aective goals in facial expressions. Communicative goals are simultaneously expressed through gestures and speech to convey messages enriched with valuable verbal and non-verbal clues. Therefore the communicative goals need to be buered, prioritized and executed in coherent manner. This analysis includes how constrained are facial areas to express non-verbal messages such as emotions, during active speaking. The aective/linguistic interplay is observed not only in single modalities, but also between communicative channels. We hypothesize that when one modality is con- strained by the articulatory speech process, other channels with more degrees of freedom 41 are used to convey the emotions. Section 3.3 explores these ideas by comparing the emo- tional modulation observed across modalities. The results presented here support this hypothesis, since it is observed that facial expression and prosodic speech tend to have a stronger emotional modulation when the vocal tract is physically constrained by the articulation to convey other linguistic communicative goals. Finally, Section 3.4 gives the conclusion and future directions of this research. The results presented in this chapter have important implications in recognition (Chapter 4) and synthesis (Chapter 5) of non-linguistic expressive messages. 3.1 Interrelation between speech and facial gestures in emotional utterances 3.1.1 Introduction In addition to speech, non-verbal communication plays an important role in day-to-day interpersonal human interaction. People simultaneously use their hand, change their facial expression and control the tone and their energy of the speech to consciously or unconsciously express or emphasize specic messages. The fact that all these com- municative channels interact and cooperate to convey a desired message suggests that gestures and speech are controlled by the same internal control system [29, 133, 173, 174]. Human communication, manifested through a combination of verbal and nonverbal channels used in normal interaction, is a result of dierent communicative components that mutually modulate gestures and speech in a non-trivial manner. Notable among them are the linguistic, emotional and idiosyncratic aspects of human communication. The linguistic aspect denes the verbal content of what is expressed. In addition, the underlying articulation also aects the appearance of the face. Each phoneme is pro- duced by the activation of a number of muscles that simultaneously shape the vocal 42 tract and the face, in regions beyond the oral aperture [174]. Likewise, the emotional state of the speaker is directly expressed through both gestures and speech. Commu- nicative channels such as facial expressions [65, 69], head motion (Section 5.4), pitch [44, 156] and short time spectral envelope [187] all present specic patterns under emo- tional states. Similarly, the idiosyncratic aspects also in uence the patterns of human speech and gestures, and are dependent on culture and social environment. These also include personal styles, such as the rate of the speech and intensity and manner of expressing emotions [64, 65]. Speech and gestures are directly connected and manipulated by one or more of these components of human communication. For example, the shape of the orofacial area is highly connected with the linguistic content [174, 186], hand gestures are related with idiosyncrasy [133] and facial expressions are most of the time triggered by emotions [65, 69]. Although, each communicative channel has been separately analyzed in the literature [31, 44, 65, 133, 156, 174, 185], relatively few studies have addressed the interaction between gestures and speech as a function of the multiple aspects of human communication. Understanding the linguistic, emotional and idiosyncratic eects on the gesture-speech relationships is a crucial step toward improving a number of interesting applications such as emotion recognition (Chapter 4), human-computer interaction [28, 181], human-like conversational agents [29] and realistic facial animation [12, 79, 84]. It will also provide useful insights into human speech production and perception [119]. Toward understanding how to model expressive human communication, the present section focuses on the linguistic and emotional aspects of human communication and their in uences on the relationships between gestures and speech. We investigate the interrelation between facial gestures, such as lip, eyebrow and head motion and acoustic features that represent the vocal tract shaping and prosody of speech. We analyze which gestures are more related with speech and how this relation changes in the presence of 43 four dierent expressive emotion states: neutral, sadness, happiness and anger. The analysis presented here is based on a database recorded from a single actress with mark- ers attached to her face, which provide detailed information about her facial expressions (FMCD database, Section 2.1). We analyze how active facial gestures are as function of dierent emotions. The results indicate that the rate of facial gesture displacements for emotional utterances signicantly dier from the ones observed in neutral speech. For example, the activeness, as quantied by the Euclidean distance between the facial fea- tures and their sentence-mean vector, for happiness and anger is more than 30% higher than for neutral speech. To quantify the relationship between facial gestures and speech, we compute Pearson's correlation between the original facial features and the speech- based estimated sequences generated with multilinear regression. This framework is implemented both at sentence-level, in which the mapping parameters are computed for each sentence assuming that the facial features are known, and at global-level, in which the mapping parameters are estimated using the entire database. At sentence- level, the audio-visual mapping presents signicant inter-emotional dierences, showing the in uence of emotions in the interrelation between gesture and speech. When the mapping is estimated at global-level the correlation levels not surprisingly, decrease. However, the results indicate that there is a clear emotion-dependent structure that can be learned using more sophisticated techniques than linear mapping. Furthermore, the results reveal that the correlation levels also decrease for consonants, especially unvoiced and fricative sounds, and silence regions, when compared to other corresponding broad speech classes of voiced and sonorant sounds. The main contribution of this section is the analysis of the interplay between gestures and speech as a function of emotion and articulation. As far as we are concerned, this is the rst attempt to quantize the linguistic and aective in uence on the relationship between facial gestures and speech. As specic results based on the detailed framework 44 presented here to analyze gestures and speech at dierent levels (global, sentence and phoneme level), important ndings are presented: The existence of a strong relationship between facial gestures and speech, which greatly depends on the linguistic content of the sentence The existence of a low-dimension emotional-dependent structure in the ges- ture/speech mapping that is preserved across sentences An emotional-dependent structure in the mapping that seems to be easier to model when prosodic features are used to estimate the facial features Multiresolution nature of the relationship between facial gestures and speech, both in (feature) space and time These results can guide the design of better models to capture the complex rela- tionship between facial gestures and speech. Specically, human-machine interfaces for applications such as games and educational and entertainment systems represent some of the wide range of applications that can be greatly enhanced by properly modeling and including human-like capabilities. The rest of this section is organized as follows. Section 3.1.2 describes related work about co-analysis of gestures and speech. Section 3.1.3 introduces the features, and the mathematical framework utilized in this section. Section 3.1.4 describes and discusses our results about the relationship between the various modalities. We analyze how active the facial features are during emotional speech and how much the audio-visual mapping is aected by articulation and emotions. Section 3.1.5 presents the audio-visual mapping at the phoneme level. We explore whether dierent broad phonetic classes have stronger or weaker relationships when the mapping is estimated at sentence level. Section 3.1.6 discusses the implications of the results presented in this section in the areas of facial animation and multimodal emotion recognition. 45 3.1.2 Related Work Based on high levels of correlation found between acoustic features and various human gestures, researchers have suggested that an internal control simultaneously triggers the production of both speech and gestures, sharing the same semantic meaning in dierent communication channels [29, 133, 173, 174]. Cassell et al. mentioned that they are not only strongly connected, but also systematically synchronized in dierent scales (phonemes-words-phrases-sentences) [29]. This theory, referred to as the excitatory hypothesis [173] implies that a joint analysis of these modalities is needed to fully understand human communication. The relation between gestures and speech has been studied under dierent perspec- tives. One line of research has focused on analyzing this relation in terms of conversa- tional functions, also known as regulators and conversational signals [64]. Gestures and speech co-occur to fulll functions during conversation, such as acknowledging agreement, changing turns and asking for clarication [28]. Valbonesi et al. studied co-occurrence of events during speech and hand gestures. They showed that most of the acoustic events, dened as maximum and minimum in the pitch and the RMS en- ergy, occur during hand gesture strokes [173]. Graf et al. concluded that head and eyebrow motions are consistently correlated with the prosodic structure of the speech [79]. Granstr om et al. conducted perceptual experiments showing that head and eye- brow motion help to stress prominence in speech [80]. They have also shown that smile is the strongest indicator of positive feedback. Such ndings have been used in designing human-like virtual agents. Cassell et al. proposed a rule-based system to generate facial expressions, hand gestures, head nod, eyebrow motion and spoken intonation, which were properly synchronized according to rules [29]. In [28], they extended their work to create a human-like agent, called REA that was able to respond to discourse function using gestures. 46 Another line of research has analyzed the relation between speech and gestures, es- pecially facial expressions, as results of articulatory processes. Vatikiotis-Bateson and Yehia showed that facial expressions are directly connected with the articulatory pro- duction [174]. They argued that the production of speech shapes the vocal tract and deforms the face, aecting regions quite further from the oral aperture. In [186], they continued their analysis presenting results about the relation between facial expressions, vocal-tract conguration and speech. They concluded that most of the facial motions can be predicted from acoustic features (Line Spectral Pair (LSP)) using linear esti- mations. In [185], they extended those results to non-linear mapping, showing higher accuracy. Jiang et al. also studied the relation between facial, articulatory and acoustic features, using multilinear regression analysis [95]. The focus of their work was to quan- tify the audio-visual relationship for the consonant-vowel (CVs) syllables C/a/, C/i/ and C/u/. They concluded that the mapping was syllable-dependent, since they found better results for C/a/ than for C/i/ and C/u/. They also found dierences in the cor- relation levels between the four speakers considered, suggesting that the mapping was speaker-dependent. Following a similar approach, Barker and Berthommier studied the mapping for isolated words with a xed vowel-consonant structure [9]. They concluded that the correlation levels between facial (jaw and lips) and acoustic (LSP) features are higher during vowels than during consonants. One common aspect in all these works, however, is that the data used provides only sparse information of the facial area with relatively few markers on the subject's face ( 20). With advances in data acquisition capabilities, probabilistic frameworks such as Hidden Markov Models (HMMs) and Dynamic Bayesian Networks (DBNs) have been successfully used to model both speech and facial expressions. Recently, a variety of these probabilistic frameworks have been proposed to jointly model facial expressions and speech for applications such as audio-visual emotion recognition [191], audio-visual 47 speech recognition [151], user modeling [40, 106], and facial animation [120]. We believe that these statistic learning frameworks are attractive schemes to capture the temporal relationship between gestures and speech in the presence of emotions. Another important aspect that in uences the relation between speech and gestures is the emotion conveyed by the speakers. Since each communication channel can be greatly modulated under dierent emotions, it is expected that their relation will also be emotion-dependent. Many previous studies have shown that speech is colored by emotional eects [44, 156, 187]. Emotions in uence not only supra-segmental char- acteristics of the speech (prosody and energy), but also short-time spectral envelope features such as Mel-Frequency Cepstrum Coecients (MFCCs) [187]. The face is also highly aected by the aective state of the speaker. For instance, the group led by Ekman has extensively analyzed the relation between facial expressions and emotions. After studying apex poses of expressive faces, they concluded that specic facial patterns are displayed under certain family of emotions [65, 69]. The eect of emotions in the orofacial area has been especially analyzed for realistic facial animations. Nordstrand et al. studied the eect of emotions in the shape of the lips for vowels. They concluded that there are signicant dierences in the patterns presented in neutral and emotional speech [139]. Similar inferences were also reported in [25]. As a direct consequence of these results, emotion-dependent models to synchronize lips movement with speech have been developed for human-like facial animations [12]. Other facial gestures such as head motion (Section 5.4) and eyebrown motion [64] also show strong dierences when the aective state of the speaker changes. Although these communication channels have been separately studied under dierent emotions, relatively few eorts have focused on the in uence of emotion in the relation between these communication channels. This section explores the in uence of emotions on the relation between facial gestures and speech. 48 3.1.3 Methodology The approach followed in this study to analyze the relation between speech and facial expression is to estimate the Pearson's correlation between the original facial expression signal, F Facial and a predicted signal, b F Facial , estimated from speech acoustic features using a linear mapping. This approach is similar to the method presented by Yehia et al. in [186]. Notice that it is also possible to estimate the acoustic features based on facial expression as presented in [9, 95, 186], which could be very useful for applications such as audio-visual speech recognition. However, for analysis simplicity we implemented only the unidirectional speech to gesture approach. Voiced speech production is usually modeled as a quasi-periodic source signal that excites the vocal-tract transfer function; unvoiced sounds are modeled with noise source excitation. The excitation models both the air exhaled from the lungs through the tra- chea and pharynx, and the vocal cord, which adjusts its tension to create the oscillatory signal. The vocal-tract transfer function models the pharyngeal, oral and nasal cavities, which carries much of the phonetic information of the speech. Based on this broad de- scription, two dierent sets of acoustic features can be dened: prosodic features, which provide the tonal and rhythmic aspect of the speech contained in the source; and, the vocal-tract acoustic features that model the time-varying vocal-tract transfer function [52]. In this section, pitch and energy were used as prosodic features, and MFCCs were used as vocal-tract features. We chose MFCCs rather than other short-time spectral envelope, because our preliminary results showed that MFCCs have higher correlation with facial gestures. Furthermore, unlike LSP that are based on a parametric model us- ing the all-pole assumption, MFCCs can handle zeros in the nasal sound spectrum, since they directly model the signal spectrum. Therefore, the use of MFCCs may improve the mapping during nasal phonemes, as discussed in Section 3.1.5. 49 To analyze facial gestures, the face is divided in Voronoi cells centered on the facial markers. Each marker provides local information about her facial movements. Further- more, rigid head motion was included as a part of the facial features. The shape of the lips and eyebrow, which are the most important distinguishable part of the face, were also considered in this study. These features were parameterized using specic facial markers, as explained in Section 3.1.3.1. The complete set of the markers, head motion, eyebrow and lip features is referred together here on as facial features. In this section, the FMCD database described in Section 2.1 is used for the analysis. 3.1.3.1 Feature extraction The pitch (F0), the RMS energy and the 13 MFCC coecients were extracted using the Praat speech processing software [14]. The analysis window was set to 25 milliseconds with an overlap of 8.3 milliseconds producing 60 frames per second. The smoothing and interpolation options of the Praat software were applied to remove any spurious spike in the pitch estimates and to avoid zeros in unvoiced/silence regions, respectively. In addition, the rst and second derivatives of the pitch and energy were added to the prosodic feature vector to incorporate their temporal dynamics. As shown in [9, 94], these dynamic features improve the correlation levels of the audio-visual feature mapping. The rst coecient of the MFCCs was removed, since it provides information about the RMS energy rather than the vocal-tract conguration. The velocity and acceleration coecients were also included as features. The dimension of this feature vector was reduced from 36 to 12 using Principal Component Analysis (PCA). This number was chosen to contain 95% of the variance of the MFCC features. This post- processed feature vector is what will be referred here on as MFCCs. After compensating for the translation and rotation eects using the technique de- scribed in Section 2.2.5.4, the remaining motion between frames corresponds to local 50 Middle Upper Lower (a) (b) (c) Figure 3.1: Face parameterization. (a) the gure shows the facial markers subdivision (upper, middle and lower face regions), (b) the gure shows head motion features, and (3) the gure shows the lip features. displacements of the markers, which dene the subject's facial expressions. To synchro- nize the frames of the acoustic features with the frames of facial features, the marker data was downsampled from 120 to 60 frames per second. In the analyses, each of the facial markers, except the reference nose marker, was used as a facial feature. The markers were grouped into three main areas: upper, middle and lower face regions (Figure 3.1 (a)). The upper face region includes the markers above the eyes in the forehead and brow area. As Ekman et al. observed, this facial area is strongly shaped by the emotion conveyed in the facial expression [69]. The lower face region groups the markers below the upper lip, including the mouth and jaw. As discussed in Section 3.1.4.2, this area is modulated not only by the emotional content, but also by the articulatory processes. Finally, the middle face region contains the markers in the facial area between the upper and lower face region (cheeks). This subdivision will be used to summarize the aggregated results in the tables presented in Sections 3.1.4 and 3.1.5. In addition to the aggregated facial region features, parametric features describing the head, eyebrow and lip motion were also analyzed. Low dimensional features were selected that capture articulatory information (especially for the lips), and that are af- fected by emotional modulation. The matrix R t (Equation 2.2) denes the three Euler 51 angles of the rigid head motion, which are added as visual features (Figure 3.1 (b)). Furthermore, the eyebrows were parameterized with a two-dimensional feature vector, computed by subtracting the position of two chosen markers in the right eyebrow from a neutral pose. After that, the vector was normalized in the range 0 to 1. Notice that right and left eyebrow motions are assumed symmetrical. Although Cav e et al. sug- gested that there can be some dierences between the magnitude of the two eyebrows' motions [31], this symmetry assumption is in general reasonable (especially for the sub- ject analyzed). The lip features describe the width and the height of the opening area of the mouth. The lip features were computed by measuring the Euclidean distance be- tween the markers shown on Figure 3.1 (c). This two-dimensional feature vector relates to three articulatory parameters that describe the shape of the lips: upper lip height (ULH), lower lip height (LLH) and lip width (LW). 3.1.3.2 Audio-visual mapping framework In this study, our goal is to analyze the temporal relation between facial gestures and speech. We desire to measure the areas in the face that are shaped or modied by the articulatory and prosodic aspects of the speech. For this purpose, the Pearson's correlation was chosen as the measure to discern the relationship between the acoustic and facial features. Correlation provides a solid mathematical tool to infer and quantify how connected or disconnected dierent streams of data are. Its results are easy to interpret and no probability density functions need to be estimated, such as in mutual information calculation between feature streams. In general, the speech and the facial features span dierent spaces which have dis- similar dimensions and scales. Therefore, preprocessing steps need to be implemented before computing Pearson's correlation. The complete framework proposed in this sec- tion is depicted in Figure 3.2. 52 Affine Minimum Mean-Squared Error (AMMSE) Pearson's correlation Facial F Facial F Speech F r Acoustic Features Extraction Facial Features Extraction MA Filter Figure 3.2: Linear estimation framework to quantify the level of coupling between facial gestures and acoustic features. AMMSE is used to map the acoustic features into the facial feature space. A ltered version of this estimated signal, b F Facial , is used to measure the Pearson's correlation. The rst step after extracting the features is to map the acoustic features into the facial feature space. In this section, we used the Ane Minimum Mean Square Error estimator (AMMSE), which is dened by a transformation matrix,T , and a translation vector, M, b F =TF Speech +M (3.1) T =K FS K 1 S (3.2) M =K FS K 1 S U S +U F (3.3) where K FS is the cross covariance matrix between the facial and the acoustic features, K S and U S are the covariance and mean vector of the speech features, and U F is the mean vector of the facial features. This is an optimum linear estimator under mean square sense. We chose AMMSE because it is simple and has less overtting problems than other non-linear techniques. In addition, linear estimation has shown 53 better generalization than non-linear mappings in related studies [175]. Notice that in some applications such as facial animation, other mapping techniques such as Articial Neural Networks (ANN) or HMMs could give better results than AMMSE [9, 120, 175, 185]. After the acoustic features are mapped onto the facial features space, a moving average window (MA) is applied to smooth the estimated signal, producing b F Facial . The nal step is to compute the Pearson's correlation between the estimated and real facial features, in each dimension of the facial feature vector. The code was implemented for o-line processing in Matlab. Since the models are linear, the computational requirements are not as high as other sophisticated time series modeling frameworks such as HMMs and DBNs. Two implementations of this framework are presented. The rst implementation is when the target facial features are known and the parameters of the mapping (T u ;M u ) are estimated for each sentence (u). This implementation is referred to as sentence-level mapping. The second implementation of this framework is when the parameters of the mapping are estimated using the entire database. In this case, the same parameters (T emo ;M emo ) for each emotional category are used to estimate each sentence. This implementation is referred to as global-level mapping. Both implementations provide useful information in dierent applications, as we will discuss in the next section. 3.1.4 Results of the Audio Visual Mapping In this section, the facial and acoustic features are jointly analyzed in terms of the emotional categories considered in this work. Before studying the relation between facial expressions and speech, Section 3.1.4.1 discusses the activeness of the facial features during speech. Knowing the motion rate of facial gestures during emotional speech is important to correctly interpret the results presented in this section. Following the 54 methodology described in Section 3.1.3.2, Sections 3.1.4.2 and 3.1.4.3 analyze the results of the correlation between the original and estimated facial features at the sentence-level and global-level mapping, respectively. Section 3.1.4.4 discusses the structure of the audio-visual mappings, by studying Eigen spaces of the mapping parameters (Equation 3.2). 3.1.4.1 Facial activeness during speech During speech, some facial areas are naturally more active than others. This section explores the motion rate of facial gestures in terms of the articulatory and emotional aspects of human communication. For each sentence, the displacement coecient, , described in Equation 3.4, was calculated to measure the activeness of the facial features. This coecient is computed as the average Euclidean distance between the facial features and their mean vector, at sentence-level: u = 1 N u Nu X i=1 D eq ( ~ X u i ;~ u ) (3.4) where N u is the number of frames in sentence u, ~ u is the mean vector, and D eq is the Euclidean distance, dened as: D eq ( ~ X; ~ Y ) = v u u t D X d=1 (x d y d ) 2 (3.5) whereD is the dimension of the facial features. The average displacement coecient, , is obtained by computing the mean of the displacement coecients across theN utter- ances, for each emotional class: 55 = 1 N N X u=1 u (3.6) Notice that in this analysis the acoustic features are not used, since is completely dened by the facial features. This coecient provides global-level description of the activeness of facial areas and gestures. The results for the displacement coecient are presented in Tables 3.1 and 3.2, and in Figure 3.3. Table 3.1 presents the average displacement coecient, , for the facial features related to head, eyebrow and lip motion, in terms of emotional categories. In addition, Table 3.1 summarizes the average activeness of the markers in the three facial areas described in Section 3.1.3.1: upper, middle and lower face regions. To infer whether the emotional dierences in the results presented in Table 3.1 are signicant or not, one-way Analysis of Variance (ANOVA) tests were performed. For head (F [3,622], p=0.000), eyebrow (F [3,622], p=0.000) and lip motion (F [3,619], p=0.000), Table 3.2 provides the p-values for the multiple comparison tests. The same statistical test was individually applied to each marker. Instead of reporting thep-value results of the facial markers, Table 3.2 gives the percentage of the markers in the facial regions in which the dierences were found signicant, using a 95% condence interval (p 0:05). Figure 3.3 shows a visual representation of facial area activeness, per emotional category. This gure was created by computing the average displacement coecient for each facial marker. Then, the coecients were normalized across emotions in the range between 0 and 1. After that, gray-scale colors were assigned to each marker according to the palette shown in Figure 3.4. Finally, the Voronoi cells centered in the markers were colored according to the gray-scale assigned to the markers. Figure 3.3 and Table 3.1 show that during speech, the lower face region, specically the jaw, is the most active area in the face. This result conrms the important role of articulation in the dynamic motion of the facial expressions. However, when the values 56 Table 3.1: Average activeness of facial features during emotional speech (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) Facial Area Neu Sad Hap Ang Head Motion [deg] 2.30 4.52 5.05 5.00 Eyebrow 0.05 0.07 0.12 0.12 Lips 4.69 3.68 6.24 6.94 Upper region 0.72 0.85 1.51 1.37 Middle region 0.92 0.90 1.43 1.52 Lower region 3.24 2.49 4.20 4.47 Table 3.2: Statistical signicant of inter-emotion activation dierences (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) Facial Area Neu-Sad Neu-Hap Neu-Ang Sad-Hap Sad-Ang Hap-Ang Head Motion 0.000 0.000 0.000 0.176 0.202 1.000 Eyebrow 0.000 0.000 0.000 0.000 0.000 1.000 Lips 0.000 0.000 0.000 0.001 0.000 0.000 Upper region 34.48% 100.00% 100.00% 100.00% 100.00% 58.62% Middle region 25.64% 100.00% 100.00% 100.00% 100.00% 48.72% Lower region 100.00% 100.00% 100.00% 100.00% 100.00% 41.38% of the displacement coecients associated with neutral speech are compared with those with emotional speech in each of the facial features, signicant dierences are found (Table 3.2). Note that the lexical content and syntactic structure of the utterances used in each emotional category were identical, since the same sentences were used for each emotion. Therefore, the inter-emotional dierences shown in Figure 3.3 and Table 3.1 can be attributed primarily to dierent emotional modulation of facial gestures. In agreement with previous works [25, 69, 139], the results presented here show that the lower face region conveys important emotional clues. Table 3.1 and Figure 3.3 indicate that the inter-emotional dierences in the average displacement coecient of the markers are signicant, with the exception of the pair happiness-anger, in which only 41.38% of the markers were found with signicant dierences (see Table 3.2). Notice that for happy and angry sentences in the lower face region are more than 30% active than in neutral speech. 57 (a) Neutral (b) Sad (c) Happy (d) Angry Figure 3.3: Facial activeness during speech. The gures show that during speech, the lower face region is the most active area in the face. It also shows inter-emotional dierences. During happiness and anger, the activeness of the face is higher than during neutral state. Conversely, during sadness the activeness of the face decreases. In the upper and lower face regions, the results in Tables 3.1 and 3.2 show that two emotional clusters are clearly grouped: happiness-anger and sadness-neutral. This result is also observed for the displacement coecient in the head and eyebrow motion. As an aside, it is interesting to note that similar grouping trends were observed in the analysis of acoustic speech features presented by Yildirim et al.[187], suggesting that these emotional categories share similarities across modalities. The average activeness in sad and neutral sentences are very similar in the upper and middle face regions (see Figure 3.3), suggesting that those areas of the face are not modied during sad sentences. Notice that this result does not disagree with previous works, which have indicated that facial poses for sadness presents signicant dierences 58 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3.4: Palette used in the plots. Darker shadings imply higher activeness (Figure 3.3) or higher correlation (Figure 3.5 and 3.6). in those areas compared to neutral poses (inner corners of the eyebrows and cheeks are raised [65, 69]), since the displacement coecient measures the average variance of facial gestures rather than the pose of the face itself (mean vector is removed in Equation 3.4). Conversely, in happy and angry sentences, the activeness of the upper and middle face region is over 60% more active than in neutral sentences, showing the dierences in emotional modulation in those areas. In general, similar trends were observed in the displacement coecient for head, eyebrow, and lip motion. However, some important results are worth highlighting. The activeness of head motion for sad and neutral sentences are signicantly dierent. Furthermore, the dierence between the average displacement coecient for angry and happy utterances for lips features is also signicant. These results suggest that this coecient could be used to discriminate between these pairs of emotional categories. 3.1.4.2 Sentence-Level mapping In the previous section, the facial gestures were analyzed without considering acoustic features. Since gestures and speech are strongly interconnected, a deeper analysis of facial expressions needs to consider acoustic features. In this section, the audio-visual mapping framework presented in section 3.1.3.2 is used to shed light into the underlying 59 relations between acoustic and facial features. The parameters of the linear transforma- tion are calculated at the sentence-level, assuming that the acoustic and facial features are known. Although this is clearly not useful for an application such as animation, in which the facial features are unknown, the results presented here are important to understand better the coupling between speech and facial gestures. Also, areas such as multimodal emotion recognition can benet from understanding the relation between the modalities. In this section we are specically interested in studying whether the correlation between the original and estimated facial features are aected by the emotional and linguistic content (lexical, syntactic) of the sentences. (a) Neutral (b) Sad (c) Happy (d) Angry (e) Neutral (f) Sad (g) Happy (h) Angry Figure 3.5: Correlation results for Sentence-Level mapping. (a-d) Prosodic features, (e- h) Vocal tract features. The gure shows high levels of correlation between the original and estimated facial features, especially when MFCCs are used to estimate the mapping. The gures also suggest that the link between acoustic and facial features is in uenced by the emotional content in the utterance. 60 Table 3.3: Summary Correlation for Sentence-Level mapping (N =neutral, S=sadness, H = happiness, A= anger) Facial Area Prosodic Mfcc N S H A N S H A Head Motion 0.60 0.62 0.57 0.57 0.86 0.89 0.88 0.87 Eyebrow 0.52 0.54 0.53 0.53 0.76 0.80 0.85 0.85 Lips 0.57 0.60 0.57 0.57 0.88 0.89 0.90 0.89 Upper region 0.53 0.56 0.55 0.53 0.80 0.83 0.85 0.84 Middle region 0.51 0.56 0.54 0.53 0.82 0.84 0.86 0.85 Lower region 0.53 0.57 0.55 0.54 0.87 0.87 0.88 0.87 Table 3.4: Statistical signicant of inter-emotion dierences in correlation of Sentence- level mapping (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) Facial Area Neu-Sad Neu-Hap Neu-Ang Sad-Hap Sad-Ang Hap-Ang Prosodic features Head Motion 0.444 0.332 0.061 0.005 0.000 1.000 Eyebrow 0.916 1.000 1.000 1.000 1.000 1.000 Lips 0.894 1.000 1.000 0.890 0.880 1.000 Upper region 51.72% 27.59% 10.34% 0.00% 55.17% 0.00% Middle region 100.00% 35.90% 30.77% 17.95% 53.85% 2.56% Lower region 89.66% 34.48% 3.45% 6.90% 37.93% 13.79% MFCC features Head Motion 0.012 0.098 1.000 1.000 0.221 0.743 Eyebrow 0.000 0.000 0.000 0.000 0.000 1.000 Lips 1.000 0.180 1.000 0.603 1.000 1.000 Upper region 96.55% 100.00% 100.00% 51.72% 6.90% 6.90% Middle region 38.46% 53.85% 56.41% 30.77% 23.08% 5.13% Lower region 0.00% 13.79% 10.34% 6.90% 6.90% 0.00% Table 3.3 and Figure 3.5 show the results of the correlation at sentence-level. Table 3.3 presents the average correlation between the facial features and the signal separately estimated with prosodic and MFCC features, in terms of emotional categories. Figure 3.5 shows a graphical representation of these results. This gure was created following the same steps described in Section 3.1.4.1 for Figure 3.3. Here, the correlation of each marker without normalization was used to assign the gray-scale color from Figure 3.4 to the Voronoi cells. Table 3.3 and Figure 3.5 show high levels of correlation between the original and estimated facial features, which agree with the hypothesis that the production of facial 61 gestures and speech are internally connected [29, 173, 174]. The results also show that the correlation levels are higher when MFCC features are used to compute the mapping parameters. This result is observed not only near the orofacial area, in which the appearance of the face is directly modied by the conguration of the vocal tract, but also in areas far from the mouth such as cheeks and forehead, which agrees with the observations made in [174]. For example, head motion features, which are thought to be less dependent on the phonetic content of the speech, have higher correlation when the mapping is based on MFCCs rather than prosodic features. Notice that MFCCs carry the spectral envelope information and are closely related with the articulatory processes, while prosodic features are predominately related with the source of the speech. Therefore, these results indicate that articulatory events co-occur with the production of facial gestures. When prosodic features are used to estimate the facial features, the observed corre- lation levels are similar along the facial regions, as shown in Figure 3.5 and Table 3.3. However, when MFCCs are used to estimate the mapping parameters, the correlation of the lower face region is signicantly higher than in any other facial region (see Ta- ble 3.3). This result is also observed in lip features, which presents a correlation level higher than for eyebrow features. One explanation is that facial features with relatively low motion activity tend to have smaller correlation levels, as suggested in [186]. As mentioned in Section 3.1.4.1, the lower face region has higher motion activity than the upper and middle face region, and consequently, it presents higher correlation. Up to this point, all the results presented in this subsection are observed across emotional categories. Another interesting question is whether those correlation levels between acoustic and facial features are emotion-type dependent. To answer this ques- tion, Table 3.4 provides statistical signicance measures of inter-emotion dierences between the results presented in Table 3.3. Similar to Table 3.2, Table 3.4 shows the 62 statistical results for multiple comparison tests across emotional categories, using a 95% condence interval. For head, eyebrow and lip motion, the p-values are given. For the aggregated facial regions, the percentage of the markers with signicant dierences at = 0:05 within each facial region is given. Table 3.4 shows that there are signicant inter-emotional dierences in the corre- lation levels between the original and estimated facial features presented in Table 3.3. This result suggests that the link between acoustic and facial features is in uenced by the emotional content in the utterance. As discussed in Section 3.1.4.1, the facial active- ness changes under emotional speech. Likewise, acoustic features such as pitch, energy and spectral envelope vary as function of emotions [44, 156]. These observations suggest that the audio-visual coupling presented here is also aected by this jointly emotional modulation. When MFCC features are used to estimate the mapping parameters, Tables 3.3 and 3.4 show that the upper face region presents signicant inter-emotional dierence in the correlation levels between the original and the estimated facial features. In this region, the results for neutral speech are statistically dierent from the results of any other emotional category (see Figure 3.5). However, in the middle and lower face region there are fewer markers with statistically signicant inter-emotional dierences. It is interesting to notice that facial areas that are less connected with the articulatory process, such as eyebrow and forehead (see Figure 3.3), present higher levels of inter- emotional dierences. In those cases, the correlation levels for emotional utterances are higher than those with neutral speech. When prosodic features are used to estimate the mapping parameters, the re- sults show that the lower and especially the middle face region present stronger inter- emotional dierences in the correlation levels, compared with the upper face region. 63 The results also show that the correlation levels for emotional utterances are slightly higher than the ones for neutral speech. In the case of lip features, the results indicate that there is no evidence to reject the null hypothesis that the correlation levels are similar across emotional categories. This result is observed when either MFCCs or prosodic features are used to estimate the mapping parameters. The fact that the emotional aspects do not signicantly aect the coupling between speech and lip motion suggests that this link is mainly controlled by the articulation. This observation agrees with the results presented in Section 3.1.4.1, which show that the relative dierence in the displacement coecient for the lips between neutral and emotional speech is lower than in other facial areas. However, in other facial gestures the results indicate that emotional content do aect the relationship between facial gestures and speech. 3.1.4.3 Global-level Mapping In the previous section, the mapping parameters between acoustic and facial features were computed at the sentence-level. This section extends those results when a single set of generic parameters (T emo , M emo ) is computed across sentences (Equation 3.1). Since the level of correlation depends on the aective state of the speaker, as shown in Section 3.1.4.2, separate mapping parameters were calculated for each emotional cate- gory. Notice that speech is considerably easier and cheaper to collect, compared with any of the facial gestures considered here. Therefore, it is very convenient for appli- cations such as facial animation to have reliable procedures to estimate facial features from speech. Table 3.5 and Figure 3.6 show the average levels of correlation observed with global- level mapping. These results show that the correlation signicantly decreases compared to the results with sentence-level mapping presented in the previous section (see Table 64 Table 3.5: Summary Correlation for Global-level Mapping (N =neutral, S=sadness, H = happiness, A= anger) Facial Area Prosodic Mfcc N S H A N S H A Head Motion 0.14 0.13 0.06 0.14 0.16 0.14 0.15 0.12 Eyebrow 0.18 0.05 0.01 -0.02 0.34 0.21 0.25 0.06 Lips 0.25 0.16 0.22 0.21 0.54 0.46 0.55 0.46 Upper region 0.15 0.07 0.10 0.02 0.28 0.15 0.32 0.13 Middle region 0.16 0.11 0.12 0.10 0.46 0.33 0.34 0.34 Lower region 0.21 0.18 0.20 0.17 0.58 0.49 0.51 0.51 Table 3.6: Statistical signicant of inter-emotion dierences in correlation of Global-level mapping (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger) Facial Area Neu-Sad Neu-Hap Neu-Ang Sad-Hap Sad-Ang Hap-Ang Prosodic features Head Motion 1.000 0.030 1.000 0.078 1.000 0.071 Eyebrow 0.000 0.000 0.000 1.000 0.071 1.000 Lips 0.005 0.905 0.751 0.500 0.751 1.000 Upper region 58.62% 48.28% 86.21% 44.83% 20.69% 55.17% Middle region 58.97% 43.59% 66.67% 33.33% 12.82% 41.03% Lower region 37.93% 48.28% 44.83% 37.93% 0.00% 27.59% MFCC features Head Motion 1.000 1.000 0.543 1.000 1.000 1.000 Eyebrow 0.000 0.002 0.000 0.987 0.000 0.000 Lips 0.001 1.000 0.003 0.001 1.000 0.003 Upper region 96.55% 34.48% 96.55% 100.00% 13.79% 100.00% Middle region 94.87% 84.62% 89.74% 56.41% 35.90% 56.41% Lowe region 96.55% 75.86% 75.86% 34.48% 17.24% 48.28% 3.3 and Figure 3.5). This result is observed when either MFCCs or prosodic features are used to estimate the generic parameters T emo and M emo . Vatikiotis-Bateson and Yehia have also reported similar results [175]. Similar to the sentence-level mapping section, Table 3.5 shows that the MFCC-based estimated facial features present higher levels of correlation than the prosody-based estimated facial features. In fact, in some cases of the latter, the correlation levels are close to zero (e.g. eyebrow with emotional utterances). This is probably because the link between some facial gestures and prosodic features, especially for emotional utterances, 65 (a) Neutral (b) Sad (c) Happy (d) Angry (e) Neutral (f) Sad (g) Happy (h) Angry Figure 3.6: Correlation results for Global-level mapping. (a-d) Prosodic features, (e-h) Vocal tract features. The gures show that the correlation levels signicantly decreases compared with the results at sentence-level. MFCC-based estimated facial features also present higher correlation levels than when prosodic features are used. The lower face region presents the highest correlation levels in the face. varies from sentence to sentence and it is not preserved after estimating the mapping parameters over the entire dataset. The lower face region presents the highest correlation when either of the acoustic features is used to estimate the facial features. As shown in Table 3.5 and Figure 3.6, the levels of correlation decrease in the upper and middle face regions. These dierences are also observed between lip and eyebrow features, in which the correlation levels in lip features are higher than in eyebrow features, across emotional category. Similar to Table 3.4, Table 3.6 presents details of statistical signicance of inter- emotion dierences in the correlation levels presented in Table 3.5. The results indicate that there are strong emotional dependencies in those results. 66 In general, it is interesting to notice that the correlation levels for neutral speech are higher than any other emotional category. This result is in opposition with the results observed with sentence-level mapping, in which the correlation levels for neutral speech are equal or lower than other emotional categories. This result suggests that the coupling between facial gestures and emotional speech has a more complex structure than for neutral speech, which is not preserved when a single set of parameters is used to estimate the linear mapping. The fact that the levels of correlation decrease when global-level mapping is used in- dicates that the coupling between facial gestures and speech may change from sentence to sentence, depending on the underlying linguistic structure, as noted by Vatikiotis- Bateson and Yehia [175]. Since the parameters are averaged across sentences, the artic- ulatory eects on the facial gestures are no longer re ected on the mapping. The high levels of correlation that were observed when MFCC features were used to estimate the parameter at sentence-level support this hypothesis. MFCC features model the cong- uration of the vocal tract, which shapes the appearance of the face [186]. Therefore, when the linguistic structure is blurred, the correlation levels signicantly decrease. This hypothesis is addressed further in Section 3.1.4.4. 3.1.4.4 Analysis of Mapping Parameters In Equation 3.1, the linear coupling between facial and acoustic features is mainly ex- pressed through the parameter T , which is computed based on the cross-correlation between facial and acoustic features,K FS (Equation 3.2). The structure of this param- eter provides further insights about the relation between facial gestures and speech. This section presents a detailed analysis of the parameter T , when sentence-level mapping is used to estimate Equation 3.1. 67 The approach used to study the structure of T is based on PCA. In this technique, a reduced orthonormal subspace is selected such that it spans most of the variance of the multidimensional data. This subspace is formed from a subset of eigenvectors and eigenvalues of the covariance matrix of the data, associated with the highest eigenvalues. If the data is relatively clustered, only few eigenvectors will be needed to approximate the data. Therefore, by studying the eigenvectors of the covariance matrix of the parameter T , useful inferences can be drawn about the complexity of the assumed linear mapping between facial and acoustic features. The parameter T is a nm matrix whose dimensions depend on the acoustic and facial features used to compute the mapping (Equation 3.2). This matrix T is reshaped into a nm 1 vector, ~ t. The covariance matrix of this vector, K ~ t , is approximated in Equation 3.7, in which ~ t u is the mean-removed vector associated with sentence u. K ~ t = [ ~ t 1 ::: ~ t N ] [ ~ t 1 ::: ~ t N ] T (3.7) After computing the eigenvalues (j) and eigenvectors (e j ) ofK ~ t , each parameter ~ t u can be expressed without errors as a linear combination of the eigenvectors (Equation 3.8, P =nm). Assuming that the eigenvalues are sorted in descending order, ~ t u can be approximated with only P eigenvectors (P <nm), ~ b t u (P ) = t + P X j=1 < ~ t u ;~ e j >~ e j (3.8) where t is the vector mean of ~ t. Table 3.7 gives the fraction of eigenvectors (P ) needed to span 90% or more of the variance of the parameters ~ t, for head, eyebrow and lip features. The same procedure was also applied for each marker. The average results within each facial area is pre- sented in Table 3.7. The results are presented for each emotional category, in which 68 Table 3.7: Fraction of eigenvectors used to span 90% or more of the variance of the parameter T (N =neutral, S=sadness, H = happiness, A= anger, G= global) Facial Prosodic Mfcc Area N S H A G N S H A G Head 0.22 0.33 0.28 0.28 0.28 0.36 0.47 0.44 0.53 0.56 Eyebrow 0.33 0.33 0.25 0.25 0.33 0.50 0.46 0.38 0.38 0.46 Lips 0.33 0.42 0.33 0.33 0.42 0.46 0.54 0.50 0.50 0.54 Upper 0.26 0.33 0.27 0.30 0.31 0.44 0.45 0.45 0.46 0.51 Middle 0.30 0.29 0.29 0.30 0.32 0.43 0.40 0.42 0.39 0.48 Lower 0.28 0.29 0.26 0.27 0.30 0.41 0.43 0.36 0.39 0.44 only the parameters ~ t u of the corresponding emotional sentences were used to estimate K ~ t (Equation 3.7). In addition, the parameters ~ t u were concatenated together across emotional categories to estimate an emotion-independent covariance matrix. This pro- cedure provides information for inferring whether the structure of T depends on the emotional content of the utterance. These results are presented in Table 3.7 under the letterG (global). Notice that the dimension of ~ t varies when dierent facial and acous- tic features are used to estimate the mapping. Therefore, the fraction of eigenvectors to cover 90% of the variance is easier to compare across facial features than just the dimension of this reduced subspace P . Table 3.7 indicates that when prosodic features are used to estimate the mapping parameters, between 22% and 42% of the eigenvectors of K ~ t are needed to cover 90% or more of the variance of T . When the mapping is based on MFCC features, between 36% and 54% of the eigenvectors are required to span this reduced subspace. These results suggest that the parameters ~ t are clustered in a reduced subspace, showing a dened structure. Since the reduced subspaces for MFCC-based parameters have bigger dimensions than the ones for prosodic features, it can be infered that their structures are more dicult to model. Therefore, the mapping between prosodic features and facial gestures may be easier to generalize across sentences than the mapping between MFCCs and facial features. 69 As can be observed in Table 3.7, the percentage of eigenvectors of the emotion- independent covariance matrix needed to span 90% or more of the variance of ~ t is gen- erally higher than the percentage of eigenvectors of the emotion-dependent covariance matrices needed to cover the same reduced subspace. These results provide further evi- dence that the structures of the mapping parameters depend on the emotional content of the sentences. The results presented in Table 3.7 do not directly provide information about the correlation levels that will be observed when a reduced set of eigenvectors of K ~ t is used to approximate T . To analyze this question, the sentence-level mapping framework presented in section 3.1.3.2 was implemented to measure the correlation levels between facial and acoustic features when dierent numbers of eigenvectors (P ) are used to approximate T (Equation 3.8). Figure 3.7 presents the results for the facial gestures considered in this section. For the upper, middle and lower face regions the average results of the markers within each area is presented. The slopes of these curves indicate that the correlation levels slowly decreased as the number of eigenvectors used in the approximation ofT decrease. Figure 3.7 also shows that when MFCC features are used to estimate the facial features, the slopes of the curves tend to be higher than the ones for prosodic features. These results support the hypothesis that there is a well-dened structure in the parameterT , and that this structure seems to be simpler when prosodic features are used to estimate the facial gestures. When MFCCs are used to estimate the mapping, the correlation level for head motion is more aected than other facial gestures when low values of P are used in Equation 3.8 (see Figure 3.7). This result indicates that the coupling between head motion and speech varies from sentence to sentence, as suggested in [185]. The same result is observed for eyebrow motion. Conversely, the correlation level for lip motion is higher than 0:4, even when only one eigenvector is used to approximate T . This result 70 Head Motion 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 10 20 30 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Eyebrows 0 2 4 6 8 10 12 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Lips 0 2 4 6 8 10 12 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 5 10 15 20 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Upper region 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 10 20 30 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Middle region 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 10 20 30 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Lower region 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang 0 10 20 30 0 0.2 0.4 0.6 0.8 1 Eigenvectors Correlation neu sad hap ang Prosodic features MFCC features Figure 3.7: Correlation levels as a function of the number of Eigenvectors (P ) used to ap- proximate ~ t (Equation 3.8). The slopes of these curves indicate that the correlation lev- els slowly decreased as P decrease, supporting the hypothesis of an emotion-dependent structure in the audio-visual mappings. 71 indicates that not only there is a strong link between what is said and the appearance of the orofacial area, but also this map does not signicantly change across sentences. This is not surprising given the tight coupling between lips and the segmental speech properties, so the articulatory eects dominate for all conditions. These observations suggest that the relation between speech and some facial gestures are easier to generalize than others. Notice that the projections of ~ t u into the eigenvectors in Equation 3.8 is unknown when the facial features are not available. Therefore, if this approach is implemented for predicting facial gestures from speech, the terms < ~ t u ;~ e j > will have to be estimated, which may be a non-trivial problem. In sum, the results presented in this section suggest that even though the coupling between facial gestures and speech varies from sentence to sentence, there is an emotion- dependent structure in the mapping parameters across sentences that may be learned using more sophisticated non-linear estimation techniques such as HMMs or DBNs, as proposed in Chapter 5. 3.1.5 Results of Phoneme Level Analysis In the previous sections, the correlation levels were estimated for each utterance using either sentence-level, or global-level mapping parameters. This information provides general measures of the coupling between facial gestures and speech. This section ana- lyzes whether these interrelations vary as function of broad phonetic categories. Instead of computing the Pearson's correlation over entire sentences, the speech is segmented into the constituent phones, over which the correlation levels are estimated. Notice that this approach diers from the framework followed in references [9, 95]. In those works, the authors estimated the mapping parameters for short time intervals corre- sponding to the target phonemes or syllables, which were spoken as nonsense words 72 (e.g. /ma/, /cu/). Instead, we are interested in analyzing the phonetic dependency on the correlation levels when the mappings are estimated for the entire sentences. This procedure will indicate whether certain phonemes require specic attention for accurate facial feature estimation. The broad phonetic categories considered in this section are voiced/unvoiced, nasal, fricative and plosive phonemes. We also distinguish between vowels and consonant sounds. These phonetic categories share source characteristics and vocal-tract cong- urations that may in uence the relation between gestures and speech. In addition, we also consider the correlation levels during periods of acoustic silence, which are sepa- rately analyzed. Note that silence periods can be linguistically meaningful (such as in phrase or sentence boundaries) and can be accompanied by specic gestures. For each sentence, a single set of parameters (T ,M) was estimated using the sentence- level mapping framework presented in Section 3.1.3.2. Forced alignment, implemented with the HTK toolkit [189], was used to align the transcript with the speech. First, a small window centered on the target phoneme was selected in the original and estimated facial feature stream. The window was expanded in both directions as a function of the phoneme duration, to compensate for possible error in the automatic forced alignment procedure and the phase dierences between gestures and speech [87, 173]. Finally, Pearson's correlation between the selected segments was computed. This procedure was repeated for each phoneme. The results were used to estimate average values for each broad phonetic category. Table 3.8 gives the average correlation levels for the broad phonetic classes, for each emotional category, when prosodic and MFCC features were used to estimate the mapping between gestures and speech. To analyze whether the results presented in Table 3.8 are statistically signicant, two-way ANOVA tests were implemented. The dependent factors are the emotional and phonetic categories. The results show that 73 Table 3.8: Pearson's correlation for the audio-visual mapping at phoneme levels (N =neutral, S=sadness, H =happiness, A=anger) Prosodic MFCC N S H A N S H A Average all phonemes Head Motion 0.41 0.35 0.34 0.36 0.51 0.44 0.46 0.47 Eyebrow 0.39 0.37 0.34 0.35 0.47 0.45 0.50 0.50 Lips 0.54 0.52 0.49 0.51 0.74 0.70 0.71 0.75 Upper region 0.39 0.36 0.35 0.35 0.51 0.47 0.51 0.49 Middle region 0.44 0.40 0.39 0.40 0.62 0.56 0.57 0.59 Lower region 0.51 0.48 0.45 0.48 0.72 0.68 0.66 0.70 Vowels Head Motion 0.42 0.37 0.37 0.39 0.52 0.47 0.46 0.49 Eyebrow 0.38 0.39 0.35 0.37 0.47 0.46 0.51 0.53 Lips 0.53 0.51 0.49 0.55 0.74 0.72 0.73 0.75 Upper region 0.40 0.38 0.37 0.38 0.52 0.48 0.52 0.51 Middle region 0.46 0.43 0.41 0.42 0.63 0.57 0.59 0.61 Lower region 0.53 0.50 0.48 0.49 0.72 0.69 0.69 0.71 Consonant Head Motion 0.39 0.33 0.31 0.33 0.50 0.43 0.45 0.46 Eyebrow 0.40 0.37 0.32 0.34 0.47 0.45 0.48 0.48 Lips 0.54 0.54 0.48 0.50 0.75 0.69 0.70 0.75 Upper region 0.38 0.34 0.34 0.32 0.50 0.47 0.50 0.48 Middle region 0.43 0.39 0.38 0.38 0.62 0.55 0.56 0.59 Lower region 0.50 0.46 0.44 0.47 0.71 0.67 0.65 0.69 Voiced phonemes Head Motion 0.42 0.36 0.36 0.37 0.51 0.45 0.47 0.48 Eyebrow 0.39 0.38 0.36 0.36 0.47 0.46 0.51 0.50 Lips 0.54 0.52 0.49 0.53 0.75 0.72 0.72 0.76 Upper region 0.39 0.37 0.36 0.36 0.52 0.48 0.53 0.50 Middle region 0.44 0.41 0.40 0.41 0.63 0.57 0.58 0.60 Lower region 0.51 0.49 0.46 0.49 0.72 0.69 0.68 0.71 Unvoiced phonemes Head Motion 0.38 0.31 0.30 0.32 0.50 0.41 0.44 0.44 Eyebrow 0.39 0.36 0.27 0.33 0.46 0.44 0.44 0.48 Lips 0.55 0.53 0.45 0.46 0.72 0.64 0.65 0.73 Upper region 0.37 0.33 0.32 0.31 0.49 0.45 0.45 0.46 Middle region 0.43 0.37 0.36 0.37 0.61 0.53 0.51 0.56 Lower region 0.50 0.44 0.42 0.45 0.71 0.65 0.60 0.67 Nasal phonemes Head Motion 0.41 0.36 0.34 0.34 0.49 0.43 0.47 0.45 Eyebrow 0.38 0.38 0.34 0.31 0.46 0.46 0.51 0.49 Lips 0.52 0.58 0.46 0.53 0.77 0.79 0.73 0.79 Upper region 0.36 0.36 0.31 0.34 0.52 0.50 0.53 0.50 Middle region 0.40 0.42 0.35 0.37 0.64 0.57 0.58 0.60 Lower region 0.47 0.50 0.40 0.46 0.73 0.69 0.68 0.71 Plosive phonemes Head Motion 0.35 0.27 0.31 0.30 0.46 0.45 0.45 0.45 Eyebrow 0.41 0.34 0.28 0.32 0.48 0.46 0.51 0.47 Lips 0.59 0.53 0.50 0.56 0.75 0.71 0.75 0.78 Upper region 0.38 0.32 0.31 0.29 0.51 0.46 0.50 0.47 Middle region 0.46 0.39 0.41 0.42 0.64 0.56 0.57 0.56 Lower region 0.54 0.49 0.47 0.55 0.74 0.67 0.67 0.69 Fricative phonemes Head Motion 0.36 0.32 0.27 0.34 0.51 0.40 0.42 0.44 Eyebrow 0.38 0.35 0.25 0.32 0.44 0.43 0.41 0.45 Lips 0.55 0.52 0.44 0.47 0.73 0.63 0.65 0.74 Upper region 0.34 0.33 0.30 0.30 0.48 0.45 0.44 0.45 Middle region 0.43 0.37 0.32 0.36 0.61 0.53 0.49 0.58 Lower region 0.50 0.45 0.40 0.45 0.70 0.66 0.58 0.68 Silence Head Motion 0.22 0.19 0.18 0.21 0.29 0.21 0.19 0.25 Eyebrow 0.19 0.15 0.10 0.15 0.28 0.19 0.19 0.20 Lips 0.17 0.22 0.08 0.16 0.35 0.33 0.24 0.42 Upper region 0.17 0.15 0.12 0.15 0.28 0.19 0.18 0.18 Middle region 0.13 0.14 0.09 0.14 0.28 0.22 0.19 0.25 Lower region 0.16 0.16 0.12 0.16 0.35 0.31 0.26 0.35 74 Table 3.9: Statistical signicant in dierences between broad phonetic classes (Al=All, Vw=Vowels,Co=Consonant, Vo=Voiced, Uv=Unvoiced, Na=Nasal, Pl=Plosive, Fr=Fricative) Facial Area Al-Vw Al-Co Al-Vo Al-Uv Al-Na Al-Pl Al-Fr Vw-Co Vo-Uv Prosodic features Head Motion 0.16 0.01 1.00 0.00 1.00 0.00 0.01 0.00 0.00 Eyebrow 1.00 1.00 1.00 0.06 1.00 1.00 0.40 1.00 0.00 Lips 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Upper 51.7% 24.1% 0.0% 93.1% 0.0% 51.7% 100% 93.1% 100% Middle 53.9% 0.0% 0.0% 56.4% 2.6% 20.5% 53.9% 84.6% 79.5% Lower 31.0% 3.5% 0.0% 24.1% 20.7% 51.7% 27.6% 72.4% 72.4% MFCC features Head Motion 1.00 1.00 1.00 0.22 1.00 1.00 0.73 0.72 0.02 Eyebrow 1.00 1.00 1.00 0.05 1.00 1.00 0.00 0.29 0.00 Lips 1.00 1.00 1.00 0.00 0.06 1.00 0.06 1.00 0.00 Upper 0.0% 0.0% 0.0% 100% 0.0% 0.0% 100% 24.1% 100% Middle 0.0% 0.0% 0.0% 84.6% 2.6% 0.0% 61.5% 15.4% 94.9% Lower 0.0% 0.0% 0.0% 79.3% 24.1% 0.0% 44.8% 27.6% 96.6% thep-values for the tests for each facial feature considered in this section were less than 0.001, which indicate that there are signicant dierences in the emotional and phonetic levels. Thep-values for the interaction terms were higher than 0.05, which suggest that the phonetic and emotional eects are additive. To identify the phonetic classes that present statistical dierences, a multiple com- parison test with Bonferroni adjustment was applied to the data. A summary of the results is presented in Table 3.9. Similar to Tables 3.2, 3.4 and 3.6, Table 3.9 gives the p-values for the head, eyebrow and lip motion test. It also gives the percentage of markers belonging to the upper, middle and lower face regions that present signicant dierence (p< 0:05). In general, Table 3.9 indicates that the average correlation levels for vowels is signif- icantly higher than the levels for consonants. This result agrees with previous ndings [9]. Similarly, voiced regions present higher correlation than unvoiced regions. Fricative phonemes present lower correlation levels than the average phoneme. Also, in contrast to the results presented by Yehia et al, our results suggest that the correlation levels for 75 nasal phonemes do not have statistical dierences compared to the average phonemes [185]. The dierent acoustic features used to estimate the mapping may explain this result. They used LSP coecients, which, unlike MFCCs, do not adequately model the zeros in the nasal spectra. Like nasal phonemes, plosive phonemes presents similar correlation levels as the average phoneme. Table 3.8 reveals that the lower face region and the lip features present the highest correlation levels. When compared to the case when the correlation was computed over the entire sentence (Table 3.3), these facial features show the lowest reduction in the correlation levels. These results indicate that lips features are locally related to the segmental acoustic events, and that they have a dierent time resolution than eyebrow and head motion features. This is not surprising given the key role of the lips in the articulatory process. An interesting result in Table 3.8 is that the correlation levels during acoustic silence is less than 50% lower than any other broad phonetic category. The main reason of this result is the behavior of acoustic features during silence. The RMS energy decreases in the entire frequency spectrum, aecting the MFCC coecients. Also, in this section the pitch (F0) is interpolated during unvoiced and silence regions (see Section 3.1.3.1), which clearly aects the accuracy of the mapping. Notice that acoustic silence does not mean that there is no communication activity. In fact, while the verbal channel is passive, the face may continue to gesture which explains why the correlation levels are not zero. This result suggests that silence regions should be treated in a special way. In sum, the results presented in this section suggest that special consideration need to be taken for consonants, especially unvoiced and fricative sounds as well as silence segments. In applications such as facial animation, the use of interpolation or pre- learned visemes during those segments may give better perceptive quality than just the use of the estimated facial features. Furthermore, these results support the observations 76 presented in [29] regarding the use of dierent resolutions to integrate facial features for realistic avatars. 3.1.6 Discussion While description of multimodal synthesis or recognition algorithms were not the goals of this section, the results presented here aim to provide insights on how to jointly model facial gestures and speech for expressive interfaces. The results of the present section indicate that linguistic and emotional aspects need to be jointly considered in the models. This applies not only to the orofacial area (e.g. with the use of visemes), but also in facial region such as forehead and even head motion. This section discusses and suggests new directions in the areas of facial animation and emotion recognition. It also presents observations with theoretical value. Although the correlation levels decrease when the mapping parameters are estimated at global level from the entire database, the results presented here indicate that the rela- tionship between facial gestures and speech has a structure that may be automatically learned with a more sophisticated mapping. This structure varies from emotion to emotion, indicating that specic emotional models are needed for expressive facial ani- mations. As discussed in Section 3.1.4.1, when prosodic features are used to estimate the mapping parameters, the correlation levels are lower than when vocal-tract features are used. However, the internal structure of the mapping seems to be easier to generalize. Therefore, prosodic features may be better to use than vocal tract features to estimate facial features given the interplay between speech articulation and facial actions. As an example of these observations, Chapter 5 proposes an HMM-based framework that can successfully capture the temporal relationship between head motion gesture and acous- tic prosody features. The results showed that the generated sequences were perceived as natural as the original sequences, indicating that the relationship between head motion 77 and prosody was successfully modeled. These works suggest that similar time series frameworks may be used to learn the structure of the audio-visual mapping for other facial gestures such as eyebrow motion. As discussed in Section 3.1.4.4, some facial gestures showed more complex structure than others. Also, errors in some facial gestures are more perceptively detrimental for facial animation than others (e.g. lip motion). Therefore, the correlation levels between acoustic features may not be sucient to generate some facial gestures, as suggested by Barker et al.[9]. In those cases, extra constraints, specications or pre-learned rules should be generated. The results in Section 3.1.5 reveal that facial gestures and speech are connected at dierent resolutions. While the orofacial area is locally tightly connected with the spoken message, the forehead, eyebrow and head motion are coupled only at the sentence level. Therefore, a coarse-to-ne decomposition of acoustic and facial features may be benecial to model the coupling between dierent facial areas and speech. This is an important observation, because believable models need to properly include the right timing and coupling resolution between dierent communication channels. Currently, this is one area that we are actively working to expand the results presented in this section. Likewise, the results indicate that the correlation levels in the audio-visual mapping during consonant, unvoiced and especially silence regions tend to decrease. Therefore, special consideration is required during those regions. The results presented in this section indicate that linguistic and emotional aspects of communication jointly modulate human speech and gestures to communicate and ex- press desired messages. Importantly, many of the same physical channels are involved actively or passively during the production of both speech (verbal) and the various face gestures (non-verbal). Since articulatory and aective goals appear to co-occur during normal human interaction and share the same channels, some form of internal central 78 control needs to buer, prioritize and execute them in coherent manner. We hypothesize that linguistic and aective goals interplay interchangeably as primary and secondary controls. During speech, for instance, articulatory processes targeting linguistic goals might have priority over expressive goals, which as a consequence are restricted to mod- ulating the communicative channels under the given articulatory constraints to convey the desired emotions. During the silence period, in contrast, articulatory processes are passive and aective goals are dominant in the non-verbal communication channels. In our previous work, emotional modulation in acoustic and articulatory parameters were analyzed [119, 187]. The results showed evidence for this interplay between aective and linguistic goals in speech, where low vowels with less restrictive tongue position observed greater emotional coloring than high vowels such as /i/. This hypothesis is also supported by the results presented in Section 3.1.4.2. The results indicate that when MFCCs are used to estimate the mappings, the eyebrows and the upper face re- gion, which are less constrained by the linguistic goals, present higher inter-emotional dierences. In fact, the eyebrow pose may be enough to perceive the emotional message [64]. Section 3.2 discusses these ideas in further detail. Another interesting research area is multimodal emotion recognition. In addition to the poses of facial expressions, the results presented in Section 3.1.4.1 suggest that the activeness of facial gestures also provides discriminative information about emotions. For example, the average displacement coecients for lips and head motion features may be used to discriminate between pairs of emotion that are usually confused in other modalities, such as happy-anger and sadness-neutral. It is interesting to note that in previous works, head motion has been usually compensated and removed to analyze facial expressions. In those works, the subjects were instructed to avoid moving their head. Since one of the channels is intentionally blocked, the resulting data may contain non-natural patterns in other facial gestures. The results presented here suggest that 79 head motion sequences not only should be encouraged for natural interaction, but also can be used to discriminate between emotions, as proposed in [39, 98]. Although the use of facial markers is suitable for the kind of analysis presented here, facial expressions need to be automatically extracted for realistic applications. This challenging task can be done with automatic platforms such as the CMU/Pitt Automated Facial Image Analysis (AFA) system [39], which has been tested with head motion sequences that include pitch and yaw as large as 40 and 70 , respectively. While the orofacial area is clearly the target area for audio-visual speech recognition, it is not clear which area need to be considered for multimodal emotion recognition. The analysis presented here, especially in Section 3.1.4.1, shed light into the important facial areas used to display emotions. Figure 3.3 and Table 3.1 show that the displacement coecient in the middle and upper regions in the face for happiness and anger is approximately 70% higher than during neutral speech. Conversely, the facial activeness in the lower face region increases only 30% for the same emotions. These results suggest that the forehead area seems to have more degree of freedom to display non-verbal information such as emotion. Therefore, this area needs to be especially considered for emotion recognition systems. These results agree with perceptual experiments, which have shown that the upper face region is perceptively the most important region to detect visual prominence [112, 168]. 3.2 Interplay between linguistic and aective goals in fa- cial expression Motivate by the results presented in Section 3.1, this section analyzes the interplay between aective and articulatory goals in facial expression during expressive speech. 80 3.2.1 Introduction During human interaction, gestures and speech are simultaneously used to express not only verbal information, but also important communicative clues that enrich, comple- ment and clarify the conversation. Notable among these non-linguistic clues is the emotional state of the speaker, which is manifested through modulation of various com- municative channels. The fact that many of these channels are actively or passively involved during the production of speech (verbal) and facial expressions (non-verbal) indicates that the linguistic and aective goals co-occur during human interaction. Since con icts may appear between these communicative goals in their realization, some kind of central control system needs to buer, prioritize and execute them in a coherent manner. Many studies have shown that acoustic parameters such as the speech rate, speech duration, the fundamental frequency and the RMS energy change during emotional ut- terances [44, 156, 187]. Articulatory parameters such as the tongue tip, jaw and lip also present more peripheral articulation during emotional speech compared to neutral speech [117, 119]. Similar results were reported by Nordstrand et al.[139] and Cal- dognetto et al.[25]. In fact, these characteristic patterns during emotional speech have been used in facial animation to generate viseme models for expressive virtual agents [12]. In spite of all this spatial-temporal emotional modulation, the linguistic goals are successfully fullled, which suggests that the communicative goals are prioritized according to their roles. As mentioned in Section 3.1.6, we believe that aective and linguistic goals inter- play interchangeably as primary and secondary control. For instance, during speech, linguistic goals are prioritized over aective goals, which are restricted to enhance and complement the communicative channels, under the constraints imposed by the artic- ulatory processes. However, during acoustic silence, articulatory processes are passive, 81 so aective goals may dominate human gestures. Toward validating this hypothesis, this section focuses on investigating the interplay between linguistic and aective goals in facial expressions. Although apex poses of expressive faces have been studied before [69], it is not clear how these communicative goals aect the face during emotional ut- terances. The goal of this study is to quantify both the degree of freedom of facial areas to express the underlying emotion, and the articulatory constraints imposed in the face during active speech. This study uses a subset of the emotional database presented in Section 2.1. After aligning neutral and emotional sentences with similar semantic content with the use of Dynamic Time Warping (DTW), the facial features were compared frame-by-frame. The results show that the upper face region has more freedom to convey non-verbal in- formation regardless of the verbal message, which contrasts with the lower face region, which is constrained by the articulatory processes. These results have important im- plications in areas such as emotion recognition, facial animation and in understanding speech production and perception. 3.2.2 Audio-visual database and facial features A subset of 404 sentences from the FMCD database (Section 2.1) is used in this study. Four target emotions are considered (sadness, happiness, anger and neutral state). The facial features considered here are the head, eyebrow and lip motion, which are param- eterized using the same approach presented in Section 3.1.3.1. Also, each facial marker, except the reference nose marker, is used as a facial feature. They are also grouped in three facial areas: Upper, middle and lower face region (right of Figure 3.1). 82 3.2.3 Emotional modulation 3.2.3.1 Temporal emotional modulation The temporal modulation in the speech of this database was analyzed in Yildirim et al.[187]. The results showed that the mean and variance values of the utterance du- rations for sadness, anger and happiness were higher than for neutral state. Likewise, the speaking rate had higher average values during emotional speech. With regard to vowel durations, the results showed that the mean values for anger and happiness were signicantly higher than for sadness and neutral state. To further analyze this temporal modulation at the sentence-level, Dynamic Time Warping (DTW) was used to estimate the temporal alignment between neutral and emotional speech, for the same sentences. This technique uses dynamic programming algorithms to nd the lowest-cost alignment path between two signals. The slope of this path provides valuable information about the overall emotional temporal modula- tion. The median of the slopes in the alignment path between neutral and emotional speech are 1.14, 1.09 and 1.09 for sad, happy and angry speech, respectively. The results show that emotional utterances are more than 9% longer than the neutral utterances. Interestingly, sad sentences are longer than happy and angry sentences. Since the av- erage phoneme duration for sadness is shorter than in happiness and anger, this result indicates that the inter-word silence to speech ratio was highly modulated during sad speech. 3.2.3.2 Spatial emotional modulation During emotional speech, areas in the face present dierent levels of movement activity, as discussed in Section 3.1.4.1. The results in Table 3.1 and Figure 3.3 reveal that the lower face region has the highest activeness levels. In fact, in neutral speech this area 83 0 50 100 150 45 50 55 60 65 Anger Neutral 0 50 100 150 20 30 40 50 60 Anger Neutral 0 50 100 150 −2 0 2 4 Anger Neutral 0 50 100 150 −1 0 1 2 Anger Neutral (a) (b) (c) Figure 3.8: Dynamic time warping to align neutral and emotional facial features. (a) optimum alignment path (b) original lip features (c) normalized and warped lip features. is four times more active than the upper face area. Since this area is directly connected with the production of the speech, this result suggests that the articulatory processes play a crucial role in the movement of the face. This result supports the hypothesis that linguistic goals have priority during active speech. The results also indicate that the activeness levels change during emotional speech (Table 3.2). The average displacement coecient (Equation 3.6) for happiness and anger is more than 30% higher than in the neutral case. Also, the activeness in the lower face region for sadness decreases 20% compared with the activeness in neutral state. These results reveal that emotional modulation aects the activeness in the face, which agrees with previous work [12, 25, 117, 119, 139]. Notice that the upper face region presents the highest relative increments for happiness and anger compared to the neutral case (120%). This result suggests that valuable non-verbal information is conveyed in this area. The displacement coecient provides only an overall measure of the emotional in u- ence in the face during aective speech. To study this spatial-emotional modulation in 84 (a) Neutral-Sad (b) Neutral-Happy (c) Neutral-Angry Figure 3.9: Graphical representation of correlation levels between neutral and warped version of emotional facial features. Dark areas represent high correlation values, which imply low degree of freedom to convey non-verbal information more detail, the neutral and emotional facial features for the same sentences were com- pared frame-by-frame. As discussed in Section 3.2.3.1, there are temporal dierences between neutral and emotional utterances that need to be taken into account before the analysis. The alignment paths estimated with DTW were used to match the neutral and emotional frames. Figure 3.8 shows an example with the results between neutral and anger for the lip features extracted during the sentence \That dress looks like it comes from Asia". Notice that repetitions of the same sentence will generate dierences not only in the gestures, but also in the speech. However, since the emotional content is the most important variable that is changed, the dierence can be associated mainly with emotional modulation. After aligning the utterances, Pearson's correlation was calculated between the neu- tral and warped versions of the emotional facial features. The goal of this experiment is to quantify how free the facial areas are to convey emotional information. Since the linguistic content is the same, high correlation levels will be associated with facial areas with low degree of freedom to convey non-verbal messages, and vice versa. The median results of the correlation levels are presented in Table 3.10 and Figure 3.9. Figure 3.9 shows a graphical representation of the results, following the same 85 Table 3.10: Frame-by-Frame analysis between emotional and neutral facial features Facial Area Pearson's correlation Euclidean distance Neu-Sad Neu-Hap Neu-Ang Neu-Sad Neu-Hap Neu-Ang Head Motion 0.24 0.25 0.17 4.28 3.83 3.44 Eyebrow 0.25 0.15 0.07 0.69 2.56 1.31 Lip 0.54 0.50 0.53 0.38 1.61 0.82 Upper region 0.27 0.24 0.15 1.08 2.49 2.02 Middle region 0.46 0.38 0.37 0.63 2.12 1.27 Lower region 0.58 0.52 0.53 0.46 0.95 0.71 procedure used for Figure 3.3. The results clearly indicate that the lower facial region presents the highest correlation levels. Thus, this area is not freely able to convey emotion due to the underlying articulatory constraints. In contrast, the correlation for the upper face region is very low which indicates that this area can communicate non- verbal information regardless of the linguistic content. The same results are observed for head and eyebrow motion, which can be freely modulated to express emotional goals. In addition to Pearson's correlation, the Euclidean distance between the neutral (F (neu) t ) and the warped version of the emotional ( e F (emo) t ) facial features were computed. Since the facial features considered here have dierent movement ranges, they need to be normalized before they can be directly compared. Firstly, the mean vector of the neutral facial features, ~ (neu) t , was removed from both F (neu) t and e F (emo) t . Then, the amplitude of F (neu) t was scaled by (neu) t , such that its range was 1. Finally, the amplitude of e F (emo) t was also scaled by the same factor (neu) t . Figure 3.8-(c) shows an example of the results after this normalization (see y-axis). b F (neu) t = (F (neu) t ~ (neu) t ) (neu) t (3.9) b F (emo) t = ( e F (emo) t ~ (neu) t ) (neu) t (3.10) 86 (a) Neutral-Sad (b) Neutral-Happy (c) Neutral-Angry Figure 3.10: Graphical representation of Euclidean distance between the normalized neutral, ( b F (neu) t ) and emotional ( b F (emo) t ) facial features. Dark areas represent large distances, which imply that the areas are not driven by articulatory processes. Table 3.10 shows the median Euclidean distance between the normalized neutral and emotional facial features, in terms of emotional categories. Similar to the Figures 3.3 and 3.9, Figure 3.10 shows a graphical representation of the results. Contrary to the Pearson's correlation results, high values indicate that facial features are more inde- pendent of the articulation, and vice versa. These results also show that the upper face region presents the highest dierences between the neutral and emotional expression, supporting the fact that emotional goals control this area. Table 3.10 also quantify the levels of emotional modulation in the face. For this subject, happiness, followed by anger, seems to have the highest indices of spatial-facial emotional modulation. 3.2.4 Discussions This section analyzed the spatial-temporal emotional modulation in the face during active speech. It also presented evidence about the interplay between linguistic and aective goals in facial expression. The results have important implications in areas such as emotion perception, emotion recognition, speech production and facial animation. The results regarding the activeness of the face showed that facial motion is mainly driven by the articulatory processes. The results also showed that the activeness levels 87 are aected by emotional modulation. Compared to the neutral case, the activeness levels increase for happiness and anger, and decrease for sadness. The frame-by-frame analysis comparisons between the neutral and warped emotional facial features indicate that the upper face region presents more freedom than the lower face region, which is highly constrained by the underlying articulatory processes, to convey non-verbal information such as the emotional content. These results explain why the upper face region is perceptively the most important facial region to detect visual prominences [112, 168]. They also explain why the upper and middle face regions are sucient to accurately recognize human emotions, as shown in Section 4.2. 3.3 Joint analysis of the inteplay between linguistic and aective goals In the previous section, the interplay between aective and linguistic goals observed in the face was analyzed. This section studies the emotional modulation observed across facial gestures and speech. 3.3.1 Introduction Under the Brunswik's lens model, adapted by Scherer to study emotions [156], dierent communicative modalities are used to encode the aective state, such as speech [44], facial expression [69], head motion (Section 5.4) and even body posture [42]. The nature of this encoding is non-trivial, since the same modalities are simultaneously used to convey other communicative goals. However, listeners are particularly good at decoding each aspect of the message, even when the cues are only slightly expressed. The emotional modulation in the acoustic and articulatory domain in speech has been analyzed at the phoneme level [119, 187]. The results showed that low vowels, 88 such as /a/, with less restricted tongue position presented stronger emotional modu- lation than high vowels, such as /i/. As shown in Section 4.1.3, some broad phonetic classes, such as front vowels, have stronger emotional variability in the spectral domain than other phonetic classes, such as nasal sounds. These observations suggest an in- teraction in the acoustic domain between linguistic and aective goals. Likewise, the interplay between articulation and emotions in facial expressions has been analyzed. As described before (Section 3.2), some facial areas are commonly used to convey emotions during expressive utterances. The results showed that the upper face region, such as the forehead, has more degrees of freedom to communicate non-verbal cues than the oro- facial area, which is constrained by the articulatory process. Although the encoding of emotions has been individually studied for these modalities, a joint analysis of the emo- tional ngerprints in dierent modalities is needed to discover the underlying emotional encoding process. Only by simultaneously studying the emotional ngerprint in dier- ent modalities, will we understand the interplay between dierent channels observed during expressive utterances. This section studies the emotional ngerprint in facial expressions and speech during expressive utterances. In particular, our hypothesis is that when some communicative channels are used to fulll linguistic goals, other modalities with less restrictive articula- tory constraints are used to convey aective goals. These ideas resemble the water-lling principle in communication, in which the power (bits) is rst assigned to the channels with lowest noise (variance) [43]. Here, the emotional bits are assigned to the modalities that are not used to convey other communicative goals. Toward exploring this hypothesis, the emotional interplay observed in speech spec- tral envelope and prosodic features, as well as in facial expressions is jointly analyzed. In the proposed approach, the phonetic boundaries are used as a basic segmental unit 89 across modalities. Therefore, the instantaneous behavior that is simultaneously con- veyed in dierent communicative channels can be compared. Then, a measure of the emotional content is estimated based on the dierences between the features extracted from neutral and emotional utterances. Therefore, the emotional modulation in each modality can be quantied, to analyze how the emotions are displayed. The database used in the analysis was recorded from a single subject, using audio and facial motion capture (marker based), which provide detailed facial information (the FMCD database, Section 2.1). In addition to neutral speech, the database is comprised of sad, angry and happy utterances. The results reported here support our hypothesis, since it is observed that the pitch and the facial areas tend to have a stronger emotional modulation for the phonemes that are physically constrained by articulation, such as nasal sounds. The analysis proposed is novel and provides thoughtful insights to help design bet- ter emotional models, capable of capturing the complex behavior observed in human interaction. The section is organized as follows. Section 3.3.2 describes the proposed approach to jointly study the emotional ngerprints in speech and facial expressions. Section 3.3.3 provides the results for the emotional modulation observed in speech and facial expressions. Finally, Section 3.3.4 concludes the section with the discussion and future directions of this work. 3.3.2 Proposed method 3.3.2.1 Motivation As described in Section 4.1.3, acoustically neutral models were used to measure the degree of similarity between the input speech and the reference neutral models. The aim of this approach was to segment emotional speech, since this speech would likely 90 Table 3.11: Broad phonetic classes Description Phonemes B Mid/back vowels ax ah axh ax-h uw uh ao aa ux D Diphthong ey ay oy aw ow F Front vowels iy ih eh ae ix L Liquid and glide l el r y w er axr N Nasal m n en ng em nx eng T Stop b d dx g p t k pcl tcl kcl qcl bcl dcl gcl q epi C Fricatives ch j jh dh z zh v f th s sh hh hv S Silence sil h# #h pau be mismatched with the neutral acoustic models. In the approach, the emotionally- neutral TIMIT corpus was used to build standard Hidden Markov Models (HMMs), trained with Mel Frequency Bank (MFB) outputs, for broad phonetic classes that share similar articulation conguration (see Table 3.11). After recognition, the likelihood scores provided by the Viterbi algorithm were used as a tness measure. The details are given in Section 4.1.3. An interesting result from this work was that some phonetic classes have more degrees of freedom to express emotions. These results are reproduced in Figure 3.11, for the speech taken from the database presented in Section 2.1 (the original Figure 4.2 was created with another corpus). This gure shows the mean and standard deviation of the likelihood scores for the emotional categories, in terms of the broad phonetic classes. As can be observed, the emotional ngerprint in these spectral features is stronger in phoneme classes such as front vowels. In contrast, the likelihood scores for the nasal sounds are similar across emotional categories, suggesting that for these phonemes this vocal tract feature does not have enough degrees of freedom to convey the aective goals. Motivated by these ndings, this section explores the patterns shown in other modal- ities to discover whether other communicative channels compensate the articulatory re- strictions shown in these spectral features. In this work, we also grouped the phonemes 91 B D F L N T C −440 −420 −400 −380 −360 −340 −320 −300 −280 neu sad hap ang Figure 3.11: Mean and standard deviation of the likelihood scores in terms of broad phonetic classes, evaluated with the corpus presented in Section 2.1. The neutral models were trained with MFB features. according to the broad phonetic classes described in Table 3.11 (based on manner of articulation). 3.3.2.2 Methodology The proposed approach consists of projecting the phonetic segmental boundaries to other communicative channels and analyzing the results in terms of these acoustic units. Specically, the modalities that we examined were facial expressions and prosodic fea- tures such as pitch and energy. Notice that the modalities considered here have dierent time scales and are known to be asynchronous. For example, it has been reported that in the orofacial area there may be a phase dierence of hundreds of milliseconds, because of the co-articulation process and articulator inertia [87]. However, in this work we are interested in ana- lyzing the instantaneous behavior simultaneously displayed during the duration of the phonemes. Therefore, it is theoretically consistent to project the acoustic boundaries to other modalities, in the context of the proposed method. 92 For the analysis, the results for the emotional utterances are compared with the results obtained with neutral utterances to quantify the relative changes produced by the emotional ngerprints in dierent modalities. 3.3.2.3 Audio-visual database and features The FMCD database presented in Section 2.1 was used for the analysis. The detailed facial information and the size of the corpus (more than 600 sentences) are particularly useful for the kind of analysis presented here. The markers were modied to compensate the translation and rotation eects of rigid head movement by using the technique described in Section 2.2.5.4. For the analysis, each marker, except the reference nose marker, was used as a facial feature. To display the results in the gures and tables, the markers were aggregated in facial areas. This subdivision is based on the data-driven QR factorization algorithm presented by Lucero et al., in which independent markers selected by the technique are used as a basis for the facial kinematics [126]. Here, the QR factorization algorithm was applied to our data. For clustering, the dependent markers were associated with the independent marker with highest weight in the linear combination. Figure 3.12-a shows the results of the grouping. This data-driven facial division was modied by including symmetry and muscle activity considerations. Figure 3.12-b shows the seven facial subdivisions used in this section: forehead (F1), left eye (F2), right eye (F3), left cheek (left intraorbital triangle) (F4), right cheek (right intraorbital triangle) (F5), nasolabial (F6), and chin (F7). The acoustic signal was downsampled to 16KHz. The fundamental frequency and the RMS energy were then estimated with the Praat speech processing software [14]. The phoneme transcription was obtained with forced alignment, implemented with the HTK toolkit [189]. 93 (a) (b) Figure 3.12: (a) Data-driven approach to cluster the facial region [126]. (b) Facial subdivition: (F1) Forehead, (F2) Left eye, (F3) Right eye, (F4) Left cheek, (F5) Right cheek, (F6) Nasolabial, and (F7) Chin. 3.3.3 Experimental results This section presents the analysis of the emotional ngerprints observed in speech and facial expressions. Sections 3.3.3.1 and 3.3.3.2 describe the results for acoustic speech features, and Section 3.3.3.3 gives the analysis for facial expressions. 3.3.3.1 Spectral speech features As mentioned in Section 3.3.2.1, an alternative approach to quantize the emotional modulation for spectral features is to measure how well the emotional input speech matches the reference spectral-based neutral models. Table 3.12 shows the average likelihood scores for the four emotional categories in terms of the phonetic classes (Table 3.11). These values were obtained after recognizing these broad phonetic classes, with the neutral models built with the MFB outputs (trained with the TIMIT database). In addition to the average likelihood scores, Table 3.12 shows the ratio of the average values for the emotional categories (sadness, anger and happiness) to the ones for the neutral set. These values are particularly useful to quantify the relative changes between the neutral and emotional speech. 94 Table 3.12: Likelihood scores for MFB (Sad = sadness, Ang = anger, Hap = happiness, Neu = neutral) Average likelihood scores Ratio emotion/neutral Sad Ang Hap Nen Sad/Neu Ang/Neu Hap/Neu B -338.7 -375.8 -377.7 -346.5 0.98 1.08 1.09 D -352.9 -387.8 -388.1 -356.1 0.99 1.09 1.09 F -340.1 -397.5 -392.4 -350.6 0.97 1.13 1.12 L -337.4 -371.6 -374.1 -346.9 0.97 1.07 1.08 N -309.1 -328.9 -332.2 -318.4 0.97 1.03 1.04 T -343.6 -381.9 -381.1 -351.0 0.98 1.09 1.09 C -330.9 -367.9 -365.0 -341.2 0.97 1.08 1.07 Table 3.12 quanties the results observed in Figure 3.11. For emotions with a high level of arousal, such as happiness and anger, the magnitude of the average likelihood scores for vowels (F;B) increases approximately 10%, compared with the values observed in the neutral case. This results agree with previous work that have shown that the tongue tip, jaw and lip present more peripheral articulation for vowels during expressive speech when compared to neutral speech [118, 119, 139]. In contrast, the magnitude of the likelihood scores for the nasal sounds (N) in those emotions increases only 3% compared with the neutral case. This result suggests that physical constraints in the articulatory domain restrict the degrees of freedom to simultaneously express other communicative goals, such as emotions. Another approach to measure the dierences of the likelihood scores between the emotional and neutral cases is to compare not only their averages, but also their distri- butions. For this purpose, the probability mass functions (PMFs) were approximated with the histograms of the likelihood scores. The width of the bin was chosen such that the average number of samples per bin was higher than 100. Then, the Kullback-Leibler Divergence (KLD) was used to measure the distances between the three emotion distri- butions (sadness, anger and happiness) and the neutral distribution. Figure 3.13 shows the results, which complement the aforementioned observations. In the gure, longer bars mean larger KLD, which implies stronger dierences between the distributions. 95 B D F L N T C 0 0.5 1 1.5 Sad−Neu Ang−Neu Hap−Neu Figure 3.13: KLD between the distributions for the neutral and emotional likelihood scores values. Vowels present stronger emotional modulation in the spectral acoustic domain. 3.3.3.2 Prosodic speech features In this section, we are interested in analyzing the behavior of the pitch and energy during the acoustic boundaries projected from the broad phonetic classes, estimated with forced alignment (Section 3.3.2.3). These features predominantly describe the source of speech rather than the vocal tract conguration. Notice that pitch values are observed only for voiced phonemes. Therefore, the results reported here for the pitch correspond only to voiced segments. Table 3.13 shows the average values for the pitch and energy in terms of the emotion, for the phonetic classes. Similar to Figure 3.13, Figure 3.14 shows the KLD between the emotional and neutral distribution of the pitch and energy. For the pitch, the results reveal a strong emotional modulation for stop and fricatives phonemes, especially for happiness (see also Table 3.13). Notice that in the experiments reported here, those classes do not present strong modulation in the spectral domain (see Fig. 3.13). For the energy, Figure 3.14 shows that the KLD for vowels (classes B,D andF ) are higher than other classes. In summary, while the spectral and energy features tend to have higher emotional modulation for vowels, the pitch, which is relatively independent of the vocal tract conguration, is used to convey emotions in voiced stop and fricative phonemes. 96 Table 3.13: Average values of pitch and energy during broad phonetic classes (Sad = sad- ness, Ang = anger, Hap = happiness, Neu = neutral) Average value Ratio emotion/neutral Sad Ang Hap Neu Sad/Neu Ang/Neu Hap/Neu Pitch B 208.2 232.3 247.2 194.2 1.07 1.20 1.27 D 203.7 227.2 248.6 192.0 1.06 1.18 1.30 F 209.3 234.2 240.2 196.1 1.07 1.19 1.22 L 202.8 215.5 235.0 194.7 1.04 1.11 1.21 N 194.8 213.3 236.9 184.8 1.05 1.15 1.28 T 217.1 234.7 276.4 197.4 1.10 1.19 1.40 C 209.5 233.7 254.2 196.1 1.17 1.19 1.30 Energy B 71.81 82.38 82.38 75.32 0.95 1.09 1.09 D 72.99 83.12 83.46 76.79 0.95 1.08 1.09 F 71.72 81.41 81.86 74.79 0.96 1.09 1.09 L 71.00 79.71 80.31 74.21 0.96 1.07 1.08 N 68.65 76.79 77.58 70.60 0.97 1.09 1.10 T 62.21 70.70 72.35 64.18 0.97 1.10 1.13 C 63.22 71.05 72.23 65.17 0.97 1.09 1.11 Table 3.14: Ratio between the average displacement coecient observed in emotional and neutral utterances. (S = sadness, A = anger, H = happiness, N = neutral) (B) Mid/back vowels (D) Diphthong (F) Front vowels (L) Liquid and glide S/N A/N H/N S/N A/N H/N S/N A/N H/N S/N A/N H/N Forehead 1.36 2.40 2.29 1.39 2.41 2.13 1.31 2.38 2.07 1.31 2.21 2.15 Left eye 1.29 2.37 2.04 1.30 2.36 1.95 1.29 2.37 1.91 1.23 2.15 1.83 Right eye 1.33 2.66 2.59 1.38 2.75 2.51 1.33 2.70 2.43 1.30 2.44 2.43 Left cheek 1.05 1.76 1.36 1.00 1.68 1.42 1.03 1.82 1.50 1.04 1.79 1.39 Right cheek 1.05 1.75 1.55 1.01 1.69 1.62 0.99 1.75 1.63 1.02 1.71 1.54 Nasolabial 0.88 1.68 1.40 0.87 1.66 1.41 0.87 1.68 1.34 0.83 1.63 1.36 Chin 0.89 1.55 1.49 0.84 1.41 1.25 0.85 1.39 1.38 0.88 1.54 1.50 Average 1.12 2.03 1.82 1.11 1.99 1.76 1.09 2.01 1.75 1.09 1.92 1.74 (N) Nasal (T) Stop (C) Fricatives S/N A/N H/N S/N A/N H/N S/N A/N H/N Forehead 1.31 2.46 2.31 1.30 2.33 2.25 1.35 2.24 2.06 Left eye 1.27 2.36 1.98 1.23 2.26 2.03 1.37 2.31 2.00 Right eye 1.31 2.66 2.63 1.27 2.56 2.54 1.31 2.42 2.31 Left cheek 1.10 1.87 1.52 1.11 1.83 1.53 1.04 1.73 1.38 Right cheek 1.03 1.75 1.67 1.08 1.77 1.66 1.03 1.70 1.45 Nasolabial 0.88 1.71 1.47 0.90 1.69 1.43 0.82 1.54 1.20 Chin 0.91 1.60 1.70 0.89 1.56 1.59 0.86 1.52 1.31 Average 1.12 2.06 1.90 1.11 2.00 1.86 1.11 1.92 1.67 97 (a) Pitch B D F L N T C 0 0.2 0.4 Sad−Neu Ang−Neu Hap−Neu (b) Energy B D F L N T C 0 0.2 0.4 0.6 Figure 3.14: KLD between the distributions of the neutral and emotional values of the pitch (a) and energy (b). 3.3.3.3 Facial expression In this analysis, we are also interested in studying the emotional modulation in facial expressions that is simultaneously displayed during the phonetic segmental boundaries (also estimated with forced alignment). To quantize the facial activeness, the displace- ment coecient is calculated for each facial marker (Equation 3.4). After estimating the displacement coecient for the markers, the average value for each facial area was calculated for each phoneme segment. This procedure was applied for each emotional category. For sake of simplicity, only the ration between emotional and neutral results are reported in Table 3.14. This table shows that the stronger emotional modulation for happiness and anger is observed in nasal sounds. A matched pairs test [134] between the rations reported in Table 3.14 was performed to see whether this result is statistically signicant. This statistical test removes the dierences observed between facial areas. Therefore, the signicance of the emotional eect during the phonetic classes can be measured. The test revealed that the ratios for the class N is signicantly higher than the ratios of any other classes (df = 20, p< 0:05), with the exception of class B. 98 Similar to Figures 3.13 and 3.14, Figure 3.15 gives the KLD between the distributions of the emotional and neutral displacement coecients. This gure also shows that the nasal sounds are among the phonetic classes with higher emotional modulation. Figure 3.15 also shows that the emotional modulation in the upper region of the face (areas F1 to F3) is higher than in the orofacial area (areas F6 and F7). This result agrees with our previous work which showed that the forehead area is less constrained by the articulation process, and therefore, it has more freedom to convey other communicative goals, such as emotions (Section 3.2). Notice that emotional dierences are observed in the results presented here. In the acoustic domain, the emotional modulation for happiness tends to be higher than the modulation for anger and sadness (see Figures 3.13 and 3.14). However, in the facial expression domain, anger is the emotion with stronger modulation, especially in the lower face regions (see Figures 3.15). These results suggest that dierent modalities are used to emphasize the aective goals for happiness and anger. 3.3.4 Discussion The analysis presented in this section provides evidence about the emotional encoding process, in which dierent modalities are used to convey the aective goals. The re- sults support the idea that emotional bits are assigned to the modalities that are not constrained by other communicative goals. In particular, it was observed that facial ex- pressions and pitch tend to have stronger emotional modulation when the articulatory conguration does not have enough freedom to express emotions, given the physical constraints in the speech production system. This emotional assignment compensates the temporal limitation observed in some modalities, and plays a crucial role in the interplay between communicative goals. 99 (F1) B D F L N T C 0 0.5 1 Sad−Neu Ang−Neu Hap−Neu (F2) B D F L N T C 0 0.2 0.4 0.6 (F3) B D F L N T C 0 0.5 1 (F4) B D F L N T C 0 0.2 0.4 (F5) B D F L N T C 0 0.2 0.4 (F6) B D F L N T C 0 0.2 0.4 (F7) B D F L N T C 0 0.1 0.2 0.3 Figure 3.15: KLD between the distributions of the neutral and emotional values of the displacement coecient for the facial areas. (F1) Forehead, (F2) Left eye, (F3) Righ eye, (F4) Left cheek, (F5) Righ cheek, (F6) Nasolabial, (F7) Chin 100 An interesting question is whether the phonetic classes used here, which are based on the manner of articulation, is the best phonetic description to link acoustic and visual modalities. An alternative approach is to subdivide the phonemes according to the place of articulation (e.g. bilabial, dental, palatal). Since this classication is connected with the mapping between visemes and phonemes, it may be useful to compare similar analysis with this phonetic subdivision. While the results presented here are interesting from a theoretical point of view, many practical applications such as believable human-like animation and cognitive in- terfaces can be enhanced by understanding how the emotions are jointly encoded in dierent communicative channels. The analysis presented in this section provides use- ful insights to design emotional models that capture the underlying relationships and interplays between modalities. 3.4 Conclusions This chapter analyzed the in uence of emotional content and articulatory processes in the relationship between facial gestures and speech. The results show that articulation and emotions jointly modulate the interaction between these communicative modalities. The articulatory process is strongly correlated with the appearance of the face. This result is observed not only in the lower face region but also in the upper and middle face regions. Likewise, we observe signicant inter-emotion dierences in correlation levels of the audio-visual mapping, which suggest that the emotional content also aect the relationship. Under emotional speech the activeness of facial gestures for anger and happiness increases by more than 30% than during neutral speech. This pattern directly aects the interrelation between facial gestures and speech. The results presented here suggest that even though the relationship between facial gestures and speech can change from sentence to sentence, there is an emotion-dependent 101 structure that may be learned using more sophisticated techniques than the commonly used multilinear regression. Our preliminary results, which will be presented in Chapter 5, suggest that time series models such as HMMs, are promising and hence seem to be suitable for learning the kinds of audio-visual mapping analyzed here. We are currently analyzing the timing and resolution in the interrelation between facial and acoustic features. As shown in this chapter, facial gestures and speech are coupled at dierent resolutions. Also, the timing between gestures and speech is in- trinsically asynchronous. Even in the orofacial area there may be a phase dierence of hundreds of milliseconds, because of the co-articulation process and articulator inertia [87]. An open question is how to learn and model this asynchronous behavior not only near the lips, but also in the entire face. Our goal is to appropriately integrate the facial gestures to generate believable human-like facial animations. This chapter also analyzed the spatial-temporal emotional modulation in the face during active speech. The results indicate that there is an interplay between linguistic and aective goals in facial expression. For instances, facial areas such as forehead and cheeks are less constrained by the articulatory process, and therefore have more degree of freedom to express non-linguistic clues such as emotions. These results suggest that facial emotion recognition systems during active speech should focus primarily in this area, because it is not constrained by the linguistic content. Also, for human-like facial animations, this facial area should be properly modeled and rendered to convey more realistic emotional representations. In the lower face region, the results reveal that linguistic and aective goals co-occur during active speech. Further analyses need to be conducted to evaluate what happens when con ict between these communicative channels occur. In these cases, areas with more degrees of freedom to convey non-verbal information such as head and eyebrow motion may be used to simultaneously achieve these communicative goals. 102 In this analysis, the facial gestures during active speech were analyzed. An interest- ing question is how the patterns change during acoustic silence, in which the aective goals can be expressed without articulatory constraints. Another interesting question is how to model this spatial-temporal emotional modulation. This chapter also analyzed the interplay between linguistic and aective goals that are simultaneously observed in facial expression and speech. Interestingly, the results suggested that when one modality is constrained by the articulatory processes, other communicative channels are used to express the emotional content. The main limitation of this work is that we analyzed the gestures and speech of a single actress. The database presented in Section 2.2, in which ten subjects were recorded, can be used to validate and expand the experiments presented here. This data will also be appropriate to study inter-personal variabilities in facial gestures. Our next step will be to control the emotional content and the personal styles of the facial animations. This will be possible if we understand how the emotional, idiosyncratic and linguistic aspects of human communication modulate each communicative modality. 103 Chapter 4: Recognition on non-linguistic cues In this chapter, we study approaches to automatically recognize paralinguistic cues such as emotions, strategies and behaviors displayed by the speakers at dierent levels of granularity. Using speech features, Section 4.1 proposes a framework for binary emotion recog- nition systems (emotional versus neutral speech). Instead of using a conventional ap- proach, a two-step scheme is proposed. In the rst step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a tness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or dierent from, in the case of emotional speech, the reference models. Since emotion is jointly conveyed through dierent communicative channels, Section 4.2 presents results on emotion recognition using features extracted from the speech and facial expressions. The limitations of unimodal system are analyzed and the advantages 104 of considering both streams of features in the system are measured in terms of robustness and accuracy. At a small group level, Section 4.3 analyzes participants' interaction in an intel- ligent environment, equipped with non-invasive audio-visual sensors. The goal is to automatically monitoring and tracking the behavior, strategies and engagement of the participants in multiperson meetings. High-level features are calculated from active speaker segmentations, automatically annotated by our smart room system, to infer the interaction dynamics between the participants. These high-level features, which can- not be inferred from any of the individual modalities by themselves, can be useful for summarization, classication, retrieval and (after action) analysis of meetings. Finally, Section 4.4 presents the nal remarks in this chapter. 4.1 Using Neutral Speech Models for Emotional Speech Recognition 4.1.1 Introduction Detecting and utilizing non-lexical or paralinguistic cues from a user is one of the ma- jor challenges in the development of usable human-machine interfaces (HMI). Notable among these cues are the universal categorical emotional states (e.g., angry, happy, sad, etc.), prevalent in day-to-day scenarios. Knowing such emotional states can help adjust system responses so that the user of such a system can be more engaged and have a more eective interaction with the system. For the aforementioned purpose, identifying a user's emotion from the speech signal is quite desirable since recording the stream of data and extracting features from this modality is comparatively easier and simpler than in other modalities such as facial expression and body posture. Most of the current eorts to address this problem have 105 been limited to dealing with emotional databases spanning a sub-set of emotional cat- egories. These studies have shown accuracy between 50% and 85% depending on the task (e.g. number of emotion labels, number of speakers, size of database) [144]. A comprehensive review of the current approaches is given in [47]. However, such emotion categorization performance is largely specic to individual databases examined (and usually o-line) and it is not plausible to easily generalize the results to dierent databases or on-line recognition tasks. This is due to inherent speaker-dependency in emotion expression, acoustic confusions among emotional cate- gories, and dierences in acoustic environments across recording sessions. The feature selection and the models are trained for specic databases, with the risk of sparseness in the feature space and over-tting. It is also fairly dicult, if not infeasible, to collect enough emotional speech data so that one can train robust and universal acoustic mod- els of individual emotions, especially, if one considers that there exist more than dozen of emotional categories and their possible combinations that we can use to dierentiate aective states or attitudes [44]. As a possible way to circumvent the fundamental problem in emotion categorization based on speech acoustics, this study tests a novel idea of discriminating emotional speech against neutral (i.e., non-emotional) speech. That is, instead of training individ- ual emotional models, we build a single, neutral speech model and use it for emotion evaluation either in the categorical approach or in the dimensional approach [44] based on the assumption that emotional speech productions are variants of the non-emotional counterparts in the (measurable) feature space. For example, it has been shown that speech rate, speech duration, fundamental frequency (F0), and RMS energy are simulta- neously modulated to convey the emotional information [187]. Also in the articulatory domain, it has been shown that the tongue tip, jaw and lip kinematics during expres- sive speech production are dierent from neutral speech [118, 119]. Hence, modeling 106 the dierential properties with respect to neutral speech is hypothesized to be advanta- geous. In addition, because there are a lot more neutral speech corpora, robust neutral acoustic speech models can be built. This section presents our attempts to examine the aforementioned idea using spectral (Section 4.1.3) and prosodic (Section 4.1.4) features. Using the most emotionally salient speech features, the performance of the proposed approach for binary emotion recognition reaches over 77% (baseline 50%), when the various emotional databases are considered together. Furthermore, when the system is trained and tested with dierent databases (in a dierent language), the recognition accuracy does not decrease compared to the case without any mismatch between the training and testing condition. For the same task in contrast, the performance of a con- ventional emotion recognition system (without the neutral models) decreases up to 17% using the same speech features. These results indicate that the proposed neutral model approach for binary emotion discrimination (emotional versus neutral speech) outper- forms conventional emotion recognition schemes in terms of accuracy and robustness. 4.1.2 Proposed approach Figure 4.1 describes the proposed two-step approach. In the rst step, neutral models are built to measure the degree of similarity between the input speech and the reference neutral speech. The output of this block is a tness measure of the input speech. In the second step, these measures are used as features to infer whether the input speech is emotional or neutral. If the features from the expressive speech dier in any aspect from their neutral counterparts, the tness measure will decrease. Therefore, we hypothesize that setting thresholds over these tness measures is easier and more robust than setting thresholds over the features themselves. 107 Assessing the neutral models Emotional vs. neutral classification Figure 4.1: Proposed two-step neutral model approach to discriminate neutral versus emotional speech. In the rst step, the input speech is contrasted with robust neutral references models. In the second step, the tness measures are used for binary emotional classication (details are given in Section 4.1.2). Table 4.1: Broad phone classes Description Phonemes F Front vowels iy ih eh ae ix B Mid/back vowels ax ah axh ax-h uw uh ao aa ux D Diphthong ey ay oy aw ow L Liquid and glide l el r y w er axr N Nasal m n en ng em nx eng T Stop b d dx g p t k pcl tcl kcl qcl bcl dcl gcl q epi C Fricatives ch j jh dh z zh v f th s sh hh hv S Silence sil h# #h pau While the rst step is independent of the emotional database, the speakers, and the emotional categories, the second step depends on these factors since the classier needs to be trained with emotional and neutral speech. 4.1.3 Neutral model for spectral speech features While the neutral models can be trained for any speech feature that shows emotional modulation, in this section, we considered the conventional spectral features Mel Fil- ter Bank (MFB) and Mel-Frequency Cepstrum Coecients (MFCCs) (prior work had demonstrated that spectral features carry signicant emotional information [116]). Sep- arate set of models for both types of features were built at the phonetic-class level. The English phonetic alphabet was aggregated in seven broad phonetic classes that share similar articulation conguration (Table 4.1). 108 Hidden Markov Models (HMMs) were selected to build the neutral models, since they are suitable to capture the time series behavior of the speech, as widely demonstrated in automatic speech recognition (ASR). Here, an HMM of 3 states and 16 Mixtures of Gaussians was built for each broad phonetic category. The HTK toolkit was used to build these models, using standard techniques such as forward-backward and Baum- Welch re-estimation algorithm [189]. After a high-frequency pre-emphasis of the speech signal, MFCC and MFB feature vectors were estimated. For MFCCs, 13 coecients were estimated with cepstral mean normalization (CMN) option. Likewise, the outputs of 13 lter banks were used as MFB features. In both cases, the velocity and acceleration of these coecients were included forming a 39-feature vector. An important issue in this approach is the selection of the tness measure to assess how well the input speech t the reference models. Here, the likelihood scores are used, which are provided by the Viterbi decoding algorithm. Since this likelihood depends on the length of the segment, the acoustic scores were normalized according to the duration of the phones (optiono N in function HVite [189]). One assumption made in this approach is that the emotional corpus used to test this approach will have a set of neutral speech. The purpose of this assumption is to compensate dierent recording settings between the neutral reference and emotional databases. In particular, the speech les are scaled such that the average RMS energy of the neutral reference database and the neutral set in the emotional database are the same. This approach is similar to the normalization presented in [192], and it is described in further detail in Section 4.1.4.2. 4.1.3.1 Analysis of the likelihood scores The popular read-speech TIMIT database was used as a reference for neutral speech [75]. This database contains 4620 sentences for the training set, and 1680 for the testing set, 109 collected from 460 speakers. The nature, size, and the inter-speaker variability make this database suitable to train the proposed \neutral-speech" models. Two corpora are used as emotional databases. The rst one is the EMA database [119], in which three subjects read 10 sentences ve times portraying four emotional states: sadness, anger, happiness and neutral state. Although this database contains articulatory information, only the acoustic signals were analyzed. Notice that the EMA data have been perceptually evaluated and inter-evaluator agreement has been shown to be 81.9% [82]. The second corpus was collected from a call center application [115]. It provides spontaneous speech of dierent speakers from a real human-machine application. The data was labeled as negative or non-negative (neutral). Only the sentences with high agreement between the raters were considered (1027 neutral, 338 negative). This database is referred here on as call center database (CCD). More details of these two emotional databases can be found in [115, 119], respectively. While the sample rate of the TIMIT database is 16KHz, the sample rate of the CCD corpus is 8KHz (telephone speech). To compensate this mismatch, the TIMIT database was downsampled to 8KHz to train the reference neutral models used to assess the CCD corpus. In contrast, for the EMA corpus the broad band TIMIT data was used for training, since its speech les were also recorded at 16KHz. MFB-based neutral speech models After the MFB-based neutral models were built with the TIMIT training set, the likelihood scores of the emotional testing corpora were computed. Figures 4.2 and 4.3 shows plots with the mean and standard deviation of the likelihood scores for the broad phonetic categories, obtained with the EMA and CCD databases, respectively. For reference, the likelihood scores for the TIMIT testing set were also plotted. These gures reveal that the mean and the variance of the likelihood score for emo- tional speech dier from the results observed in neutral speech, especially for emotion 110 B D F L N T C −600 −550 −500 −450 −400 −350 −300 −250 ref neu sad hap ang Figure 4.2: Error bar of the likelihood scores in terms of broad phonetic classes, evalu- ated with the EMA corpus. The neutral models were trained with MFB features. with high level of arousal such as anger and happiness. We also observed that some broad phonetic classes present stronger dierences than others. For example, front vow- els present distinctive emotional modulations. In contrast, the likelihood scores for nasal sounds are similar across emotional category (see Fig. 4.4-c), suggesting that during articulation there is not enough degrees of freedom to convey emotional modulation. These results agree with our previous work that indicated that emotional modulation is not displayed uniformly across speech sounds [119, 187]. Figure 4.4 gives the histograms of the likelihood scores of four broad phonetic classes for the EMA database. For reference, the likelihood scores for the TIMIT testing database are also included. This gure reveals that the results of the neutral speech from the EMA corpus are closer to the results of the TIMIT references. It is also ob- served that the histograms for happiness and anger signicantly dier from that of the references. Unfortunately, the results for sadness were similar to neutral speech, so it is expected that these classes will not be correctly separated with these chosen spectral features. Interestingly, similar confusion trends were observed in our previous work between these emotional categories with other acoustic speech features (spectral and 111 B D F L N T C −420 −400 −380 −360 −340 −320 −300 −280 −260 −240 ref neu neg Figure 4.3: Error bar of the likelihood scores in terms of broad phonetic classes, evalu- ated with the CCD corpus. The neutral models were trained with MFB features. −500 −450 −400 −350 −300 0 2000 4000 TIMIT reference −500 −450 −400 −350 −300 0 200 400 Neutral −500 −450 −400 −350 −300 0 200 400 Sadness −500 −450 −400 −350 −300 0 100 200 Happiness −500 −450 −400 −350 −300 0 100 200 Anger −500 −450 −400 −350 −300 0 2000 4000 TIMIT reference −500 −450 −400 −350 −300 0 200 400 Neutral −500 −450 −400 −350 −300 0 200 400 Sadness −500 −450 −400 −350 −300 0 200 400 Happiness −500 −450 −400 −350 −300 0 500 Anger −500 −450 −400 −350 −300 0 2000 4000 TIMIT reference −500 −450 −400 −350 −300 0 500 1000 Neutral −500 −450 −400 −350 −300 0 500 1000 Sadness −500 −450 −400 −350 −300 0 500 Happiness −500 −450 −400 −350 −300 0 200 400 Anger −500 −450 −400 −350 −300 0 5000 TIMIT reference −500 −450 −400 −350 −300 0 500 Neutral −500 −450 −400 −350 −300 0 200 400 Sadness −500 −450 −400 −350 −300 0 200 400 Happiness −500 −450 −400 −350 −300 0 200 400 Anger (a) Class B (b) Class F (c) Class N (d) Class T Figure 4.4: Likelihood score histograms for the broad phonetic classes B, F, N and T. EMA corpus is used (MFB-based neutral models). prosodic features) [82, 187]. One possible explanation is that neutral and sad speech mainly dier in the valence domain. However, it has been shown that with speech fea- tures the valence domain is more dicult to recognize than the arousal or activation domain [82]. MFCC-based neutral speech models Neutral models were also built with MFCC features. Figures 4.5 and 4.6 show the results of the likelihood score in terms of the broad phonetic categories for the EMA and CCD databases, respectively. Although emotional modulation is observed, these gures show that the dierences between the likelihood scores for emotional categories are not 112 as strong as the dierences obtained with the MFB-based models (note the values in the vertical axes). Interestingly, MFCCs are calculated from MFB, by applying the Discrete Cosine Transform (DCT) over the Mel log-amplitudes. This post-processing step seems to blur the acoustic dierences between emotional and neutral speech. B D F L N T C −95 −90 −85 −80 −75 −70 −65 ref neu sad hap ang Figure 4.5: Error bar of the likelihood scores in terms of broad phonetic classes, evalu- ated with the EMA corpus. The neutral models were trained with MFCC features. B D F L N T C −90 −85 −80 −75 −70 −65 −60 ref neu neg Figure 4.6: Error bar of the likelihood scores in terms of broad phonetic classes, evalu- ated with the CCD corpus. The neutral models were trained with MFCC features. 113 4.1.3.2 Discriminant analysis This section analyzes the emotional information conveyed in the likelihood scores. It also discusses whether they can be used to segment emotional speech automatically. It can be observed from Figures 4.2, 4.3, 4.5, and 4.6 that the means and the standard deviations of the likelihood scores dier from the values obtained with neutral speech. In this experiment, the average of these measures at sentence level for each broad phonetic class was used as features for emotion recognition. If the emotional classes are denoted by C and the features for the phonetic class i j 2 (F;B;D;L;N;T;C), are denoted by Lh i j , the classication problem can be formulated as: P (CjObs) (4.1) P (CjLk i 1 ;Lk i 2 ;Lk i N ) (4.2) where N is the number of dierent phonetic classes recognized on the sentence. As- suming independence between the results of the phonetic classes, Equation 4.2 can be rewritten as: P (CjObs) = N Y j=1 P (CjLk i j ) (4.3) In other words, only the probabilities P (CjLk i j ) from the phonetic classes detected in the sentences are combined. A linear discriminant classier (LDC) was used in these experiments. Since the number of samples for each emotional category is dierent, the prior probabilities were set to equal values. The databases were randomly split in training (80%) and testing (20%) sets. The results reported here correspond to the average performance over 100 realizations. 114 Table 4.2: Discriminant analysis of likelihood scores for CCD database (Neg=negative, Neu=neutral) Ground truth MFB-based models MFCC-based models Neg Neu Neg Neu Classied Neg 0.42 0.16 0.38 0.15 Neu 0.58 0.84 0.62 0.85 Table 4.3 presents the recognition results for two experiments using the EMA database: binary emotion recognition, in which the emotional classes sadness, anger and happiness were grouped together versus neutral speech, and categorical emotion recog- nition, in which the labels of the four categories were classied. For the MFB-based neutral models, the binary classier achieved an accuracy of 78% (chance is 50%). As can be observed from the confusion matrix, the classication errors were mainly between neutral state and sadness. The performance of the 4-label emotion recognition test was 65% (chance is 25%). In our previous work, an accuracy of 66.9% was achieved for a similar task by using many acoustic features [82]. These two experiments reveal that with this approach neutral speech can be accurately separated from emotional speech when there is a high level of arousal (i.e. happiness and anger). For the MFCC-based neutral models, the performance decreases measurably. These results agree with the analysis presented in Section 4.1.3.1. For the CCD corpus, the emotional classes considered in the experiment were neg- ative versus neutral speech. The results for the MFB-based neutral models were 42% for the negative class, and 84% for neutral class (63% is the average). The performance for MFCC-based models was slightly lower than with MFB: 38% for the negative class, and 85% for neutral class (61.5% is the average). As a reference, our previous work has reached approximately 74% accuracy for the same task, by using many dierent acoustic 115 Table 4.3: Discriminant analysis of likelihood scores for EMA database (Neu=neutral, Sad=sadness, Hap= happiness, Ang= anger, Emo=emotional) Ground truth MFB-based models MFCC-based models Sad Ang Hap Neu Sad Ang Hap Neu Classied Emo 0.20 0.98 0.97 0.02 0.05 0.99 0.81 0.13 Neu 0.80 0.02 0.03 0.98 0.95 0.01 0.19 0.87 Sad 0.66 0.00 0.10 0.27 0.65 0.00 0.02 0.25 Ang 0.00 0.73 0.32 0.01 0.01 0.65 0.44 0.03 Hap 0.02 0.27 0.58 0.06 0.04 0.35 0.46 0.22 Neu 0.33 0.00 0.01 0.66 0.30 0.00 0.07 0.50 speech features, including prosodic features [115]. One reason that the performance is worse in the CCD data than in the EMA data is that most of the sentences are short with only one word (median duration is approximately 1.5 seconds). This issue aects the accuracy of the features extracted from the likelihood scores. 4.1.4 Neutral model for prosodic speech features Speech prosody is also an important communicative channel that is in uenced by, and enriched with emotional modulation. The intonation, tone, timing, and energy of speech are all jointly in uenced in a non-trivial manner to express the emotional message [47]. The standard approach in current emotion recognition systems is to compute high-level statistical information from prosodic features at the sentence-level such as mean, range, variance, maximum and minimum of F0 and energy. These statistics are concatenated to create an aggregated feature vector. Then, a suitable feature selection technique such as forward or backward feature selection, sequential forward oating search, genetic al- gorithms, evolutionary algorithms, linear discriminant analysis, or principal component analysis [2, 162, 178] is used to extract a feature subset that provides better discrimina- tion for the given task. As a result, the selected features are sensitive to the training and testing conditions (database, emotional descriptors, recording environment). Therefore, it is not surprising that the models do not generalize across domains, and notably in 116 real-life scenarios. A detailed study of the emotional modulation in these features can inform the development of robust features, not only for emotion recognition, but also for other applications, such as expressive speech synthesis. This section focuses on one aspect of expressive speech prosody: the F0 (pitch) contour. The goal of this section is twofold. The rst is to study which aspects of the pitch contour are manipulated during expressive speech (e.g. curvature, contour, shape, dy- namics). For this purpose, we present a novel framework based on Kullback-Leibler Di- vergence (KLD) and logistic regression models to identify, quantify, and rank the most emotionally salient aspects of the F0 contour. Three dierent emotional databases are used for the study, spanning dierent speakers, emotional categories and languages (En- glish and German). First, the Kullback-Leibler Divergence is used to compare the dis- tributions of dierent pitch statistics (e.g., mean, maximum) between emotional speech and reference neutral speech. Then, a logistic regression analysis is implemented to discriminate emotional speech from neutral speech using the pitch statistics as input. These experiments provide insights about the aspects of pitch that are modulated to convey emotional goals. The second goal is to use these emotionally salient features to build robust prosody speech models to detect emotional speech. While the focus of the previous section was on spectral speech models, this section focuses on features derived from the F0 contour. Gaussian Mixture Models (GMMs) are trained using the most discriminative aspects of the pitch contour, following the analysis results presented in this section. The results reveal that features that describe the global aspects (or properties) of the pitch contour such as the mean, maximum, minimum, and range are more emotionally salient than features that describe the pitch shape itself (e.g., slope, curvature and in- exion). However, features such as pitch curvature provide complementary information that are useful for emotion discrimination. The classication results also indicate that 117 the models trained with the statistics derived over the entire sentence have better per- formance in terms of accuracy and robustness than when they are trained with features estimated over shorter speech regions (i.e., voiced segments). 4.1.4.1 Related work Pitch features from expressive speech have been extensively analyzed during the last few years. From these studies, it is well known that the pitch contour presents dis- tinctive patterns for certain emotional categories. In an exhaustive review, Juslin and Laukka reported some consistent results for the pitch contour across 104 studies on vocal expression [96]. For example, they concluded that the pitch contour is higher and more variable for emotions such as anger and happiness and lower and less variable for emotions such as sadness. Despite having a powerful descriptive value, these observa- tions are not adequate to quantify the discriminative power and the variability of the pitch features. In this section, we highlight some of the studies that have attempted to measure the emotional information conveyed in dierent aspects of the pitch contour. The results obtained by Lieberman and Michaels indicate that the ne structure of the pitch contour is an important emotional cue [123]. Using human perceptual experiments, they showed that the recognition of emotional modes such as bored and pompous decreased when the pitch contour is smoothed. Therefore, small pitch uctu- ations, which are usually neglected, seem to convey emotional information. In many languages, the F0 values tend to gradually decrease toward the end of the sentence, a phenomenon known as declination. Wang et al. compared the pitch declination conveyed in happy and neutral speech in Mandarin [182]. Using four-word sentences, they studied the pitch patterns at the word level. They concluded that the declination in happy speech is less than in neutral speech and that the slope of the F0 contour is higher than neutral speech, especially at the end of the sentence. Paeschke 118 et al. also analyzed the pitch shape in expressive speech [142]. They proposed dierent pitch features that might be useful for emotion recognition, such as the steepness of rising and falling of the pitch, and direction of the pitch contour [142]. Likewise, they also studied the dierences in the global trend of the pitch, dened as the gradient of linear regression, in terms of emotions [141]. In all these experiments, they found statistically signicant dierences. B anziger and Scherer argued that the pitch mean and range account for most of the important emotional variation found in the pitch [7]. In our previous work, the mean, shape and range of the pitch of expressive speech were systematically modied [18]. Then, subjective evaluations were performed to assess the emotional dierences perceived in the synthesized sentences with the F0 modications. The mean and the range were increased/decreased in dierent percentages and values. The pitch shape was modied by using stylization at varying semitone frequency resolution. The results indicated that modications of the range (followed by the mean), had the biggest im- pact in the emotional perception of the sentences. The results also showed that the pitch shape needs to be drastically modied to change the perception of the original emotions. Using perceptual experiments, Ladd et al. also suggested that pitch range was more salient than pitch shape. Scherer et al. explained these results by making the distinction between linguistic and paralinguistic pitch features [158]. The authors suggested that gross statistics from the pitch are less connected from the verbal con- text, so they can be independently manipulated to express the emotional state of the speaker (paralinguistic). The authors also argued that the pitch shape (i.e., rise and fall) is tightly associated with the grammatical (linguistic) structure of the sentence. Therefore, the pitch shape is jointly modied by linguistic and aective goals. As an aside, similar interplay with pitch has been observed in facial expressions as described in Section 3.2. 119 Another interesting question is whether the emotional variations in the pitch contour change in terms of specic emotional categories or general activation levels. B anziger and Scherer reported that the mean and range of the pitch contour change as a function of emotional arousal [7]. On the other hand, they did not nd evidence for specic pitch shapes for dierent emotional categories. Thus, we argue that using pitch features is more suited for binary emotion classication than for implementing multi-class emotion recognition. These results support our ideas of contrasting pitch statistics derived from emotional speech with those of the neutral counterpart. Although the aforementioned studies have reported statistically signicant emotional dierences, they do not provide automatic recognition experiments to validate the dis- criminative power of the proposed features. The framework presented in this section allows us not only to identify the emotionally salient aspects the F0 contour, but also to quantify and compare their discriminative power for emotion recognition purposes. The main contributions of this study are: A discriminative analysis of emotional speech with respect to neutral speech A novel methodology to analyze, quantify, and rank the most prominent and discriminative pitch features A novel robust binary emotion recognition system based on contrasting expressive speech with reference neutral models 4.1.4.2 Methodology Overview The fundamental frequency or F0 contour (pitch), which is a prosodic feature, pro- vides the tonal and rhythmic properties of the speech. It predominantly describes the 120 speech source rather than the vocal tract properties. Although it is also used to empha- size linguistic goals conveyed in speech, it is largely independent of the specic lexical content of what is spoken in most languages [104]. The fundamental frequency is also a supra-segmental speech feature, where infor- mation is conveyed over longer time scales than other segmental speech correlates such as spectral envelope features. Therefore, rather than using the pitch value itself, it is commonly accepted to estimate global statistics of the pitch contour over an entire utterance or sentence (sentence-level) such as the mean, maximum, and standard devia- tion. However, it is not clear that estimating global statistics from the pitch contour will provide local information of the emotional modulation [123]. Therefore, in addition to sentence-level analysis, we investigate alternative time units for the F0 contour analysis. Examples of time units that have been proposed to model or analyze the pitch contour include those at the foot-level [103], word-level [182], and even syllable-level [142]. In this section, we propose to study the pitch features extracted over voiced regions (here on referred as voiced-level). In this approach, the frames are labeled as voiced or un- voiced frames according to their F0 value (greater or equal to zero). Consecutive voiced frames are joined to form a voiced region, over which the pitch statistics are estimated. The average duration of this time unit is 167 ms (estimated from the neutral refer- ence corpus described in Table 4.4). The lower and upper quartiles are 60 and 230 ms, respectively. The motivation behind using voiced region as a time unit is that the voicing process, which is in uenced by the emotional modulation, directly determines voiced and unvoiced regions. Therefore, analysis along this level may shed further in- sights into emotional in uence on the F0 contour not evident from the sentence level analyses. From a practical viewpoint, voiced regions are easier to segment compared to other short time units, which require forced alignment (word and syllable), or syllable stress detections (foot). In real-time applications, in which the audio is continuously 121 recorded, this approach has the advantage that smaller buers are required to process the audio. Also, it does not require pre-segmenting the input speech into utterances. Both sentence- and voiced-level pitch features are analyzed in this study. For the sake of generalization, the results presented in this section are based on four dierent emotional databases (three for training and testing, and one for validation) recorded from dierent research groups and spanning dierent emotional categories (Table 4.4). Therefore, some degree of variability in the recording settings and the emotional elicitation is included in the analysis. Instead of studying the pitch con- tour in terms of emotional categories, the analysis is simplied to a binary problem in which emotional speech is contrasted with neutral speech (i.e., neutral versus emo- tional speech). This approach has the advantage of being independent of the emotional descriptors (emotional categories or attributes), and it is useful for many practical ap- plications such as automatic expressive speech mining. In fact, it can be used as a rst step in a more sophisticated multi-class emotion recognition system, in which a second level classication would be used to achieve a ner emotional description of the speech. Notice that the concept of neutral speech is not clear due to speaker variability. To circumvent this problem, we propose the use of a neutral (i.e., non-emotional) reference corpus recorded from many speakers (Table 4.4). This neutral speech reference will be used to contrast the speech features extracted from the emotional databases (Section 4.1.4.3), to normalize the energy and the pitch contour for each speaker (Section 4.1.4.2), and to build neutral model for emotional versus non-emotional classication (Section 4.1.4.6). Databases In this study, ve databases are considered: one non-emotional corpus used as a neutral speech reference, and four emotional databases with dierent properties. A summary of the databases is given in Table 4.4. 122 Table 4.4: Summary of the databases (neu=neutral, ang=anger, hap=happiness, sad=sadness, bor=boredom, dis=disgust, fea=fear, anx=anxiety, pan=panic, anh=hot anger, anc=cold anger, des=despair, ela=elation, int=interest, sha=shame, pri=pride, con=contempt, sur=surprise). Data type Use Spo/act Language # spe. # utte. Emotions WSJ1 neu. Reference spont. English 50 8104 neu EMA emo. Train/test acted English 3 688 neu,ang,hap,sad EPSAT emo. Train/test acted English 8 4738 neu,hap,sad,bor,dis,anx,pan,anh,anc,des,ela,int,sha,pri,con GES emo. Train/test acted German 10 535 neu,ang,hap,sad,bor,dis,fea SES emo. Validation acted Spanish 1 266 neu,ang,hap,sad,sur The corpus considered in this study as the neutral (i.e. non-emotional) reference database is the Wall Street Journal-based Continuous Speech Recognition Corpus Phase II (WSJ) [145]. This corpus, which we will refer to here on as WSJ1, comprises read and spontaneous speech from Wall Street Journal articles. For our purposes, only the spontaneous portion of this data was considered, which was recorded by fty journalists with varying degrees of dictation experience. In total, more than eight thousand sponta- neous utterances were recorded. Notice that in Section 4.1.3.1, the read-speech TIMIT database was used as reference [75]. Since our ultimate goal is to build a robust neutral model for contrasting and recognizing emotion in real-life applications, this spontaneous corpus was preferred. For the analysis and the training of the models (Section 4.1.4.3, 4.1.4.4, 4.1.4.5), three emotional corpora were considered. These emotional databases were chosen to span dierent emotional categories, speakers, genders, and even languages, with the purpose to include, to some extent, the variability found in the pitch. The rst database was collected at the University of Southern California (USC) using an electromagnetic articulography (EMA) system [119]. In this database, which will be referred to here on as EMA, one male and two female subjects (two of them with formal theatrical vocal training) read 10 sentences ve times portraying four emotional states: sadness, anger, 123 happiness, and neutral state. Further information about this corpus is given in Section 4.1.3.1. The second emotional corpus corresponds to the Emotional Prosody Speech and Transcripts database (EPSAT) [122]. This database was collected at the University of Pennsylvania and is comprised of recordings from eight professional actors (ve female and three male) who were asked to read short semantically neutral utterances, cor- responding to data and numbers, expressing fourteen emotional categories (Table 4.4). The emotion states were elicited by providing examples of a situation for each emotional category. This database provides a wide spectrum of emotional manifestation. The third emotional corpus is the Database of German Emotional Speech (GES) which was collected at the Technical University of Berlin [21]. This database was recorded from ten participants, 5 female, and 5 male, who were selected based on the naturalness and the emotional quality of the participant's performance in audition sessions. The emotional categories considered in the database are anger, happiness, sadness, boredom, disgust, fear and neutral state. While the previous databases were recorded in English, this database was recorded in German. Although each language in- uences the shape of the pitch contour dierently, we hypothesize that emotional pitch modulation can be still measured and quantied using English neutral pitch models. The assumption is that the fundamental frequency in English and German will share similar patterns. The results presented in Section 4.1.4.6 gives some evidence for this hypothesis. In addition, a fourth emotional database is used in Section 4.1.4.6 to evaluate the robustness of the pitch neutral models. Since the most discriminant F0 features are selected from the analysis presented in Sections 4.1.4.3 and 4.1.4.4, this database will be used to assess whether the emotional discrimination from this set of features extends to other corpora. This validation corpus corresponds to the Spanish Emotional Speech 124 database (SES), which was collected from one professional actor with Castilian accent at the Polytechnic University of Madrid [136]. The emotions considered in this database are anger, happiness, sadness, surprise and neutral state. Speaker dependent normalization Normalization is a critical step in emotion recognition. The goal is to eliminate speaker and recording variability while keeping the emotional discrimination. For this analysis, a two-step approach is proposed: energy normalization (Eq. 4.4), and pitch normalization (Eq. 4.5). In the rst step, the speech les are scaled such that the average RMS energy of the neutral reference database (E ref ) and the neutral subset in the emotional databases (E s neu ) are the same, for each speaker s. This normalization is separately applied for each subject in each database. The goal of this normalization is to compensate for dierent recording settings among the databases. S s Energy = s E ref E s neu (4.4) In the second step, the pitch contour is normalized for each subject (speaker- dependent normalization). The average pitch across speakers in the neutral reference database is estimated, F 0 ref . Then, the average pitch value for the neutral set of the emotional databases is estimated for each speaker,F 0 s neu . Finally a scaling factor (S s F 0 ) is estimated by taking the ratio between F 0 ref and F 0 s neu , as shown in Equation 4.5. Therefore, the neutral samples of each speaker in the databases will have a similar F0 mean value. S s F 0 = F 0 ref F 0 s neu (4.5) 125 One assumption made in this two-step approach is that neutral speech will be avail- able for each speaker. For real-life applications, this assumption is reasonable when either the speakers are known or a few seconds of their neutral speech can be pre- recorded. In the worst scenario, at least gender normalization should be applied [150]. Notice that these scaling factors will not aect emotional discrimination in the speech, since the dierences in the energy and the pitch contour across emotional categories will be preserved. Pitch features The pitch contour was extracted with the Praat speech processing software [14], using an autocorrelation method. The analysis window was set to 40 milliseconds with an overlap of 30 milliseconds, producing 100 frames per second. The pitch was smoothed to remove any spurious spikes by using the corresponding option provided by the Praat software. Table 4.5 describes the statistics estimated from the pitch contour and the derivative of the pitch contour. These statistics are grouped into sentence-level and voiced-level features as dened in Section 4.1.4.2. These are the statistics that are commonly used in related work to recognize emotions from the pitch. Note that only voiced regions with more than 4 frames are considered to have reliable statistics (more than 40 ms). Likewise, kurtosis and skewness, in which the third and four moments about the mean need to be estimated, are not estimated at the voiced-level segments. As mentioned in Section 4.1.4.2, the average duration of the voiced segments is 167 ms. (16.7 frames). Therefore, there are not enough samples to robustly estimate these statistics. Describing the pitch shape for emotional modulation analysis is a challenging prob- lem, and dierent approaches have been proposed. The Tones and Break Indices System (ToBI) is a well-known technique to transcribe prosody (or intonation) [165]. Although progress has been made toward automatic ToBI transcription [4], an accurate and more 126 complete prosodic transcription requires hand labeling. Furthermore, linguistic mod- els of intonation may not be the most appropriate labels to describe the emotions [7]. Taylor has proposed an alternative pitch contour parameterization called Tilt intona- tion model [169]. In this approach, the pitch contour needs to be pre-segmented into intonation events. However, there is no straightforward or readily available system to estimate these segments. Given these limitations, we follow a similar approach pre- sented by Grabe et al.[78]. The voiced regions, which are automatically segmented from the pitch values, are parameterized using polynomials. This parameterization captures the local shape of the F0 contour with few parameters, which provide clear physical interpretation of the curves. Here, the slope (a 1 ), curvature (b 2 ), and in exion (c 3 ) are estimated to capture the local shape of the pitch contour by tting a rst, second and third order polynomial to each voiced region segment (Eqs. 4.6, 4.7, and 4.8). y = a 1 x +a 0 (4.6) y = b 2 x 2 +b 1 x +b 0 (4.7) y = c 3 x 3 +c 2 x 2 +c 1 x +c 0 (4.8) Table 4.6 shows additional sentence-level statistics derived from the voiced-level feature average. These statistics provide insights about the local dynamics of the pitch contour. For example, while the pitch range at the sentence-level (Srange) gives the extreme value distance of the pitch contour over the entire sentence, SVmeanRange, the mean of the range of the voiced regions, will indicate whether the voiced regions have at or in ected shape. Likewise, some of these features will inform global patterns. For instance, the feature SVmeanSlope is highly correlated with the declination or global 127 Table 4.5: Sentence- and voiced-level features extracted from the F0. Sentence level statistic Voiced level statistic Description F0 F0 derivative F0 F0 derivative F0 mean Smean Sdmean Vmean Vdmean F0 standard deviation Sstd Sdstd Vstd Vdstd F0 range Srange Sdrange Vrange Vdrange F0 minimum Smin Sdmin Vmin Vdmin F0 maximum Smax Sdmax Vmax Vdmax F0 median Smedian Sdmedian Vmedian Vdmedian F0 lower quartile SQ25 SdQ25 VQ25 VdQ25 F0 upper quartile SQ75 SdQ75 VQ75 VdQ75 F0 interquartile range Siqr Sdiqr Viqr Vdiqr F0 kurtosis Skurt Sdkurt ** ** F0 skewness Sskew Sdskew ** ** F0 slope ** ** Vslope ** F0 curvature ** ** Vcurv ** F0 in exion point ** ** Vin e ** trend of the pitch contour, which previous studies have reported to convey emotional information [141, 182]. In sum, 60 pitch features are analyzed (39 sentence-level features and 21 voiced-level features). From here on, the statistics presented in Tables 4.5 and 4.6 are interchange- ably referred to as \features", \F0 features", or \pitch features". 4.1.4.3 Experiment 1: Comparisons using Kullback-Leibler Divergence This section presents our approach to identifying and quantifying the pitch features with higher level of emotional modulation. Instead of comparing just the mean, the distributions of the pitch features extracted from the emotional databases are compared with the distributions of the pitch features extracted from the neutral reference corpus using KLD [43]. KLD provides a measure of the distance between two distributions. Although it is not a symmetric metric, it is an appealing approach to robustly estimate the dierences between the distributions of two random variables. The rst step is to estimate the distribution of the pitch features for each database, including the neutral reference corpus. Since the KLD is sensitive to the bins used to 128 Table 4.6: Additional sentence-level F0 features derived from the statistics of the voiced region patterns. Global statistic derived from voiced segments Value Mean of the voiced segment ranges SVmeanRange Mean of the voiced segment maximums SVmeanMax Mean of the voiced segment minimums SVmeanMin Mean of the voiced segment lower quartiles SVmeanQ25 Mean of the voiced segment upper quartiles SVmeanQ75 Mean of the voiced segment interquartile ranges SVmeanIqr Mean of the voiced segment slopes SVmeanSlope Mean of the voiced segment curvatures SVmeanCurv Mean of the voiced segment in exions SVmeanIn e Max. of the voiced segment slopes SVmaxSlope Max. of the voiced segment curvatures SVmaxCurv Max. of the voiced segment in exion point SVmaxIn e Max. of the voiced segment mean SVmaxMean Std. of the voiced segment means SVstdMean Std. of the voiced segment slopes SVstdSlope Std. of the voiced segment curvatures SVstdCurv Std of the voiced segment in exions SVstdIn e approximate the distributions, the k-nearest neighbor estimation was implemented to estimate the bins [62]. To compare the KLD in terms of features and emotional cate- gories,k was set constant for each distribution (k = 40, empirically chosen). Notice that these feature-dependent non-uniform bins were estimated considering all the databases to include the entire range spanned by the features. After the bins were calculated, the distribution (p (d;e) f ) of each pitch feature (f) was estimated for each database (d), and for each emotional category (e). The same procedure was used to estimate the distribution of the pitch features in the reference neutral corpus, q ref f . The next step is to compute the KLD between the distribution of the emotional databases and the distribution estimated from the reference database. This procedure is repeated for each database and for each emotional category. D (d;e) f q ref f kp (d;e) f = X 2X q ref f () log q ref f () p (d;e) f () (4.9) 129 A good pitch feature for emotion discrimination ideally would haveD (d;neutral) f close to zero (neutral speech of the database d is similar to the reference corpus), and a high value for D (d; e) f , where e is any emotional category except the neutral state. Notice that if D (d;neutral) f and D (d; e) f have high values, this test would indicate that the speech from the emotional database is dierent from the reference database (how neutral is the neutral speech?). Likewise, if both values were similar, this feature would not be relevant for emotion discrimination. Therefore, instead of directly comparing the KLD, we propose to estimate the ratio between D (d; e) f and D (d;neutral) f (Equation 4.10). That is, after matching the feature distributions with the reference feature distributions, the emotional speech is directly compared with the neutral set of the same emotional database by taking the ratio. High values of this ratio will indicate that the pitch features for emotional speech are dierent from their neutral counterparts, and therefore are relevant to discriminate emotional speech from neutral speech. r (d; e) f = D (d; e) f D (d;neutral) f (4.10) Figure 4.7 shows the average ratio between the emotional and neutral KLD ob- tained across databases and emotional categories. The pitch features with higher values are SVmeanMin, SVmeanMax, Sdirq, and Smean for the sentence-level features, and Vrange, Vstd, Vdrange, and Vdiqr for the voiced-level features. As further discussed in Section 4.1.4.5, these results indicate that gross statistics of the F0 contour are more emotionally salient than the features describing the pitch shape itself. In Section 4.1.4.6, the top features from this experiment will be used for binary emotion classication. Figures 4.8, 4.9 and 4.10 show the results for the EMA, EPSAT and GES databases, respectively. For the sake of space, these gures only display the results for the emotions anger, happiness and sadness. They also include the average ratio across the emotional 130 0 2 4 6 8 10 Ratio SVmeanMin SVmeanMax Sdirq Smean SVmeanQ25 SVmeanRange SdQ25 Smin SVmeanQ75 SdQ75 SQ75 Sdmin Sdrange SVmeanIqr Sdmedian 0 2 4 6 8 10 Ratio Vrange Vstd Vdrange Vdiqr Vdstd Viqr Vmean Vmedian VQ25 Vmax Vmin VdQ25 VdQ75 VQ75 Vcurv Figure 4.7: Most emotional prominent features according to the average KLD ratio between features derived from emotional and neutral speech. The gures show the sentence-level (top) and voiced-level (bottom) features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. categories for each database (Emo). The gures show that the rank of the most promi- nent pitch features varies according to the emotional databases. By analyzing dierent corpora, we hypothesize that the reported results will give more general insights about the emotional salient aspects of the fundamental frequency. These gures also reveal that some emotional categories with high activation levels (i.e., high arousal) such as anger and happiness are clearly distinguished from neutral speech using pitch-related features. However, subdued emotional categories such as sadness present similar pitch characteristics to neutral speech. This result agrees with the hypothesis that emotional pitch modulation is triggered by the activation level of the sentence [7], as mentioned in Section 4.1.4.1. Further discussion about the pitch features are given in Section 4.1.4.5. 131 0 10 Emo 0 10 Ang 0 10 Hap 0 10 Sad SVmeanMin VSmeanMax Sdirq Smean SVmeanQ25 VSmeanRange SdQ25 Smin SVmeanQ75 SdQ75 Vrange Vstd Vdrange Vdiqr Vdstd Viqr Vmean Vmedian VQ25 Vmax Figure 4.8: Average KLD ratio between pitch features derived from emotional and neutral speech from the EMA corpus. The label Emo corresponds to the average results across all emotional categories. In order to keep the y axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. 4.1.4.4 Experiment 2: Logistic regression Analysis The experiments presented in Section 4.1.4.3 provide insight about the pitch features from expressive speech that dier from the neutral counterpart. However, they do not directly indicate the discriminative power of these features. This section addresses this question with the use of logistic regression analysis [89]. All the experiments reported in this section correspond to binary classication (neu- tral versus emotional speech). Unlike Section 4.1.4.6, the emotional databases are sepa- rately analyzed. The neutral reference corpus is not used in the section. The emotional categories are also separately compared with neutral speech (i.e., neutral-anger, neutral- happiness). Logistic regression is a well-known technique to model binary or dichotomous vari- ables. In this technique, the conditional expectation of the variable given the input variables is modeled with the specic form described in Equation 4.11. After applying 132 0 10 Emo 0 10 Anh 0 10 Hap 0 10 Sad VSmeanMin VSmeanMax Sdirq Smean VSmeanQ25 VSmeanRange SdQ25 Smin VSmeanQ75 SdQ75 Vrange Vstd Vdrange Vdiqr Vdstd Viqr Vmean Vmedian VQ25 Vmax Figure 4.9: Average KLD ratio between pitch features derived from emotional and neu- tral speech from the EPSAT corpus. The label Emo corresponds to the average results across all emotional categories. Only the emotional categories hot anger, happiness and sadness are displayed. In order to keep the y-axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. the logit transformation (Eq. 4.12), the regression problem becomes linear in its pa- rameters ( 0 ;:::; n ). A nice property of this technique is that the signicance of the coecients can be measured using the log-likelihood ratio test between two nested mod- els (the input variables of one model are included in the other model). This procedure provides estimates about the discriminative power of each input feature. E(Yjf 1 ;:::;f n ) =(x) = e 0 + 1 x 1 +:::nxn 1+e 0 + 1 x 1 +:::nxn (4.11) g(x) = ln (x) 1(x) = 0 + 1 x 1 +::: n x n (4.12) Experiment 2.1: The rst experiment was to run logistic regression with only one pitch feature in the model at a time. The procedure is repeated for each emotional cat- egory. The goal of this experiment is to quantify the discriminative power of each pitch feature. This measure is estimated in terms of the improvement in the log-likelihood 133 0 10 Emo 0 10 Ang 0 10 Hap 0 10 Sad SVmeanMin VSmeanMax Sdirq Smean SVmeanQ25 VSmeanRange SdQ25 Smin SVmeanQ75 SdQ75 Vrange Vstd Vdrange Vdiqr Vdstd Viqr Vmean Vmedian VQ25 Vmax Figure 4.10: Average KLD ratio between pitch features derived from emotional and neutral speech from the GES corpus. The label Emo corresponds to the average results across all emotional categories. Only the emotional categories anger, happiness and sadness are displayed. In order to keep the y-axis xed, some of the bars were clipped. The rst 10 bars correspond to sentence-level features, and the last 10 to voiced-level features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. of the model when a new variable is added (the statistic x=-2 log-likelihood ratio is approximately chi-square distributed, and can be used for hypothesis testing). Fig- ure 4.11 gives the average log-likelihood improvement across the emotional categories and databases for the top-15 sentence- and voiced-level features. The pitch features with higher score are Smedian, Smean, SVmeanQ75, and SQ75 for the sentence-level features, and VQ75, Vmean, Vmedian, and Vmax for the voiced-level feature. These features will also be considered for binary emotion recognition in Section 4.1.4.6. Al- though the order in the ranking in the F0 features is dierent in Figures 4.7 and 4.11, eight sentence- and voiced-level features are included among the top-ten features ac- cording to both criteria (experiments 1 and 2.1). This result shows the consistency of the two criteria used to identify the most emotionally salient aspects of the F0 contour (the F0 features with higher emotional/neutral KLD ratio are supposed to provide more discriminative information in the logistic regression models). 134 0 20 40 60 80 100 120 Log−Lik Smedian Smean SVmeanQ75 SQ75 SQ25 SVmeanQ25 SVmeanMax SVmeanMin Sdirq SVmaxMean Smin SdQ25 SVmeanIqr Smax SVmeanRange 0 50 100 150 200 250 Log−Lik VQ75 Vmean Vmedian Vmax VQ25 Vmin Vrange Viqr Vstd Vdiqr Vdrange Vdstd Vdmin VdQ25 Vdmax Figure 4.11: Most emotionally discriminative pitch features according to the log- likelihood ratio scores in logistic regression analysis, when only one feature is entered at a time in the models. The gure displays the average results across emotional databases and categories. The gures show the sentence-level (top) and voiced-level (bottom) fea- tures. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. Experiment 2.2: Some of the pitch features provide overlapping information or are highly correlated. Since the pitch features were individually analyzed in experiment 2.1, these important issues were not addressed. Therefore, a second experiment was designed to answer this question, which is important for classication. Logistic regression analysis is used with Forward Feature Selection (FFS) to discriminate between each emotional category and neutral state (i.e., neutral-anger). Here, the pitch features are sequentially included in the model until the log-likelihood improvement given the new variable is not signicant (chi-square statistic test). In each case, the samples are split in training (70%) and testing (30%) sets. Figure 4.12 gives the pitch features that were most often selected in each of the 26 logistic regression tests (see Table 4.7). This gure provides insights about some 135 pitch features, which may not be good enough if they are considered alone, but they give supplementary information to other pitch features. Notice that in each of these experiments, the pitch features were selected to maximize the performance of that spe- cic task. The goal of analyzing the selected features across emotional categories and databases is to identify pitch features that can be robustly used to discriminate between emotional and neutral speech in a more general fashion (for generalization). 0 5 10 15 20 Freq Smedian Sdmedian SVmeanRange SVmaxCurv SQ25 SQ75 SdQ75 SVmaxInfle Smin Sdkurt Sdmax Sdirq Skurt Sdskew SVmeanCurv 0 5 10 15 20 Freq Vcurv Vmin Vmedian VQ25 VQ75 Vmean VdQ75 Vdiqr Vstd Vrange Vdmedian Vmax Viqr Vdstd Vdmin Figure 4.12: Most frequently selected features in logistic regression models using forward feature selection. The gures show the sentence-level (top) and voiced-level (bottom) features. The nomenclature of the F0 features is given in Tables 4.5 and 4.6. The pitch features that were most often selected in the logistic regression experiments reported in Figure 4.12 are Smedian, Sdmedian, SVmeanRange, and SVmaxCurv for the sentence-level features, and Vcurv, Vmin, Vmedian, and VQ25 for the voiced- level features. This experiment reveals interesting results. For example, Smedian and Smean were seldom selected together since they are highly correlated ( 0:96). In fact, while Smean was the second best feature according to the experiment 2.1 (Fig. 4.11), it is not even in the top 15 features according to this criterion (Fig. 4.12). In 136 Table 4.7: Details of the logistic regression analysis using FFS (Sentence-level features). Classication Power of the model Acc Pre Rec F Bas -2 Log Cox & Nagelkerke likelihood Snell R 2 R 2 EMA Anger 91.9 92.0 93.9 92.9 58.1 66.6 0.675 0.901 Happiness 96.0 95.3 98.4 96.8 64.0 16.6 0.729 0.976 Sadness 68.2 63.5 78.6 70.2 59.1 314.7 0.124 0.166 Emotional 87.1 91.0 93.2 92.1 82.2 370.3 0.341 0.491 GES Anger 98.1 97.1 100 98.5 67.3 0.0 0.740 1.000 Happiness 95.1 91.7 100 95.7 58.5 5.2 0.734 0.983 Sadness 78.4 60.0 100 75.0 54.1 67.7 0.494 0.666 Boredom 69.0 64.0 80.0 71.1 59.5 56.8 0.590 0.787 Disgust 87.1 78.6 91.7 84.6 54.8 34.3 0.596 0.824 Fear 92.9 88.0 100 93.6 59.5 25.5 0.671 0.903 Emotional 83.1 93.0 88.7 90.8 89.4 249.2 0.191 0.324 EPSAT Happiness 87.6 72.7 0.816 76.9 75.1 156.1 0.537 0.795 Sadness 80.9 25.0 0.714 37.0 77.5 452.2 0.088 0.130 Boredom 77.3 27.9 0.546 36.9 76.2 426.5 0.196 0.285 Disgust 84.0 57.1 0.757 65.1 73.8 356.5 0.322 0.466 Fear 74.7 15.9 0.438 23.3 75.8 480.1 0.133 0.192 Panic 97.1 93.8 0.909 92.3 81.2 46.8 0.638 0.946 Hot anger 97.1 97.1 0.895 93.1 79.8 37.7 0.635 0.954 Cold anger 79.0 52.1 0.610 56.2 74.2 218.4 0.458 0.683 Despair 79.3 50.0 0.732 59.4 69.7 276.4 0.368 0.553 Elation 94.5 86.7 0.907 88.6 75.4 62.5 0.625 0.927 Interest 85.6 68.4 0.796 73.6 70.8 254.2 0.430 0.634 Shame 73.4 6.5 0.333 10.9 75.0 438.2 0.077 0.115 Pride 83.3 54.8 0.677 60.5 76.7 318.5 0.296 0.446 Contempt 76.4 33.3 0.704 45.2 70.8 465.0 0.148 0.214 Emotional 80.5 97.1 0.824 89.2 82.6 1371.4 0.184 0.305 contrast, other features that were not relevant when they were individually included in the model appear to provide important supplementary information (e.g., SVmeanCurv and Vcurv). Further discussion about the features is given in Section 4.1.4.5. Tables 4.7 and 4.8 provide details about the logistic regression experiments per- formed with FFS for the sentence- and voiced-level features, respectively. In the tables, we highlight the cases when the t of the logistic regression models was considered ade- quate, according to the Nagelkerke r-square statistic (r 2 > 0:4, empirically chosen) [138]. These tables show that some emotional categories cannot be discriminated from neutral 137 speech based on these pitch features (e.g., sadness, boredom, shame). The tables also re- veal that voiced-level features provide emotional information, but the performance is in general worse than the sentence-level features. This result indicates that voiced segment region may not be long enough to capture the emotional information. An alternative hypothesis is that not all the voiced region segments present measurable emotional mod- ulation, since the emotion is not uniformly distributed in time (Section 3.3). In fact, previous studies suggest that the patterns displayed by the pitch at the end of the sen- tences are important for emotional categories such as happiness [158, 182]. Therefore, the confusion in the classication task may increase by considering each voiced region as a sample. 4.1.4.5 Analysis of pitch features On the one hand, the results presented in the previous sections reveal that pitch statis- tics such as the mean/median, maximum/upper quartile, minimum/lower quartile, and range/interquartile range, are the most emotionally salient pitch features. On the other hand, features that describe the pitch contour shape such as the slope, curvature and in exion, do not seem to convey the same measurable level of emotional modulation. These results indicate that the continuous variations of pitch level are the most salient aspects that are modulated in expressive speech. These results agree with previous ndings reported in [7, 111], which indicate that pitch global statistics such as the mean and range are more emotionally prominent than the pitch shape itself, which is more related with the verbal context of the sentence [158]. The results of the experiment 1 indicate that the standard deviation and its deriva- tive convey measurable emotional information at the voiced-level analysis (Vstd, Fig. 4.7). This result agrees with the nding reported by Lieberman and Michaels, which suggested that uctuations in short-time segments are indeed important emotional cues 138 Table 4.8: Details of the logistic regression analysis using FFS (voiced-level features). Classication Power of the model Acc Pre Rec F Bas -2 Log Cox & Nagelkerke likelihood Snell R 2 R 2 EMA Anger 75.9 64.8 84.3 73.3 50.9 1730.7 0.320 0.426 Happiness 83.2 74.5 91.2 82.0 51.4 1429.9 0.424 0.565 Sadness 59.5 58.4 60.9 59.6 51.2 2371.0 0.043 0.057 Emotional 76.5 96.6 77.8 86.2 75.9 3286.9 0.167 0.246 GES Anger 91.0 91.5 93.6 92.6 61.3 581.5 0.541 0.751 Happiness 88.0 81.8 94.4 87.7 52.1 565.6 0.477 0.636 Sadness 59.9 61.5 61.5 61.5 52.1 1014.8 0.105 0.140 Boredom 55.7 42.6 58.6 49.3 50.6 1026.8 0.041 0.054 Disgust 73.1 43.0 76.7 55.1 61.6 761.1 0.174 0.236 Fear 88.7 83.5 90.6 86.9 55.3 570.6 0.458 0.612 Emotional 85.6 99.4 86.1 92.3 86.1 2002.2 0.100 0.183 EPSAT Happiness 82.3 43.8 85.4 57.9 72.2 1120.6 0.278 0.413 Sadness 74.7 10.6 66.7 18.3 73.3 1487.1 0.044 0.065 Boredom 70.6 0.0 0.0 0.0 71.3 1594.8 0.015 0.023 Disgust 72.4 22.0 70.0 33.5 68.5 1558.7 0.129 0.186 Fear 72.0 2.5 44.4 4.7 72.2 1677.9 0.024 0.034 Panic 88.6 62.3 86.2 72.3 76.1 581.3 0.457 0.707 Hot anger 91.3 74.5 88.7 81.0 75.2 720.3 0.451 0.672 Cold anger 77.4 23.7 75.0 36.0 73.2 1340.4 0.164 0.243 Despair 77.1 26.3 71.9 38.5 72.7 1368.5 0.146 0.216 Elation 89.8 71.2 90.6 79.7 71.8 719.4 0.446 0.668 Interest 77.4 33.5 80.0 47.2 69.9 1276.6 0.205 0.303 Shame 76.7 0.0 0.0 0.0 76.7 1530.3 0.010 0.015 Pride 76.9 14.9 65.8 24.2 74.8 1332.7 0.133 0.200 Contempt 75.3 17.1 65.0 27.1 73.2 1576.1 0.077 0.112 Emotional 83.8 99.9 83.9 91.2 83.9 4980.5 0.101 0.167 139 [123]. Notice that in the experiments 2.1 and 2.2 reported in Section 4.1.4.4, Vstd is among the top-10 best features (Figs. 4.11 and 4.12). The results in Figure 4.12 suggest that the curvature of the pitch contour is aected during expressive speech. Although SVmaxCurv and Vcurv were never selected as the rst feature in the FFS algorithm, they are among the most selected features for the sentence- and voiced-level logistic regression experiments. These results indicate that these features provide supplementary emotional information that can be used for classi- cation purposes. For other applications such as expressive speech synthesis, changing the curvature may not signicantly change the emotional perception of the speech. This result agrees with the nding reported by Bulut and Narayanan [18]. The analysis also reveals that sentence-level features derived from voiced segment statistics (Table 4.6) are important. From the top-5 sentence-level features in Figures 4.7, 4.11, and 4.12, six out of twelve features correspond to global statistics derived from voiced segments. This result suggests that variations between voiced regions convey measurable emotional modulation. Features derived from the pitch derivative are not as salient as the features derived from the pitch itself. Also, SVmeanSlope, which is related to the pitch global trend, does not seem to be an emotionally salient feature, as suggested by Wang et al. and Paeschke [141, 182]. To build the neutral models for binary emotion recognition (Section 4.1.4.6), a subset of the pitch features was selected. Instead of nding the best features for that particular task, we decided to pre-select the top-5 sentence- and voiced-level features based on results from experiments 1, 2.1 and 2.2 presented in Sections 4.1.4.3 and 4.1.4.4 (Figs. 4.7, 4.11, and 4.12). Some of the features were removed from the group since they presented high levels of correlation. The pitch features Sdirq, Smedian, SQ75, SQ25, Sdmedian, SVmeanRange, and SVmaxCurv were selected as sentence-level features, and 140 Table 4.9: Correlation of the selected pitch features. Sentence-level features Sdirq Smedian SQ75 SQ25 Sdmedian SVmeanRange SVmaxCurv Sdirq 1.000 0.709 0.751 0.641 -0.211 0.668 -0.107 Smedian 0.709 1.000 0.897 0.956 -0.268 0.520 -0.227 SQ75 0.751 0.897 1.000 0.834 -0.252 0.575 -0.176 SQ25 0.641 0.956 0.834 1.000 -0.248 0.455 -0.224 Sdmedian -0.211 -0.268 -0.252 -0.248 1.000 -0.166 0.098 SVmeanRange 0.668 0.520 0.575 0.455 -0.166 1.000 0.178 SVmaxCurv -0.107 -0.227 -0.176 -0.224 0.098 0.178 1.000 Voiced-level features Vstd Vdrange Vdiqr VQ75 Vmedian Vmax Vcurv Vstd 1.000 0.910 0.789 0.477 0.231 0.656 0.152 Vdrange 0.910 1.000 0.800 0.491 0.291 0.685 0.053 Vdiqr 0.789 0.800 1.000 0.617 0.428 0.664 -0.082 VQ75 0.477 0.491 0.617 1.000 0.952 0.937 -0.394 Vmedian 0.231 0.291 0.428 0.952 1.000 0.855 -0.525 Vmax 0.656 0.685 0.664 0.937 0.855 1.000 -0.276 Vcurv 0.152 0.053 -0.082 -0.394 -0.525 -0.276 1.000 Vstd, Vdrange, Vdiqr, VQ75,Vmedian,Vmax, and Vcurv were selected as voiced-level features. Table 4.9 gives the correlation matrix between these features. Only few pitch features present high levels of correlation. These variables were not removed since our preliminary results indicated that they improve the recognition accuracy. 4.1.4.6 Emotional discrimination results using neutral models In this section, the ideas are extended to build neutral models for the selected sentence- and voiced-level pitch features (Table 4.9). As mentioned in Section 4.1.2, the second step in the proposed approach depends on the emotional database, the speakers, and the emotional categories (Fig. 4.1). To overcome this limitation, the three emotional databases (EMA, EPSAT, and GES) were combined to train a semi-corpus-independent classier. Notice that this binary recognition task is more challenging than the logistic regression analysis presented in Section 4.1.4.4, since the emotional corpora are jointly used, and all the emotional categories (without neutral state) are grouped into a single category (emotional). 141 To build the neutral reference models, a Gaussian Mixture Model (GMM) is proposed to model each of the selected pitch features. For a given input speech, the likelihoods of the models are used as a tness measures. In the second step, a Linear Discrimi- nant Classier (LDC) was implemented to discriminate between neutral and expressive speech. While more sophisticated non-linear classiers may give better accuracy, this linear classier was preferred for the sake of generalization. The recognition results presented in this section are the average values over 400 real- izations. Since the emotional categories are grouped together, the number of emotional samples is higher than the neutral samples. Therefore, in each of the 400 realizations, the emotional samples were randomly drawn to match the number of neutral samples. Thus, for the experiments presented here, the priors were equally set for the neutral and emotional classes (baseline = 50%). Then, the selected samples were split in training and testing sets (70% and 30%, respectively). Notice that the three emotional corpora are considered together. Given that some emotional categories are confused with neutral speech in the pitch feature space (Section 4.1.4.4), a subset of emotional categories for each database was selected. The criterion was based on the Nagalkerke r-square score of the logistic re- gression presented in Table 4.7 (R 2 > 0:4). This section presents the results in terms of all emotional categories (All emotions) and this subset of emotional categories (Selected emotions). An important parameter of the GMM is the number of mixtures, K. Figure 4.13 shows the performance of the GMM-based pitch neutral models for dierent number of mixtures. The gure shows that the proposed approach is not sensitive to this parameter. For the rest of the experiments, K was set to 2. Table 4.10 presents the performance of the proposed approach for the sentence- and voiced-level features. When all the emotional categories are used, the performance 142 0 10 20 30 40 50 60 0.65 0.7 0.75 0.8 0.85 0.9 Number of mixtures % Sentence−level (All) Voiced−level (All) Sentence−level (Sel.) Voiced−level (Sel.) Figure 4.13: Performances of the proposed neutral model approach in term of the number of mixtures in the GMM. The results are not sensitive to this variable. For the rest of the experiments k = 2 was selected. Table 4.10: Performances of the proposed neutral model approach. The performances of conventional LDC classier (without neutral models) for the same task are also pre- sented. Neutral Model Conventional scheme sentence voiced sentence voiced All emotions 77.31% 72.00% 74.67% 70.96% Selected emotions 81.88% 76.19% 80.36% 75.42% accuracy reaches 77.31% for the sentence-level features and 72% for the voiced-level features. These values increase approximately 5% when only the selected emotional categories are considered. Notice that only pitch features are used, so these values are notably high, compared to the baseline (50%). For comparison, Table 4.10 also presents the results for the same task, using the pitch statistics as features without the neutral models (without the rst step in the proposed approach as described in Fig. 4.14). This classier, which is similar to the conventional frameworks used to discriminate emotions, was also implemented with LDC. The table shows that the proposed approach achieves better performance than the conventional approach in each of the four conditions (sentence/voiced level features; all/selected emo- tional categories). A paired samples t-test was computed over the 400 realizations to 143 Conventional classification scheme Figure 4.14: Conventional classication scheme for automatic emotion recognition. Speech features are directly used as input of the classier, instead of the tness measures estimated from the neutral reference models (Fig. 4.1). Table 4.11: Performances of the proposed neutral model approach for each emotional database. Sentence level (all emotions) Voiced level (all emotions) Acc Pre Rec F Acc Pre Rec F EMA 0.865 0.726 0.921 0.812 0.742 0.622 0.707 0.662 GES 0.809 0.779 0.867 0.821 0.711 0.688 0.799 0.739 EPSAT 0.740 0.733 0.757 0.745 0.710 0.623 0.777 0.691 Sentence level (selected emotions) Voiced level (selected emotions) Acc Pre Rec F Acc Pre Rec F EMA 0.905 0.808 0.942 0.870 0.797 0.716 0.740 0.727 GES 0.799 0.760 0.916 0.831 0.703 0.670 0.854 0.751 EPSAT 0.798 0.805 0.792 0.798 0.773 0.727 0.790 0.757 measure whether the dierences between these two approaches are statistically signi- cant. The results indicate that the classier trained with the likelihood scores (proposed approach) is signicantly better than the one trained with the F0 features (using the conventional approach), in each of the four conditions (pvalue<< 0:001). Later, the neutral model approach will be compared with the conventional LDC classier in terms of robustness. In Table 4.11, the results of the proposed approach are disaggregated in terms of databases (notice that three dierent emotional databases are used for training and testing). An interesting result is that the recall rate is in general high, which means that there are not many neutral samples labeled as emotional (false positive). For the sentence-level features, the performance for the EPSAT database is lower than other 144 Table 4.12: Precision rate of the proposed neutral model approach for each emotional category. All emotions Selected emotions sentence voiced sentence voiced Pre Pre Pre pre EMA Anger 0.739 0.668 0.710 0.654 Happiness 0.917 0.787 0.907 0.777 Sadness 0.520 0.421 *** *** GES Anger 1.000 0.953 1.000 0.947 Happiness 0.961 0.887 0.968 0.875 Sadness 0.463 0.214 0.427 0.189 Boredom 0.333 0.340 0.226 0.306 Disgust 0.891 0.698 0.866 0.665 Fear 0.920 0.859 0.932 0.848 EPSAT Happiness 0.881 0.792 0.875 0.785 Sadness 0.555 0.391 *** *** Boredom 0.424 0.338 *** *** Disgust 0.610 0.545 0.606 0.534 Fear 0.666 0.436 *** *** Panic 0.982 0.971 0.980 0.968 Hot anger 0.987 0.945 0.986 0.945 Cold anger 0.778 0.648 0.760 0.632 Despair 0.747 0.656 0.724 0.646 Elation 0.963 0.928 0.962 0.924 Interest 0.738 0.659 0.727 0.648 Shame 0.545 0.425 *** *** Pride 0.724 0.563 0.697 0.541 Contempt 0.730 0.526 *** *** databases. This result might be explained by the short sentences used to record this corpus (Table 4.4), resulting in noisy pitch statistics. Table 4.12 provides further details about the recognition performance of the proposed approach. In this table, the results are disaggregated in terms of the emotional categories for each database. The results are presented in terms of precision rate (accuracy and recall values are given in Table 4.11). In most of the cases, the precision rate is equal or better than the precision rates reported in the logistic regression experiments (Tables 4.7 and 4.8). Notice that this task is signicantly harder than the task presented in Section 4.1.4.4, since the emotional categories and the emotional database were jointly analyzed. 145 Table 4.13: Validating the robustness of the neutral model approach against mis- match between training and testing conditions. Sentence-level features (E=English, G=German, S=Spanish). Databases Neutral model Conventional scheme Acc Training Testing Acc Pre Rec F Acc Pre Rec F E (EPSAT,EMA) G (GES) 0.802 0.778 0.818 0.798 0.761 0.620 0.864 0.722 4.1% G (GES) E (EPSAT,EMA) 0.751 0.732 0.762 0.746 0.705 0.509 0.837 0.633 4.6% E (EPSAT,EMA) S (SES) 0.782 0.739 0.809 0.772 0.604 0.412 0.668 0.510 17.9% G (GES) S (SES) 0.792 0.708 0.851 0.773 0.686 0.445 0.857 0.586 10.6% E-G (EPSAT,EMA,GES) S (SES) 0.794 0.729 0.838 0.780 0.649 0.420 0.775 0.545 14.5% Robustness of the neutral model approach As mentioned before, using neutral models for emotional recognition is hypothe- sized to be more robust, and therefore, to generalize better than using a direct emotion classication approach. To validate this claim, this section compares the performance of the proposed approach (Fig. 4.1) with the conventional classier, (without neutral models, Fig. 4.14), when there is a mismatch between the training and testing condi- tions. For this purpose, the emotional databases were separated by languages in two groups: English (EPSAT, EMA), and German (GES). One of these groups was used for training, and the other one for testing. The results for the two conditions are given in Table 4.13 for the sentence-level features, and Table 4.14 for the voiced-level features. Since the samples were randomly drawn to have equal number of emotional and neutral samples (both in the training and testing sets), the baseline is 50%. The recognition results reported here are also average values over 400 realizations. For sentence-level F0 features, Table 4.13 shows that the neutral model approach generalizes better than the conventional scheme. In fact, the absolute accuracy im- provement over the conventional scheme is over 4%. Even though there is a mismatch between the training and testing conditions, the performance of the proposed approach does not decrease compared to the case when the same corpora are used for training and testing (no mismatch). For instance, Table 4.11 shows that the accuracy of the 146 Table 4.14: Validating the robustness of the neutral model approach against mismatch between training and testing conditions. Voiced-level features (E=English, G=German, S=Spanish). Databases Neutral model Conventional scheme Acc Training Testing Acc Pre Rec F Acc Pre Rec F E (EPSAT,EMA) G (GES) 0.710 0.695 0.716 0.705 0.733 0.590 0.827 0.689 -2.4% G (GES) E (EPSAT,EMA) 0.716 0.594 0.787 0.677 0.699 0.520 0.809 0.633 1.8% E (EPSAT,EMA) S (SES) 0.681 0.564 0.736 0.639 0.641 0.333 0.868 0.481 4.0% G (GES) S (SES) 0.692 0.518 0.794 0.627 0.634 0.314 0.872 0.462 5.8% E-G (EPSAT,EMA,GES) S (SES) 0.684 0.555 0.749 0.638 0.641 0.328 0.876 0.477 4.4% GES database was 80.9% when there was not training/testing mismatch. Interestingly, Table 4.13 shows that the performance for this database is still over 80% when only the English databases are used for training. When the classier is trained with the German database, and tested with the English databases, the performance is 75.1%. As men- tioned before, the EPSAT database presents the lowest performance of the emotional databases considered in this section (74%, Table 4.11). Since this corpus accounts for more than 85% of the English samples (Table 4.4), the lower accuracy observed for the English databases is expected. For the voiced-level F0 features, Table 4.14 shows that the performance of the pro- posed approach is similar to the performance of the system without any mismatch (see Table 4.11). The conventional scheme presents similar performance. Notice that the F0 features were selected from the analysis presented in Sections 4.1.4.3 and 4.1.4.4. The EMA, EPSAT and GES databases were considered for the analysis. To assess whether the emotional discrimination observed from these F0 fea- tures transpires to other corpora, a fourth emotional database was considered for the nal experiments. For this purpose, the SES database is used, which was recorded in Spanish (Table 4.4). Notice than the SES corpus contains the emotional category surprise, which is not included in the training set. 147 For this experiment, the classier of the neutral model approach was separately trained with the English (EPSAT, EMA), German (GES), and English and German databases. Tables 4.13 and 4.14 present the results for the sentence-, and voiced-level F0 features, respectively. The results indicate that the accuracy of the proposed ap- proach is over 78% for the sentence-level features, and 68% for the voiced level features. The performance is similar to the ones achieved with the other emotional databases considered in this section. Interestingly, the performance of the proposed approach is about 15% (absolute) better than the one obtained with the conventional scheme. These results suggest that conventional approaches to automatically recognizing emo- tions are sensitive to the feature selection process (the most discriminant features from one database may not have the same discriminative power in another corpus). However, the performance of the proposed approach can be robust against this type of variability. In Section 4.1.4.2, we hypothesized that neutral speech prosodic models trained with English speech can be used to detect emotional speech in another language. The results presented in Tables 4.13 and 4.14 support this hypothesis. As mentioned in Sections 4.1.4.2, the fundamental frequency in language such as German, English and Spanish is largely independent of the specic lexical content of the utterance. As a result, the proposed neutral speech prosodic models present similar performance regardless of the language of the databases used to train and test the classier. 4.1.5 Remarks from this section Toward discriminating emotional from neutral speech, the section proposed the use of neutral models to contrast expressive speech. The approach is implemented with spectral and prosodic speech features. For the spectral features, HMMs were built with MFB and MFCC features estimated from the TIMIT corpus. The likelihood scores were used as features. The results 148 show that this approach can achieve accuracies up to 78% in the binary emotional classication task. These results suggest that well-trained neutral acoustic models can be eectively used as a front-end for emotion recognition. Interestingly, the models trained with conventional MFCCs are found to perform worse than the models with the original MFBs for both emotional databases, suggesting that MFB-based models will achieve better performance regardless of the speech characteristics. The results indicated that some broad phonetic classes present more emotional dif- ferentiation than others. In general, vowels seem to have more degrees of freedom to convey emotional information than phonemes such as nasal sounds. These observations can be used for recognition by weighting the features extracted from the likelihood scores according to the observed emotional modulation conveyed in them. Notice that the detected phoneme classes are the ones that maximize the likelihood in the Viterbi decoding path. If the transcript of the sentences is known, which may not be true for all real time applications, the phoneme recognition accuracy could also be used as a tness measure for emotional discrimination. The hypothesis here is that the performance will be higher for neutral speech, compared to emotional speech. This approach could be extremely useful to automatically nd emotional speech portions in larger scripted emotional corpora. Figures 4.2, 4.3, 4.5, and 4.6 reveal that likelihood scores for the neutral set in the emotional corpus is close to the results for the neutral references. To reduce the mismatches between the neutral corpora even further, the acoustic models in the HMMs can be adapted to the neutral set in the emotional speech. For the prosodic features, GMMs were trained with seven F0 statistics estimated from the WSJ1 corpus. These features were selected after a thorough analysis of dier- ent expressive pitch contour statistics. For this purpose, two experiments were proposed. In the rst experiment, the distribution of dierent pitch features was compared with 149 the distribution of the features derived from neutral speech using KLD. In the second experiment, the emotional discriminative power of the pitch features was quantied within a logistic regression framework. Both experiments indicate that dynamic statis- tics such mean, maximum, minimum and range of the pitch are the most salient aspects of expressive pitch contour. The statistics were computed at sentence and voiced region levels. The results indicate that the system based on sentence-level features outperforms the one with voiced-level statistics both in accuracy and robustness, which facilitates a turn-by-turn processing in emotion detection. After contrasting the input speech with neutral models, the likelihood scores were used for classication. The approach was trained and tested with three dierent emo- tional databases spanning dierent emotional categories, recording settings, speakers and even languages (English and German). The recognition accuracy of the proposed approach was over 77% (baseline 50%) using only pitch-related features. To validate the robustness of the approach, the system was trained and tested with dierent databases recorded in three dierent languages (English, German and Spanish). Although there was a mismatch between the training and testing condition, the performance of the proposed framework did not degrade. In contrast, the performance of the conventional classier without the neutral models decreased up to 17% (absolute), for the same task using the same F0 features. These results show that this system is robust against dif- ferent speakers, languages and emotional descriptors, and can generalize better than standard emotional classiers. Results from the analysis section indicated that emotional modulation is not uni- formly distributed, in time and in space, across dierent communicative channels (Sec- tion 3.3). If this trend is also observed in the fundamental frequency, certain regions in the pitch contour might present stronger emotional modulation. We are planning to study this problem by comparing neutral and emotional utterances spoken with the 150 same lexical content. With this approach, we would be able to locally compare the F0 contours between emotional and neutral speech under similar lexical constraints. In terms of classication, we are planning to expand the proposed approach to in- clude features related to energy and duration. Likewise, this neutral prosodic model can be combined with the neutral spectral models presented in Section 4.1.3. By con- sidering dierent emotionally salient aspects of speech, we expect to further improve the accuracy and robustness of the proposed neutral model approach. Although this framework addresses binary emotion classication, the proposed scheme can be used as a rst step, in a more sophisticated emotion recognition sys- tem. After detecting emotional speech, a second level classier can be used to achieve a ner emotional description of the speech. 4.2 Analysis of emotion recognition using facial expres- sions, speech and multimodal information 4.2.1 Introduction Inter-personal human communication includes not only spoken language but also non- verbal cues such as hand gestures, facial expressions and tone of the voice, which are used to express feeling and give feedback. However, the new trends in human-computer interfaces, which have evolved from conventional mouse and keyboard to automatic speech recognition systems and special interfaces designed for handicapped people, do not take complete advantage of these valuable communicative abilities, resulting often in a less than natural interaction. If computers could recognize these emotional inputs, they could give specic and appropriate help to users in ways that are more in tune with the user's needs and preferences. 151 It is widely accepted from psychological theory that human emotions can be classi- ed into six archetypal emotions: surprise, fear, disgust, anger, happiness, and sadness. Facial motion and the tone of the speech play a major role in expressing these emotions. The muscles of the face can be changed and the tone and the energy in the production of the speech can be intentionally modied to communicate dierent feelings. Human beings can recognize these signals even if they are subtly displayed, by simultaneously processing information acquired by ears and eyes. Based on psychological studies, which show that visual information modies the perception of speech [128], it is possible to as- sume that human emotion perception follows a similar trend. Motivated by these clues, De Silva et al. conducted experiments, in which 18 people were required to recognize emotion using visual and acoustic information separately from an audio-visual database recorded from two subjects [49]. They concluded that some emotions are better iden- tied with audio such as sadness and fear, and others with video, such as anger and happiness. Moreover, Chen et al. showed that these two modalities give complemen- tary information, by arguing that the performance of the system increased when both modalities were considered together [34]. Although several automatic emotion recogni- tion systems have explored the use of either facial expressions [13, 71, 127, 171, 183] or speech [51, 116, 140] to detect human aective states, relatively few eorts have focused on emotion recognition using both modalities [34, 50, 192]. It is hoped that the multi- modal approach may give not only better performance, but also more robustness when one of these modalities is acquired in a noisy environment [144]. These previous studies fused facial expressions and acoustic information either at a decision-level, in which the outputs of the unimodal systems are integrated by the use of suitable criteria, or at a feature-level, in which the data from both modalities are combined before classication. However, none of these studies attempted to compare which fusion approach is more 152 suitable for emotion recognition. This section evaluates these two fusion approaches, in terms of the performance of the overall system. This section analyzes the use of audio-visual information to recognize four dierent human emotions: sadness, happiness, anger and neutral state, using the database de- scribed in Section 2.1. As mentioned before, the database was recorded from a actress with markers attached to her face to capture visual information (the more challeng- ing task of capturing salient visual information directly from conventional videos is a topic for future work but is hoped to be informed by studies such as in this report). The primary purpose of this research is to identify the advantages and limitations of unimodal systems, and to show which fusion approaches are more suitable for emotion recognition. 4.2.2 Emotion recognition systems 4.2.2.1 Emotion recognition by speech Several approaches to recognize emotions from speech have been reported. A compre- hensive review of these approaches can be found in [47] and [144]. Most researchers have used global suprasegmental/prosodic features as their acoustic cues for emotion recognition, in which utterance-level statistics are calculated. For example, mean, stan- dard deviation, maximum, and minimum of pitch contour and energy in the utterances are widely used features in this regard. Dellaert et al. attempted to classify 4 human emotions by the use of pitch-related features [51]. They implemented three dierent classiers: Maximum Likelihood Bayes classier (MLB), Kernel Regression (KR), and K-nearest Neighbors (KNN). Roy and Pentland classied emotions using a Fisher linear classier [155]. Using short-spoken sentences, they recognized two kinds of emotions: approval or disapproval. They conducted several experiments with features extracted from measures of pitch and energy, obtaining an accuracy ranging from 65% to 88%. 153 The main limitation of those global-level acoustic features is that they cannot de- scribe the dynamic variation along an utterance. To address this, for example, dynamic variation in emotion in speech can be traced in spectral changes at a local segmental level, using short-term spectral features. In [116], 13 Mel-frequency cepstral coecients (MFCC) were used to train a Hidden Markov Model (HMM) to recognize four emo- tions. Nwe et al. used 12 Mel-based speech signal power coecients to train a Discrete Hidden Markov Model to classify the six archetypal emotions [140]. The average ac- curacy in both approaches was between 70% and 75%. Finally, other approaches have used language and discourse information, exploring the fact that some words are highly correlated with specic emotions [114]. In this study, prosodic information is used as acoustic features as well as the duration of voiced and unvoiced segments. 4.2.2.2 Emotion recognition by facial expressions Facial expressions give important clues about emotions. Therefore, several approaches have been proposed to classify human aective states. The features used are typically based on local spatial position or displacement of specic points and regions of the face, unlike the approaches based on audio, which use global statistics of the acoustic features. For a complete review of recent emotion recognition systems based on facial expression the readers are referred to [144]. Mase proposed an emotion recognition system that uses the major directions of specic facial muscles [127]. With 11 windows manually located in the face, the muscle movements were extracted by the use of optical ow. For classication, K-nearest neighbor rule was used, with an accuracy of 80% with four emotions: happiness, anger, disgust and surprise. Yacoob et al. proposed a similar method [183]. Instead of using facial muscle actions, they built a dictionary to convert motions associated with edge 154 of the mouth, eyes and eyebrows, into a linguistic, per-frame, mid-level representation. They classied the six basic emotions by the used of a rule-based system with 88% of accuracy. Black et al. used parametric models to extract the shape and movements of the mouse, eye and eyebrows [13]. They also built a mid- and high-level representation of facial actions by using a similar approach employed in [183], with 89% of accuracy. Tian et al. attempted to recognize Actions Units (AU), developed by Ekman and Friesen in 1978 [68], using permanent and transient facial features such as lip, nasolabial furrow and wrinkles [171]. Geometrical models were used to locate the shapes and appearances of these features. They achieved a 96% of accuracy. Essa et al. developed a system that quantied facial movements based on parametric models of independent facial muscle groups [71]. They modeled the face by the use of an optical ow method coupled with ge- ometric, physical and motion-based dynamic models. They generated spatial-temporal templates that were used for emotion recognition. Without considering sadness that was not included in their work, a recognition accuracy rate of 98% was achieved. In this study, the extraction of facial features is done by the use of markers. There- fore, face detection and tracking algorithms are not needed. 4.2.2.3 Emotion recognition by bimodal data Relatively few eorts have focused on implementing emotion recognition systems using both facial expressions and acoustic information. De Silva et al. proposed a rule-based audio-visual emotion recognition system, in which the outputs of the uni-modal classi- ers are fused at the decision-level [50]. From audio, they used prosodic features, and from video, they used the maximum distances and velocities between six specic facial points. A similar approach was also presented by Chen et al.[34], in which the dominant modality, according to the subjective experiments conducted in [49], was used to resolve 155 discrepancies between the outputs of mono-modal systems. In both studies, they con- cluded that the performance of the system increased when both modalities were used together. Yoshitomi et al. proposed a multimodal system that not only considers speech and visual information, but also the thermal distribution acquired by infrared camera [188]. They argue that infrared images are not sensitive to lighting conditions, which is one of the main problems when the facial expressions are acquired with conventional cameras. They used a database recorded from a female speaker that read a single word acted in ve emotional states. They integrated these three modalities at decision-level using empirically determined weights. The performance of the system was better when three modalities were used together. In [90] and [33], a bimodal emotion recognition system was proposed to recognize six emotions, in which the audio-visual data was fused at feature-level. They used prosodic features from audio, and the position and movement of facial organs from video. The best features from both unimodal systems were used as input in the bimodal classier. They showed that the performance signicantly increased from 69.4% (video system) and 75% (audio system) to 97.2% (bimodal system). However they use a small database with only six clips per emotion, so the generalizability and robustness of the results should be tested with a larger data set. All these studies have shown that the performance of emotion recognition systems can be improved by the use of multimodal information. However, it is not clear which is the most suitable technique to fuse these modalities. This analysis addresses this open question, by comparing decision and features level integration techniques in term of the performance of the system. 156 4.2.3 Methodology Four emotions { sadness, happiness, anger and neutral state { are recognized by the use of three dierent systems based on audio, facial expression and bimodal information, respectively. The main purpose is to quantify the performance of unimodal systems, recognize the strengths and weaknesses of these approaches and compare dierent ap- proaches to fuse these dissimilar modalities to increase the overall recognition rate of the system. For this analysis, the database described in Section 2.1 was used. Notice that the facial features are extracted with high precision, so this multimodal database is suitable to extract important clues about both facial expressions and speech. In order to compare the unimodal systems with the multimodal system, three dif- ferent approaches were implemented all using support vector machine classier (SVC) with 2nd order polynomial kernel functions [20]. SVC was used for emotion recogni- tion in our previous study, showing better performance than other statistical classiers [114, 116]. Notice that the dierence between the three approaches is in the features used as inputs, so it is possible to conclude the strengths and limitations of acoustic and facial expressions features to recognize human emotions. In all the three systems, the database was trained and tested using the leave-one-out cross validation method. 4.2.3.1 System based on speech The most widely used speech cues for audio emotion recognition are global-level prosodic features such as the statistics of the pitch and the intensity. Therefore, the means, the standard deviations, the ranges, the maximum values, the minimum values and the me- dians of the pitch and the energy were computed using Praat speech processing software [14]. In addition, the voiced/speech and unvoiced/speech ratio were also estimated. By 157 the use of sequential backward features selection technique, a 11-dimensional feature vector for each utterance was used as input in the audio emotion recognition system. 4.2.3.2 System based on facial expressions In the system based on visual information, which is described in Figure 4.17, the spatial data collected from markers in each frame of the video is reduced into a 4-dimensional feature vector per sentence, which is then used as input to the classier. The facial expression system, which is shown in Figure 4.17, is described below. After the motion data are captured, the data are normalized using an approach similar to the one used in Section 3.1.3.1: (1) all markers are translated in order to make a nose marker be the local coordinate center of each frame, (2) one frame with neutral and close-mouth head pose is picked as the reference frame, (3) three approximately rigid markers (manually chosen) dene a local coordinate origin for each frame, and (4) each frame is rotated to align it with the reference frame. Each data frame is divided into ve blocks: forehead, eyebrow, low eye, right cheek and left cheek area (see Figure 4.15). For each block, the 3D coordinate of markers in this block is concatenated together to form a data vector. Then, Principal Component Analysis (PCA) method is used to reduce the number of features per frame into a 10-dimensional vector for each area, covering more than 99% of the variation. Notice that the markers near the lips are not considered, because the articulation of the speech might be recognized as a smile, confusing the emotion recognition system [144]. In order to visualize how well these feature vectors represent the emotion classes, the rst two components of the low eye area vector were plotted in gure 4.16. As can be seen, dierent emotions appear in separate clusters, so important clues can be extracted from the spatial position of these 10-dimensional features space. 158 Figure 4.15: ve areas of the face considered in this study −2 −1.5 −1 −0.5 0 0.5 1 −0.5 0 0.5 1 1.5 Sadness Hapiness Neutral Sadness Hapiness Neutral Anger Sadness Happiness Neutral Figure 4.16: First two components of low eye area vector 159 Notice that for each frame, a 10-dimensional feature vector is obtained in each block. This local information might be used to train dynamic models such as HMM. However, in this analysis we decided to use global features at utterance level for both unimodal systems, so these feature vectors were preprocessed to obtain a low dimensional feature vector per utterance. In each of the 5 blocks, the 10-dimensional features at frame level were classied using a K-nearest neighbor classier (k=3), exploiting the fact that dierent emotions appear in separate clusters (Figure 4.16). Then, the number of frames that were classied for each emotion was counted, obtaining a 4-dimentional vector at utterance level, for each block. These feature vectors at utterance level take advantage not only of the spatial position of facial points, but also of global patterns shown when emotions are displayed. For example, when happiness is displayed in more than 90 percent of the frames, they are classied as happy, but when sadness is displayed even more than 50 percent of the frames, they are classied as sad. The SVC classiers use this kind of information, improving signicantly the performance of the system. Also, with this approach the facial expression features and the global acoustic features do not need to be synchronized, so they can be easily combined in a feature-level fusion. As described in Figure 4.17, a separate SVC classier was implemented for each block, so it is possible to infer which facial area gives better emotion discrimination. In addition, the 4-dimensional features vectors of the 5 blocks were added before classica- tion, as shown in Figure 4.17. This system is referred as the combined facial expressions classier. 4.2.3.3 Bimodal system To fuse the facial expression and acoustic information, two dierent approaches were implemented: feature-level fusion, in which a single classier with features of both modalities are used (left of Figure 4.18); and, decision level fusion, in which a separate 160 Visual Data PCA Preprocessing 4-D feature vector PCA PCA PCA PCA Forehead Eyebrow Low eye Right check Left check Preprocessing Preprocessing Preprocessing Preprocessing 10-D feature vector for each area + Classifier Classifier Classifier Classifier Classifier Classifier Frame Level Utterance Level Figure 4.17: System based on facial expression Classifier Classifier Integration Audio Features Video features Classifier Audio Features Video features Figure 4.18: Features-level and decision-level fusion classier is used for each modality, and the outputs are combined using some criteria (right of Figure 4.18). In the rst approach, a sequential backward feature selection technique was used to nd the features from both modalities that maximize the perfor- mance of the classier. The number of features selected was 10. In the second approach, several criteria were used to combine the posterior probabilities of the mono-modal sys- tems at the decision-level: maximum, in which the emotion with greatest posterior probability in both modalities is selected; average, in which the posterior probabilities of each modalities are equally weighted and the maximum is selected; product, in which the posterior probabilities are multiplied and the maximum is selected; and, weight, in which dierent weights are applied to the dierent unimodal systems. 161 Table 4.15: Confusion matrix of the emotion recognition system based on audio Anger Sadness Happiness Neutral Anger 0.68 0.05 0.21 0.05 Sadness 0.07 0.64 0.06 0.22 Happiness 0.19 0.04 0.70 0.08 Neutral 0.04 0.14 0.01 0.81 4.2.4 Results 4.2.4.1 Acoustic emotion classier Table 4.15 shows the confusion matrix of the emotion recognition system based on acous- tic information, which gives details of the strengths and weaknesses of this system. The overall performance of this classier was 70.9 percent. The diagonal components of Table 4.15 reveal that all the emotions can be recognized with more than 64 percent of accuracy, by using only the features of the speech. However, Table 4.15 shows that some pairs of emotions are usually confused more. Sadness is misclassied as neutral state (22%) and vise versa (14%). The same trend appears between happiness and anger, which are mutually confused (19% and 21%, respectively). These results agree with the human evaluations done by De Silva et al.[49], and can be explained by sim- ilarity patterns observed in acoustic parameters of these emotions [187]. For example, speech associated with anger and happiness is characterized by longer utterance du- ration, shorter inter-word silence, higher pitch and energy values with wider ranges. On the other hand, in neutral and sad sentences, the energy and the pitch are usually maintained at the same level. Therefore, these emotions are dicult to be classied. 4.2.4.2 System based on facial expressions Table 4.16 shows the performance of the emotion recognition systems based on facial ex- pressions, for each of the ve facial blocks and the combined facial expression classier. 162 Table 4.16: Performance of the facial expression classiers Overal Anger Sadness Happiness Neutral Forehead 0.73 0.82 0.66 1.00 0.46 Eyebrow 0.68 0.55 0.67 1.00 0.49 Low eye 0.81 0.82 0.78 1.00 0.65 Right cheek 0.85 0.87 0.76 1.00 0.79 Left cheek 0.80 0.84 0.67 1.00 0.67 Combined classier 0.85 0.79 0.81 1.00 0.81 Table 4.17: Decision-level integration of the 5 facial blocks emotion classiers Overal Anger Sadness Happiness Neutral Majority voting 0.82 0.92 0.72 1.00 0.65 Maximum 0.84 0.87 0.73 1.00 0.75 Averaging combining 0.83 0.89 0.72 1.00 0.70 Product combining 0.84 0.87 0.72 1.00 0.77 This table reveals that the cheek areas give valuable information for emotion classica- tion. It also shows that the eyebrows, which have been widely used in facial expression recognition, give the poorest performance. The fact that happiness is classied without any mistake can be explained by the Figure 4.16, which shows that happiness is sep- arately clustered in the 10-dimentional PCA spaces, so it is easily to recognize. Table 4.16 also reveals that the combined facial expression classier has an accuracy of 85%, which is higher than most of the 5 facial blocks classiers. Notice that this database was recorded from a single actress, so clearly more experiments should be conducted to evaluate these results with other subjects. The combined facial expression classier can be seen as a feature-level integration approach in which the features of the 5 blocks are fused before classication. These classiers can be also integrated at decision-level. Table 4.17 shows the performance of the system when the facial block classiers are fused by the use of dierent criteria. In general, the results are very similar. All these decision-level rules give slightly worse performance than the combined facial expression classier. 163 Table 4.18: Confusion matrix of the combined facial expression classier Anger Sadness Happiness Neutral Anger 0.79 0.18 0.00 0.03 Sadness 0.06 0.81 0.00 0.13 Happiness 0.00 0.00 1.00 0.00 Neutral 0.00 0.04 0.15 0.81 Table 4.18 shows the confusion matrix of the combined facial expression classier to analyze in detail the limitation of this emotion recognition system. The overall perfor- mance of this classier was 85.1 percent. This table reveals that happiness is recognized with very high accuracy. The other three emotions are classied with 80 percent of accu- racy, approximately. Table 4.18 also shows that in the facial expressions domain, anger is confused with sadness (18%) and neutral state is confused with happiness (15%). Notice that in the acoustic domain, sadness/anger and neutral /happiness can be sep- arated with high accuracy, so it is expected that the bimodal classier will give good performance for anger and neutral state. This table also shows that sadness is confused with neutral state (13%). Unfortunately, these two emotions are also confused in the acoustic domain (22%), so it is expected that the recognition rate of sadness in the bimodal classiers will be poor. Other discriminating information such as contextual cues are needed. 4.2.4.3 Bimodal system Table 4.19 displays the confusion matrix of the bimodal system when the facial expres- sions and acoustic information were fused at feature-level. The overall performance of this classier was 89.1 percent. As can be observed, anger, happiness and neutral state are recognized with more than 90 percent of accuracy. As it was expected, the recogni- tion rate of anger and neutral state was higher than unimodal systems. Sadness is the emotion with lower performance, which agrees with our previous analysis. This emotion is confused with neutral state (18%), because none of the modalities we considered can 164 Table 4.19: Confusion matrix of the feature-level integration bimodal classier Anger Sadness Happiness Neutral Anger 0.95 0.00 0.03 0.03 Sadness 0.00 0.79 0.03 0.18 Happiness 0.02 0.00 0.91 0.08 Neutral 0.01 0.05 0.02 0.92 Table 4.20: Decision-level integration bimodal classier with dierent fusing criteria Overal Anger Sadness Happiness Neutral Maximum combining 0.84 0.82 0.81 0.92 0.81 Averaging combining 0.88 0.84 0.84 1.00 0.84 Product combining 0.89 0.84 0.90 0.98 0.84 Weight combining 0.86 0.89 0.75 1.00 0.81 accurately separate these classes. Notice that the performance of happiness signicantly decreased to 91 percent. Table 4.20 shows the performance of the bimodal system when the acoustic emotion classier (Table 4.15) and the combined facial expressions classier (Table 4.18) were integrated at decision-level, using dierent fusing criteria. In the weight-combining rule, the modalities are weighted according to rules extracted from the confusion matrices of each classier. This table reveals that the maximum combining rule gives similar results compared to the facial expression classier. This result suggests that the posterior probabilities of the acoustic classier are smaller than the posterior probabilities of the facial expression classier. Therefore, this rule is not suitable for fusing these modalities, because one modality will be eectively ignored. Table 4.20 also shows that the product- combining rule gives the best performance. Table 4.21 shows the confusion matrix of the decision-level bimodal classier when the product-combining criterion was used. The overall performance of this classier was 89.0 percent, which is very close to the overall performance achieved by the feature-level bimodal classier (Table 4.19). However, the confusion matrices of both classiers show important dierences. Table 4.21 shows that in this classier, the recognition rate of 165 Table 4.21: Confusion matrix of the decision-level bimodal classier with product- combining rule Anger Sadness Happiness Neutral Anger 0.84 0.08 0.00 0.08 Sadness 0.00 0.90 0.00 0.10 Happiness 0.00 0.00 0.98 0.02 Neutral 0.00 0.02 0.14 0.84 anger (84%) and neutral states (84%) are slightly better than in the facial expression classier (79% and 81%, Table 4.18), and signicantly worse than in the feature-level bimodal classier (95%, 92%, Table 4.19). However, happiness (98%) and sadness (90%) are recognized with high accuracy compared to the feature-level bimodal classier (91% and 79%, Table 4.19). These results suggest that in the decision-level fusion approach, the recognition rate of each emotion is increased, improving the performance of the bimodal system. 4.2.5 Discussion Humans use more than one modality to recognize emotions, so it is expected that the performance of automatic multimodal systems will be higher than automatic unimodal systems. The results reported in this work conrm this hypothesis, since the bimodal approach gave an improvement of almost 5 percent (absolute) compared to the per- formance of the facial expression recognition system. The results show that pairs of emotions that were confused in one modality were easily classied in the other. For ex- ample, anger and happiness that were usually misclassied in the acoustic domain were separated with greater accuracy in the facial expression emotion classier. Therefore, when these two modalities were fused at feature-level, these emotions were classied with high precision. Unfortunately, sadness is confused with neutral state in both domains, so its performance was poor. 166 Although the overall performance of the feature-level and decision-level bimodal classiers was similar, an analysis of the confusion matrices of both classiers reveals that the recognition rate for each emotion type was totally dierent. In the decision-level bimodal classier, the recognition rate of each emotion increased compared to the facial expression classier, which was the best unimodal recognition system (except happiness, which decreased in 2%). In the feature-level bimodal classier, the recognition rate of anger and neutral state signicantly increased. However, the recognition rate of happiness decreased 9 percent. Therefore, the best approach to fuse the modalities will depend on the application. The results presented in this research reveal that, even though the system based on audio information had poorer performance than the facial expression emotion clas- sier, its features have valuable information about emotions that cannot be extracted from the visual information. These results agree with the nding reported by Chen et al.[34], which showed that audio and facial expressions data present complementary information. On the other hand, it is reasonable to expect that some characteristic patterns of the emotions can be obtained by the use of either audio or visual features. This redundant information is very valuable to improve the performance of the emotion recognition system when the features of one of the modal are inaccurately acquired. For example, if a person has beard, mustache or eyeglasses, the facial expressions will be extracted with high level of error. In that case, audio features can be used to overcome the limitation of the visual information. Although the use of facial markers are not suitable for real applications, the analysis presented in this section gives important clues about emotion discrimination contained in dierent blocks of the face. Although the shapes and the movements of the eyebrows have been widely used for facial expression classication, the results presented in this analysis show that this facial area gives worse emotion discrimination than other facial 167 areas such as the cheeks. Notice that in this work only four aective states were studied, so it is possible that eyebrows play an important role in other emotions such as surprise. The experiments were conducted by using a database based on one female speaker, so the three systems were trained to recognize her expressions. If the system is applied to detect the emotions of other people it is expected that the performance will vary. Therefore, more data collected from other people are needed to ensure that the variabil- ity that human beings display emotions are well represented by the database, a subject of ongoing work. Another limitation of the approach reported in this research is that the visual information was acquired by the use of markers. In real applications, it is not feasible to attach these markers to users. Therefore, automatic algorithm to extract facial motions from video without markers should be implemented. An option is to use optical ow, which has been successfully implemented in previous research [71, 127]. The next steps in this research will be to nd better methods to fuse audio-visual information that model the dynamics of facial expressions and speech. Segmental level acoustic information can be used to trace the emotions at a frame level. Also, it might be useful to nd other kind of features that describe the relationship between both modalities with respect to temporal progression. For example, the correlation between the facial motions and the contour of the pitch and the energy might be useful to discriminate emotions. 4.3 Real-time monitoring of participants' interaction in a meeting using audio-visual sensors 4.3.1 Introduction Group meetings are a crucial part of planning and organization of any institution. If important meetings could be automatically recorded and annotated, it is possible 168 that it can be useful to retrieve information about ideas and decisions that transpired and can help with analyzing teamwork and collaboration strategies, and contribute to productivity. Towards that direction, new technologies in sensing, tracking, storage, and retrieval are oering exciting and challenging applications for human interaction sensing and human centered computing. Example realms in which monitoring human interaction is very useful are retrieval [93], summarization [135] and classication [6, 131, 153] of meetings. While much of the prior work has focused on content analysis (such as speech transcription), there is increasing interest in expanding it to include meta- information such as aect, speaker dynamics etc. Since it is expected that the number of meetings being archived will rapidly increase, especially given interactions across the globe, automatic annotations of meta-information of human interaction will play an important role in ecient and intuitive searching for specic portions and aspects of the meeting. This leads to even more interest into novel methods for robustly monitoring and measuring human interactions. In this context, Smart rooms equipped with non- intrusive multimodal sensors provide a suitable platform for automatically inferring meta-information from the participants in meeting and control room type environments. This section focuses on meta-analysis of certain aspects of group meetings from audio- visual information obtained in a smart room. An important component of this study is the Smart room environment being devel- oped at the University of Southern California (USC). This is a challenging multidis- ciplinary application that involves research in diverse topics including object tracking, speaker activity detection, speaker identication, human action recognition and user be- havior modeling. We propose a real time multimodal approach to determine the spatial position of the user, detect speaker activity, and additionally determine the speaker's identity aimed at applications such as remote video-conferencing and audio-video in- dexing and retrieval for tasks such as meetings. The long-term objective of this project 169 is to create a system which is cognizant of the users and can become an active but non-intrusive member of the interaction. The conference room contains three user monitoring systems: four synchronized cameras located in the corners of the room, a full-circle 360 degree camera located at the center of the table, and an array of sixteen microphones also located at center of the table. The location of each user is computed based on (i) the 3D polygon surface model from 4 synchronized cameras and (ii) a face detection technique using a full-circle 360 degree camera. Subsequently a dynamic model, under the Gaussian distribution assumption, is used with a moving window to combine the above information and localize the participants. The Speaker ID, operating on far-eld sound, employs a standard Gaussian Mixture model based on MFCCs. Finally, the active speaker's identity and location is estimated by fusing all the information channels. Building upon that work, this section evaluates the use of high-level information extracted from this environment to monitor the participants' behavior. For each sub- ject in the room, the number of turns, the percentage of time during active speaking, the average length of the turns and the turn-taking transition matrix between partici- pants were measured. The results show that these high-level features provide important information about the ow of the interaction that cannot be accurately inferred from any of the individual sensors. They also provide dynamic estimation of participant engagement/involvement during the meeting. Since these features are estimated from automatic speaker segmentations, the proposed system can monitor human interaction in real-time, making feasible the interesting applications mentioned before. Section 4.2.2 presents related work in intelligent environment and in monitoring humans' interaction. Section 4.3.3 describes our Smart Room system, and Section 4.3.4 explains the fusion algorithm. Section 4.3.5 analyze the performance of the system. 170 Finally, the results on inferring participants behavior, strategies and engagement is presented in Section 4.3.6. 4.3.2 Related work 4.3.2.1 Intelligent environments One of the well-studied areas in SRT is the detection and tracking of user locations. Two important sources of information are the visual and the acoustic modality. Within a multimodal framework, these two sources have been used to track a single active speaker using methods such as Sequential Monte Carlo [176] [194], Kalman ltering [166] and Dynamic Bayesian Networks (DBN) [146], taking advantage of the complementary information represented by these two modalities. Recently, [32] and [76] extended these approaches to track multiple speakers using particle ltering, while at the same time achieving active speaker detection, which is another important aspect of smart room technologies. In [149], visual clues were used to track users and a microphone array to select the active speaker by computing the distance between the visual and acoustic results. Another important aspect of SRT is speaker identication (SID), in which the iden- tity of the user is detected. There are several additional possible biometric systems for smart room applications (e.g., retina, ngerprint), although most of them are imprac- tical due to their invasive nature. One feasible option is to classify the user according to acoustic speech features [110] or through face recognition. 4.3.2.2 Monitoring human interaction Recent eorts to infer high-level information from meetings include [6], where the au- thors evaluated the use of interactive features extracted from manual annotations to 171 classify the meeting (e.g. discussion, presentation) and the participant's roles (e.g. pre- senter, listener). A similar goal was pursued in [131] and [193], in which HMM-based approaches were implemented to detect meeting actions, using features extracted from the audio-visual sensors. The in uence between participants was studied in [10]. They learned behavioral models to predict who will take the next turn. In [154], the author tracked the level of dominance of the participants. In most of these works, manual an- notations were used to extract the features, which make these approaches not directly applicable for real-time applications. 4.3.3 The Smart Room The present initial design primarily comprises microphones and cameras for activity sensing. The microphone array consists of 16 omnidirectional microphones that process sound at 48kHz sampling frequency. The microphone array was placed in the middle of the table, in a 2-D structure (see Figure 4.19). In addition, an extra directional microphone at 16KHz was added at one side of the table for speaker ID. The room is acoustically treated on three walls and has a full-wall glass window on the other side, and has ceiling panels and carpeting on the oor. The 3D camera system consists of 4 rewire CCD cameras near the corners on the ceiling that overlook the meeting area around the main table and capture the image sequences of the meeting from multiple angles. Each camera provides 1024768 images at 15 frames per second, but we scale them to 320 240 for real time processing. The room is lighted with halogen lights. At the center of the meeting table, a full-circle omnidirectional (360 ) camera cap- tures the faces of all participants. The size of the original omnidirectional image is 1280 960. 172 Figure 4.19: Smart Room. The left gure shows the smart room. The right gure shows the microphone array and the omnidirectional camera. The next subsection describes the algorithms used to process each of these raw information sources. 4.3.3.1 Microphone array One modality of localization is the sound source localization using a microphone array. The principle of sound source localization is based on the Time Dierence of Arrival (TDOA) of the sound to the various sensors and the geometric inference of the source location from this TDOA. First, pair-wise delays are estimated between all the micro- phones ((16 15)/2 =120) [77]. These delays are subsequently projected as angles into a single axes system. This results in 240 direction-of-arrival (DOA) estimates, half of them stemming from the front-to-back confusion of the microphone pairs. The density of these estimates provides a mode that corresponds to the correct DOA of sound. Georgiou et al.[77] have demonstrated the impulsive nature of audio signals and introduced a time delay estimation approach to mitigate its eects. The algorithm called Fractional Lower Order Statistics-Phase Transform Method (FLOS-PHAT) is based on a signed-non linearity on the input signal that reduces the detrimental eects of outliers. As is common practice, this implementation of the FLOS-PHAT algorithm employs memory in order to approximate the expectation in the lower order statistics, and 173 additionally the memory varies as a function of time to mitigate temporal propagation of errors. In the current study, the microphone array was placed in the middle of the table, in a 2-D structure (see Figure 4.19). Although this conguration does not provide depth information, the wide DOA range allows the microphone array to separate the participants' speech with higher accuracy (approximately 85%, Table 4.22). The depth information is redundant in this setting due to the 4-camera system's complementary measurements, which compensate amply for the inaccurate range information that an array can provide for a far-eld source. 4.3.3.2 Speaker ID The extra directional microphone was used for supervised speaker identication. Speaker identication was implemented by analyzing the short-time spectrum (through Mel Frequency Cepstral Coecients (MFCCs)) of the spoken phrases. In speaker recog- nition, the Gaussian Mixture Model (GMM), a weighted sum of Gaussian distributions, has been found to be good to capture the speaker information in MFCCs, and hence a GMM with 16 mixtures was used as a speaker model. Model training was accomplished by the standard Expectation-Maximization (EM) algorithm. All frames were initially divided into 16 clusters. An initial model was obtained by parameter estimation for mean and covariance matrices, which were estimated from the vectors in each cluster. The prior weights of GMM can be simply set by the proportion of feature vectors in each cluster. Next, the feature vectors are clustered by the Maximum Likelihood (ML) method using the previously estimated model. This process is iteratively executed until the model parameters converged. Additionally, we have created a silence/background noise model. 174 The result of the speaker identication was in terms of pairs, (S i ;P i ), where P i refers to the probability of speech activity of speaker S i given for all speakers i. This information is evaluated and transmitted to the fusion algorithm every 1 second. We should note that the acoustic signal processed is a reverberant, far-eld signal corrupted by noise, and so the performance of this method is expected to be lower compared to a case when clean signal from a close-talking microphone is to be used. 4.3.3.3 Video detection The goal of visual tracking is to detect and track the 3D locations of the participants in the meeting room using video streams acquired by multiple synchronized cameras. We use a Gaussian background-learning model to segment moving regions in the scene. When large variations from the learned Gaussian models are detected the fore- ground pixels are extracted. These pixels are then merged into regions. However, this method will segment actual people as well as their shadows and re ections. In our indoor setting, the shadow regions cast by diused light do not have strong bound- aries. We eliminate the shadows by combining the foreground pixels detection and the edge features detection [37] for segmenting into moving regions and corresponding cast shadows. The resulting regions are the silhouettes of the moving objects in the room. The detected silhouettes across the views are integrated for inferring the 3D visual hulls of people in the room [113]. The silhouette contour is converted to a polygon approximation and a visual hull with polyhedral representation is then computed di- rectly from these polygons [129]. This polygonal 3D approximation of the shapes is fast and is done in real-time. In detecting the locations of the people in the meeting room, we only need an estimation of general location of blobs of shapes instead of a precise reconstruction. Furthermore, we want the detection to cover an area as large as possible given a limited number of cameras. For this purpose we use a variation of 175 the visual hull method proposed by [15]: the polyhedral visual hull is required to be the integration of only a subset (at least 3 out of 4) of the silhouettes instead of all of them. The resulting visual hull shape is less accurate, but the 3D shape of all people in the room can be approximated. The computed visual hull is in a polygonal representation. We randomly sample points on the polygon surface and construct a height map of those points. This map assumes the XY plane in the Cartesian space is the meeting room oor and the Z coordinate represents the height. The local maxima of the height are then detected and considered as heads of the meeting participants. In this process some thresholds are applied to eliminate small regions such as moving chairs. 4.3.3.4 Full-circle 360-degree camera We have added an omnidirectional (360 ) camera on the meeting table to capture faces of all participants in order to get thumbnail representation of \who's talking". The image of the omnidirectional (360 ) camera is the result of the projection of the surrounding scene into a hemisphere. The acquired images are the result of the projection of the surrounding scene into a hemisphere, which are then unrolled and projected them back onto cylinders (see Figure 4.20). To detect the foreground region, we use Gaussian background-learning model. All pixels in a new frame are compared to the current color distribution in order to detect moving blobs prior to capturing the faces. Morphological operators are used to group detected pixels into foreground regions, and small regions are eliminated. Pixel color distributions are updated in these regions for adapting the background model to slow variations. In these moving regions, we perform face detection. The face detector is based on Haar-like features and is implemented using Intel's open source computer vision library [108]. To accurately detect faces under low light level conditions, the 176 Figure 4.20: Omnidirectional image from 360 camera and its panoramic transform Figure 4.21: Detection of participants' faces with the 360 camera color histogram of detected regions is normalized beforehand. Detected regions are then tracked using a graph-based tracking approach [37]. These regions correspond to the upper body of the meeting participants. Spatial and temporal information of tracking regions are combined as a graphical structure where nodes represent the detected moving regions and edges represent the relationship between two moving regions detected in two separate frames. Each newly processed frame generates a set of regions corresponding to the detected moving objects. The size of original omnidirectional image is 1280960 and the panoramic image resolution is 848 180. The average size of detected faces is approximately 30 30. The faces are detected and tracked at approximately 13 FPS in a 2.8 GHz Pentium4 PC. In Figure 4.21 shows an example of detection and tracking of the participants' faces during a meeting. 177 Figure 4.22: The system is distributed running over TCP, with information exchange as depicted above. 4.3.3.5 Synchronization Each modality was initially processed independently and asynchronously. Therefore, the estimated 3D coordinates from the polygonal representation (X v ) and from the micro- phones array (X MA ), the angles of the faces detected (X ), and the speaker information from the acoustic analysis (S i ;P i ) are sent to the fusion algorithm for integration. Al- though the results are received in an asynchronous manner, they are transformed and processed in a synchronous fashion. The real time system is distributed running over TCP. The fusion algorithm makes the decision every 1 second, which is the slower frame rate of the modalities (speaker ID). 4.3.4 Multimodal integration The various modalities are subsequently received and processed by a fusion algorithm for the purpose of nding and tracking the participants' spatial locations and identifying where and who the current active speaker is. Figure 4.22 shows the information ow between the various modules, and what information is used for each decision. 178 4.3.4.1 Participant localization It is well known that visual tracking algorithms have better spatial resolution than acoustic localization techniques [76, 149]. Hence, our algorithm for localization of all the participants' location employs a dynamic visual approach that uses only information obtained by cameras X = (X v ;X ). Based on the distribution of the samples X, we model the position of each speaker as a multidimensional Gaussian distribution. A single distribution with covariance K of a signicant spread and mean M is ini- tialized at the center of the room. As data are obtained, the variance and mean converge to the detected object's location. When information is received for a location scoring below a certain threshold of belonging to the existing distribution, a new multidimen- sional Gaussian is initialized at (M;K). The process continues sequentially until all the speakers are detected, with new data points either spawning new participant models or adapting the existing ones. In addition, temporal ltering ensures that false participant detections are identied and removed. This procedure allows us to determinate not only the spatial positions of the participants (X P ), but also the number of participants in the room (N P ). 4.3.4.2 Participant identication The spatial location of the current speaker (X MA+P ) as obtained from the microphone array (X MA ) and participants' location information (X P ), as well as the speaker ID from the GMM algorithm (S i ;P i ) are used to determine the identities of the participants. The goal is to detect who the participants are and also correlate their identity with their location in space (derive the \seating arrangement"). The probability that the acoustic source comes from cluster i given X MA , P (C i jX MA ), was modeled as a Gaussian distribution centered at the locations of the participant X P (in angle) and a constant variance. 179 Figure 4.23: Speaker localization system. See Sections 4.3.3.3, 4.3.3.4 and 4.3.4 for details. Using (S;P ), the probabilistic identity of the participant along withP (CjX MA ), the probabilistic location of the current speaker, over time and with physical constraints, such that a participant can only be at one point in space at a time, and one position can only be occupied by one participant at a time, we estimate the participants seating arrangement (L). 4.3.4.3 Speaker identication and localization We compute activity speaker detection by employing all modalities: X MA+P , which is derived from the visual modality and the microphone array, and (S;P ) obtained from the acoustic analysis of the signal. Under the assumption of independence, the information is fused as described in Equation 4.13, where r ij is the correlation measure between the probabilities of the current speaker belonging in clusterj and being speaker i. Figure 4.23 shows a graphic representation of the system's output. P (S i ) =P i n X i r ij P (C j jX MA ) (4.13) 180 4.3.5 Performance evaluation Three 20-minute meetings with four participants were recorded and processed with the system. Since the participants were asked to speak as naturally as possible, the conversation was casual with many interruptions, overlaps and short utterances, making this an extremely challenging task for both the microphone array and the Speaker ID (e.g. the overlap between speakers was between 7% and 15%). The meetings were segmented by hand to provide ground truth and compared to the results given by the fusion algorithm. Two criteria were dened: strong decision, in which the detection was considered correct if the speaker was active at least 50% of the time interval, and weak decision, in which the detection is considered correct if the speaker was active in any part of the time interval. Session Strong Weak Decision Decision A Speaker ID (GMM based) 1 66.13% 73.28% 2 61.27% 68.51% 3 60.10% 67.85% B Microphone Array + Video 1 81.26% 86.02% 2 85.41% 92.86% 3 83.03% 89.62% C Microphone Array + Video + Speaker ID (assumes known seating arrangement L) 1 81.55% 88.42% 2 85.60% 93.56% 3 82.49% 90.32% D Microphone Array + Video + Speaker ID (Participant location (L) learned through data) 1 80.37% 87.34% 2 78.77% 87.26% 3 82.49% 90.24% E Seating arrangement automatically learned through data (L) 1 87.78 2 74.60 3 97.14 Table 4.22: All of the above results are obtained in real time, and include the whole length of the meeting, with no time given for initial convergence. A: Speaker ID as obtained purely from the speech signal using a GMM; B: Localization obtained by the two visual information channels and the microphone array; C: Speaker Identication & Localization based on all information channels. Assumes perfect knowledge of L, the seating arrangement of the participants; D: As C, but the mapping of speaker-location, L, is continuously estimated from the data; E: Speaker Location mapping, L. 181 As can be observed from the results in Table 4.22 (rows C & D), the speaker identi- cation and localization based on all the modalities is fairly robust, achieving about 85% performance. This is a signicant improvement of about 30% compared to the speaker ID based purely on the speech signal as shown in row A, which suers from the far eld and noisy nature of the data. Similarly, there is an improvement in the accuracy of localization as contrasted to the performance based purely on the microphone array (row B). Notice that by adding this supervised speaker ID modality, the system infers not only who is speaking, but also the identity of the speaker, information that is not provided by the microphone array. Finally, the identication of the participants' spatial arrangement (row E) is ex- tremely accurate, a fact that explains the very close results observed in rows C & D. 4.3.6 Participant interaction In this section, high-level features derived from the automatic segmentation provided by the fusion algorithm are used to infer how people interact. For each participant, we calculated the number of turns, the average time duration of each turn, the amount of time used as active speaker and the transition matrix depicting turn-taking between participants. Figures 4.24 (a-c) and Figure 4.25 (a) show the results for meeting 3. For reference and evaluation, Figures 4.24 (d-f) and Figure 4.25 (b) show the ground-truth for the same data obtained through human annotation. Interesting observations can be inferred from these high-level features. The distribution of time used as active speaker and the number of turns taken is closely related with how dominant each participant was [154]. We observe that subject 1 spoke more than 65% of the time, which suggests that he was most probably leading the discussion. This subject also presented the longest duration 182 1 2 3 4 0 2 4 6 8 10 12 14 Speaker ID Time [sec] S1 S2 S3 S4 0 10 20 30 40 50 60 70 1 2 3 4 N o turns Speaker ID (a) Estimated duration (b) Estimated distribution (c) Estimated no. of turns 1 2 3 4 0 2 4 6 8 10 12 14 Speaker ID Time [sec] S1 S2 S3 S4 0 10 20 30 40 50 60 1 2 3 4 N o turns Speaker ID (d) True duration (e) True distribution (f) True no. of turns Figure 4.24: High-level group interaction measures estimated from automatic (a-c) and manual (d-f) speaker segmentation. for each turn, which suggests that his strategy was to present, elaborate and support his ideas. In contrast, the average duration for subject 3, which had the second larger number of turns, was only about 4 seconds, which reveals that he contributed with shorter sentences to support or contradict current ideas. These interpretations agree with previous work which suggest that discussion are characterized by many short turns to show agreement (e.g. \uh-huh") and longer turns taken by mediators [19]. The transition matrix between participants provides further information about the ow of the interaction and the turn taking patterns. By annotating the transition between speakers, a rough estimation can be inferred about who was being addressed by the speaker. To evaluate this hypothesis, we manually annotated whether the subject 183 Hand-based addressee Annotation Turn taking Transition Matrix Sp1 Sp2 Sp3 Sp4 Sp1 Sp2 Sp3 Sp4 Sp1 0.00 0.31 0.44 0.25 0.03 0.34 0.46 0.17 Sp2 0.72 0.00 0.21 0.07 0.74 0.04 0.22 0.00 Sp3 0.69 0.18 0.00 0.13 0.76 0.08 0.05 0.11 Sp4 0.50 0.23 0.28 0.00 0.73 0.00 0.20 0.07 Table 4.23: Comparison between the hand-based addressee annotations and turn-taking transition matrix. S1 S2 S3 S4 S1 S2 S3 S4 (a) Estimated transition (b) True transition Figure 4.25: High-level interaction ow inferences. was speaking to all the participants or only to one of them. Table 4.23 compares the ground-truth annotations with the results provided by the transition matrix. As can be observed, the transition matrix provides a good rst approximation to identifying the interlocutor dynamics. In Figures 4.25, the width of the arrows increases with the number of the times that one subject spoke after the other. Figure 4.25 (a) reveals that the discussion was mainly between subjects 1 and 3. In the presence of information such as context or change in aective states of the participants, this transition matrix could be extremely useful in determining coalition and rivalry between participants. Such information can in turn inform decision making ecacy in organizational communication. 184 0 50 100 S1 0 50 100 S2 0 50 100 S3 0 200 400 600 800 1000 0 50 100 S4 Time [sec] Estimated True Figure 4.26: Dynamic behavior of speakers' activeness over time. The same high-level features can be estimated in small windows over time to analyze the dynamic behavior of the participants' interaction. Figure 4.26 shows the percentage of the time that each speaker was active, in one minute windows. The gure also compares the results derived from automatic and manual annotations. A measure of the participants' engagement can be inferred from this dynamic feature. In this example, the gure shows that subject 4 only occasionally contributed in the discussion, suggesting that he was not engaged. Conversely, subjects 1, 2 and 3 participated in the discussion during the entire session. For a more reliable estimation of participants' engagement, recognition of gestures such as body posture and head orientation could be added to our current system. Again, these features can be useful for enriching meeting information retrieval. Knowing the segment of the meeting when a specic participant was speaking can help target the intended search. Such information can also be useful as training tool for improving participant skills during discussions. In posthoc analysis, for instance, 185 after showing a meeting summary report to the participants, subject 1 declared that he was not aware of how dominant he was during the meeting. Now, his strategies and style may change toward a more productive discussion in which everyone contributes with ideas. Notice that these high-level features calculated from automatic and manual annota- tions do not signicantly dier, which indicates that the fusion algorithm provides quite accurate and useful active speaker segmentation. Therefore, the proposed system can be used in real-time monitoring of human interaction. 4.4 Conclusions This chapter presented approaches to recognize non-linguistic cues. At individual level, the results presented here show that it is feasible to recognize human aective states with high accuracy by the use of audio and visual modalities. A novel approach was proposed to recognize emotional versus neutral utterances from speech features. Instead of building emotional models, we proposed the use of neutral models to contrast the input speech. Then, a tness measure was used for this binary classication. This framework was implemented with spectral and prosodic features. The results reveled that this approach outperforms conventional approaches in terms of accuracy and robustness. Since emotions are conveyed using dierent communicative channels, this chapter also analyzed multimodal emotion recognition using motion capture data. We stud- ied the strengths and weaknesses of facial expression classiers and acoustic emotion classiers. In these unimodal systems, some pairs of emotions are usually misclassied. However, the results presented in this chapter show that most of these confusions could be resolved by the use of the other modality. Therefore, the performance of the bimodal emotion classier was higher than each of the unimodal systems. 186 Two fusion approaches were compared: feature-level and decision-level fusion. The overall performance of both approaches was similar. However, the recognition rate for specic emotions presented signicant discrepancies. In the feature-level bimodal classier, anger and neutral state were accurately recognized compared to the facial expression classier, which was the best unimodal system. In the decision-level bimodal classier, happiness and sadness were classied with high accuracy. Therefore, the best fusion technique will depend on the application. At the small group level, this chapter evaluated the used of high-level information derived from automatic speaker segmentations, estimated by our Smart room system, to infer how the participants interacted in a meeting. The results show that with these features, it is possible to infer not only the ow of the discussion, but also the how dominant and engaged each participant is. This information that cannot be accurately derived from any of the sensor modalities by themselves, is important for many appli- cations such as summarization, classication, retrieval and analysis of meetings. Our ongoing research is focused on improving our Smart room system with the long- term goal of understanding how human beings communicate especially in multiparty interactions. We are working to improve our tracking and fusion algorithms to have more reliable and robust active speaker localization. We are also working on gesture recognition from the visual modalities, as well as spoken language processing, that can provide further detailed measures of participant emotions, awareness and engagement. The results presented in this chapter suggest that the next generation of human- computer interfaces might be able to perceive humans feedback, and respond appropri- ately and opportunely to changes of users aective states, improving the performance and engagement of the current interfaces. 187 Chapter 5: Synthesis of Human-Like gestures This chapter describes a data-driven approach to synthesize head motion sequences from speech. According to Section 3.1.4.4, the relationship between head motion and acoustic features is easier to model when prosodic features are used. Therefore, the proposed system receives prosodic features as input and generates head motion sequences as outputs, which are used to generate realistic expressive facial animations. Three questions are addressed in this chapter: (1) How important is rigid head motion for natural facial animation? (2) Do head motions change our emotional per- ception? (3) Can emotional and natural head motion sequences be synthesized only by prosodic features? This chapter shows that head motion is perceptively important for expressive facial animation. It also shows that by using HMMs to model the interrelation between head motion and prosodic features is possible to synthesize realistic head motion sequences, which are perceived as natural as the original sequences. 188 5.1 Introduction The development of new human-computer interfaces and exciting applications such as video games and animated feature lms has motivated the computer graphics com- munity to generate realistic avatars with the ability to replicate and mirror natural human behavior. Since the use of large motion capture datasets is expensive, and can be only applied to delicately planned scenarios, new automatic systems need to be used to generate natural human facial animation. As mentioned in Chapter 3, in normal human-human interaction, gestures and speech are intricately coordinated to express and emphasize ideas, and to provide suit- able feedback. These interrelations need to be considered in the design of realistic human animation to eectively engage the users. One useful and practical approach to include these relationships is to synthesize animated human faces driven by speech. The straightforward use of speech in facial animation is in lip motion synthesis, in which the acoustic phonemes are used to generate visual visemes that match the spoken sentences. Examples of these approaches include [16, 17, 38, 56, 72, 107, 125]. Also, speech has been used to drive human facial expression, under the assumption that the articulation of the mouth and jaw modify facial muscles, producing dierent faces poses. Examples of these approaches are [41, 109]. One important component of our body language that has received little attention compared to other non-verbal gestures is rigid head motion. Head motion is important not only to acknowledge active listening or replace verbal information (e.g.,`nod'), but also for many interesting aspects of human communication. Munhall et al. showed that head motion improves the acoustic perception of the speech [137]. They also suggested that head motion helps to distinguish between interrogative and declarative statements. Hill and Johnston found that head motion also helps to recognize speaker identity [88]. Graf et al. proved that the timings of head motion and the prosodic structure of the text 189 are consistent [79], suggesting that head motion is useful to segment the spoken content. In addition to that, we hypothesize that head motion provides useful information about the mood of the speaker, as suggested by [79]. We believe that people use specic head motion patterns to emphasize their aective states. Given the importance of head motion in human communication, this aspect of non- verbal gestures should be properly included in an engaging talking avatar. The manner in which people move their head depends on several factors such as speaker styles and idiosyncrasies [88]. However, the production of speech seems to play a crucial role in the production of rigid head motion. Kuratate et al.[109] presented preliminary results about the close relation between head motion and acoustic prosody. They concluded, based on the strong correlation between these two streams of data (r=0.8) that the production systems of the speech and head motion are internally linked. This suggests that the tone and the intonation of the speech provide important cues about head motion and vice versa [137]. Therefore, head motion can be estimated from prosodic features. Notice that, here, it is more important how the speech is uttered rather than just what is said. Thus, prosodic features (e.g. pitch and energy) are more suitable than vocal tract-based features (e.g. LPC and MFCC). Also, results from Section 3.1.4.4 suggest that the mapping between prosodic features and head motion seems to be easy to learn than with vocal-tract features. The work of [184] even reports that about 80% of the variance observed in the pitch can be determined from head motion. In this chapter, an innovative technique is presented to generate natural head motion directly from acoustic prosodic features. We model the problem as classication of discrete representations of head poses, instead of estimating mapping functions between the head motion and prosodic features, as in [41, 79]. First, vector quantization is used to produce a discrete representation of head poses. Then, a Hidden Markov Model (HMM) is trained for each cluster, which models the temporal relation between the prosodic 190 features and the head motion sequence. Given that the mapping is not one to one, the observation probability density is modeled with a mixture of Gaussians. The smoothness constraint is imposed by dening a bi-gram model (rst order Markov model) on head poses learned from the database. Then, given new speech material, the HMM, working as a sequence generator, produces the most likely head motion sequences. Finally, a smoothing operation based on spherical cubic interpolation is applied to generate the nal head motion sequences. Notice that prosodic features predominantly describe the source of speech rather than the vocal tract. Therefore, this head motion synthesis system is independent of the specic lexical content of what is spoken, reducing the size of the database needed to train the models. In addition to that, prosodic features contain important clues about the aective state of the speakers. Consequently, the proposed model can be naturally extended to include emotional content of the head motion sequence, by building HMMs appropriate for each emotion, instead of generic models. In this chapter, we address three fundamental questions: (1) How important is rigid head motion for natural facial animation? (2) Do head motions change our emotional perception? (3) Can emotional and natural head motion be synthesized only by prosodic features? To answer these questions, the temporal behaviors of head motion sequences extracted from our audiovisual database were analyzed for three dierent emotions: happiness, anger, sadness and the neutral state. The results show that the dynamic of head motion with neutral speech signicantly diers from the dynamics of head motion with emotional speech. These results suggest that emotional models need to be included to synthesize head motion sequences that eectively re ect these characteristics. Fol- lowing this direction, the head motion synthesis approach is presented which include emotional models that learn the temporal dynamics of the real emotional head motion 191 sequences. To investigate whether rigid head motion aects our perception of the emo- tion, we synthesized facial animation with deliberate mismatches between the emotional speech and the emotional head motion sequence. Human raters were asked to assess the emotional content and the naturalness of the animations. In addition, animations without head motion were also included in the evaluation. Our results indicate that head motion signicantly improves the naturalness perception in the facial animation. They also show that head motion changes the emotional content perceived from the animation, especially in the valence and activation domain. Therefore, head motion can be appropriately and advantageously included in the facial animation to emphasize the emotional content of the talking avatars. The chapter is organized as follows: Section 5.2 motivates the use of audiovisual information to synthesize expressive facial animations. It also presents the related work in head motion synthesis. Section 5.3 describes the audiovisual database, the head pose representation and the acoustic features used in the chapter. Section 5.4 presents statistical measures of head motion displayed during expressive speech. Section 5.5 de- scribes the multimodal framework, based on HMMs, to synthesize realistic head motion sequences. Section 5.6 summarizes the facial animation techniques used to generate the expressive talking avatars. Section 5.7 presents and discusses the subjective evaluations employed to measure the emotional and naturalness perception under dierent expres- sive head motion sequences. Finally, Section 5.8 gives the concluding remarks and our future research direction. 192 5.2 Related work 5.2.1 Emotion analysis For engaging talking avatars, special attention needs to be given to include emotional capability in the virtual characters. Importantly, Picard has underscored that emotions play a crucial rule in rational decision making, in perception and in human interac- tion [148]. Therefore, applications such as virtual teachers, animated lms and new human-machine interfaces can be signicantly improved by designing control mecha- nisms to animate the character to properly convey the desired emotion. Human beings are especially good at not only inferring the aective state of other people, even if emo- tional clues are subtly expressed, but also in recognizing non-genuine gestures, which challenges the designs of these control systems. The production mechanisms of gestures and speech are internally linked in our brain. Cassell et al. mentioned that they are not only strongly connected, but also systemat- ically synchronized in dierent scales (phonemes-words-phases-sentences) [29]. They suggested that hands gestures, facial expressions, head motion, and eye gaze occur at the same time as speech, and they convey similar information as the acoustic signal. Similar observations were mentioned by Kettebekov et al.[101]. They studied deictic hand gestures (e.g. pointing) and the prosodics of the speech in the context of gesture recognition. They concluded that there is a multimodal coarticulation of gestures and speech, which are loosely coupled. From an emotional expression point of view, in communication, it has been observed that human beings jointly modify gestures and speech to express emotions. Therefore, a more complete human-computer interaction system should include details of the emo- tional modulation of gestures and speech. 193 In sum, all these ndings suggest that the control system to animate virtual human- like characters needs to be closely related and synchronized with the information pro- vided by the acoustic signal. This is especially important if a believable talking avatar conveying specic emotion is desired. Following this direction, Cassell et al. proposed a rule-based system to generate facial expressions, hand gestures and spoken intona- tion, which were properly synchronized according to rules [29]. Other talking avatars that take into consideration the relation between speech and gestures to control the animation were presented in [16, 86, 97, 105]. Given that head motion also presents similar close temporal relation with speech [79, 109, 184] (Chapter 3), this chapter proposes to use HMMs to jointly model these streams of data. 5.2.2 Head motion synthesis Researchers have presented various techniques to model head motion. Pelachaud et al.[147] generated head motions from labeled text by predened rules, based on Facial Action Coding System (FACS) representations [66]. Cassell et al.[29] automatically generated appropriate non-verbal gestures, including head motion, for conversational agents, but their focus was only the \nod" head motion. Graf et al.[79] estimated the conditional probability distribution of major head movements (e.g. nod) given the occurrences of pitch accents, based on their collected head motion data. Costa et al.[41] used Gaussian Mixture Model (GMM) to model the connections between audio features and visual prosody. The connection between eyebrow movements and audio features was specically studied in their work. Chuang and Bregler [35] presented a data-driven approach to synthesize novel head motion corresponding to input speech. They rst acquired a head motion database indexed by pitch values, then a new head motion sequence was synthesized by choosing and combining best-matched recorded 194 head motion segments in the constructed database. Deng et al.[54] presented a new audio-driven head motion synthesis technique that synthesized appropriate head motion with keyframing control. After a audio-head motion database was constructed, given novel speech input and user controls (e.g. specied key head poses), a guided dynamic programming technique was used to generate an optimal head motion sequence that maximally satises both speech and key frames specications. In this work, we propose to use HMMs to capture the close temporal relation between head motions and acoustic prosodic features. Also, we propose an innovative two-step smoothing technique based on bi-gram models, learned from data, and spherical cubic interpolation. 5.3 Audio-visual database The same audiovisual database presented in Section 2.1 was used in this research. Notice that the actress did not receive any instruction about how to move her head. In previous work, head motion has been modeled with 6 degrees of freedom (DOF), corresponding to head rotation (3 DOF) and translation (3 DOF) [54, 184]. However, for practical reasons, in this analysis we consider only head rotation. As discussed in Section 5.5, the space spanned by the head motion features is split using vector quantization. For a constant quantization error, the number of clusters needed to span the head motion space increases as the dimension of the feature vector increases. Since an HMM is built for each head pose cluster, it is preferred to model head motion with only a 3-dimensional feature vector, thereby decreasing the number of HMMs. Furthermore, since most of the avatar applications require close-view of the face, translation eects are considerably less important than the eects of head rotation. Thus, the 3 DOF of head translation are not considered here, reducing the number of required HMM models and the expected quantization errors. 195 Figure 5.1: Head poses using Euler angles The 3D head rotation features, the pitch (F0) and the RMS energy were extracted using the same procedure that was presented in Section 3.1.3.1. The pitch (F0) and the RMS energy and their rst and second derivatives were used as prosodic features. 5.4 Head motion characteristics in expressive speech To investigate head motion in expressive speech, the audiovisual data were separated according to the four emotions. Dierent statistical measurements were computed to quantify the patterns in rigid head motion during expressive utterances. Canonical Correlation Analysis (CCA) was applied to the audiovisual data to vali- date the close relation between the rigid head motions and the acoustic prosodic features. CCA provides a scale-invariant optimal linear framework to measure the correlation be- tween two streams of data with equal or dierent dimensionality. The basic idea is to project the features into a common space in which Pearson's correlation can be computed. The rst part of Table 5.1 shows these results. One-way Analysis of Vari- ance(ANOVA) evaluation indicates that there are signicant dierences between emo- tional categories (F [3,640], p=0.00013). Multiple comparison tests also show that the CCA average of neutral head motion sequences is dierent from the CCA mean of sad (p= 0.001) and angry (p= 0.001) head motion sequences. Since the average of the rst 196 Table 5.1: Statistics of rigid head motion Neu Sad Hap Ang Canonical correlation Analysis 0.74 0.74 0.71 0.69 Motion Coecient [ ] 3.32 4.76 6.41 5.56 0.88 3.23 2.60 3.67 0.81 2.20 2.32 2.69 Range [ ] 9.54 13.71 17.74 16.05 2.31 8.29 6.14 9.06 2.27 6.52 6.67 8.21 Velocity Magnitude [ /sample] Mean .08 0.11 0.15 0.18 Std .07 0.10 0.13 0.15 Discriminant Analysis Neu 0.92 0.02 0.04 0.02 Sad 0.15 0.61 0.11 0.13 Hap 0.14 0.09 0.59 0.18 Ang 0.14 0.11 0.25 0.50 order canonical correlation in each emotion is over r=0.69, it can be inferred that head motion and speech prosody are strongly linked. Consequently, meaningful information can be extracted from prosodic features to synthesize the rigid head motion. To measure the motion activity of head motion in each of the three Euler angles, we estimated a motion coecient, , which is dened as the standard deviation of the sentence-level mean-removed signal, = v u u t 1 NT N X u=1 T X t=1 (x u t u ) 2 (5.1) whereT is the number of frames,N is the number of utterances, and u is the mean of the sentence u. Notice that the denition of this motion coecient is slightly dierent from the one presented in Section 3.1.4.1. The results shown in Table 5.1 suggest that the head motion activity displayed when the speaker is under emotional states (sadness, happiness or anger) is much higher than the activity displayed under neutral speech. Furthermore, it can be observed that head 197 motion activity related to sad emotion is slightly lower than the activity for happy or angry. As an aside, it is interesting to note that similar trends with respect to emotional state have been observed in articulatory data of tongue and jaw movement [119]. Table 5.1 also shows the average ranges of the three Euler angles that dene the head poses. The results indicate that during emotional utterances the head is moved over a wider range than in normal speech, which is consistent with the results of the motion coecient analysis. The velocity of head motion was also computed. The average and the standard deviation of the head motion velocity magnitude is presented in Table 5.1. The re- sults indicate that the head motion velocities for happy and angry sequences are about two times greater than that of neutral sequences. The velocities of sad head motion sequences are also greater than that of neutral head motion, but smaller than that of happy and angry sequences. In terms of variability, the standard deviation results reveal a similar trend. These results suggest that emotional head motion sequences present dierent temporal behavior than those of neutral condition. To analyze how distinct the patterns of rigid head motion for emotional sentences are, a discriminant analysis was applied to the data. The mean, standard deviation, range, maximum and minimum of the Euler angles computed at the sentence-level were used as features. Fisher classication was implemented with leave-one-out cross validation method. Table 5.1 shows the results. On average, the recognition rate just with head motion features was 65.5%. Notice that the emotional class with lower performance (anger) is correctly classied with an accuracy higher than 50% (chance is 25%). These results suggest that there are distinguishable emotional characteristics in rigid head motion. Also, the high recognition rate of neutral state implies that global patterns of head motion in normal speech are completely dierent from the patterns displayed under an emotional state. 198 These results suggest that people intentionally use head motion to express specic emotion patterns. Therefore, to synthesize expressive head motion sequences, suitable models for each emotion need to be built. 5.5 Rigid head motion synthesis The proposed speech-driven head motion sequence generator uses HMMs because they provide a suitable framework to jointly model the temporal relation between prosodic features and head motion. Instead of estimating a mapping function [41, 79], or design- ing rules according to the lexical content of the speech [29], or nding similar samples in the training data [54], we model the problem as classication of discrete representations of head poses which are obtained by the use of vector quantization. The Linde-Buzo- Gray Vector Quantization (LBG-VQ) technique [124] is used to compute K Voronoi cells in the 3D Euler angle space (Figure 5.2). The clusters are represented with their mean vectorU i and covariance matrix i , withi = 1;:::;K. The pairs (U; ) dene the nite and discrete set of code vectors called codebook. For each of these clusters, V i , an HMM is built to generate the most likely head motion sequence, given the observations O, which correspond to the prosodic features. Consequently, the number of HMMs that need to be trained is given by the number of clusters (K) used to represent the head poses. The HTK toolkit is used to build these HMMs [189]. In the quantization step, the continuous Euler angles of each frame are approximated with the closest code vector in the codebook. Two smoothing techniques are used to produce continuous head pose sequences. The rst smoothing technique is imposed in the decoding step of the HMMs, by con- straining the transition between clusters. The second smoothing technique is applied during synthesis, by using spherical cubic interpolation to avoid breaking of the discrete 199 Figure 5.2: 2D projection of Voronoi regions using 32-size vector quantization representation. More details of these smoothing techniques are given in Sections 5.5.1 and 5.5.2, respectively. As shown in the previous section (Section 5.4), the dynamics and the patterns of head motion sequences under emotional states are signicantly dierent. Therefore, generic models do not re ect the specic emotional behaviors. In this chapter, the synthesis technique includes emotion-dependent HMMs. Instead of using generic models for the whole data, we proposed building HMMs for each emotional category to incorporate in the models the emotional patterns of rigid head motion. 5.5.1 Learning relations between prosodic features and head motion To synthesize realistic head motion, our approach searches for the sequences of dis- crete head poses that maximize the posterior probability of the cluster models V = (V t i 1 ;V t+1 i 2 ;:::), given the observations O = (o t ;o t+1 ;:::). arg max i 1 ;i 2 ;::: P (V t i 1 ;V t+1 i 2 ;:::jO) (5.2) This posterior probability is computed according to Bayes rule as: P (VjO) = P (OjV )P (V ) P (O) (5.3) 200 P (O) is the probability of the observation which does not depend on the cluster models. Therefore, it can be considered as a constant. P (OjV ) corresponds to the likelihood distribution of the observation, given the cluster models. This probability is modeled as a rst order Markov process, with S states. Hence, the probability description includes only the current and previous state, which signicantly simplies the problem. For each of theS states, a mixture ofM Gaussians is used to estimate the distribution of the observations. The use of mixtures of Gaussians models the many- to-many mapping of head motion and prosodic features. Under this formulation, the estimation of the likelihood is reduced to computing the parameters of the HMMs, which can be estimated using standard methods such as forward-backward and Baum-Welch re-estimation algorithms [152, 189]. P (V ) in Equation 5.3 corresponds to the prior probability of the cluster models. This probability is used as a rst smoothing technique to guarantee valid transition between the discrete head poses. A rst-order state machine is built to learn the transition prob- abilities of the clusters, by using bi-gram models (similar to bi-gram language models [189]).The transition between clusters are learned from the training data. In the decod- ing step of the HMMs, these bi-gram models are used to penalize or reward transitions between discrete head poses according to their occurrences in the training database. As our results suggest, the transition between clusters is also emotion-dependent. There- fore, this prior probability is separately trained for each emotion category. Notice that in the training procedure the segmentation of the acoustic signal is obtained from the vector quantization step. Therefore, the HMMs were initialized with this known segmentation, avoiding the use of forced alignment, as it is usually done in speech recognition to align phonemes with the speech features. 201 Figure 5.3: Head motion synthesis framework. 5.5.2 Generating realistic head motion sequences Figure 5.3 describes the proposed framework to synthesize head motion sequences. Using the acoustic prosodic features as input, the HMMs, which were previously trained as described in Section 5.5.1, generate the most likely head pose sequences, b V = ( b V t i 1 ; b V t+1 i 2 :::), according to Equation 5.2. After the sequence b V is obtained, the means of the clusters are used to form a 3D sequence, b Y = (U t i 1 ;U t+1 i 2 :::), which is the rst approximation of the head motion. In the next step, colored noise is added to the sequence b Y , according to Equation 5.4 (see Figure 5.3). The purpose of this step is to compensate for the quantization error of the discrete representation of head poses. The noise is colored with the covariance matrix of the clusters, , so as to distribute the noise in proportion to the error yielded during vector quantization. The parameter is included in Equation 5.4 to attenuate, if desired, the level of noise used to blur the sequence b Y (e.g. = 0:7). Notice that this is an optional step that can be ignored by setting equal to zero. Figure 5.4 shows an example of b Z (blue solid lines). b Z t i = b Y t i +W ( i ) (5.4) 202 As can be observed from Figure 5.4, the head motion sequence b Z shows a break in the cluster transition even if colored noise is added or the number of clusters is increased. To avoid these discontinuities, a second smoothing technique is applied to this sequence which is based on spherical cubic interpolation [63]. With this technique, the 3D Euler angles are interpolated in the unit sphere by using quaternion representation. This technique performs better than interpolating each Euler angle separately, which has been shown to produce jerky movements and undesired eects such as Gimbal lock [163]. In the interpolation step, the sequence b Z is down-sampled to 6 points per second to obtain equidistant frames. These frames are referred here as key-points and are marked as a circle in Figure 5.4. These 3D Euler angles points are then transformed into the quaternion representation [63]. Then, spherical cubic interpolation, squad, is applied over these quaternion points. The squad function builds upon the spherical linear interpolation, slerp. The functions slerp and squad are dened by equations 5.5 and 5.6, slerp(q 1 ;q 2 ;) = sin(1) sin q 1 + sin sin q 2 (5.5) squad(q 1 ;q 2 ;q 3 ;q 4 ;) =slerp(slerp(q 1 ;q 4 ;); slerp(q 2 ;q 3 ;); 2(1)) (5.6) where q i are quaternions, cos = q 1 q 2 and is a parameter that ranges between 0 and 1 and determines the frame position of the interpolated quaternion. Using these equations, the frames between key-points are interpolated by setting at the specic times to recover the original sample rate (120 frames per second). The nal step in 203 0 50 100 150 200 250 300 350 400 −10 0 10 20 Degree 0 50 100 150 200 250 300 350 400 0 10 20 Degree 0 50 100 150 200 250 300 350 400 0 5 10 Frames Degree HMMs sequences Interpolated angles key−points Figure 5.4: Example of a synthesized head motion sequence. The gure shows the 3D noisy signal b Z (Equation 5.4), with the key-points marked as a circle, and the 3D interpolated signal b X, used as head motion sequence. this smoothing technique is to transform the interpolated quaternions into the 3D Euler angle representation. Notice that colored noise is applied before the interpolation step. Therefore, the nal sequence, b X, is a continuous and smooth head motion sequence without the jerky behavior of the noise. Figure 5.4 shows the synthesized head motion sequence for one example sentence. The gure shows the 3D noisy signal b Z (Equation 5.4), with the key-points marked as a circle, and the 3D interpolated signal b X, used here as head motion sequence. Finally, for animation a blend shape face model composed of 46 blend shapes is used in this work (eye ball is controlled separately, as explained in Section 5.6). The head motion sequence, b X, is directly applied to the angle control parameters of the face model. The face modeling and rendering are done in Maya [130]. Details of the approach used to synthesize the face are given in Section 5.6. 204 Table 5.2: Results for dierent congurations using generic HMMs HMM cong. D CCA Mean Std Mean Std K=16 S=5 M=2 LR 10.2 3.4 0.88 0.11 K=16 S=5 M=4 LR 9.3 3.4 0.87 0.11 K=16 S=3 M=2 LR 9.1 3.6 0.87 0.12 K=16 S=3 M=2 EG 9.1 3.4 0.87 0.10 K=16 S=3 M=4 EG 9.5 3.4 0.83 0.12 K=32 S=5 M=1 LR 12.8 4.0 0.83 0.14 K=32 S=3 M=2 LR 10.7 3.3 0.86 0.12 K=32 S=3 M=1 EG 10.4 3.1 0.86 0.11 5.5.3 Conguration of HMMs The topology of the HMM is dened by the number and the interconnection of the states. In this particular problem, it is not completely clear which HMM topology provides the best description of the dynamics of the head motion. The most common topologies are the left-to-right topology (LR), in which only transitions in forward direction between adjacent states are allowed, and the ergodic (EG) topology, in which the states are fully connected. The LR topology is simple and needs less data to train its parameters. The EG topology is less restricted, so it can learn a larger set of state transitions from the data. For simplicity, eight generic emotional-independent HMM congurations, described in Table 5.2, with dierent topologies, number of models,K, number of states, S, and number of mixtures,M, were trained. Notice that the size of the database is not big enough to train more complex HMMs with more states, mixtures or models than those described in Table 5.2. As can be seen from Table 5.2, the performance of the dierent HMM topologies in terms of Euclidean distance and canonical correlation analysis are similar. The Left-to- Right HMM, with a 16-size codebook, 3 states and 2 mixtures achieves the best result. However, if the database were big enough, an ergodic topology with more states and mixture could perhaps give better results. When emotional models are used instead of generic models, the training data is even smaller, since emotion-dependent model are 205 Table 5.3: Canonical Correlation Analysis Between original and synthesized head motion sequences Neutral Sadness Happiness Anger Mean 0.86 0.88 0.89 0.91 Std 0.12 0.11 0.08 0.08 separately trained. Therefore, the HMMs used in the experiments were implemented using a LR topology with 2 states (S = 2) and 2 mixtures (M = 2). Another important parameter that needs to be set is the number of HMMs, which is directly related to the number of clustersK. IfK increases, the error quantization of the discrete representation of head poses decreases. However, the discrimination between models will signicantly decrease and more training data will be needed. Therefore, there is a tradeo between the quantization error and the inter-cluster discrimination. In these experiments a 16-word-sized codebook (K = 16) is used. 5.5.4 Objective evaluation Table 5.3 shows the average and standard deviation of the rst order canonical cor- relation between the original and the synthesized head motion sequences. As can be observed, the results show that the emotional sequences generated with the prosodic feature are highly correlated with the original signals (r > 0:85). Notice that the rst order canonical correlation between the prosodic speech features and the original head motion sequence was about r 0:72 (see Table 5.1). This result shows that even though the prosodic speech features do not provide complete information to synthesize the head motion, the performance of the proposed system is notably high. This result is conrmed by the subjective evaluations presented in Section 5.7. To compare how dierent the emotional HMMs presented in this work are, an an- alytic approximation of the Kullback-Leibler Distance (KLD) was implemented. The KLD, or relative entropy, provides the average discrimination information between the 206 probability density functions of two random variables. Therefore, it can be used to com- pare distances between models. Unfortunately, there is no analytic close-form expression for Markov chains or HMMs. Therefore, numerical approximation, such as Monte Carlo simulation, or analytic upper bound for the KLD need to be used [59, 164]. Here, we use the analytic approximation of the Kullback-Leibler Distance rate (KLDR) presented by Do, which is fast and deterministic. It has been shown that it produces similar results to those obtained through Monte Carlo simulations [59]. Figure 5.5 shows the distance between emotional HMMs for eight head-motion clus- ters. Even though some of the emotional models are close, most of them are signicantly dierent. Figure 5.5 reveals that happy and angry HMMs are closer than any other emo- tional category. As discussed in Section 5.4, the head motion characteristics of happy and angry utterances are similar, so it is not surprising that they share similar HMMs. This result indicates that a single model may be used to synthesize happy and an- gry head motion sequences. However, in the experiments presented in this chapter a separate model was built for each emotion. Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Ang Hap Neu Sad Ang Hap Neu Sad 0 100 200 300 400 500 Figure 5.5: Kullback-Leibler distance rate (KLDR) of HMMs for eight head-motion clusters. Light colors mean that the HMMs are dierent, and dark colors mean that the HMMs are similar. The gure reveals the dierences between the emotion-dependent HMMs. 207 Figure 5.6: Overview of the data-driven expressive facial animation synthesis system. The system is composed of three parts: recording, modeling, and synthesis. 5.6 Facial animation synthesis Although this chapter is focused on head motion, for realistic animations, every facial component needs to be modeled. Expressive visual speech and eye motion were syn- thesized by the techniques presented in [53, 55, 56, 57]. This section brie y described these approaches, which are very important to creating a realistic talking avatar. Figure 5.6 illustrates the overview of our data-driven facial animation synthesis sys- tem. In the recording stage, expressive facial motion and its accompanying acoustic signal are simultaneously recorded and preprocessed. In the modeling step, two ap- proaches are used to learn the expressive facial animation: the neutral speech motion synthesis [56] and the dynamic expression synthesis [53]. The neutral speech motion synthesis approach learns explicit but compact speech co-articulation models by encod- ing co-articulation transition curves from recorded facial motion capture data, based on a weight-decomposition method that decomposes any motion frame into linear com- binations of neighboring viseme frames. Given a new phoneme sequence, this system synthesizes corresponding neutral visual speech motion by concatenating the learned co- articulation models. The dynamic expression synthesis approach constructs a Phoneme- Independent Expression Eigen-Space (PIEES) by a phoneme-based time warping and 208 0 100 200 300 400 500 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 Frame Number Eye Gaze X Trajectory Sample Synthesized 0 100 200 300 400 500 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0 Frame Number Eye Gaze Y Trajectory Sample Synthesize 0 100 200 300 400 500 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frame Number Eye Blink amptitude eyeblink sample synthesized eyeblink Figure 5.7: Synthesized eye-gaze signals. Here, the solid line (blue) represents synthe- sized gaze signals, and dotted line (red) represents captured signal samples. subtraction that extracts neutral motion signals from captured expressive motion sig- nals. It is assumed that the above subtraction removes \phoneme-dependent" content from expressive speech motion capture data [53]. These phoneme-independent signals are further reduced by Principal Component Analysis (PCA) to create an expression eigen-space, referred here PIEES [53]). Then, novel dynamic expression sequences are generated from the constructed PIEES by texture-synthesis approaches originally used for synthesizing similar but dierent images given a small image sample in graphics eld. In the synthesis step, the synthesized neutral speech motions are weight-blended with the synthesized expression signals to generate expressive facial animation. In addition to expressive visual speech synthesis, we used a texture-synthesis based approach to synthesize realistic eye motion for talking avatars [55]. Eye gaze is one of the strongest cues in human communication. When a person speaks, he/she looks to our eyes to judge our interest and attentiveness, and we look into his/her eyes to signal our intent to talk. We adopted data-driven texture synthesis approaches [121], originally used in 2D image synthesis, to the problem of realistic eye motion modeling. Eye gaze and aligned eye blink motion are considered together as an \eye motion texture" sample. The samples are then used to synthesize novel but similar eye motions. In our work, the patch-based sampling algorithm [121] is used, due to its time eciency. The basic idea is to generate one texture patch (xed size) at a time, randomly chosen from qualied 209 Figure 5.8: Synthesized sequence for happy (top) and angry (bottom) sentences. candidate patches in the input texture sample. Figure 5.7 illustrates the synthesized eye motion results. Figure 5.8 shows frames of the synthesized data for happy and angry sentences. The text of sentences are \We lost them at the last turno" and \And so you just abandoned them?", respectively. 5.7 Evaluation of emotional perception from animated se- quences To analyze whether head motion patterns change the emotional perception of the speaker, various combinations of facial animations were created, including deliberate mismatches between the emotional content of the speech and the emotional pattern of head motion, for four sentences in our database (one for each emotion). Given that the actress repeated each of these sentences under four emotional states, we generated facial animations with speech associated with one emotion, and recorded head motions associated with a dierent emotion. Altogether, 16 facial animations were created (4 sentences 4 emotions). The only consideration was that the timing between the rep- etitions of these sentences was dierent, and it was overcome by aligning the sentences 210 0 50 100 150 200 250 300 350 −5 0 5 10 Degree Original Warped 0 50 100 150 200 250 300 350 −5 0 5 Degree 0 50 100 150 200 250 300 350 −4 −2 0 2 Degree Frames Figure 5.9: Dynamic Time Warping. Optimums path (left panel) and warped head motion signal (right panel). using Dynamic Time Warping (DTW) [52]. After the acoustic signals were aligned, the optimal synchronization path was applied to the head motion sequences and were used to create the mismatched facial animations (Figure 5.9). In the DTW process, some emotional characteristics could be removed, especially for sad sentences, in which the syllable duration is inherently longer than in other emotions. However, most of the dynamic behaviors of emotional head motion sequences are nevertheless preserved. Notice that even though lip and eye motions were also included in the animations, the only parameter that was changed was the head motion. For assessment, 17 human subjects were asked to rate the emotions conveyed and the naturalness of the synthesized data presented as short animation videos. The animations were presented to the subjects in a random order. The evaluators received instructions to rate their overall impression of the animation and not individual aspects such as head movement or voice quality. The emotional content was rated using three emotional attributes (\primitives"), namely valence, activation and dominance, following a concept proposed by Kehrein [100]. Valence describes the positive or negative strength of the emotion, activation 211 details the excitation level (high vs. low), and dominance refers to the apparent strength or weakness of the speaker. Describing emotions by attributes in an emotional space is a powerful alternative to assigning class labels such as sadness or happiness [44], since the primitives can be easily used to capture emotion dynamics and speaker dependencies. Also, there are dif- ferent degrees of emotions that cannot be measured if only category labels are just used (e.g. how \happy" or \sad" the stimuli is). Therefore, these emotional attributes are more suitable to evaluate the emotional salience in human perception. Notice that for animation, we propose to use categorical classes, since the specications of the expres- sive animations are usually described in terms of category emotions and not emotional attributes. As a tool for emotion evaluation, Self Assessment Manikins (SAMs) have been used [73, 81] as shown in Figure 5.10. For each emotion primitive, the evaluators had to select one out of ve iconic images (\manikins"). The SAMs system has been previously used successfully for assessment in emotional speech, showing low standard deviation and high inter-evaluator agreement [81]. Also, using a text-free assessment method bypasses the diculty that each evaluator has on his/her individual understanding of linguistic emotion labels. For each SAM row in Figure 5.10, the selection was mapped to the range 1 to 5 from left to right. The naturalness of the animation was also rated using a ve-point scale. The extremes were called robot-like (value 1), and human-like (value 5). In addition to the animations, the evaluators also assessed the underlying speech signal without the video signal. This rating was used as a reference. Table 5.4 presents the inter-evaluator average variance in the scores rated by the human subjects, in terms of emotional attributes and naturalness. These measures conrm the high inter-evaluator agreement of emotional attributes. The results also 212 Figure 5.10: Self Assessment Manikins [73]. The rows illustrate: top, Valence [1- positive, 5-negative]; middle, Activation [1-excited, 5-calm]; and bottom, Dominance [1-weak, 5-strong]. Table 5.4: Subjective Agreement Evaluation, Variance about the mean Valence Activation Dominance Naturalness 0.52 0.49 0.48 0.97 show that the naturalness of the animation was perceived slightly dierent between the evaluators, which suggest that the concept of naturalness is more person-dependent than the emotional attribute. However, this variability does not bias our analysis, since we will consider dierences between the scores given to the facial animations. Figures 5.11, 5.12 and 5.13 show the results of the subjective evaluations in terms of emotional perception. Each quadrant has the error bars for six dierent facial anima- tions with head motion synthesized with (from left to right): original sequence (without mismatch), three mismatched sequences (one for each emotion), synthesized sequence, and, xed head poses. In addition, the result for audio (WAV) was also included. For example, the second quadrant in the most upper-left block of Figure 5.11 shows the va- lence assessment for the animation with neutral speech and sad head motion sequence. To measure whether the dierence in the means of two of these groups are signicant, the two-tailed Student's t-test was used. 213 In general, the gures show that the emotional perception changes in presence of dierent emotional head motion patterns. In the valence domain (Figure 5.11), the results show that when the talking avatar with angry speech is animated with happy head motion, the attitude of the character is perceived more positive. The t-test result indicates that the dierence in the scores between the mismatched and the original animations is statistical signicant (t=2.384, df = 16, p = 0.03). The same result is also held when sad and neutral speeches are synthesized with happy head motion sequences. For these pairs, the t-test results are (t=2.704, df = 16, p = 0.016), and (t=2.384, df = 16, p = 0.03), respectively. These results suggest that the temporal pattern in happy head motion makes the animation to have a more positive attitude. Figure 5.11 also shows that when neutral or happy speech is synthesized with angry head motion sequences, the attitude of the character is perceived slightly more negative. However, the t-test reveals that these dierences are not completely signicant. In the activation domain (Figure 5.12), the results show that the animation with happy speech and angry head motion sequence is perceived with a higher level of ex- citation. The t-test result indicates that the dierences in the scores are signicant (t=2.426, df = 16, p = 0.027). On the other hand, when the talking avatar with angry speech is synthesized with happy head motion, the animation is perceived slightly more calmed, as observed in Figure 5.12. Notice that in the acoustic domain, anger is usually perceived more excited than happiness, as reported in [45, 161], and as shown in the evaluations presented here (see the last bars of (c) and (d) in Figure 5.12). Our results suggest that the same trend is observed in the head motion domain: angry head motion sequences are perceived more excited than happy head motion sequences. When animation with happy speech is synthesized with sad head motion, the talking avatar is perceived more excited (t=2.184, df = 16, p = 0.044). It is not clear whether 214 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (a) Neutral (b) Sadness 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (c) Happiness (d) Anger Figure 5.11: Subjective evaluation of emotions conveyed in valence domain [1-positive, 5-negative]. Each quadrant has the error bars of facial animations with head motion synthesized with (from left to right): original head motion sequence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). this result, which is less intuitive than the previous results, may be a true eect generated by the combination of modalities, which together produce a dierent percept (similar to the McGurk eect [132]), or may be an artifact introduced in the warping process. In the dominance domain, Figure 5.13 shows that the mismatched head motion sequences do not modify in signicant ways how dominant the talking avatar is perceived as compared to the animations with the original head motion sequence. For example, the animation with neutral speech and with happy head motion is perceived slightly stronger. A similar result is observed when animation with happy speech is synthesized with an angry head motion sequence. However, thet-test reveals that the dierences in the means of the scores of the animations with mismatched and original head motion 215 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (a) Neutral (b) Sadness 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (c) Happiness (d) Anger Figure 5.12: Subjective evaluation of emotions conveyed in activation domain [1-excited, 5-calm]. Each quadrant has the error bars of facial animations with head motion syn- thesized with (from left to right): original head motion sequence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). sequences are not statistical signicant: (t=-1.461, df = 16, p = 0.163), and (t=-1.289, df = 16, p = 0.216), respectively. These results suggest that head motion has a lower in uence in the dominance domain than in the valence and activation domains. A possible explanation of this result is that human listeners may be more cognizant of other facial gestures such as eyebrow and forehead motion to infer how dominant the speaker is. Also, the intonation and the energy of the speech may play a more important role than head motion gesture for dominance perception. Notice that the emotional perception of the animations synthesized without head motion usually diers from the emotion perceived from the animations with the original sequences. This is especially clear in the valence domain, as can be observed in Figure 216 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (a) Neutral (b) Sadness 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV 5.0 4.0 3.0 2.0 1.0 PSfrag replacements ANG NEU SAD HAP SYN FIX WAV (c) Happiness (d) Anger Figure 5.13: Subjective evaluation of emotions conveyed in dominance domain [1-weak , 5-strong]. Each quadrant has the error bars of facial animations with head motion synthesized with (from left to right): original head motion sequence (without mismatch), three mismatched head motion sequences (one for each emotion), synthesized sequence (SYN), and, xed head poses (FIX). The result of the audio without animation is also shown (WAV). 5.11. The dierences in the mean of the scores in Figures 5.11(a) and 5.11(b) between the xed head motion and the original animations are statistical signicant, as shown by thet-test: (a) (t=2.746,df = 16,p = 0.014) and (b) (t=2.219,df = 16,p = 0.041). For (c) and (d) the dierences in the means observed in the gure are not totally signicant: (c) (t=-1.144,df = 16,p = 0.269), (d) (t=2.063,df = 16,p = 0.056). This result suggests that head motion has a strong in uence on the perception of how positive or negative the aective state of the avatar is. Figures 5.11, 5.12 and 5.13 also suggest that the emotional perception of the acoustic signal changes when facial animation is added, emphasizing the multimodal nature of human emotional expression. This is particularly noticeable in sad sentences, in which 217 Table 5.5: Naturalness assessment of rigid head motion sequences [1-robot-like, 5-human- like] Head Motion Neutral Sadness Happiness Anger Data Mean Std Mean Std Mean Std Mean Std Original 3.76 0.90 3.76 0.83 3.71 0.99 3.00 1.00 Synthesized 4.00 0.79 3.12 1.17 3.82 1.13 3.71 1.05 Fixed Head 3.00 1.06 2.76 1.25 3.35 0.93 3.29 1.45 the t-test between the means of the scores of the original animation and the acoustic signal gives (t=4.190, df = 16, p = 0.01) in the valence domain, and (t=2.400, df = 16, p = 0.029) in the activation domain. Notice that in this analysis, the emotional perception of the acoustic signal is directly compared to the emotional perception of the animation. Therefore, the dierences in the results are due to not only the head motion, but also the other facial gestures included in the animations (see Section 5.6). These results suggest that facial gestures (including head motion) are extremely important to convey the desired emotion. Table 5.5 shows how the listeners assessed the naturalness of the facial animation with head motion sequences generated with the original and with the synthesized data. It also shows the results for animations without head motion. These results show that head motion signicantly improves the naturalness of the animation. Furthermore, with the exception of sadness, the synthesized sequences were perceived even more natural than the real head motion sequences, which indicates that the head motion synthesis approach presented here was able to generate realistic head motion sequences. 218 5.8 Conclusions Rigid head motion is an important component in human-human communication that needs to be appropriately added into computer facial animations. The subjective eval- uations presented in this work show that including head motion into talking avatars signicantly improves the naturalness of the animations. The statistical measures obtained from the audiovisual database reveal that the dynamics of head motion sequences are dierent under dierent emotional states. Fur- thermore, the subjective evaluations also show that head motion changes the emotional perception of the animation, especially in the valence and activation domain. The im- plications of these results are signicant. Head motion can be appropriately included in the facial animation to emphasize its emotional content. In this chapter, a head motion synthesis approach was implemented to handle ex- pressive animations. Emotion-dependent HMMs were designed to generate the most likely head motion sequences driven by speech prosody. The objective evaluations show that the synthesized and the original head motion sequences were highly correlated, suggesting that the dynamics of head motion were successfully modeled by the use of prosodic features. Also, the subjective evaluations show that on average, the anima- tions with synthesized head motion were perceived as realistic when compared with the animation with the original head motion sequence. We are currently working to modify the system to generate head motion sequences that not only look natural, but also preserve the emotional perception of the input signal. Even though the proposed approach generates realistic head motion sequences, the results of the subjective evaluations show that in some cases the emotional content in the animations were perceived slightly dierent from the original sequences. Further research is needed to shed light into the underlying reasons. It may be that dierent combinations of modalities create dierent emotion percepts, similar to the famous 219 McGurk eect [132]. Or, it may be that the modeling and techniques used are not perfect enough, creating artifacts. For instance, it may be that the emotional HMMs preserve the phase but not the amplitude of the original head motion sequence. If this is the case, the amplitude of the head motion could be externally modied to match the statistics of the desired emotion category. One limitation of this work is that head motion sequences considered here did not include the 3 DOF of head translation. Since human neck translates the head, especially back and forward, our future work will investigate how to jointly model the 6 DOF of the head. In this work, head motion sequences from a single actress were studied, which is generally enough for synthesis purposes. An open area that requires further work is the analysis of inter-person variabilities and dependencies in head motion patterns. We are planning to collect more data from dierent subjects to address the challenging questions triggered by this topic. We are also studying the relation between the speech and other facial gestures, such as eyebrow motion. If these gestures are appropriately included, we believe that the overall facial animation will be perceived to be more realistic and compelling. 220 Chapter 6: Conclusions and future work Toward understanding how to model expressive human communication, this disserta- tion presented a multimodal analysis of verbal and non-verbal messages. Gestures and speech were jointly studied to shed light into the underlying relationships between these communicative channels, during emotional utterances. Consequently, interesting results in applications in recognition and synthesis were presented. The results showed that facial gestures and speech are aected by spatial-temporal modulations that in uence the correlation levels between acoustic and facial features. A detailed analysis of these emotional modulations indicates that there is interplay between linguistic and aective goals, which are buered, prioritized and executed in coherent manner. While facial areas close to the lips are strongly constrained by the underlying articulation, other facial areas present more degree of freedoms to express non-linguistic messages such as the emotions. Similar interplay between linguistic and aective goals is also observed in speech. For example, some broad phonetic classes such as front vowels present stronger emo- tional modulation in the spectral speech features than in other phonetic classes such as nasal sounds. During those segments, the spectral features may not have enough degree 221 of freedom to convey emotional modulation. Likewise, gross statistics from the funda- mental frequency such as mean, maximum, minimum and range are more emotionally prominent than the features describing the pitch shape. As suggested by Scherer et al., the shape of the F0 contour seems to be associated with the grammatical structure of the sentence [158]. Therefore, the interplay between aective and linguistic goals in the pitch shape is tighter compared to the degree of freedom observed in the manipulation of gross patterns in the F0 contour. The joint analysis of linguistic and aective goals in the speech and facial expression suggested that emotional bits are assigned to the modalities that are less constrained by other communicative goals. In particular, it was observed that facial expressions and pitch tend to have stronger emotional modulation when the articulatory conguration does not have enough freedom to express emotions, given the physical constraints in the speech production system. This emotional assignment compensates the temporal limitation observed in some modalities, and plays a crucial role in the interplay between communicative goals. Since these emotional modulations produce specic patterns in our communicative channels, it is possible to detect the aective state of the user. A novel two-step approach to discriminate emotional from neutral speech was proposed. The framework is based on contrasting the input speech with neutral reference models. In the rst step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a tness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or dierent from, in the case of emotional speech, the reference models. The proposed approach was implemented with spectral and prosodic speech features. The results suggested that the proposed approach outperforms the conventional approaches in terms of accuracy and robustness. 222 A multimodal approach considering facial expression and speech was also proposed. Focusing in the cheek and forehead areas, the results showed that high emotional recog- nition performance can be achieved. While some of the emotions are confused in one domain, they can be separated in the other domain. Therefore, a multimodal approach considering facial and acoustic features is more robust and has better performance than unimodal emotion recognition systems. At small group level, the results presented in this proposal revealed that is possi- ble to estimate the ow of the interaction, based on high level features automatically extracted from an intelligent environment equipped with audio-visual sensors. These results suggest that is possible to infer how dominant and involved each participant was during the discussion. The results presented in this proposal also indicate that the mappings between facial and acoustic features have a structure that is preserved from sentence to sentence. This structure seems to be easier to be learned when acoustic prosodic features are used to estimate the facial gestures. Motivated from these results, this proposal proposed a data- driven approach to synthesize head motion sequences, estimated from prosodic features. Hidden Markov Models (HMM) and smoothing constraints are used to synthesize the most likely head motion sequence given the prosodic features. The results indicated that the synthesized sequences follow the temporal dynamic of the emotional head motion sequences and were perceived as natural as the original sequences. The models, hypotheses and theories presented in this dissertation open new in- teresting questions that will guide the future direction of this research. Section 6.1 describes some of the future work of this research. 223 6.1 Future work An interesting open question is how to jointly model dierent modalities with a single framework. This is a challenging goal, because gestures and speech are not synchronous and they are coupled at dierent resolutions. In Chapter 5, head motion and prosodic features were jointly modeled. If other gestures are included in the model, more so- phisticated techniques will be needed (e.g. multiresolution analysis, graphical models, coupled HMMs). As mentioned by Cassell et al., facial gestures and speech are not only strongly connected, but also systematically synchronized in dierent scales (phonemes-words- phrases-sentences) [29]. These observations were supported by the results presented in Chapter 3. In particular, we showed that the orofacial area is strongly constrained by the underlying articulatory processes. Contrarily, the upper face area (e.g. forehead), which still presents signicant levels of correlation with the speech, has more degrees of freedom to communicate non-linguistic messages (Section 3.2). Also, the correlation between facial areas and speech estimated in small windows (phoneme levels) was signicantly higher in the lower face area compared to other regions (Section 3.1.5). These results suggest that facial gestures and speech are coupled at dierent resolutions. Therefore, a multiresolution decomposition approach may provide a better framework to analyze the interrelation between facial and acoustic features from coarse-to-ne representations. The results of this research can be used to synthesize facial gestures that will be coupled with the speech at dierent scales. Another interesting direction is to study the relation between high-level linguistic functions and gestures. In particular, we propose to analyze gestures that are generated as discourse functions (e.g. head nod for \no") to improve facial animations [28]. In the results presented in Chapter 5, the head motion sequences were synthesized using only acoustic prosodic features. However, if the underlying semantic structure of the 224 sentence is known, constraints can be imposed to appropriately respond to discourse functions. Therefore, we propose to synthesize facial expressions driven by speech and discourse functions. Some of the elements presented in Figure 1.1 were not addressed in this work. For example, the idiosyncratic in uence in expressive human communication was not consid- ered. An open question is whether inter-speaker dierences and similarities of gestures and speech can be modeled during expressive utterances, and how we can use those models to improve applications in recognition and synthesis. For example, by learning inter-personal similarities, speaker-independent emotion recognition systems can be de- signed. By using models based on inter-personal dierences, better human-like facial animation can be generated. Another important aspect that needs to be considered is dyad interaction. Gestures and speech of the speakers are aected by the feedbacks provided by other interlocutors. In fact, the gestures of the listeners are an important part of interaction that needs to be considered for better human-like virtual agents. Active listeners respond with non- verbal gestures. These gestures appear in specic structure of the speaker's word [64]. This implies that the speech of the active speaker is linked to the listener's gestures. Likewise, we can analyze the gestures and speech of the subjects when they are trying to positively aect the mood of the other interlocutor. We hypothesize that particular gestures are used which can be learned and synthesize to improve Human Computer Interfaces when the aective state of the user change. At small group level, our research goal is to infer meta-information from gestures of the participants. Participants' engagement, speaker activity and emotional states are some of the cues that we hypothesize can be estimated from the participants' gestures. Also, an interesting question is whether a report provided by the Smart room can be used as training tool for improving participant skills during discussions. 225 Bibliography [1] S. Abrilian, L. Devillers, S. Buisine, and J.C.Martin. EmoTV1: Annotation of real-life emotions for the specication of multimodal aective interfaces. In 11th International Conference on Human-Computer Interaction (HCI 2005), pages 195{200, Las Vegas, Nevada, USA, July 2005. [2] A. Alvarez, I. Cearreta, J.M. L opez, A. Arruti, E. Lazkano, B. Sierra, and N. Garay. Feature subset selection based on evolutionary algorithms for auto- matic emotion recognition in spoken spanish and standard basque language. In Ninth International Conference on Text, Speech and Dialogue (TSD 2006), pages 565{572, Brno, Czech Republic, September 2006. [3] N. Amir, S. Ron, and N. Laor. Analysis of an emotional speech corpus in Hebrew based on objective criteria. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, pages 29{33, Newcastle, Northern Ireland, UK, September 2000. [4] S. Ananthakrishnan and S. Narayanan. Automatic prosody labeling using acous- tic, lexical, and syntactic evidence. IEEE Transactions on Speech, Audio and Language Processing, 16(1):216{228, January 2008. [5] K.S. Arun, T.S. Huang, and S.D. Blostein. Least-squares tting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell., 9(5):698{700, September 1987. [6] S. Banerjee and A.I. Rudnicky. Using simple speechbased features to detect the state of roles of the meeting participants. In 8th International Conference on Spoken Language Processing (ICSLP 04), pages 2189{2192, Jeju Island, Korea, October 2004. [7] T. B anziger and K.R. Scherer. The role of intonation in emotional expressions. Speech Communication, 46(3-4):252{267, July 2005. 226 [8] T. B anziger and K.R. Scherer. Using actor portrayals to systematically study multimodal emotion expression: The GEMEP corpus. In A. Paiva, R. Prada, and R.W. Picard, editors, Aective Computing and Intelligent Interaction (ACII 2007), Lecture Notes in Articial Intelligence 4738, pages 476{487. Springer- Verlag Press, Berlin, Germany, September 2007. [9] J. P. Barker and F. Berthommier. Estimation of speech acoustics from visual speech features: A comparison of linear and nonlinear models. In Conference Audio-Visual Speech Processing (AVSP 1999), pages 112{117, Santa Cruz, CA, USA, August 1999. [10] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland. Towards measuring human interactions in conversational settings. In IEEE Int. Workshop Cues in Commu- nication, Kauai, HI, USA, December 2001. [11] A. Batliner, K. Fischer, R. Huber, J. Spilker, and E. N oth. Desperately seeking emotions or: actors, wizards and human beings. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, pages 195{200, Newcastle, Northern Ireland, UK, September 2000. [12] E. Bevacqua and C. Pelachaud. Expressive audio-visual speech. Computer Ani- mation and Virtual Worlds, 15(3-4):297{304, July 2004. [13] M. J. Black and Y. Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric model of image motion. In Fifth International Conference on Computer Vision (ICCV 1995), pages 374{381, Cambridge, MA, USA, June 1995. [14] P. Boersma and D. Weeninck. Praat, a system for doing phonetics by computer. Technical Report 132, Institute of Phonetic Sciences of the University of Amster- dam, Amsterdam, Netherlands, 1996. http://www.praat.org. [15] E. Boyer and J.-S. Franco. A hybrid approach for computing visual hulls of com- plex objects. In Computer Vision and Pattern Recognition (CVPR'03), volume I, pages 695{701, 2003. [16] M. Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques (SIGGRAPH 1999), pages 21{28, New York, NY, USA, 1999. [17] C. Bregler, M. Covell, and M. Slaney. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques (SIGGRAPH 1997), pages 353{360, Los Angeles, CA,USA, August 1997. 227 [18] M. Bulut and S. Narayanan. On the robustness of overall F0-only modication eects to the perception of emotions in speech. Journal of the Acoustical Society of America, 123(6):4547{4558, June 2008. [19] S. Burger, V. MacLaren, and H. Yu. The ISL meeting corpus: The impact of meeting type on speech style. In International Conference on Spoken Language (ICSLP), Denver,CO, USA, September 2002. [20] C. Burges. A tutorial on support vector machines for pattern recognition. Journal Data Mining and Knowledge Discovery, 2(2):121{167, June 1998. [21] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, and B. Weiss. A database of German emotional speech. In 9th European Conference on Speech Communi- cation and Technology (Interspeech'2005 - Eurospeech), pages 1517{1520, Lisbon, Portugal, September 2005. [22] C. Busso and S.S. Narayanan. The expression and perception of emotions: Com- paring assessments of self versus others. In Interspeech 2008 - Eurospeech, Bris- bane, Australia, September 2008. [23] C. Busso and S.S. Narayanan. Recording audio-visual emotional databases from actors: a closer look. In Second International Workshop on Emotion: Corpora for Research on Emotion and Aect, International conference on Language Resources and Evaluation (LREC 2008), pages 17{22, Marrakech, Morocco, May 2008. [24] C. Busso and S.S. Narayanan. Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database. In Interspeech 2008 - Eurospeech, Brisbane, Australia, September 2008. [25] E. Magno Caldognetto, P. Cosi, C. Drioli, G. Tisato, and F. Cavicchio. Coproduc- tion of speech and emotions: Visual and acoustic modications of some phonetic labial targets. In Audio Visual Speech Processing (AVSP 03), pages 209{214, S. Jorioz, France, September 2003. [26] R. Campbell, B. Dodd, and D. Burnham. Hearing by eye II: advances in the psychology of speechreading and auditory-visual speech. Psychology Press, Hove, East Sussex, U.K., 1998. [27] G. Caridakis, L. Malatesta, L. Kessous, N. Amir, A. Raouzaiou, and K. Karpouzis. Modeling naturalistic aective states via facial and vocal expressions recognition. In Proceedings of the 8th international conference on Multimodal interfaces (ICMI 2006), pages 146{154, Ban, Alberta, Canada, November 2006. 228 [28] J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhjalmsson, and H. Yan. Embodiment in conversational interfaces: Rea. In International Conference on Human Factors in Computing Systems (CHI-99), pages 520{527, Pittsburgh, PA, USA, May 1999. [29] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Bechet, B. Dou- ville, S. Prevost, and M. Stone. Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversational agents. In Computer Graphics (Proc. of ACM SIGGRAPH'94), pages 413{420, Orlando, FL,USA, 1994. [30] R. Cauldwell. Where did the anger go? the role of context in interpreting emotion in speech. In ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, pages 127{131, Newcastle, Northern Ireland, UK, September 2000. [31] C. Cav e, I. Gua tella, R. Bertrand, S. Santi, F. Harlay, and R. Espesser. About the relationship between eyebrow movements and F0 variations. In International Conference on Spoken Language (ICSLP), volume 4, pages 2175{2178, Philadel- phia, PA, USA, October 1996. [32] N. Checka, K. Wilson, M. Siracusa, and T. Darrell. Multiple person and speaker activity tracking with a particle lter. In International Conference on Acoustics, Speech, and Signal Processing, volume V, pages 881{84, 2004. [33] L.S. Chen and T.S. Huang. Emotional expressions in audiovisual human computer interaction. In IEEE International Conference on Multimedia and Expo (ICME 2000), volume 1, pages 423{426, New York City, NY, USA, July-August 2000. [34] L.S. Chen, T.S. Huang, T. Miyasato, and R. Nakatsu. Multimodal human emotion / expression recognition. In Third IEEE International Conference on Automatic Face and Gesture Recognition, pages 366{371, Nara, Japan, 1999. [35] E. Chuang and C. Bregler. Head emotion. Technical Report CS-TR-2003-02, Computer Science, Stanford University, Stanford, CA,USA, April 2003. [36] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette. The SAFE cor- pus: illustrating extreme emotions in dynamic situations. In First International Workshop on Emotion: Corpora for Research on Emotion and Aect (Interna- tional conference on Language Resources and Evaluation (LREC 2006)), pages 76{79, Genoa,Italy, May 2006. [37] I. Cohen and G. Medioni. Detecting and tracking moving objects for video surveil- lance. In Computer Vision and Pattern Recognition (CVPR'99), volume II, pages 319{325, 1999. 229 [38] M. M. Cohen and D. W. Massaro. Modeling coarticulation in synthetic visual speech. In Magnenat-Thalmann N., Thalmann D. (Editors), Models and Tech- niques in Computer Animation, Springer Verlag, pages 139{156, Tokyo, Japan, 1993. [39] J.F. Cohn, L.I. Reed, Z. Ambadar, J. Xiao, and T. Moriyama. Automatic analysis and recognition of brow actions and head motion in spontaneous facial behavior. In IEEE Conference on Systems, Man, and Cybernetic, volume 1, pages 610{616, The Hague, the Netherlands, October 2004. [40] C. Conati and H. Mclaren. Data-driven renement of a probabilistic model of user aect. In L. Ardissono, P. Brna, and A. Mitrovic, editors, Proceedings of the Tenth International Conference on User Modeling (UM2005), pages 40{49. Springer-Verlag Press, Berlin, Germany, 2005. [41] M. Costa, T. Chen, and F. Lavagetto. Visual prosody analysis for realistic motion synthesis of 3D head models. In International Conference On Augmented, Virtual Environments and Three Dimensional Imaging (ICAV3D), pages 343{346, Ornos, Mykonos, Greece, May-June 2001. [42] M. Coulson. Attributing emotion to static body postures: Recognition accuracy, confusions, and viewpoint dependence. Journal of Nonverbal Behavior, 28(2):117{ 139, June 2004. [43] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley- Interscience, New York, NY, USA, 2006. [44] R. Cowie and R.R. Cornelius. Describing the emotional states that are expressed in speech. Speech Communication, 40(1-2):5{32, April 2003. [45] R. Cowie, E. Douglas-Cowie, B. Apolloni, J. Taylor, A. Romano, and W. Fellenz. What a neural net needs to know about emotion words. In Circuits Systems Communications and Computers (CSCC), pages 5311{5316, Athens, Greek, July 1999. [46] R. Cowie, E. Douglas-Cowie, and C. Cox. Beyond emotion archetypes: Databases for emotion modelling using neural networks. Neural Networks, 18(4):371{388, May 2005. [47] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J.G. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32{80, January 2001. [48] L.J. Cronbach. Coecient alpha and the internal structure of tests. Psychome- trika, 16:297{334, September 1951. 230 [49] L.C. De Silva, T. Miyasato, and R. Nakatsu. Facial emotion recognition using multi-modal information. In International Conference on Information, Commu- nications and Signal Processing (ICICS), volume I, pages 397{401, Singapore, 1997. [50] L.C. De Silva and P. C. Ng. Bimodal emotion recognition. In Fourth IEEE International Conference on Automatic Face and Gesture Recognition, pages 332{ 335, Phoenix, AZ, 2000. [51] F. Dellaert and T. Polzin A. Waibel. Recognizing emotion in speech. In Interna- tional Conference on Spoken Language (ICSLP 1996), volume 3, pages 1970{1973, Philadelphia, PA, USA, October 1996. [52] J.R. Deller, J.H.L. Hansen, and J.G. Proakis. Discrete-Time Processing of Speech Signals. IEEE Press, Piscataway, NJ, USA, 2000. [53] Z. Deng, M. Bulut, U. Neumann, and S. Narayanan. Automatic dynamic expres- sion synthesis for speech animation. In IEEE 17th International Conference on Computer Animation and Social Agents (CASA 2004), pages 267{274, Geneva, Switzerland, July 2004. [54] Z. Deng, C. Busso, S. Narayanan, and U. Neumann. Audio-based head motion synthesis for avatar-based telepresence systems. In ACM SIGMM 2004 Workshop on Eective Telepresence (ETP 2004), pages 24{30, New York, NY, October 2004. ACM Press. [55] Z. Deng, J.P. Lewis, and U. Neumann. Automated eye motion using texture synthesis. IEEE Computer Graphics and Applications, 25(2):24{30, March/April 2005. [56] Z. Deng, J.P. Lewis, and U. Neumann. Synthesizing speech animation by learning compact speech co-articulation models. In Computer Graphics International (CGI 2005), pages 19{25, Stony Brook, NY, USA, June 2005. [57] Z. Deng, U. Neumann, J.P. Lewis, T.Y. Kim, M. Bulut, and S. Narayanan. Ex- pressive facial animation synthesis by learning speech co-articultion and expression spaces. IEEE Transactions on Visualization and Computer Graphics (TVCG), 12(6):1523{1534, November/December 2006. [58] L. Devillers, L. Vidrascu, and L. Lamel. Challenges in real-life emotion annotation and machine learning based detection. Neural Networks, 18(4):407{422, May 2005. [59] M.N. Do. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE Signal Processing Letters, 10(4):115{118, April 2003. 231 [60] E. Douglas-Cowie, N. Campbell, R. Cowie, and P. Roach. Emotional speech: Towards a new generation of databases. Speech Communication, 40(1-2):33{60, April 2003. [61] E. Douglas-Cowie, L. Devillers, J.C. Martin, R. Cowie, S. Savvidou, S. Abrilian, and C. Cox. Multimodal databases of everyday emotion: Facing up to com- plexity. In 9th European Conference on Speech Communication and Technology (Interspeech'2005), pages 813{816, Lisbon, Portugal, September 2005. [62] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. Wiley-Interscience, New York, NY, USA, 2000. [63] D. Eberly. 3D Game Engine Design: A Practical Approach to Real-Time Com- puter Graphics. Morgan Kaufmann Publishers, San Francisco, CA, USA, 2000. [64] P. Ekman. About brows: emotional and conversational signals. In M. von Cranach, K. Foppa, W. Lepenies, and D. Ploog, editors, Human ethology: claims and limits of a new discipline, pages 169{202. Cambridge University Press, New York, NY, USA, 1979. [65] P. Ekman. Facial expression and emotion. American Psychologist, 48(4):384{392, April 1993. [66] P. Ekman and W. V. Friesen. Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Prentice-Hall, Englewood Clis. NJ, USA, 1975. [67] P. Ekman and W.V. Friesen. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2):124{129, March 1971. [68] P. Ekman and W.V. Friesen. Facial Action Coding System: A Technique for Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA, USA, 1978. [69] P. Ekman and E.L. Rosenberg. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the Facial Action Coding System (FACS). Oxford University Press, New York, NY, USA, 1997. [70] F. Enos and J. Hirschberg. A framework for eliciting emotional speech: Capitaliz- ing on the actors process. In First International Workshop on Emotion: Corpora for Research on Emotion and Aect (International conference on Language Re- sources and Evaluation (LREC 2006)), pages 6{10, Genoa,Italy, May 2006. [71] I.A. Essa and A.P. A.P. Pentland. Coding, analysis, interpretation, and recogni- tion of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):757{763, July 1997. 232 [72] T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. ACM Transaction on Graphics(Proceedings of ACM SIGGRAPH'02), pages 388{ 398, 2002. [73] L. Fischer, D. Brauns, and F. Belschak. Zur Messung von Emotionen in der angewandten Forschung. Pabst Science Publishers, Lengerich, 2002. [74] J.L. Fleiss. Statistical methods for rates and proportions. John Wiley & Sons, New York, NY, USA, 1981. [75] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren. Timit acoustic-phonetic continuous speech corpus, 1993. [76] D. Gatica-Perez, G. Lathoud, I. McCowan, J.-M.Odobez, and D. Moore. Audio- visual speaker tracking with importance particle lters. In International Confer- ence on Image Processing, volume III, pages 25{28, 2003. [77] P. G. Georgiou, P. Tsakalides, and C. Kyriakakis. Alpha-stable modeling of noise and robust time-delay estimation in the presence of impulsive noise. IEEE Trans- actions on Multimedia, 1(3):291{301, September 1999. [78] E. Grabe, G. Kochanski, and J. Coleman. Connecting intonation labels to math- ematical descriptions of fundamental frequency. Language and Speech, 50(3):281{ 310, October 2007. [79] H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. Visual prosody: Facial movements accompanying speech. In Proc. of IEEE International Conference on Automatic Faces and Gesture Recognition, pages 396{401, Washington, D.C., USA, May 2002. [80] B. Granstr om and D. House. Audiovisual representation of prosody in expressive speech communication. Speech Communication, 46(3-4):473{484, July 2005. [81] M. Grimm and K. Kroschel. Evaluation of natural emotions using self assessment manikins. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2005), pages 381{385, San Juan, Puerto Rico, December 2005. [82] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan. Primitives-based evaluation and estimation of emotions in speech. Speech Communication, 49(10-11):787{800, October-November 2007. [83] M. Grimm, K. Kroschel, and S. Narayanan. The Vera AM Mittag German audio- visual emotional speech database. In IEEE International Conference on Multi- media and Expo (ICME 2008), Hannover, Germany, June 2008. 233 [84] R. Gutierrez-Osuna, P.K. Kakumanu, A. Esposito, O.N. Garcia, A. Bojorquez, J.L. Castillo, and I. Rudomin. Speech-driven facial animation with realistic dy- namics. IEEE Transactions on Multimedia, 7(1):33{42, February 2005. [85] E. Hall. The hidden dimension. Doubleday & Company, New York, NY, USA, 1966. [86] B. Hartmann, M. Mancini, and C. Pelachaud. Formational parameters and adap- tive prototype instantiation for MPEG-4 compliant gesture synthesis. In Proceed- ings of Computer Animation, pages 111{119, Geneva, Switzerland, June 2002. [87] T.J. Hazen. Visual model structures and synchrony constraints for audio-visual speech recognition. IEEE Transactions on Audio, Speech and Language Process- ing, 14(3):1082{1089, May 2006. [88] H. Hill and A. Johnston. Categorizing sex and identity from the biological motion of faces. Current Biology, 11(11):880{885, June 2001. [89] D.W. Hosmer and S. Lemeshow. Applied logistic regression. Wiley Series in Probability and Statistics, New York, NY, USA, 2000. [90] T.S. Huang, H. Tao L.S. Chen, T. Miyasato, and R. Nakatsu. Bimodal emotion recognition by man and machine. In Proceeding of ATR Workshop on Virtual Communication Environmnts, Kyoto, Japan, 1998. [91] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, and R. Rosenfeld. The SPHINX-II speech recognition system: an overview. Computer Speech and Language, 7:137{148, April 1993. [92] http://emotion-research.net/, 2007. Retrieved August 1st, 2007. [93] A. Jaimes, K. Omura, T. Nagamine, and K. Hirata. Memory cues for meeting video retrieval. In 1st ACM workshop on Continuous archival and retrieval of personal experiences (CARPE 2004), pages 74{85, 2004. [94] J. Jiang, A. Alwan, L.E. Bernstein, E.T. Auer Jr., and P.A. Keating. Predict- ing face movements from speech acoustics using spectral dynamics. In IEEE International Conference on Multimedia and Expo (ICME 2002), volume 1, pages 181{184, Lausanne, Switzerland, August 2002. [95] J. Jiang, A. Alwan, P.A. Keating, B. Chaney, E.T. Auer Jr., and L.E. Bern- stein. On the relationship between face movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 11:1174{1188, 2002. [96] P.N. Juslin and P. Laukka. Communication of emotions in vocal expression and music performance: dierent channels, same code? Psychological Bulletin, 129(5):770{814, September 2003. 234 [97] K. Kakihara, S. Nakamura, and K. Shikano. Speech-to-face movement synthesis based on HMMS. In IEEE International Conference on Multimedia and Expo (ICME), volume 1, pages 427{430, New York, NY, USA, April 2000. [98] R. El Kaliouby and P. Robinson. Mind reading machines: automated inference of cognitive mental states from video. In IEEE Conference on Systems, Man, and Cybernetic, volume 1, pages 682{688, The Hague, the Netherlands, October 2004. [99] A. Kapur, A. Kapur, N. Virji-Babul, G. Tzanetakis, and P.F. Driessen. Gesture- based aective computing on motion capture data. In 1st International Confer- ence on Aective Computing and Intelligent Interaction (ACII 2005), pages 1{8, Beijing, China, October 2005. [100] R. Kehrein. The prosody of authentic emotions. In Proceedings of the Speech Prosody, pages 423{426, Aix-en-Provence, France, April 2002. [101] S. Kettebekov, M. Yeasin, and R. Sharma. Prosody based audiovisual coanalysis for coverbal gesture recognition. IEEE Transactions on Multimedia, 7(2):234{242, April 2005. [102] M. Kipp. ANVIL - a generic annotation tool for multimodal dialogue. In European Conference on Speech Communication and Technology (Eurospeech), pages 1367{ 1370, Aalborg, Denmark, September 2001. [103] E. Klabbers and J. Van Santen. Clustering of foot-based pitch contours in expres- sive speech. In 5th ISCA Speech Synthesis Workshop, pages 73{78, Pittsburgh, PA, USA, June 2004. [104] G. Kochanski. Prosody beyond fundamental frequency. In D. Lenertov a, R. Meyer, S. Pappert, P. Augurzky, I. Mleinek, N. Richter, and J. Schliesser, editors, Methods in Empirical Prosody Research, Language, Context and Cognition Series, pages 89{122. Walter de Gruyter & Co, Berlin, Germany, April 2006. [105] S. Kopp and I. Wachsmuth. Model-based animation of co-verbal gesture. In Proceedings of Computer Animation, pages 252{257, Geneva, Switzerland, June 2002. [106] S. Kshirsagar and N. Magnenat-Thalmann. A multilayer personality model. In Proceedings of the 2nd international symposium on Smart graphics (SMART- GRAPH 2002), pages 107{115, Hawthorne, NY, USA, June 2002. ACM Press. [107] S. Kshirsagar and N. M. Thalmann. Visyllable based speech animation. Computer Graphics Forum (Proc. of Eurographics'03), 22(3):631{639, 2003. 235 [108] A. Kuranov, R. Leinhart, and V. Pisarevsky. An empirical analysis of boosting algorithms for rapid objects with an extended set of Haar-like features. In Intel Technical Report MRL-TR-July02-01, 2002. [109] T. Kuratate, K. G. Munhall, P. E. Rubin, E. Vatikiotis-Bateson, and H. Yehia. Audio-visual synthesis of talking faces from speech production correlates. In Sixth European Conference on Speech Communication and Technology, Eurospeech 1999, pages 1279{1282, Budapest, Hungary, September 1999. [110] S. Kwon and S. Narayanan. A study of generic models for unsupervised on- line speaker indexing. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, pages 423{28, 2003. [111] D.R. Ladd, K.E.A. Silverman, F. Tolkmitt, G. Bergmann, and K.R. Scherer. Evidence for the independent function of intonation contour type, voice quality, and F0 range in signaling speaker aect. Journal of the Acoustical Society of America, 78(2):435{444, August 1985. [112] C.R. Lansing and G.W. McConkie. Attention to facial regions in segmental and prosodic visual speech perception tasks. Journal of Speech, Language, and Hearing Research, 42:526{539, June 1999. [113] A. Laurentini. The visual hull concept for silhouette-based image understanding. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(2):150{162, Feb 1994. [114] C. M. Lee, S.S. Narayanan, and R. Pieraccini. Classifying emotions in human- machine spoken dialogs. In IEEE International Conference on Multimedia and Expo (ICME 2002), volume 1, pages 737{740, Lausanne, Switzerland, August 2002. [115] C.M. Lee and S.S. Narayanan. Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2):293{303, March 2005. [116] C.M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan. Emotion recognition based on phoneme classes. In 8th International Conference on Spoken Language Processing (ICSLP 04), pages 889{ 892, Jeju Island, Korea, October 2004. [117] S. Lee, E. Bresch, J. Adams, A. Kazemzadeh, and S.S. Narayanan. A study of emo- tional speech articulation using a fast magnetic resonance imaging technique. In International Conference on Spoken Language (ICSLP), pages 2234{2237, Pitts- burgh, PA, USA, September 2006. 236 [118] S. Lee, E. Bresch, and S.S. Narayanan. An exploratory study of emotional speech production using functional data analysis techniques. In 7th International Seminar on Speech Production (ISSP 2006), pages 525{532, Ubatuba-SP, Brazil, December 2006. [119] S. Lee, S. Yildirim, A. Kazemzadeh, and S. Narayanan. An articulatory study of emotional speech production. In 9th European Conference on Speech Commu- nication and Technology (Interspeech'2005 - Eurospeech), pages 497{500, Lisbon, Portugal, September 2005. [120] Y. Li and H.Y. Shum. Learning dynamic audio-visual mapping with Input-Output Hidden Markov Models. IEEE Transactions on Multimedia, 8(3):542{549, June 2006. [121] L. Liang, C. Liu, Y.Q. Xu, B. Guo, and H.Y. Shum. Real-time texture synthesis by patch-based sampling. ACM Transactions on Graphics, 20(3):127{150, July 2001. [122] M. Liberman, K. Davis, M. Grossman, N. Martey, and J. Bell. Emotional prosody speech and transcripts, 2002. Linguistic Data Consortium. [123] P. Lieberman and S.B. Michaels. Some aspects of fundamental frequency and envelope amplitude as related to the emotional content of speech. Journal of the Acoustical Society of America, 34(7):922{927, July 1962. [124] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1):84{95, Jan 1980. [125] W. Liu, B. Yin, and X. Jia. Audio to visual signal mappings with HMM. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04), pages 885{888, Quebec, Canada, May 2004. [126] J. Lucero, A.R. Baigorri, and K.G. Munhall. Data-driven facial animation of speech using a QR factorization algorithm. In 7th International Seminar on Speech Production (ISSP 2006), pages 135{142, Ubatuba-SP, Brazil, December 2006. [127] K. Mase. Recognition of facial expression from optical ow. IEICE transactions, 74(10):3474{3483, October 1991. [128] D. W. Massaro. Illusions and issues in bimodal speech perception. In Proceed- ings of Auditory Visual Speech Perception 1998, pages 21{26, Terrigal-Sydney, Australia, December 1998. [129] W. Matusik, C. Buehler, and L. McMillan. Polyhedral visual hulls for real-time rendering. In In Proceedings of Eurographics Workshop on Rendering, 2001. 237 [130] Maya software, Alias Systems division of Silicon Graphics Limited. http://www.alias.com, 2005. [131] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):305{317, March 2005. [132] H. McGurk and J. W. MacDonald. Hearing lips and seeing voices. Nature, 264:746{748, December 1976. [133] D. McNeill. Hand and Mind: What gestures reveal about thought. The University of Chicago Press, Chicago, IL, USA, 1992. [134] W. Mendenhall and T. Sincich. Statistics for Engineering and the Sciences. Prentice-Hall, Upper Saddle River, NJ, USA, 2006. [135] I. Miki c, K. Huang, and M. Trivedi. Activity monitoring and summarization for an intelligent meeting room. In IEEE Workshop on Human Motion, pages 107{112, Austin, TX, USA, December 2000. [136] J.M. Montero, J. Gutirrez-Arriola, S. Palazuelos, E. Enrquez, S. Aguilera, and J.M. Pardo. Emotional speech synthesis: from speech database to TTS. In 5th International Conference on Spoken Language Processing(ICSLP 1998), pages 923{925, Sydney,Australia, November-December 1998. [137] K. G. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. Vatikiotis-Bateson. Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15(2):133{137, February 2004. [138] N.J.D. Nagelkerke. A note on a general denition of the coecient of determina- tion. Biometrika, 78(3):691{692, September 1991. [139] M. Nordstrand, G. Svanfeldt, B. Granstr om, and D. House. Measurements of articulatory variations and communicative signals in expressive speech. In Audio Visual Speech Processing (AVSP 03), pages 233{237, S. Jorioz, France, September 2003. [140] T.L. Nwe, F.S. Wei, and L.C. De Silva. Speech based emotion classication. In IEEE Region 10 International Conference on Electrical and Electronic Technol- ogy (TENCON 2001), volume 1, pages 297{301, Phuket Island, Langkawi Island, Singapore, August 2001. [141] A. Paeschke. Global trend of fundamental frequency in emotional speech. In Speech Prosody (SP 2004), pages 671{674, Nara, Japan, March 2004. 238 [142] A. Paeschke, M. Kienast, and W.F. Sendlmeier. F0-contours in emotional speech. In Proceedings of the 14th International Conference of Phonetic Sciences (ICPh 1999), pages 929{932, San Francisco, CA, USA, August 1999. [143] I.S. Pandzic and R. Forchheimer. MPEG-4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, November 2002. [144] M. Pantic and L.J.M. Rothkrantz. Toward an aect-sensitive multimodal human- computer interaction. Proceedings of the IEEE, 91(9):1370{1390, September 2003. [145] D.B. Paul and J.M. Baker. The design for the Wall Street Journal-based CSR corpus. In 2th International Conference on Spoken Language Processing (ICSLP 1992), pages 899{902, Ban, Alberta, Canada, October 1992. [146] V. Pavlovic, A. Garg, J.M. Rehg, and T.S. Huang. Multimodal speaker detec- tion using error feedback dynamic Bayesian networks. In IEEE Conference on Computer Vision and Pattern Recognition, volume II, pages 34{41, 2000. [147] C. Pelachaud, N. Badler, and M. Steedman. Generating facial expressions for speech. Cognitive Science, 20(1):1{46, January 1996. [148] R. W. Picard. Aective computing. Technical Report 321, MIT Media Laboratory Perceptual Computing Section, Cambridge, MA,USA, November 1995. [149] G. Pingali, G. Tunali, and I. Carlbom. Audio-visual tracking for natural interac- tivity. In Proceedings of the seventh ACM international conference on Multimedia, pages 373{382, Orlando, Fl, 1999. [150] T.S. Polzin. Verbal and non-verbal cues in the communication of emotions. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), pages 53{59, Istanbul, Turkey, June 2000. [151] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306{1326, September 2003. [152] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257{286, Feb 1989. [153] S. Reiter, S. Schreiber, and G. Rigoll. Multimodal meeting analysis by segmenta- tion and classication of meeting events based on a higher level semantic approach. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), volume 2, pages 161{164, Philadelphia, PA, USA, March 2005. [154] R.J. Rienks and D.K.J. Heylen. Automatic dominance detection in meetings using easily obtainable features. In Workshop on Multimodal Interaction and Related Machine Learning Algorithms, pages 76{86, Endinburgh, Scotland, October 2006. 239 [155] D. Roy and A. Pentland. Automatic spoken aect classication and analysis. In Second International Conference on Automatic Face and Gesture Recognition (FG 1996), pages 363{367, Killingon, VT,USA, October 1996. [156] K.R. Scherer. Vocal communication of emotion: A review of research paradigms. Speech Communication, 40(1-2):227{256, April 2003. [157] K.R. Scherer and G. Ceschi. Lost luggage: A eld study of emotionantecedent appraisal. Motivation and Emotion, 21(3):211{235, September 1997. [158] K.R. Scherer, D.R. Ladd, and K.E.A. Silverman. Vocal cues to speaker aect: Testing two models. Journal of the Acoustical Society of America, 76(5):1346{ 1356, November 1984. [159] K.R. Scherer, H.G. Wallbott, and A.B. Summereld. Experiencing emotion: A cross-cultural study. Cambridge University Press, Cambridge, U.K., 1986. [160] F. Schiel, S. Steininger, and U. T urk. The SmartKom multimodal corpus at BAS. In Language Resources and Evaluation (LREC 2002), Las Palmas, Spain, May 2002. [161] M. Schr oder, R. Cowie, E. Douglas-Cowie, M. Westerdijk, and S. Gielen. Acoustic correlates of emotion dimensions in view of speech synthesis. In European Con- ference on Speech Communication and Technology (Eurospeech), volume 1, pages 87{90, Aalborg, Denmark, September 2001. [162] M.H. Sedaaghi, C. Kotropoulos, and D. Ververidis. Using adaptive genetic algo- rithms to improve speech emotion recognition. In International Workshop on Mul- timedia Signal Processing (MMSP 2007), pages 461{464, Chania, Crete, Greece, October 2007. [163] K. Shoemake. Animating rotation with quaternion curves. Computer Graphics (Proceedings of SIGGRAPH85), 19(3):245{254, July 1985. [164] J. Silva and S. Narayanan. Average divergence distance as a statistical discrimi- nation measure for hidden Markov models. IEEE Transactions on Audio, Speech and Language Processing, 14(3):890{906, May 2006. [165] K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: a standard for labelling english prosody. In 2th International Conference on Spoken Language Processing(ICSLP 1992), pages 867{870, Ban, Alberta, Canada, October 1992. [166] S. Spors, R.Rabenstein, and N.Strobe. Joint audio-video object localization and tracking. IEEE Signal Processing Magazine, 18(1):22{31, Jan 2001. 240 [167] S. Steidl, M. Levit, A. Batliner, E. N oth, and H. Niemann. \of all things the measure is man" automatic classication of emotions and inter-labeler consistency. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), volume 1, pages 317{320, Philadelphia, PA, USA, March 2005. [168] M. Swerts and E. Krahmer. The importance of dierent facial areas for signalling visual prominence. In International Conference on Spoken Language (ICSLP), pages 1280{1283, Pittsburgh, PA, USA, September 2006. [169] P. Taylor. Analysis and synthesis of intonation using the tilt model. Journal of the Acoustical Society of America, 107(3):1697{1714, March 2000. [170] A.M. Tekalp and J. Ostermann. Face and 2-d mesh animation in MPEG-4. Signal Processing: Image Communication, 15(4):387{421, January 2000. [171] Y.L. Tian, T. Kanade, and J.F. Cohn. Recognizing lower face action units for facial expression analysis. In Fourth IEEE International Conference on Automatic Face and Gesture Recognition (FG 2000), pages 484{490, Grenoble, France, March 2000. [172] http://www.ubiqus.com/, 2007. Retrieved August 1st, 2007. [173] L. Valbonesi, R. Ansari, D. McNeill, F. Quek, S. Duncan, K.E. McCullough, and R. Bryll. Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures. In European Signal Processing Conference (EUSIPCO 02), pages 75{78, Tolouse, France, September 2002. [174] E. Vatikiotis-Bateson, K.G. Munhall, Y. Kasahara, F. Garcia, and H. Yehia. Char- acterizing audiovisual information during speech. In Fourth International Con- ference on Spoken Language Processing (ICSLP 96), volume 3, pages 1485{1488, Philadelphia, PA, USA, October 1996. [175] E. Vatikiotis-Bateson and H. C. Yehia. Speaking mode variability in multimodal speech production. IEEE Transactions on Neural Networks, 13(4):894{899, July 2002. [176] J. Vermaak, M. Gangnet, A. Blake, and P. Perez. Sequential Monte Carlo fusion of sound and vision for speaker tracking. In International Conference on Computer Vision, volume I, pages 741{46, 2001. [177] D. Ververidis and C. Kotropoulos. A state of the art review on emotional speech databases. In First International Workshop on Interactive Rich Media Content Production (RichMedia-2003), pages 109{119, Lausanne, Switzerland, October 2003. 241 [178] D. Ververidis and C. Kotropoulos. Fast sequential oating forward selection ap- plied to emotional speech features estimated on DES and SUSAS data collections. In XIV European Signal Processing Conference (EUSIPCO 2006), page 929932, Florence, Italy, September 2006. [179] Vicon iQ 2.5. http://www.vicon.com/, 2007. [180] L. Vidrascu and L. Devillers. Real-life emotions in naturalistic data recorded in a medical call center. In First International Workshop on Emotion: Corpora for Research on Emotion and Aect (International conference on Language Resources and Evaluation (LREC 2006)), pages 20{24, Genoa,Italy, May 2006. [181] W. Wahlster. Towards symmetric multi-modality: Fusion and ssion of speech, gesture, and facial expression. In A. G unter, R. Kruse, and B. Neumann, editors, Proceedings of the 26th German Conference on Articial Intelligence, pages 1{18. Springer-Verlag Press, Berlin, Germany, 2003. [182] H. Wang, A. Li, and Q. Fang. F0 contour of prosodic word in happy speech of Mandarin. In J. Tao, T. Tan, and R.W. Picard, editors, Aective Computing and Intelligent Interaction (ACII 2005), Lecture Notes in Articial Intelligence 3784, pages 433{440. Springer-Verlag Press, Berlin, Germany, November 2005. [183] Y. Yacoob and L. Davis. Computing spatio-temporal representations of human faces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 1994), pages 70{75, Seattle, WA, USA, June 1994. [184] H. Yehia, T. Kuratate, and E. Vatikiotis-Bateson. Facial animation and head motion driven by speech acoustics. In 5th Seminar on Speech Production: Models and Data, pages 265{268, Kloster Seeon, Bavaria, Germany, May 2000. [185] H. Yehia, T. Kuratate, and E. Vatikiotis-Bateson. Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30(3):555{568, Jul 2002. [186] H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. Quantitative association of vocal- tract and facial behavior. Speech Communication, 26(1-2):23{43, 1998. [187] S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan. An acoustic study of emotions expressed in speech. In 8th International Conference on Spoken Language Processing (ICSLP 04), pages 2193{2196, Jeju Island, Korea, October 2004. [188] Y. Yoshitomi, K. Sung-Ill, T. Kawano, and T. Kilazoe. Eect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face. In 9th IEEE International Workshop on Robot and Human Interactive Communication, volume II, pages 178{183, Phoenix, AZ, 2000. 242 [189] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book. Entropic Cambridge Research Laboratory, Cambridge, England, December 2006. [190] A. Zara, V. Maolo, J.C. Martin, and L. Devillers. Collection and annotation of a corpus of human-human multimodal interactions: Emotion and others an- thropomorphic characteristics. In A. Paiva, R. Prada, and R.W. Picard, editors, Aective Computing and Intelligent Interaction (ACII 2007), Lecture Notes in Ar- ticial Intelligence 4738, pages 464{475. Springer-Verlag Press, Berlin, Germany, September 2007. [191] Z. Zeng, J. Tu, M. Liu, T.S. Huang, B. Pianfetti, D. Roth, and S. Levinson. Audio-visual aect recognition. IEEE Transactions on Multimedia, 9(2):424{428, February 2007. [192] Z. Zeng, J. Tu, B. Pianfetti, M. Liu, T. Zhang, Z. Zhang, T.S. Huang, and S. Levin- son. Audio-visual aect recognition through multi-stream fused HMM for HCI. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), volume 2, pages 967{972, San Diego, CA, USA, June 2005. [193] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan. Modeling individual and group actions in meetings with layered hmms. IEEE Transactions on Multimedia, 8(3):509{520, June 2006. [194] D. Zotkin, R. Duraiswami, and L.S. Davis. Multimodal 3-D tracking and event detection via the particle lter. In IEEE Workshop on Detection and Recognition of Events in Video, pages 20{27, 2001. 243 Publications Book chapters [1] C. Busso, M. Bulut, S. Lee, and S.S. Narayanan, \Fundamental frequency anal- ysis for speech emotion processing," in Linguistic Insights: Studies in Language and Communication, Maurizio Gotti, Ed. Peter Lang Publishing Group, Berlin, Germany, 2008. [2] C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, \Learning expressive human-like head motion sequences from speech," in Data-Driven 3D Facial Ani- mations, Z. Deng and U. Neumann, Eds., pp. 113-131. Springer-Verlag London Ltd, Surrey, United Kingdom, 2007. Journal Articles [1] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, \IEMOCAP: Interactive emotional dyadic motion capture database," Journal of Language Resources and Eval- uation, vol. In press, 2008. 244 [2] C. Busso and S. Narayanan, \Interrelation between speech and facial gestures in emotional utterances: a single subject study," IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 8, pp. 2331-2347, November 2007. [3] C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan, \Rigid head mo- tion in expressive speech animation: Analysis and synthesis," IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 3, pp. 1075-1086, March 2007. [4] C. Busso, Z. Deng, U. Neumann, and S.S. Narayanan, \Natural head motion synthesis driven by acoustic prosodic features," Computer Animation and Virtual Worlds, vol. 16, no. 3-4, pp. 283-290, July 2005. [5] C. Busso, S. Lee, and S.S. Narayanan, \Expressive pitch contour revealed: Analy- sis of emotional modulation for recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. To be submitted, 2008. Conference Proceedings [1] C. Busso and S.S. Narayanan, \Scripted dialogs versus improvisation: Lessons learned about emotional elicitation techniques from the IEMOCAP database," Interspeech 2008 - Eurospeech, Brisbane, Australia, September 2008. [2] C. Busso and S.S. Narayanan, \The expression and perception of emotions: Com- paring assessments of self versus others," Interspeech 2008 - Eurospeech, Brisbane, Australia, September 2008. 245 [3] C. Busso and S.S. Narayanan, \Recording audiovisual emotional databases from actors: a closer look," in Second International Workshop on Emotion: Corpora for Research on Emotion and Aect (International conference on Language Resources and Evaluation (LREC 2008)), Marrakech, Morocco, May 2008, [4] C. Busso and S.S. Narayanan, \Joint analysis of the emotional ngerprint in the face and speech: A single subject study," in International Workshop on Multime- dia Signal Processing (MMSP 2007), Chania, Crete, Greece, October 2007, pp. 43-47. [5] V. Rozgi c, C. Busso, P.G. Georgiou, and S. Narayanan, \Multimodal meeting monitoring: Improvements on speaker tracking and segmentation through a mod- ied mixture particle lter," in International Workshop on Multimedia Signal Processing (MMSP 2007), Chania, Crete, Greece, October 2007, pp. 60-65. [6] C. Busso, S. Lee, and S.S. Narayanan, \Using neutral speech models for emotional speech analysis," in Interspeech 2007 - Eurospeech, Antwerp, Belgium, August 2007, pp. 2225-2228. [7] C. Busso, P.G. Georgiou, and S. Narayanan, \Realtime monitoring of participants interaction in a meeting using audio-visual sensors," in International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Honolulu, HI, USA, April 2007, vol. 2, pp. 685-688. 246 [8] C. Busso and S.S. Narayanan, \Interplay between linguistic and aective goals in facial expression during emotional utterances," in 7th International Seminar on Speech Production (ISSP 2006), Ubatuba-SP, Brazil, December 2006, pp. 549-556. [9] M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C.M. Lee, S. Lee, and S. Narayanan, \Investigating the role of phoneme-level modications in emotional speech resynthesis," in 9th European Conference on Speech Communication and Technology (Interspeech2005 - Eurospeech), Lisbon, Portugal, September 2005, pp. 801-804. [10] C. Busso, S. Hernanz, C.W. Chu, S. Kwon, S. Lee, P.G. Georgiou, I. Cohen, and S. Narayanan, \Smart Room: Participant and speaker localization and identi- cation," in International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, March 2005, vol. 2, pp. 1117-1120. [11] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan, \Analysis of emotion recognition using facial expres- sions, speech and multimodal information," in Sixth International Conference on Multimodal Interfaces ICMI 2004, State College, PA, October 2004, pp. 205-211, ACM Press. [12] Z. Deng, C. Busso, S. Narayanan, and U. Neumann, \Audio-based head motion synthesis for avatar-based telepresence systems," in ACM SIGMM 2004 Workshop on Eective Telepresence (ETP 2004), New York, NY, October 2004, pp. 24-30, ACM Press. 247 [13] C.M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan, \Emotion recognition based on phoneme classes," in 8th International Conference on Spoken Language Processing (ICSLP 04), Jeju Island, Korea, October 2004, pp. 889-892. [14] S. Yildirim, M. Bulut, C.M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and S.S. Narayanan, \An acoustic study of emotions expressed in speech," in 8th International Conference on Spoken Language Processing (ICSLP 04), Jeju Island, Korea, October 2004, pp. 2193-2196. Abstracts [1] C.M. Lee, S. Yildirim, M. Bulut, C. Busso, A. Kazamzadeh, S. Lee, and S. Narayanan, \Eects of emotion on dierent phoneme classes," J. Acoust. Soc. Am., vol. 116, pp. 2481, 2004. [2] M. Bulut, S. Yildirim, S. Lee, C.M. Lee, C. Busso, A. Kazamzadeh, and S. Narayanan, \Emotion to emotion speech conversion in phoneme level," J. Acoust. Soc. Am., vol. 116, pp. 2481, 2004. [3] S. Yildirim, M. Bulut, C. Busso, C.M. Lee, A. Kazamzadeh, S. Lee, and S. Narayanan, \Study of acoustic correlates associate with emotional speech," J. Acoust. Soc. Am., vol. 116, pp. 2481, 2004. 248
Abstract (if available)
Abstract
The verbal and non-verbal channels of human communication are internally and intricately connected. As a result, gestures and speech present high levels of correlation and coordination. This relationship is greatly affected by the linguistic and emotional content of the message being communicated. The interplay is observed across the different communication channels such as various aspects of speech, facial expressions, and movements of the hands, head and body. For example, facial expressions and prosodic speech tend to have a stronger emotional modulation when the vocal tract is physically constrained by the articulation to convey other linguistic communicative goals. As a result of the analysis, applications in recognition and synthesis of expressive communication are presented.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Emotional speech production: from data to computational models and applications
PDF
Effects of speech context on characteristics of manual gesture
PDF
Computational analysis of expression in violin performance
PDF
Emotional speech resynthesis
PDF
Situated proxemics and multimodal communication: space, speech, and gesture in human-robot interaction
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Computational methods for modeling nonverbal communication in human interaction
PDF
Articulatory dynamics and stability in multi-gesture complexes
PDF
Modeling expert assessment of empathy through multimodal signal cues
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Enriching spoken language processing: representation and modeling of suprasegmental events
PDF
Toward understanding speech planning by observing its execution—representations, modeling and analysis
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
Generating gestures from speech for virtual humans using machine learning approaches
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
A computational framework for exploring the role of speech production in speech processing from a communication system perspective
PDF
Robust automatic speech recognition for children
PDF
Body pose estimation and gesture recognition for human-computer interaction system
PDF
Structure and function in speech production
Asset Metadata
Creator
Busso, Carlos
(author)
Core Title
Multimodal analysis of expressive human communication: speech and gesture interplay
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering (Multimedia and Creative Technology)
Publication Date
08/04/2008
Defense Date
05/09/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
emotion,expressive behavior,human communication,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
busso@usc.edu,carlosbusso@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1533
Unique identifier
UC153180
Identifier
etd-Busso-2231 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-115673 (legacy record id),usctheses-m1533 (legacy record id)
Legacy Identifier
etd-Busso-2231-0.pdf
Dmrecord
115673
Document Type
Dissertation
Rights
Busso, Carlos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
expressive behavior
human communication