Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Emotions in engineering: methods for the interpretation of ambiguous emotional content
(USC Thesis Other)
Emotions in engineering: methods for the interpretation of ambiguous emotional content
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EMOTIONS IN ENGINEERING: METHODS FOR THE INTERPRETATION
OF AMBIGUOUS EMOTIONAL CONTENT
by
Emily K. Mower
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2010
Copyright 2010 Emily K. Mower
Dedication
I would like to dedicate this Dissertation to my family and friends.
To my parents, sister and brother, thank you. You have listened to me talk about my
work ad nauseum for years and have always managed to sound interested and excited.
You have truly helped me maintain my enthusiasm and interest and I wouldn't be where
I am today without you! To my Damen, the amount of support that you have given me
has meant the world to me. Thank you for making my my dissertation process wonderful
and exciting. I can't wait to see what lies in our future! To my friends, it was so much
fun to go on this journey with you guys. I look forward to our many adventures and
discoveries!
ii
Acknowledgements
I would like to thank my advisors, Dr. Shrikanth Narayanan and Dr. Maja Matari c
for their guidance and support during my time at the University of Southern California.
They have both provided me with a fantastic environment in which to grow and learn as
a researcher and I could not be more grateful. I would also like to thank Dr. Sungbok Lee
for his feedback and suggestions. His commitment to the research ideal is inspirational
and has changed the way I approach research problems.
Many thanks to my committee members, Dr. C.-C. Jay Kuo and Dr. Fei Sha. Thank
you so much for your research direction suggestions; they have been fascinating.
Thanks also go to Dr. Panayiotis Georgiou and Dr. May-Chen Kuo for the template
upon which this thesis was built.
Finally, I would also like to thank my labmates and classmates who have provided me
with invaluable advice, stimulating discussions, and wonderful adventures.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables viii
List Of Figures xi
Abstract xiii
Chapter 1: Introduction 1
1.1 Emotion: Denitions, Descriptions, and Quantication . . . . . . . . . . . 3
1.1.1 Dimensional Characterization . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Categorical Characterization . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Emotion Evaluation Structures . . . . . . . . . . . . . . . . . . . . 6
1.1.4 Working Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Problem Statement and Methods . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Statistical Analyses of Perception . . . . . . . . . . . . . . . . . . . 9
1.2.2 Modeling Across Users . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Emotion Recognition via Emotion Proling . . . . . . . . . . . . . 10
1.2.4 Emotional Data Corpora . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Related Work and Contributions to this Topic . . . . . . . . . . . . . . . . 13
1.3.1 Related Work in the Statistical Analysis of Perception . . . . . . . 13
1.3.2 Related Work in Evaluator-Specic Modeling . . . . . . . . . . . . 17
1.3.3 Related Work in Emotion Recognition . . . . . . . . . . . . . . . . 17
1.3.4 Related Work in Emotion Proling . . . . . . . . . . . . . . . . . . 18
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 2: Statistical Analysis of Emotion Perception 26
2.1 Audio-Visual Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 General Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Perception of Presentation Types . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Biases in Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
2.6 Analysis of VAD Ratings for Emotional Clusters . . . . . . . . . . . . . . 35
2.7 Analysis of Video Contribution to Perception . . . . . . . . . . . . . . . . 37
2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 3: Emotionally Salient Features 40
3.1 Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.2 Video Features: FACS . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.3 Prior Knowledge Features . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Class Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.2 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Feature Selection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Feature Selection Results for the Combined Congruent { Con
icting
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.2 Feature Selection Results for the Congruent and Con
icting Datasets 50
3.4 Validation: SVM Classication . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Validation of the Combined Congruent { Con
icting Feature Sets . 54
3.4.2 Validation of the Congruent and Con
icting Feature Sets . . . . . 55
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 4: Evaluators as Individuals 60
4.1 Emotional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 IEMOCAP Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.3 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.4 Treatment of Evaluations . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1 Na ve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Na ve Bayes Classication of Evaluator Consistency . . . . . . . . 69
4.3.2 HMM Classication for Correspondence Between Content and Eval-
uation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Chapter 5: Emotion Proling 75
5.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Emotion Expression Types . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Data Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Audio-Visual Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.1 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Video Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
v
5.2.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.5 Final Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Classication of Emotion Perception: Emotion Prole Support Vector Ma-
chine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Support Vector Machine Classication . . . . . . . . . . . . . . . . 86
5.3.2 Creation of Emotional Proles . . . . . . . . . . . . . . . . . . . . 87
5.3.3 Final Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Results and Discussion: the Prototypical, Non-Prototypical MV, and Mixed
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 General Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Prototypical Classication . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.3 Non-prototypical Majority-Vote (MV) Classication . . . . . . . . 96
5.4.4 Emotion Proles as a Minor Emotion Detector . . . . . . . . . . . 98
5.5 Results and Discussion: the Non-Prototypical NMV Dataset . . . . . . . . 101
5.5.1 Experiment One: Classication . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Experiment Two: ANOVA of EP based representations . . . . . . 105
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 6: The Robustness of Emotion Proling 112
6.1 Description of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.1 IEMOCAP Database . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1.2 Data Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Emotion Proles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2.1 Construction of an EP . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2.2 Classication with EP Based Representations . . . . . . . . . . . . 117
6.2.3 Speaker-Dependent and Speaker-Independent Components . . . . 118
6.3 Feature Extraction and Selection . . . . . . . . . . . . . . . . . . . . . . . 119
6.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5.1 Classication with EP Frustration Training . . . . . . . . . . . . . 122
6.5.2 Classication without EP Frustration Training . . . . . . . . . . . 123
6.5.3 EP Representation of Frustration . . . . . . . . . . . . . . . . . . . 125
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 7: Cluster Proling 130
7.1 Description of Data: The EMOCAP Database . . . . . . . . . . . . . . . . 132
7.2 Emotion and Cluster Proles . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2.1 Description of the Train and Test Sets . . . . . . . . . . . . . . . . 133
7.2.2 Unsupervised Clustering for CPs . . . . . . . . . . . . . . . . . . . 134
7.2.3 Construction of a Prole . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.4 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.5 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vi
7.6.1 EP Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.6.2 CP Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Chapter 8: Conclusions and future work 143
8.1 Emotion Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Evaluator Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Emotion Proles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.4 Open Problems and Future Work . . . . . . . . . . . . . . . . . . . . . . . 147
References 150
vii
List Of Tables
1.1 Ekman's characteristics that provide dierentiation among the basic emo-
tions, From [39] pg. 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Discriminant analysis classication (A=angry, H=happy, S=sad, N=neutral)
for (a) audio-only, (b) video-only evaluations, and (c) audio-visual. . . . . 32
2.2 ANOVA post hoc analysis (A - angry, H - happy, S - sad, N - neutral) of the
three presentation conditions. The letters VAD indicate that the cluster
means are signicantly dierent ( = 0.01) in the valence, activation, and
dominance dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Classication accuracy in the presence of con
icting audio-visual informa-
tion, (a) angry voice held constant, (b) angry face held constant. . . . . . 36
2.4 Cluster shift analysis with respect to the VAD dimensions (where V
audio
represents the shift in valence mean from the audio-only evaluation to
the audio-visual evaluation). Entries in bold designate evaluations of the
audio-visual presentations that are signicantly dierent, with 0:05,
from that of either the video-only or audio-only presentations (paired t-
test). Entries with a star (*) designate evaluations that are signicantly
dierent with 0:001. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 A summary of the audio and video features used in this study. . . . . . . 45
3.2 A summary of the features used in the audio-visual analysis of this study.
The order of the feature (left - right) indicates their relative importance.
The numbers in parentheses represent the highest and lowest mean In-
formation Gain above the threshold. Bold italic fonts represent features
selected across all three dimensions, italic fonts represent features selected
across two dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
viii
3.3 The audio-visual features selected in the congruent database. Features
in bold are features that were selected across the valence, activation, and
dominance dimensions of theCongruent database. Features in bold-italics
are features that were selected in the Congruent
VAD
and Conflicting
AD
databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 The audio-visual features selected in the con
icting database. Features
in bold are features that were selected across the valence, activation, and
dominance dimensions of theCongruent database. Features in bold-italics
are features that were selected in the Congruent
VAD
and Conflicting
AD
databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 This table presents the classication results (SVM) over the three presenta-
tion conditions (audio-only, video-only, audio-visual) and three dimensions
(valence, activation, dominance). \Full" refers to classication performed
with the original feature set. \Reduced" refers to classication performed
with the feature set resulting from Information Gain feature selection. . . 53
3.6 The SVM classication accuracies (percentages) over the two database
divisions (congruent, con
icting) and three dimensions (valence, activation,
dominance) using feature sets reduced with the Information Gain criterion
discussed in Section 3.2.2. The columns marked \Full" refer to the full
feature set. The columns marked \Reduced" refer to the reduced feature
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Mapping from phoneme to broad phoneme classes (Originally presented
in [19]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Data format used for HMM categorical emotion training, original sentence:
\What was that," expressed with a valence rating of one. . . . . . . . . . 68
4.3 Confusion matrices for the categorical emotion classication task (A =
angry, H = happy, S = sad, N = neutral). The results presented in this
table are percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Classication: valence across the three levels. . . . . . . . . . . . . . . . 71
4.5 Classication: activation across the three levels. . . . . . . . . . . . . . . 71
5.1 The distribution of the classes in the emotion expression types (note: each
utterance in the 2L group has two labels, thus the sum of the labels is
840 but the total number of sentences is 420). There are a total of 3,000
utterances in the prototypical and non-prototypical MV group, and 3,702
utterances in total. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
5.2 The average percentage of each feature over the 40 speaker-independent
emotion-specic feature sets (10 speakers * 4 emotions). . . . . . . . . . . 85
5.3 The EP and baseline classication results for three data divisions: full (a
combination of prototypical and non-prototypical MV), prototypical, and
non-prototypical MV. The baseline result (simplied SVM) is presented as
a weighted accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 The major{minor emotion analysis. . . . . . . . . . . . . . . . . . . . . . . 100
5.5 The results of the EP classication on the 2L non-prototypical NMV data.
The results are the precision, or the percentage of correctly returned class
designations divided by the total returned class designations. . . . . . . . 104
5.6 ANOVA analysis of the dierence in group means between co-occurring and
non-occurring emotions within an EP set (Individual EP set experiment).
(- = 0:1, * = 0:05, ** = 0:01, *** = 0:001) . . . . . . . 107
5.7 ANOVA analyses of the dierences between reported emotions in proles
in which they were reported vs. proles in which they weren't. Note that
the EP
1
vs. EP
2
is an interaction of an ANOVA analysis of the set EP
1
vs. EP
2
and an ANOVA analysis of the representation of the individual
emotions in each EP set. (- = 0:1, * = 0:05, ** = 0:01, ***
= 0:001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.1 The distribution of the emotion classes in the prototypical and nonproto-
typical categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 Classication results (F-measure) across the three datasets: prototypical,
combined, and nonprototypical. \EP Train" indicates ve-dimensional
EPs, \No EP Train" indicates four-dimensional EPs. . . . . . . . . . . . . 124
6.3 ANOVA analysis of the component-by-component comparison between the
frustrated and other emotional EPs. The emotion components are labeled
by the rst letter of their class (e.g., angry EP component = `A'). All
dimensions listed in this table are statistically dierent with p< 0:001. . . 127
7.1 The CP based classication results. The entries in bold font indicate the
best accuracy or F-measure recorded. . . . . . . . . . . . . . . . . . . . . . 139
x
List Of Figures
1.1 The relationship between the components of this thesis. The orange blocks
highlight the emotion perception and recognition genres. The dark yellow
blocks present the studies in the perception and recognition genres. The
light yellow sections indicate work utilized, rather than work presented.
The pink block is the unication of the perception and recognition results
and the work utilized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 The frames of the four emotional presentations (left) and online emotion
evaluation interface (right) used in this study. . . . . . . . . . . . . . . . . 30
2.2 The valence (x-axis) and activation (y-axis) dimensions of the evaluations,
the ellipses are 50% error ellipses. . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Comparison between the emotion perceptions resulting from con
icting
audio-visual presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Comparison between the emotion perceptions resulting from con
icting
audio-visual presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 The location of the IR markers used in the motion capture data collection. 82
5.2 The FAP-inspired facial distance features utilized in classication. . . . . 83
5.3 The EP system diagram. An input utterance is classied using a four-
way binary classication. This classication results in four output labels
representing membership in the class (+1) or lack thereof (1). This
membership is weighted by the condence (distance from the hyperplane).
The nal emotion label is the most highly condent assessment. . . . . . . 86
5.4 The raw distances to the hyperplane for the four emotional components of
the EP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 The average emotional proles for all (both prototypical and non-prototypical)
utterances. The error bars represent the standard deviation. . . . . . . . . 94
xi
5.6 The average emotional proles for prototypical utterances. The error bars
represent the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . 96
5.7 The average emotional proles for non-prototypical utterances. The error
bars represent the standard deviation. . . . . . . . . . . . . . . . . . . . . 97
5.8 The average emotional proles for the non-prototypical NMV utterances.
The error bars represent the standard deviation. . . . . . . . . . . . . . . 105
6.1 The EP based classication system diagram. This example demonstrates
the correct classication of a nonprototypical angry utterance (a mixture
of anger and sadness). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 The EP of an utterance tagged as 'happy'. This EP has been trained
without frustration data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3 The average EPs for the prototypical and nonprototypical utterances when
the EPs were trained without frustration data. The error bars represent
the standard deviation. The happy EP is not included in this plot; the
trends follow those of the angry and sad EPs. . . . . . . . . . . . . . . . . 125
6.4 The average EPs for the prototypical and nonprototypical utterances when
the EPs were trained with frustration data. The error bars represent the
standard deviation. The sad EP is not included in this plot; the trends
follow those of the angry and happy EPs. . . . . . . . . . . . . . . . . . . 126
7.1 The CP based classication system diagram. This example demonstrates
the correct classication of a nonprototypical angry utterance (a mixture
of anger and sadness). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
xii
Abstract
Emotion has intrigued researchers for generations. This fascination has permeated the
engineering community, motivating the development of aective computational models
for the classication of aective states. However, human emotion remains notoriously
dicult to interpret computationally both because of the mismatch between the emo-
tional cue generation (the speaker) and perception (the observer) processes and because
of the presence of complex emotions, emotions that contain shades of multiple aective
classes. Proper representations of emotion would ameliorate this problem by introducing
multidimensional characterizations of the data that permit the quantication and descrip-
tion of the varied aective components of each utterance. Currently, the mathematical
representation of emotion is an area that is under-explored.
Research in emotion expression and perception provides a complex and human-centered
platform for the integration of machine learning techniques and multimodal signal process-
ing towards the design of interpretable data representations. The focus of this dissertation
is to provide a computational description of human emotion perception and combine this
knowledge with the information gleaned from emotion classication experiments to de-
velop a mathematical characterization capable of interpreting naturalistic expressions of
emotion utilizing a data representation method called Emotion Proles.
xiii
The analysis of human emotion perception provides an understanding of how humans
integrate audio and video information during emotional presentations. The goals of this
work are to determine how audio and video information interact during the human emo-
tional evaluation process and to identify a subset of the features that contribute to specic
types of emotion perception. We identify perceptually-relevant feature modulations and
multi-modal feature integration trends using statistical analyses of the evaluator reports.
The trends in evaluator reports are analyzed using emotion classication. We study
evaluator performance using a combination of Hidden Markov Models (HMM) and Na ve
Bayes (NB) classication. The HMM classication is used to predict individual evaluator
emotional assessments. The NB classication provides an estimate of the consistency of
the evaluator's mental model of emotion. We demonstrate that evaluator reports created
by evaluators with higher levels of estimated consistency are more accurately predicted
than evaluator reports from evaluators that are less consistent.
The insights gleaned from the emotion perception and classication studies are ag-
gregated to develop a novel emotional representation scheme, called Emotion Proles
(EP). The design of the EPs is predicated on the knowledge that naturalistic emotion
expressions can be approximately described using one or more labels from a set of ba-
sic emotions. EPs are a quantitative measure expressing the degree of the presence or
absence of a set of basic emotions within an expression. They avoid the need for a hard-
labeled assignment by instead providing a method for describing the shades of emotion
present in an utterance. These proles can be used to determine a most likely assignment
for an utterance, to map out the evolution of the emotional tenor of an interaction, or
to interpret utterances that have multiple aective components. The Emotion-Prole
xiv
technique is able to accurately identify the emotion of utterances with denable ground
truths (emotions with an evaluator consensus) and is able to interpret the aective con-
tent of emotions with ambiguous emotional content (no evaluator consensus), emotions
that are typically discarded during classication tasks.
The algorithms and statistical analyses presented in this work are tested using two
databases. The rst database is a combination of synthetic (facial information) and nat-
ural human (vocal information) cues. The aective content of the two modalities is either
matched (congruent presentation) or mismatched (con
icting presentation). The congru-
ent and con
icting presentations are used to assess the aective perceptual relevance of
both individual modalities and the specic feature modulations of those modalities. The
second database is an audio-visual + motion-capture database collected at the University
of Southern California, the USC IEMOCAP database. This database is used to assess
the ecacy of the EP technique for quantifying the emotional content of an utterance.
The IEMOCAP database is also used in the classication studies to determine how well
individual evaluators can be modeled and how accurately discrete emotional labels (e.g.,
angry, happy, sad, neutral) can be predicted given audio and motion-capture feature
information.
The future directions of this work include the unication of the emotion perception,
classication, and quantication studies. The classication framework will be extended
to include evaluator-specic features (an extension of the emotion perception studies) and
temporal features based on EP estimates. This unication will produce a classication
framework that is not only more eective than previous versions, but is also able to adapt
to specic user emotion production and perception styles.
xv
Chapter 1
Introduction
Interactive technologies are becoming increasingly prevalent in society. These technolo-
gies range from simple hand-held devices to fully embodied robotic agents. Each of these
interfaces contains underlying protocols that dictate the interaction behaviors upon which
these technologies rely. The protocols range from simple if-then loops to complicated emo-
tion and drive-based internal representations of agent state. Users observe manifestations
of agent state through the conveyed interactive behaviors.
Increasingly, these synthetic agent behaviors are modulated by emotional qualities [7,
49, 96, 107]. The animators of Disney have long understood the importance of endowing
their characters with appropriate emotional attributes. In \The Illusion of Life," Thomas
and Johnston assert that, \From the earliest days, it has been the portrayal of emotions
that has given the Disney characters the illusion of life [109]." According to Bates it is
this \illusion of life" that creates a believable character, one for which the audience is
willing to suspend its disbelief [7].
The creation of reliably recognized emotion expressions requires an understanding
of underlying social expectations. Agents incapable of creating emotional expressions
1
that meet (or ideally exceed) the baseline social expectations may not be capable of
maintaining long-term user interactions [50,93,107].
However, synthetic emotion is not expressed in a vacuum. Consequently, an agent
that is able to reliably produce, but not recognize, a human interaction partner's emotion
state may still be unable to meet its interaction goals. Proper interpretation of user state
hinges on more than task awareness, it also depends on the more subtle qualities of user
expression [34, 35, 91]. In fact, there is evidence in the psychological community that
even humans who lack proper emotional production and perception abilities have trouble
forming and maintaining relationships [39].
Human emotion perception is a complex and dynamic process. There is much debate
within the psychology and neuroscience communities regarding the true denition of
emotion and how it should be quantied and analyzed. This work does not seek to develop
a new denition of emotion, rather it uses currently accepted denitions of emotion to
understand how we can develop models capable of capturing the modulations present
in emotional expressions. This work presents an engineering approach, focusing on the
development of techniques to estimate high-level emotion labels from the low-level feature-
level modulations of utterances. While motivated by psychological theories of emotion,
this mapping does not necessarily seek to produce a true human-centric model of either
the human emotion perception or production processes. Rather, it seeks to provide
techniques that can inform the design of emotional audio-visual synthetic behavior.
Emotional models for expression and user state interpretation are currently used in
diagnostics, interactions with, and interventions for children with autism [41, 60, 68, 71].
In these scenarios the computer agent acts as either a tutor or a peer for the child. The
2
computer agent must be able to do more than convey the material, it also must be able to
maintain the interest of the child, due to the sensitive nature of the domain. As a result,
the agent needs to provide more than basic instructive information, it must also provide
motivation, empathy, and encouragement while recognizing signs of child frustration,
interest, and boredom. This necessitates the development of systems designed to both
recognize human aect and to respond appropriately to human aect. Both of these
systems require proper emotional models capable of operating within the scope of human
social expectations.
1.1 Emotion: Denitions, Descriptions, and Quantication
In \Human Emotions," Izard states that emotions are comprised of three components:
conscious feelings or experiences of emotion; processes that occur in the nervous system
or brain; and observable expressive patterns (e.g., facial expressions). There are sev-
eral theories that describe how emotions are realized and produced in the body. The
cognitive/appraisal theories state that emotions result from two sources: an individual's
physiological reaction to a situation and an individual's cognitive appraisal of that situ-
ation. The physiological theories argue that this appraisal is an intuitive and automatic
process [58]. The causal theories of emotion are outside of the scope of this document
and will not be discussed further.
In this thesis, we focus on methods for quantifying the emotion content of an ut-
terance. The two most commonly used methods are the dimensional and categorical
3
(basic/discrete) characterizations of emotion. Categorical descriptions describe the emo-
tion content of an utterance in terms of semantic labels of emotion (e.g., angry, happy,
neutral, sad). Dimensional descriptions of emotion seek to describe emotion in terms of its
underlying properties. These dimensions often include valence, describing the positivity
(positive vs. negative), activation, describing the level of arousal (calm vs. excited), and
dominance, describing the level of aggression (passive vs. aggressive) of the utterance.
It is not within the scope of this work to enter into the debate between these two
methods. Rather, the quantication methods are chosen on a per study basis, to maximize
the knowledge gained from each of the presented studies.
1.1.1 Dimensional Characterization
The dimensional view of emotion is based on the idea that emotions exist on a continuum,
captured by axes with specic semantic meaning. The dimensions of emotion most com-
monly utilized were introduced by Schlosberg in 1954 [100]. These dimensions included
pleasantness-unpleasantness, attention-rejection, and low-high activation. Abelson and
Sermat [1] suggested that a two-dimensional structure is capable of capturing most emo-
tions. They determined that the axes of pleasant-unpleasant and tension-sleep are ad-
equate to describe human emotions after applying a multidimensional scaling approach
over combinations of 13 stimuli from the Lightfoot series, a series of facial expressions.
These axes, or variations of these axes, have been used in countless works, most often
taking on the labels valence and activation/arousal [27,42,62,72,97,99].
4
1.1.2 Categorical Characterization
The theory of a discrete characterization of emotion is based on the assumption that
there is a set of emotions that can be considered \basic." A basic emotion is dened as
an emotion that is dierentiable from all other emotions. In Ekman's \Basic Emotions,"
the author elucidates the properties of emotions that allow for the dierentiation between
the basic emotions (Table 1.1).
1 \Distinctive universal signals
2 Distinctive physiology
3 Automatic appraisal
4 Distinctive universals in antecedent events
5 Distinctive appearance developmentally
6 Presence in other primates
7 Quick onset
8 Brief duration
9 Unbidden occurrence
10 Distinctive thoughts, memories, images
11 Distinctive subjective experience."
Table 1.1: Ekman's characteristics that provide dierentiation among the basic emotions,
From [39] pg. 53
The set of basic emotions can be thought of as a subset of the space of human emotion,
forming a \basis" for the emotional space. More complex, or secondary, emotions can
be created by blending combinations of the basic emotions. For example, the secondary
emotion of jealousy can be thought of as the combination of the basic emotions of anger
and sadness [120].
In his critique of the theory of basic emotions, Ortony asserted that the idea of
basic emotions is attractive for three main reasons: there exists a set of emotions that
5
pervade cultural boundaries and exist even in higher animals (such as primates); they are
universally recognized and associated with specic facial expressions; and they seem to
provide survival advantages to either the species or individual. However, there are high
levels of variation in the makeup of the lists of basic emotions among emotion researchers,
which would seem to question the validity of such an assertion [88]. The size of these lists
may range from two emotions [86] to fteen [39]. The proposed basic emotion sets still
dier even when semantic variability is mitigated.
However, even in the presence of this disparity of views, there are often four emotions
postulated as basic. This emotion list includes anger, happiness, sadness, and fear [88].
The basic emotions utilized in this work are a subset of this basic emotion list and include:
anger, happiness, sadness, and neutrality (the absence of discernible emotional content).
The emotion of fear is not included in this thesis because it is not well represented in any
of the included datasets.
1.1.3 Emotion Evaluation Structures
In all of the work presented in this dissertation, the emotional labels are assigned based
on evaluator reports. One criticism of this style is that it relies on a fully conscious
approach to attributing the emotional content of an utterance while the process itself is
both conscious and unconscious. However, this limitation is inherent in the domain as it
is not yet possible to develop a true understanding of both an individual's conscious and
unconscious emotional reaction to a stimulus, although physiological emotion recognition
is an ever-developing eld (for a survey, please see [67]). Furthermore, there is knowledge
6
to be gained from an individual's opinion of an emotional stimulus, even if it diers
slightly from his or her true underlying reaction.
Ortony addresses this fundamental diculty in his book, \The Cognitive Structure of
Emotions". He states that:
There is yet no known objective measure that can conclusively establish that a
person is experiencing some particular emotion, just as there is no known way
of establishing that a person is experiencing some particular color. In practice,
however, this does not normally constitute a problem because we are willing
to treat people's reports of their emotions as valid. Because emotions are
subjective experiences, like the sensation of color or pain, people have direct
access to them, so that if a person is experiencing fear, for example, that
person cannot be mistaken about the fact that he or she is experiencing fear
( [89], pg. 9).
He thus asserts that the subjective reported emotion of the evaluators is acceptable
evidence for an assignment of a ground truth.
1.1.4 Working Denitions
In this thesis, we cast the emotion perception process as a recognition problem, mapping
between the observation and evaluator reported interpretation of a stimulus. This inher-
ently conscious interpretation allows us to understand the emotional weight of a given
stimulus.
We dene the emotion label, or ground truth, of our emotional utterances in one of two
ways. The rst method denes an emotion's ground truth as the majority-voted opinion
of the set of evaluators. The second method, used in user-specic modeling experiments,
is the individual's opinion for a given emotion.
7
Dimensional
Evaluation
Categorical
Evaluation
User Modeling
Studies
Statistical Analysis
Studies
Emotion
Proles
Emotion
Recognition
Emotion
Perception
Emotion
Quantication
Figure 1.1: The relationship between the components of this thesis. The orange blocks
highlight the emotion perception and recognition genres. The dark yellow blocks present
the studies in the perception and recognition genres. The light yellow sections indicate
work utilized, rather than work presented. The pink block is the unication of the
perception and recognition results and the work utilized.
1.2 Problem Statement and Methods
The goal of this work is to develop a mapping between emotion expression and reported
emotion perception. This thesis presents two study genres, designed to forward the de-
velopment of a new emotion quantication and interpretation framework. The rst study
genre is emotion recognition. This dissertation demonstrates that emotion recognition is
aected both by the inherent naturalness of human expression and variations in evaluator
reporting styles. In the second study genre, emotion perception, this dissertation demon-
strates that individuals have consistent methods for evaluating emotion expressions and
specic features and modalities upon which they rely. The results of these studies are
combined to forward a new emotion quantication and classication paradigm in which
8
emotions are described based on a set of either semantic labels or data-driven compo-
nents, called Emotion Proles (EP). Please see Figure 1.1 for a pictorial description of
the relationship between these studies.
1.2.1 Statistical Analyses of Perception
Evaluators do not use all information available during the emotion perception process.
This thesis presents an analysis of the interaction between emotional audio (human voice)
and video (simple animation) cues to further the understanding of how individuals inte-
grate emotional information. The emotional relevance of the channels is analyzed with
respect to their eect on human perception and through the study of the extracted audio-
visual features that contribute most prominently to human perception. As a result of the
unequal level of expressivity across the two channels, the audio biases the perception of
the evaluators. However, even in the presence of a strong audio bias, the video data aect
human perception. The feature selection results indicate that when presented with emo-
tionally matched stimuli, users rely on both audio and video cues, but when presented
with emotionally mismatched information, users rely solely on audio information. This
result suggests that observers integrate natural audio cues and synthetic video cues only
when the information expressed is in congruence. It is therefore important to properly
design the presentation of audio-visual cues as incorrect design may cause observers to
ignore the information conveyed in one of the channels.
9
1.2.2 Modeling Across Users
Evaluators provide a ground truth for emotion recognition studies. Unfortunately, these
evaluations contain large amounts of variability both related and unrelated to the evalu-
ated utterances. One approach to handling this variability is to model reported emotion
perception at the individual level. However, the perceptions of specic users may not
adequately capture the emotional acoustic properties of an utterance. This problem can
be mitigated by the common technique of averaging evaluations from multiple users. We
demonstrate that this averaging procedure improves classication performance when com-
pared to classication results from models created using individual-specic evaluations.
We also demonstrate that the performance increases are related to the consistency with
which evaluators label data. These results suggest that the acoustic properties of emo-
tional speech are better captured using models formed from averaged evaluations rather
than from individual-specic evaluations.
1.2.3 Emotion Recognition via Emotion Proling
Emotion recognition is complicated by more than feature modulation and user variability,
it is often obfuscated by an uncertain ground truth. The ground truth is often masked
and reporting processes may not capture true perception. Further, the features involved
in the production of emotion are related across multiple time scales and are correlated
with internal representations that are not observable. Engineering solutions are well
situated to approximate this process. Although these models may not be able to capture
and model the true link between emotion production and emotion perception, they can
provide insight towards the relationship between these two processes.
10
Emotion expressions can be described by creating a representation in terms of the
presence or absence of a subset of categorical emotional labels (e.g., angry, happy, sad)
within the data being evaluated (e.g., a spoken utterance). This multiple labeling rep-
resentation can be expressed using Emotion Proles (EP). EPs provide a quantitative
measure for expressing the degree of the presence or absence of a set of basic emotions
within an expression. They avoid the need for a hard-labeled assignment by instead pro-
viding a method for describing the shades of emotion present in the data. These proles
can be used in turn to determine a most likely assignment for an utterance, to map out
the evolution of the emotional tenor of an interaction, or to interpret utterances that have
multiple aective components.
Prole-based techniques have been used within the emotion research community as a
method for expressing the variability inherent in multi-evaluator expressions [105]. The
proles were used to represent the distribution of reported emotion labels from a set of
evaluators for a given utterance. Steidl et al. compared the entropy of their automatic
classication system to that present in human evaluations. In our previous work [85],
EPs were described as a method for representing the phoneme-level classication output
over an utterance. These proles described the percentage of phonemes classied as one
of ve emotion classes.
The proling method described in this dissertation is derived from the combined out-
put of a classication system composed ofn binary classiers and the condences derived
from each classication. EPs are created by weighting the output of the n classiers by
an estimate of the condence of the assignment. EPs can be used to estimate a single
11
emotion label by selecting the emotion class with the highest level of condence, repre-
sented by the EP or by further classifying the generated EPs. These EPs can also be used
as a unit to describe emotions that cannot be captured by a single ground truth label.
This technique provides a method for discriminately representing emotional utterances
with ambiguous content.
1.2.4 Emotional Data Corpora
The presented work demonstrates the development of an emotion classication frame-
work using two databases, one with articially created emotionally matched and mis-
matched stimuli, the other with actors in a motion-capture setting. The rst database
(\Congruent-Con
icting") is used to study the eect of varying levels of expression on
the reported human perception of emotion. This database is composed of synthetically
generated audio-visual expressions composed of a computer generated face and a human
voice. The expressions contain both congruent (emotionally matched audio and video in-
formation) and con
icting (emotionally mismatched audio and video information). It is
used to analyze emotion expression both with respect to high-level label and with respect
to features of importance.
The second database (\USC IEMOCAP") is used to study the relationship between
feature modulations at the phoneme-level and utterance-level and evaluator emotional
reports. The USC IEMOCAP database is an audio-visual plus motion-capture database
recorded using a dyadic human interaction paradigm. It is used to study both the accu-
racy and trade-os of user-specic modeling and to evaluate the ecacy of the emotion
proling technique for emotion recognition.
12
1.3 Related Work and Contributions to this Topic
1.3.1 Related Work in the Statistical Analysis of Perception
The perception analysis was principally motivated by the McGurk eect [75]. The
McGurk eect occurs when mismatched audio and video syllables are presented to a
human observer. Instead of perceiving either of the two presented syllables, McGurk and
MacDonald found that observers perceived a third, distinct syllable. This nding has
motivated many emotion research studies designed to determine if such an eect occurs
within the emotion domain. Emotional McGurk studies are primarily conducted using
either discrete emotion assignment (e.g., happy or angry) [28,29,31,44,56,74] or dimen-
sional evaluation [80{82]. This eect has also been studied using fMRI [80] and EEG
measurements [76]. In these studies, the emotion presentations have included congruent
and con
icting information from the facial and vocal channels (e.g., [31]), facial channel
and context (e.g., [80]), and facial and body postural/positional information (e.g., [76]).
In discrete choice evaluations, users are asked to rate the utterance by selecting the
emotional label that best ts the data. Such evaluations allow researchers the point along
an emotional continuum at which a given face or voice is of a sucient emotional strength
to bias the decision of the evaluators [28]. In a dimensional analysis, evaluators are asked
to rate the presented stimuli according to the properties of those stimuli. Common
properties (or dimensions) include valence (positive vs. negative), activation (calm vs.
excited), and dominance (passive vs. aggressive). One common dimensional evaluation
technique utilizes Self-Assessment Manikins (SAM) [9]. This evaluation methodology
presents the dimensions of valence, activation, and dominance using a pictorial, text-free
13
display. This display method allows evaluators to ground their assessments using the
provided end-points.
In [28], de Gelder combined still images with single spoken words in three experi-
ments. The rst experiment presented images morphed from two archetypal happy and
sad emotional images into a visual emotional continuum. These images were accompa-
nied by either vocally happy or sad human utterances. The evaluators were presented
with an audio-only, video-only, or combined audio-visual presentation and were asked to
assign the combined presentation into one of the discrete emotional categories of \happy"
or \sad." The researchers found that the evaluators were able to correctly recognize the
emotion in the audio-clip 100% of the time. In the combined audio-visual presentation,
they found that the voice altered the probability that an evaluator would identify the
presentation as \sad." The researchers then repeated the experiment, this time asking
the users to judge the face and ignore the voice. They found that the emotion presented
in the audio channel still had an eect on the discrete emotion assignment. However, in
this experiment, they found that the eect was smaller than that seen previously. In the
nal experiment, the researchers created a vocal continuum ranging from happiness to
fear (to allow for a more natural vocal continuum). They asked the users to attune to
the voice and to ignore the face. They found that, as in the audio channel experiments,
users were still aected by the visually presented emotion.
A similar study was presented in [56]. In that work, Hietanen et al. asked evaluators
to rate 144 stimuli composed of still images and emotional speech using happy, angry, and
neutral emotions. The evaluators were asked to respond as quickly as possible after view-
ing a stimulus presentation. In the rst experiment, the researchers asked the evaluators
14
to attune to the emotion presented in the facial channel. They found that the emotion
presented in the vocal channel (the unattended emotion) aected the evaluators with
respect to accuracy (discrete classication of facial emotion) and response time (faster
for congruent presentations). When the evaluators were instead asked to attune to the
facial channel, the researchers found that vocal channel mismatches decreased the facial
emotion recognition accuracy and increased the response time. In the nal experiment,
users were presented with an emotional stimulus (a vocal utterance), a pause, and a sec-
ond emotional stimulus (the facial emotion) used as a \go-signal." The researchers found
that when the emotion presentations were separated by a delay, the channels no longer
interacted in the emotion evaluation and the evaluators based their decisions on the vocal
signal only.
Interactions between emotional channels have also been studied using emotional faces
paired with contextual movies [80]. In that study, Mobbs et al. presented the evaluators
with four seconds of a movie and were then shown a static image. The contextual emotions
included positive, negative, and neutral. The emotional faces included happy, fear, and
neutral. These combined presentations were rated using SAMs. The researchers found
that faces presented with a positive or negative context were rated signicantly dierently
than faces presented in a neutral context. Furthermore, the fMRI data showed that
pairings between faces and emotional movies resulted in enhanced BOLD responses in
several brain regions.
The McGurk eect has also been studied with respect to body posture and facial
analyses [76]. In this study, Meeren et al. paired emotional faces with emotional body
positions (fear and anger for both) to analyze the interplay between facial and postural
15
information in emotion evaluation. They found that evaluators were able to assess the
emotion state (using a discrete choice evaluation) of the stimulus most quickly and accu-
rately when viewing congruent presentations. The results showed that the analysis time
for faces-only was faster than for bodies-only. However, these results suggest that facial
emotional assessment is biased by the emotion embedded in body posture.
The expression of emotion has also been studied in a more localized manner. One
such method utilizes a \Bubble" [48]. This method is designed to identify regions of
interest that correspond to task-related performance by only permitting users to view
certain areas of the stimulus. The stimulus is covered by an opaque mask. Regions are
randomly shown to the evaluators by creating Gaussian \bubbles," which allow users to
glimpse regions of the masked stimulus. Given an innite number of trials, all window
combinations will be explored. The \Bubble" method allows for a systematic evaluation
of stimuli components, but produces results that are dicult to translate into system
design suggestions.
These past studies suggest that the video and audio channels interact during human
emotion processing when presented synchronously. However, due to the discrete nature of
the evaluation frameworks, it is dicult to determine how the perception of the targeted
emotions change in the presence of con
icting information. The work presented in this
thesis uses the dimensional evaluation method reported in [9] to ascertain the nature of
the audio-visual channel interactions. This work also diers from previous work in its use
of video clips rather than static photographs. The inclusion of dynamic facial stimuli in
this work makes the results more transferable to the interactive design domain.
16
1.3.2 Related Work in Evaluator-Specic Modeling
Emotion recognition has been studied extensively [19,24,53,122]. However, these studies
do not provide analyses of inter-evaluator dierences. In [8,20,111], the authors present
analyses of the dierences between self-evaluations and the evaluations of others. In [105],
Steidl et al. present a new emotion classication accuracy metric that considers common
inter-evaluator emotion classication errors. The question of inter-evaluator averaging
remains unexplored.
Human evaluators are as unique as snow
akes. Consequently, one would expect that
Hidden Markov Models (HMM) trained on individual-specic data would better capture
the variability inherent in the individual's evaluation style. However, we demonstrate
that models trained on averaged data either outperform or perform comparably to those
trained solely on the individual-specic data. The results also suggest that evaluations
from individuals with a higher level of internal emotional consistency are more repre-
sentative of the emotional acoustic properties of the clips than those of less consistent
evaluators.
1.3.3 Related Work in Emotion Recognition
Engineering models provide an important avenue through which to develop a greater un-
derstanding of human emotion. These techniques enable quantitative analysis of current
theories, illuminating features that are common to specic types of emotion perception
and the patterns that exist across the emotion classes. Such computational models can
inform design of automatic emotion classication systems from speech, and other forms
17
of emotion-relevant data. Multimodal classication of emotion is widely used across the
community [14,103,115]. For a survey of the eld, see [24,59,121].
In natural human communication, emotions do not follow a static mold. They vary
temporally with speech [16], are expressed and perceived over multiple modalities [30,108],
may be inherently ambiguous [13, 33, 36], or may have emotional connotations resulting
from other emotional utterances within a dialog [63]. A classication scheme designed to
recognize only the subset of emotional utterances consisting of well-dened emotions will
not be able to handle the natural variability in human emotional expression.
Conventionally, when training emotion recognition classiers, researchers utilize emo-
tional expressions that are rated consistently, by a set of human evaluators. These expres-
sions are referred to as prototypical emotion expressions. This process ensures that the
models capture the emotionally-relevant modulations. However, while analyzing natural
human interactions, including in an online human-computer or human-robot interaction
(HCI or HRI) application, one cannot expect that every human utterance will contain
clear emotional content. Consequently, techniques must be developed to handle, model,
and utilize these emotionally ambiguous, or non-prototypical, utterances within the con-
text of HCI or HRI.
1.3.4 Related Work in Emotion Proling
Ambiguity in emotion expression and perception is a natural part of human communica-
tion. This ambiguity can be claried by designating an utterance as either a prototypical
or non-prototypical emotional episode, terms described by Russell in [98]. These labels
can be used to provide a coarse description of the ambiguity present in an utterance.
18
Prototypical emotional episodes occur when all of the following elements are present:
there is a consciously accessible aective feeling (dened as \core aect"); there is an
obvious expression of the correct behavior with respect to an object; attention is directed
toward the object, there is an appraisal of the object, and attributions of the object are
constructed; the individual is aware of the aective state; there is an alignment of the
psychophsyiological processes [98]. Non-prototypical emotional episodes occur when one
or more of these elements are missing. Non-prototypical utterances can be dierentiated
from prototypical utterances by their enhanced emotional ambiguity.
Emotional ambiguity may result from the blending of emotions, masking of emotions,
a cause-and-eect con
ict of expression, the inherent ambiguity in emotion expression,
and an expression of emotions in a sequence. Blended emotion expressions occur when
two emotions are expressed concurrently. Masking occurs when one emotion (e.g., happi-
ness) is used to mask another (e.g., anger). Cause-and-eect may result in a perception
of ambiguity when the expressions have a con
ict between the positive and negative char-
acteristics of the expression (e.g., weeping for joy). Inherent ambiguity may occur when
two classes of emotion are not strongly dierentiated (e.g., irritation and anger). Finally
ambiguity may also occur when a sequence of emotions is expressed consecutively within
the boundary of one utterance [33]. In all of these cases, the utterance cannot be well
described by a single hard label.
The proper representation and classication of emotionally ambiguous utterances has
recently received increased attention. At the Interspeech Conference in 2009 there was an
Emotion Challenge special session to focus on the classication of emotionally ambigu-
ous utterances [102]. Similarly, at the Aective Computing and Intelligent Interaction
19
(ACII) Conference in 2009 there was also a special session entitled, \Recognition of Non-
Prototypical Emotion from Speech{The Final Frontier?" This session focused on the
need to interpret non-prototypical, or ambiguous, emotional utterances. Emotional am-
biguity has also been studied with respect to classication performance [53, 104] and
synthesis [10,64].
Emotional proles (EP) can be used to interpret the emotion content of ambiguous
utterances. EP-based methods have been used to describe the emotional content of an
utterance with respect to evaluator reports [55,104], classication output [85], and percep-
tion, as a combination of multiple emotions, resulting from one group's actions towards
another group [22]. EPs can be thought of as a quantied description of the properties
that exist in the emotion classes considered. In [4], Barrett discusses the inherent dif-
ferences that exist between classes of emotions. Between any two emotion classes, there
may exist properties of those classes held in common, while the overall patterns of the
classes are distinct. For instance, Barrett suggests that anger has characteristic feature
modulations that are distinct from those of other classes. Thus, emotions labeled as an-
gry must be suciently similar to each other and suciently dierent from the emotions
labeled as another emotion. This overview suggests that in natural expressions of emo-
tion, although there exists an overlap between the properties of distinct emotion classes,
the underlying properties of two classes are dierentiable. This further recommends a
soft-labeling EP-based quantication for emotionally non-disjoint utterance classes.
EPs can be used to capture the emotional class properties expressed via class-specic
feature modulations. Using the example of anger presented above, an angry emotion
should contain feature properties that are strongly representative of the class of anger
20
but may also contain feature properties that are weakly similar to the class of sadness.
However, this similarity to sadness does not suggest an error in classication, but a
property of natural speech. Consequently, an EP representation capable of conveying
strong evidence for anger (the major emotion) and weak evidence for sadness (the minor
emotion) is well positioned to interpret the content of natural human emotional speech,
since the minor expressions of emotion may suggest how an individual will act given a
major emotion state and an event [55].
The classication technique employed in this thesis, support vector machines (SVM),
has been used previously in emotion classication tasks [101,104,115]. SVM is a discrim-
inative classication approach that identies a maximally separating hyperplane between
two classes. This method can be used to eectively separate the classes present in the
data. There are two feature selection methods utilized in this thesis. The rst is Infor-
mation Gain, which has also been used widely in the literature [101, 104] and Principal
Feature Analysis (PFA), which has also recently received attention in emotion classi-
cation [77, 78]. Both PFA and Information Gain are used to estimate the importance
of the features using a classier independent method. The purpose of this work is not
to demonstrate the ecacy of the SVM, PFA, or Information Gain approaches, but in-
stead to demonstrate the benet of considering emotion classication output in terms of
soft-labeling via relative condences, rather than solely as hard labels.
EPs are a representation of the emotional components of an utterance using the degree
of presence or absence of the emotions of angry, happy, neutral, and sad. Emotions that
can be described as combinations of \basic" emotions should be characterizable using the
21
EP representation framework. This thesis investigates the ability of EPs to represent out-
of-domain secondary emotions using frustration as a case study. The results demonstrate
that EPs can represent the unseen secondary emotion statistically signicantly dierently
than the \basic" emotions of angry, happy, neutral, and sad. The EPs can then be used to
classify between angry, happy, neutral, sad, and frustrated as accurately as EPs trained
with a frustration component. These results suggest that EPs are a robust representation
for secondary emotions.
The correct number of prole components is not obvious. Four-dimensional EPs can
accurately classify between the four classes of anger, happiness, neutrality, and sadness.
However, it is not clear that the four emotions must also be the four components of the
prole. This question is analyzed using a Cluster Prole (CP) representation. In CPs the
underlying components are not based on the categorical labels of emotion but are instead
based on data-driven clusters via the unsupervised clustering method of Agglomerative
Hierarchical Clustering (AHC). The results demonstrate that CPs are as accurate at the
four-way emotion classication task as EPs. The benet to using CPs is that they do not
require labeled training data for prole generation. However, CPs have a much higher
dimensionality (15-components vs. four-components in EPs). These results suggest that
the semantic emotional class labels of angry, happy, neutral, and sad have meaning not
only with respect to perception, but also with respect to the underlying feature properties.
The data utilized in this thesis are from the USC IEMOCAP database [13]. This
database has been used for studies ranging from interaction modeling [63,65] to classi-
cation studies. In [85], the audio utterances were classied using Hidden Markov Models
(HMM) into one of ve states: anger, happiness, neutrality, sadness, and frustration.
22
The accuracies ranged from 47.34% for the classication of emotionally well-dened, or
prototypical utterances, to 35.06% for the classication of emotionally ambiguous, or
non-prototypical, utterances. In [78], the authors performed a proling-based multi-
modal classication experiment on the IEMOCAP database. The authors utilized Mel
Filterbank Coecients (MFB), head motion features, and facial features selected using
an emotion-independent Principal Feature Analysis (PFA) [69]. The authors developed
four independent classiers: an upper-face GMM, a lower-face eight-state HMM, a vocal
four-state HMM, and a head-motion GMM. Each classier outputted a prole expressing
the soft-decision at the utterance-level. The output proles were fused at the decision
level, using a Bayesian framework. The training and testing were completed using leave-
one-speaker-out cross-validation. The overall unweighted accuracy (an average of the
per-class accuracies) for this system was 62.42%.
1.4 Contributions
This dissertation presents novel statistical measures of emotion perception, techniques
for evaluator modeling, and representations for aective data.
We present a statistical analysis of audio-visual emotion perception and demonstrate
that dynamic audio-visual stimuli can be used to assess the presence of an emotional
McGurk eect. We use this novel method to evaluate the audio-visual perceptual
bias. We also demonstrate that this technique can be used to identify audio-visual
features that are perceptually relevant to emotion perception.
23
We present a novel method to assess evaluator performance using classication
techniques. This method allows us to explain evaluator-specic classication results
through the estimation of evaluator consistency. The goals of this work are to model
individual evaluators and to analyze the ecacy of utilizing models of an average
evaluator to estimate individual evaluator behavior.
We introduce Emotion Proles (EP), a multidimensional audio-visual representa-
tion of emotion for classication. EPs can be used to characterize the aect of an
utterance, not in terms of black and white semantic labels (e.g., the speaker is
angry), but instead by estimating the degree of presence or absence of multiple
emotional components. EPs are n-dimensional representations, where n refers to
the number of components (e.g., where one component might describe the degree
of anger). Each component is estimated using a binary classier whose output rep-
resents the estimated presence (or absence) of an emotion class, from the set of
considered emotion classes, in a given utterance.
We further demonstrate that EPs can serve, not only as a representation, but also as
input for classication. EPs can be used successfully as a mid-level representation in
a four class emotion task. However, EPs need not contain the same number of com-
ponents as target classes. For example, four-dimensional EPs can be utilized as a
mid-level representation for the classication of ve emotional classes (angry, happy,
neutral, sad, and frustrated) with a similar level of accuracy to a ve-dimensional
EP. This performance equity suggests that EPs can robustly represent emotions.
24
Finally, we demonstrate that EPs need not be constructed using semantic class
labels. EPs can instead be generated using data-driven clusters. The training data
are clustered using the unsupervised method Agglomerative Hierarchical Clustering.
Classiers are then trained over the data-driven clusters (rather than the semantic
clusters). These proles are referred to as Cluster Proles (CP). CPs, like EPs,
can be used as a mid-level representation for classication. However, with the CP
representation, labels are not needed for the training of the CPs, only for the nal
classication stage (if required). Furthermore, the nal classication performance is
not aected by the use of knowledge-driven (semantic label) vs. data-driven clusters
suggesting that a large training corpus need not be labeled.
1.5 Outline of the Dissertation
The remainder of the thesis is organized as follows. Chapter 2 describes our studies of
channel bias in audio-visual congruent and con
icting emotion expression. Chapter 3
describes our feature analysis studies of the \congruent-con
icting" database. Chapter 4
describes our studies of evaluator emotion reporting strategies. Chapter 5 describes our
EP-based emotion recognition system. Chapter 6 describes a case study in the analysis of
the robustness of the EP-based representation. Chapter 7 describes our work in utilizing
data-driven clusters (rather than semantic clusters) for a prole-based representation.
Finally, Chapter 8 provides discussion, conclusions, and future work.
25
Chapter 2
Statistical Analysis of Emotion Perception
In human-machine interaction there is an implicit assumption that the utterances and
expressions produced by the machine will be recognized in a specied manner by a set of
users. Conventionally, this assumption is validated using extensive user testing. However,
this validation process is costly and time consuming. The goal of the work presented both
in this chapter and in Chapter 3 is to develop a better understanding of human audio-
visual emotion perception using statistical analyses. Accurate computational descriptions
of the perceptual process may one day facilitate the construction of emotionally targeted
and relevant stimuli.
The work presented in this chapter was published in the following articles:
1. Emily Mower, Maja J Matari c and Shrikanth S. Narayanan, \Human perception of audio-visual
synthetic character emotion expression in the presence of ambiguous and con
icting information."
IEEE Transactions on Multimedia, 11:5(843-855). August 2009.
2. Emily Mower, Sungbok Lee, Maja J Matari c, Shrikanth Narayanan. \Joint-processing of audio-
visual signals in human perception of con
icting synthetic character emotions." In Proceedings of
IEEE International Conference on Multimedia & Expo (ICME), Hannover, Germany, June 2008.
3. Emily Mower, Sungbok Lee, Maja J Matari c, Shrikanth Narayanan. \Human perception of
synthetic character emotions in the presence of con
icting and congruent vocal and facial ex-
pressions." In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP). Las Vegas, Nevada, March-April 2008.
26
Audio-visual emotional stimuli are often multichannel expressions composed of fa-
cial and vocal aect. This chapter presents an analysis of multichannel human emotion
perception in the presence of emotionally matched (\congruent") and mismatched (\con-
icting") audio-visual information in an animated display. Congruent information refers
to an expression of the same emotion class across both the video and audio channels
(e.g., angry face, angry voice). Con
icting information refers to the expression of dier-
ent emotions across the two channels (e.g., angry face, happy voice).
This work was motivated by the well-known work of McGurk and MacDonald [75]
called the McGurk Eect. As described in Chapter 1, the McGurk Eect is a multi-
channel syllabic perceptual phenomenon. The authors found that when presented with
two distinct syllables on each of the facial and vocal channels, listeners perceived a third,
distinct, syllable. This study led emotion researchers to investigate if such an eect ex-
ists for multichannel emotion perception. An understanding of the so called \Emotional
McGurk Eect" would provide researchers with crucial information regarding the process
of emotion perception.
The most common method for the study of the Emotional McGurk Eect is the
presentation of concurrent emotionally evocative still photographs and vocalizations [28,
29,56,74]. In these perceptual experiments, participants are presented with either single
channel (audio or photograph) or multichannel (audio and photograph) expressions of
emotion. The participants then rate the stimuli using the forced-choice paradigm (e.g.,
happy vs. sad). The results demonstrate that the facial emotion expression tends to more
strongly bias the emotional perception of the user than the vocal emotion expression [28].
However, dynamic audio-visual stimuli have not been as thoroughly studied.
27
The dynamic stimuli discussed in this chapter are composed of emotional audio (vocal)
and video (facial) information. The emotions are either expressions of the same state
(congruent presentation) or are of diering emotion states (con
icting presentation). This
type of experimental construct permits an analysis of the relative importance of audio-
visual features across broader levels of combinations.
The stimuli were rated using self-report manikins [51]. These manikins present a
pictorial description of the dimensional axes of emotion primitives. They include cate-
gories of valence (positive vs. negative), activation (calm vs. excited), and dominance
(passive vs. aggressive), which will be referred to as VAD. These dimensions allow for a
continuous analysis of emotion perception rather than a discrete categorical analysis. To
our knowledge, this is the rst attempt to use the dimensional approach to analyze the
combined perception of con
icting audio-visual stimuli in a continuous framework. This
continuous environment allows for a more ne-grained understanding of how audio and
video data interact in the human emotion perception process.
One of the challenges of such a study is creating stimuli that are free from artifacts.
Purely human test data present challenges in that it may be dicult for actors to express
an angry vocal signal with a happy facial expression. It would be undesirable to present
stimuli to evaluators containing residual facial information resulting from unintended
vocal emotional expressions and vice versa. As a result, we used an animated facial
display. Despite its expressivity limitations, this interface allowed for simple and artifact
free synchronization between the audio and video streams.
28
2.1 Audio-Visual Stimuli
The vocal prompts utilized in this experiment were recorded from a female professional
actress [118]. The actress recorded semantically neutral utterances across each of the
following emotion states: happy, angry, sad, and neutral. The sentences were then rated
by four evaluators using a forced-choice evaluation framework (happy, angry, sad, neutral,
and other). Sentences that were rated uniformly by the four evaluators across all four
emotion classes were used in the study. The resulting set was composed of nine distinct
sentences recorded across all four emotions, for a total of 36 distinct vocal utterances.
The video prompts created for this experiment were designed using the CSLU toolkit [106].
This toolkit allows a user to quickly and reliably create animations of targeted facial
emotions that are synchronized with an input speech signal. The toolkit has sliders (rep-
resenting the strength of emotion) for happy, angry, sad, and neutral emotions (for still
stereotypical examples, see Figure 2.1(a)). Each vocal utterance (36 total) was combined
with each of the four facial emotions (happy, angry, sad, and neutral) to create a total of
144 audio-visual clips.
2.2 Evaluation Procedure
The created stimuli were evaluated by 13 participants (ten male, three female) using a
web interface (Figure 2.1(b)). The stimuli included audio-only, video-only, and audio-
visual clips. These clips were randomly presented to the evaluators. There were a total
of 139 audio-visual clips, 36 audio-only clips, and 35 video-only clips (one of the sad
utterances was inadvertently, but inconsequentially, omitted due to a database error).
29
(a) Clockwise from top left:
angry, sad, neutral, happy.
(b) Dimensional evaluation tool.
Figure 2.1: The frames of the four emotional presentations (left) and online emotion
evaluation interface (right) used in this study.
Each participant evaluated 68 clips. The clip presentation order was randomized with
respect to clip type (angry, happy, sad, neutral) and to clip content (audio, video, audio-
visual). Each evaluator observed approximately 50% audio-visual clips, 25% audio clips
and 25% video clips. The evaluators were allowed to stop and start the evaluation as
many times as they desired.
The evaluation included both a
ash video player and a rating scheme (Figure 2.1(b)).
Each clip was rated from 0 { 100 along three dimensions, valence, activation, and domi-
nance (VAD) using a slider bar underneath a pictorial display of the variation along the
dimension. These scores were normalized using z-score normalization along all three di-
mensions for each evaluator. Z-score normalization was used to mitigate the eect of the
various rating styles of the evaluators and thus make the evaluations more compatible.
30
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Valence
Activation
Audio Only Clips
Angry
Happy
Sad
Neutral
(a) Audio-only evaluations.
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Valence
Activation
Video Only Clips
Angry
Happy
Sad
Neutral
(b) Video-only evaluations.
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Valence
Activation
Congruent Audio−Visual Clips
Angry
Happy
Sad
Neutral
(c) Congruent audio-visual evaluations.
Figure 2.2: The valence (x-axis) and activation (y-axis) dimensions of the evaluations,
the ellipses are 50% error ellipses.
2.3 General Results
The VAD ratings of the evaluators were plotted along the dimensions of valence and
activation (Figure 2.2) to observe the eect of audio-visual information on emotion per-
ception vs. that of either only video or audio information. The dominance dimension is
not shown due to its high level of correlation with the activation plots. This visualization
allows for a graphical depiction of the relationship between the emotion states and their
VAD ratings.
The separation between the clusters was higher in the audio-only evaluation (Fig-
ure 2.2(a)) than in the video-only evaluation (Figure 2.2(b)). Discriminant analysis
31
(a) Confusion matrix for audio-only
(ave. = 79.3%).
A H S N
A 90.6 3.1 0 6.3
H 3.2 80.6 6.5 9.7
S 0 0 72.7 27.3
N 6.5 2 19.4 71.0
(b) Confusion matrix for video-only
(ave. = 71.3%).
A H S N
A 70.0 1.4 11.4 17.1
H 0 76.7 1.4 21.9
S 11.3 2.8 71.8 14.1
N 5.9 20.6 7.4 66.2
(c) Confusion matrix for congruent
audio-visual (ave. = 80.8%).
A H S N
A 95.0 0 2.5 2.5
H 3.6 89.3 0 7.1
S 7.7 0 69.2 23.1
N 9.7 9.7 16.1 64.5
Table 2.1: Discriminant analysis classication (A=angry, H=happy, S=sad, N=neutral)
for (a) audio-only, (b) video-only evaluations, and (c) audio-visual.
showed that there exists a higher level of confusion in the video-only evaluation (classi-
cation rate: 71.3%) than in the audio-only evaluation (classication rate: 79.3%) (test of
proportions, 0:1, Table 2.1). This result suggests that the emotions presented in the
audio data were more highly dierentiable than in the video data, possibly due to the
limited expression in the animated face used in this analysis.
A discriminant analysis of the congruent audio-visual data showed that the average
classication accuracy non-signicantly increased (test of proportions, 0:1) to 80.8%
(Table 2.1). The congruent angry and happy classication rates increased when com-
pared to the video-only and audio-only classication rates. However, the neutral and sad
classication rates decreased. This suggests that the audio and video data were providing
emotionally confounding cues to the participant with respect to the sad and neutral emo-
tion classes. The confusion between these two classes in the congruent audio-visual case
32
was in between that of the audio-only (higher level of confusion) and video-only (lower
level of confusion).
2.4 Perception of Presentation Types
When designing an audio-visual interactive agent, it is important to determine if the
emotion perception resulting from an audio-visual presentation of emotion will dier
signicantly from that of an audio-only or video-only presentation. We investigated the
perceptual dierences of the presentation conditions by analyzing the VAD ratings of
the three presentation conditions (audio-only, video-only, audio-visual). The dierences
between the presentation conditions were analyzed by comparing the emotion-specic
cluster means of the congruent audio-visual presentation to those of the audio-only and
video-only presentations using a one-way ANOVA analysis. The independent variables
were the z-normalized VAD ratings. The dependent variables were presentation class.
The ANOVA analysis indicated that the group means for the presentation conditions were
signicantly dierent for angry across all three VAD dimensions (F(2; 130) > 25.273; p
< 0.001), for happy across the activation dimension (F(2; 129) = 7.84; p = 0.001), for
sad across the activation and dominance dimensions (F(2; 116) > 5.769; p = 0.004), and
for neutral across the valence and activation dimensions (F(2; 127) > 6.453; p = 0.002).
This result suggested that the clusters were distinct in the three presentation conditions
The ANOVA analysis was repeated with the dependent variable as emotion class
(the same independent variables were used) to determine whether or not distinct emo-
tion classes existed in the audio-visual space. The four clusters are distinct in at least
33
(a) Audio-Only
A H S N
A { VD AD VAD
H VD { VAD VA
S AD VAD { AD
N VAD VA AD {
(b) Video-Only
A H S N
A { VAD AD VD
H VAD { VAD VA
S AD VAD { VAD
N VD VA VAD {
(c) Congruent Audio-Visual
A H S N
A { VD AD VAD
H VD { VAD VA
S AD VAD { VAD
N VAD VA VAD {
Table 2.2: ANOVA post hoc analysis (A - angry, H - happy, S - sad, N - neutral) of
the three presentation conditions. The letters VAD indicate that the cluster means are
signicantly dierent ( = 0.01) in the valence, activation, and dominance dimensions.
two dimensions at the = 0.01 level of signicance in all three presentation conditions
(ANOVA post-hoc analysis, Table 2.2).
2.5 Biases in Evaluation
The design of audio-visual emotional interfaces requires both knowledge of how observers
interpret specic facial and vocal features and how observers weight the audio and video
channels during the perceptual process. This weighting process is dependent on the
relevance of the emotional information contained in the channels, and on the aective
bandwidth of the channels. The aective bandwidth of a channel is dened as, \... how
much aective information a channel lets through [93]." The bandwidth of the channel
is a function of the physical limitations (e.g., number of degrees of freedom) and the
emotional relevance of the channel (e.g., the voice is the primary source for activation
dierentiation but alone cannot suciently convey valence [54]). An understanding of the
34
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Valence
Activation
Conflicting Audio−Visual Clips: Angry Voice
Angry
Happy
Sad
Neutral
(a) \Angry" vocal emotion held constant,
facial emotion varied.
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
Valence
Activation
Conflicting Audio−Visual Clips: Angry Face
Angry
Happy
Sad
Neutral
(b) \Angry" facial emotion held con-
stant, vocal emotion varied.
Figure 2.3: Comparison between the emotion perceptions resulting from con
icting audio-
visual presentation.
audio-visual perceptual process would allow designers to tailor the information presented
to maximize the emotional information conveyed to, and recognized by, observers.
2.6 Analysis of VAD Ratings for Emotional Clusters
Within this experiment, the natural audio channel dominated the perception of the users.
This bias can be observed graphically (Figures 2.3(a) and 2.3(b)). The gures depict the
reported valence-activation perception of the con
icting audio-visual presentations. In
Figure 2.3(a), all of the con
icting audio-visual presentations have an angry voice (and
angry, happy, neutral, or sad faces) while in Figure 2.3(b), all of the con
icting audio-
visual presentations have an angry face (and angry, happy, neutral, or sad voices). These
gures demonstrate that the perception is more strongly in
uenced by the vocal emotion
than by the facial emotion. In Figure 2.3(a) the perception resulting from the varied
facial emotions are more similar to the congruent angry presentation than the perceptions
resulting from the varied vocal emotions of Figure 2.3(b).
35
(a) Confusion matrix for angry
voice held constant (ave. = 40.5%).
A H S N
A 55.0 5.0 12.5 27.5
H 14.0 48.8 16.3 20.9
S 36.4 9.1 39.4 15.2
N 28.1 46.9 12.5 12.5
(b) Confusion matrix for angry face
held constant (ave. = 70.5%).
A H S N
A 87.5 2.5 7.5 2.5
H 10.3 75.9 3.4 10.3
S 8.0 0 72.0 20.0
N 11.4 8.6 34.3 45.7
Table 2.3: Classication accuracy in the presence of con
icting audio-visual information,
(a) angry voice held constant, (b) angry face held constant.
The presence of the audio bias can also be veried using discriminant analysis (Ta-
ble 2.3). In this investigation, the con
icting audio-visual presentations were again
grouped by: 1) the emotion expressed over the vocal channel (i.e., angry voice with
angry, happy, neutral, and sad faces) and 2) the emotion expressed over the facial chan-
nel (i.e., angry face with angry, happy, neutral, and sad voices). The classication goal
was to recognize the four emotion classes, where a class is dened as a combination of a
facial and vocal emotion, e.g., angry voice { neutral face. In this analysis, a higher level
of accuracy suggests that the four emotion classes are more separable (and thus more
distinct) in the channel with the higher level of accuracy. The classication accuracy of
the vocal emotion group was 40.5% while the classication accuracy of the facial emotion
group was 70.5%. The results demonstrate that the four emotion classes are less dier-
entiable when the audio emotion is held constant than when the video emotion is held
constant (Table 2.3), providing further evidence of an audio bias.
The audio bias can also be assessed by comparing the means of the four emotion
classes in the vocal and facial emotion groups using an ANOVA post hoc analysis. In the
vocal emotion group, the class means were distinct only across the valence and dominance
dimensions (F(3; 144)> 5.152; p = 0.002). In the facial emotion group, the cluster means
36
were signicantly dierent across all three dimensions (F(3; 125) > 34.239; p < 0.001).
These three pieces of evidence indicate that the vocal information of the clip provided
a stronger in
uence on the emotion perception of the participant than did the facial
information. The presence of an audio bias suggests that when evaluators were presented
with ambiguous or con
icting emotional information, they used the natural vocal channel
to a larger degree than the synthetic facial channel to determine the emotion state.
2.7 Analysis of Video Contribution to Perception
Although the audio biased the evaluations, the video information did provide emotion-
ally salient information. The contribution of the audio and video information to user
audio-visual emotion perception can be seen by evaluating the audio-visual cluster shifts
of the con
icting presentations. In this analysis, the cluster center of the audio-visual
presentation (e.g. angry voice { happy face) was compared to the audio-only and video-
only cluster centers (e.g., angry voice and happy face) using paired t-tests implemented
in Matlab. All reported results refer to a signicance of 0:05.
The audio biased the audio-visual perception of activation. In 10 of the 12 con
icting
audio-visual presentation types, the cluster means of the audio-visual presentation were
signicantly dierent than the video-only cluster mean presentations. These same pre-
sentations were signicantly dierent from the audio-only presentation in only one of the
12 con
icting presentations (Table 2.4). This result suggests that the audio information
biased the evaluations of the users in the activation dimension.
37
Audio Video V
audio
A
audio
D
audio
V
video
A
video
D
video
Angry Happy 0.771* -0.097 -0.465 -1.558* 0.459* 0.967*
Sad -0.174 -0.045 - 0.207 -0.568* 1.390* 1.693*
Neutral 0.364 -0.250 -0.203 -1.165* 0.661 1.293*
Happy Angry -0.582 -0.165 0.174 1.053* 0.669* -0.243
Sad -1.391* -0.187 - 0.370 0.349 1.167* 0.312
Neutral 0.061 0.045 0.194 0.666* 0.874* 0.473
Sad Angry -0.001 0.137 0.922 0.005 -1.08* -0.640
Happy 0.717 0.399 0.429 -1.108* -1.173* -0.501
Neutral 0.464 0.193 0.641 -0.562* -1.024* - 0.225
Neutral Angry -0.366* 0.006 0.278 0.200 -0.452 -0.435
Happy 0.681* 0.253 0.053 -0.583* -0.564* -0.028
Sad -0.624* -0.243 -0.295 0.0462 -0.182 0.092
Table 2.4: Cluster shift analysis with respect to the VAD dimensions (where V
audio
represents the shift in valence mean from the audio-only evaluation to the audio-visual
evaluation). Entries in bold designate evaluations of the audio-visual presentations that
are signicantly dierent, with 0:05, from that of either the video-only or audio-
only presentations (paired t-test). Entries with a star (*) designate evaluations that are
signicantly dierent with 0:001.
The valence dimension was not as strongly biased by the audio information as the
activation dimension. In the valence dimension, 10 of the 12 con
icting audio-visual
presentation clusters had means signicantly dierent than those of the video-only pre-
sentations. In the activation dimension, 8 of the 12 audio-visual clusters had means
signicantly dierent than those of the audio-only presentations (Table 2.4). This sug-
gests that in the valence dimension, the evaluators integrated both the audio and video
information when making emotional assessments.
2.8 Conclusions
This chapter provided evidence supporting the joint processing of audio and visual cues
in emotion perception. This was most stridently recognized when comparing the cluster
results from the audio-only and video-only data to the congruent audio-visual clusters.
38
This chapter also provided evidence suggesting that the combination of audio and
visual cues does not always result in a combined emotional rating between the ratings
of the individual channels. It would seem that the integration of these cues results in a
dierent experience than observing the cues individually. This has been shown previously
in [28,56] regarding facial, but not vocal prominence.
One of the limitations of this work was the limited level of expression inherent in the
animated face. Users tuned to the audio more predominantly than the video when making
their emotional assessments. We believe that this is due in part to the highly expressive
vocal information. Since the two channels did not have a similar level of expression this
may have led to the perceived importance of the audio signal. In previous studies [28]
it was found that the facial information strongly in
uenced the perception of the audio
information when photographs of human faces were used.
In future work, we will use a more expressive animated face to analyze the interplay
between the facial and vocal channel with an enhanced level of facial expression. The use
of continuous domain analysis provides a novel tool for understanding the relationship
between the level of expression and the relative strength of the emotional bias. Our
further work will also analyze a synthetic voice combined with the current animation to
determine if a combination of two channels with similar levels of expression will result
in facial information having a more prominent role in the evaluation of the emotional
display.
39
Chapter 3
Emotionally Salient Features
The proper expression of robotic and computer animated character emotions have the
potential to in
uence consumer willingness to adopt technology. As technology contin-
ues to develop, robots and simulated avatars (\synthetic characters") will likely take on
roles as caregiver, guide, and tutor for populations ranging from the elderly to children
with autism. In these roles, it is important that robots and synthetic characters have
The work presented in this chapter was published in the following articles:
1. Emily Mower, Maja J Matari c and Shrikanth S. Narayanan, \Human perception of audio-visual
synthetic character emotion expression in the presence of ambiguous and con
icting information."
IEEE Transactions on Multimedia, 11:5(843-855). August 2009.
2. Emily Mower, Maja J Matari c, Shrikanth Narayanan. \Selection of Emotionally Salient Audio-
Visual Features for Modeling Human Evaluations of Synthetic Character Emotion Displays." In
Proceedings of IEEE International Symposium on Multimedia (ISM). Berkeley, California, Decem-
ber 2008.
3. Emily Mower, Sungbok Lee, Maja J Matari c, Shrikanth Narayanan. \Joint-processing of audio-
visual signals in human perception of con
icting synthetic character emotions." In Proceedings of
IEEE International Conference on Multimedia & Expo (ICME), Hannover, Germany, June 2008.
4. Emily Mower, Sungbok Lee, Maja J Matari c, Shrikanth Narayanan. \Human perception of
synthetic character emotions in the presence of con
icting and congruent vocal and facial ex-
pressions." In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP). Las Vegas, Nevada, March-April 2008.
40
interpretable and reliably recognized emotional expressions, which will allow target pop-
ulations to more easily accept the involvement of synthetic characters in their day to day
lives.
Reliable and interpretable synthetic emotion expression requires a detailed under-
standing of how users process synthetic character emotional displays. This chapter
presents a quantitative analysis of the importance of specic audio-visual features with
respect to emotion perception. Armed with this knowledge, designers may be able to
control the number of feature combinations that they explore. Instead of implementing
and testing broad combinations of features, designers may be able to concentrate on those
features upon which observers rely when making synthetic aective assessments.
As described in the previous chapter, the work of McGurk and MacDonald [75] has
provided a framework commonly employed for the study of human emotional perception.
The McGurk experimental paradigm is often employed in emotion perception research.
One common evaluation method [28,29,31,56,74] is to create an emotional continuum, an-
choring the ends with two archetypal emotional images and presenting these images with
emotional vocalizations. Subjects then identify the emotion presented from a discrete set
(e.g., angry vs. happy). This presentation framework allows the researchers to model
the perceptual in
uence of the two modalities. However, discrete emotional evaluation
frameworks do not fully capture the interplay between the two channels. The complex-
ities of the two channels may be better modeled using a continuous framework (e.g.,
valence, activation, dominance, \VAD") [9, 53, 81, 82] rather than a discrete framework
(e.g., angry, happy, sad, neutral). This framework allows users to express the complexity
of an emotional presentation using the properties of the emotion, rather than the lexical
41
description. Continuous frameworks have also been used to analyze the interplay between
facial actions and personality perception [2].
In this chapter, emotionally relevant features are identied using the con
icting and
congruent presentation framework, discussed in the previous chapter. In a con
icting pre-
sentation, a presentation in which the emotions expressed in the facial and vocal channels
do not match, the evaluators must make an assessment using mismatched emotional cues.
This presentation style is an important research tool because it provides combinations of
features that would not, under ordinary circumstances, be viewed concurrently, allowing
for a greater exploration of the feature space. Features that are selected across both con-
gruent and con
icting presentations are features that provide emotionally discriminative
power both in the presence and absence of emotional ambiguity. The feature selection
method employed in this chapter is Information Gain, which has been used previously to
identify emotionally salient features [90]. The explanatory power of the resulting reduced
feature set is validated using Support Vector Machine classication [113].
The results suggest that the pitch range and spectral components of the speech signal
are perceptually relevant features. Design-relevant prior knowledge statistics (e.g., the
average valence rating for angry speech) are also perceptually relevant. However, these
prior knowledge features contribute supplementary, rather than complementary informa-
tion. The results suggest that the perceptually relevant valence features are: pitch and
energy ranges and facial expression features (eye shape, eyebrow angle, and lip position),
the perceptually relevant activation features are: are energy and spectral features, and the
42
dominance dimension features are: energy, spectral, and pitch range features. This nov-
elty of this work is in its analysis of dynamic audio-visual features and their contribution
to dimensional emotional evaluation.
3.1 Feature Sets
The data used in the analyses presented in this chapter are described in Chapter 2.1. The
data are dynamic presentations of emotion across an animated face and a natural human
voice. The facial and vocal emotions are combined to produce congruent presentations (in
which the emotions expressed in the face and voice match) and con
icting presentations
(in which the emotions expressed in the face and voice do not match). In the previous
chapter we demonstrated that the audio biases the perception of the evaluators, in this
chapter we will investigate the contribution of specic feature types.
3.1.1 Audio Features
The audio features utilized in this experiment included 20 prosodic features and 26 spec-
tral features averaged over an utterance. The prosodic features included pitch, energy,
and timing statistics. The spectral features included the mean and standard deviation
of the rst 13 MFCCs, also used in [52]. These features are summarized in Table 3.1.
It is also important to note that the selected audio features represent relevant design
parameters that can be used to modulate synthetic speech [11,87].
43
3.1.2 Video Features: FACS
The Facial Action Coding System (FACS) was developed by Ekman and Friesen as a
method to catalogue the muscle movements of the human facial structure [40]. These
features allow for a design-centered analysis of a video sequence through the use of actu-
ated facial units (a subset of the facial muscles acting to achieve a visually perceivable
facial movement).
This method of video analysis is important for design centered user modeling. Since
the facial features described by action units are physically realizable motion, any facial
feature identied as important could, given sucient actuation, be implemented on a
synthetic character. Consequently, the method identies salient facial motions from a set
of available facial actions.
The features used in this study represent a simplied subset of the FACS action units
due to the simplicity of the input video stream. The video features employed in this
study are summarized in Table 3.1. These features include eyebrow (movements, types,
and angles), eye shape, and lip corner position features. Other areas of the face were not
analyzed because they were static with respect to emotion presentation for these data.
These features were manually coded by the author.
3.1.3 Prior Knowledge Features
In this study there were prior knowledge features included in both the audio and the
video features sets. These prior knowledge audio-visual features included average value
statistics for the individual audio or video channel (e.g., the average VAD ratings of the
audio-only and video-only components of the clip) and indicator variables that encode
44
Stream Feature Class Measures
Audio
Pitch
mean, standard deviation, median,
min, max, range, upper quartile,
lower quartile, quartile range
Volume
mean, standard deviation, max,
upper quartile, lower quartile,
quartile range
Rate
pause to speech ratio,
speech duration mean, standard deviation,
pause duration mean, standard deviation
MFCC 1 { 13, mean and standard deviation
Prior Knowledge: Binary Emotion
angry voice, happy voice,
sad voice, neutral voice
Prior Knowledge: Mean Statistics
valence, activation, dominance
of each emotion class
Video
Eyebrow Movement
none, downward, upward,
downward upward, upward downward,
downward upward downward,
upward downward upward
Eyebrow Movement Type none, once, twice, thrice
Eyebrow Angle
at, inner raised, inner lowered,
outer raised, outer lowered
Lip Corner Position neutral, raised, lowered
Eye Shape
eyes wide, top soft, top sharp,
bottom soft, bottom sharp
Prior Knowledge: Binary Emotion
angry face, happy face,
sad face, neutral face
Prior Knowledge: Mean Statistics
valence, activation, dominance
of each emotion class
Table 3.1: A summary of the audio and video features used in this study.
the presence or absence of an emotion in the audio and video channels (e.g., angry video-
y/n, happy audio- y/n). From a design perspective these features describe the relevance
of general emotion descriptors with respect to subsequent emotion perception. Although
single semantic labels (e.g., \happy") do not fully describe the properties of the clip, the
knowledge of this label may provide insight into the resulting perception of the evaluator.
45
3.2 Method
3.2.1 Class Denition
This study was designed to identify the audio-visual features that contribute most to
the emotional perception of the users. The contribution of the features was assessed
using Information Gain and the feature set was reduced using a minimum gain thresh-
old. The discriminative ability of the reduced feature set was validated using Support
Vector Machine (SVM) classication, a classication tool developed by Vapnik [113].
The classication performances of the reduced feature sets were compared to the clas-
sication performances of the full feature sets using SVM. SVM is a classication algo-
rithm that nds a maximally separating hyperplane by optionally transforming the input
data into a higher-dimensional space. SVM has been employed for emotion classication
tasks [5, 14, 66, 95]. SVM was implemented here using Weka, a Java-based data mining
software package [116].
The evaluations of the presented audio-visual, audio-only, or video-only clips were
rated dimensionally (valence, activation, and dominance), on a scale from 0-100 (Chap-
ter 2, Figure 2.1(b)). These evaluations were preprocessed using z-normalization across
all three VAD dimensions, to allow for inter-evaluator comparisons. The emotional VAD
space was also preprocessed using a binary discretization based on the neutral VAD cen-
troids. This binarization was used to account for the simplicity of the video information;
the perception of this channel did not vary widely across emotion class. After discretiza-
tion, each evaluator rating was composed of a 3-dimensional binary vector representing
the VAD rating with respect to the neutral centroid (e.g., valence: positive vs. negative).
46
3.2.2 Feature selection
The goal of the feature selection analysis was to determine which audio-visual features
contributed to the explanation of variance within the audio-visual perceptual evaluations.
The data were prepared for feature selection techniques by separating, the data into three
groups, evaluations of congruent data (\congruent database"), evaluations of con
icting
data (\con
icting database"), and evaluations of congruent and con
icting data (\com-
bined dataset"). The feature selection techniques were applied to the three data subsets
separately and the results were compared. The feature set was reduced using the Informa-
tion Gain Attribute Selection algorithm, an algorithm implemented in Weka. Information
Gain feature selection techniques have been used previously in salient emotional feature
selection [90]. Information Gain describes the decrease in the entropy of set X, H(X)
(e.g., valence), given the conditional entropy between X and attribute Y, H(XjY ) (e.g.,
valence given the presence of a lowered eyebrow) is known (Equation 3.1) [79]. Features
were retained if they contributed a gain of at least 0.1 with respect to the target class
(discretized valence, activation, and dominance).
Gain(S;A)H(X)H(XjY ) (3.1)
47
3.3 Feature Selection Results
3.3.1 Feature Selection Results for the Combined Congruent { Con
icting
Dataset
The features selected for the combined audio-visual congruent-con
icting presentations
can be viewed in Table 3.2. Features in italics were observed over two VAD dimensions,
features in bold italics were observed across all three dimensions. The features that
contribute most to the explanation of evaluator variance, as suggested by Information
Gain feature selection, are the prior knowledge audio statistics (e.g., average valence
rating), the high energy band spectral components, and the lower pitch quartile. The
prior knowledge audio statistics include the average ratings for valence, activation, and
dominance. These statistics represent the centroid of the dimensional ratings of the audio
and video components of the audio-visual clip.
Feature selection was extended to the audio-only and video-only presentations to de-
termine if the features selected as important to audio-visual perception also explain the
variance inherent in the audio-only and video-only evaluations. The feature representation
across the presentation conditions will be discussed using two abbreviations: VAD
Audio
and VAD
Video
. These abbreviations indicate the features that were selected for the va-
lence, activation, and dominance in both the audio and audio-visual presentations or the
video and audio-visual presentations. Each feature occurs in the VAD
AudiojVideo
sets a
maximum of six times, corresponding to the valence, activation, and dominance for the
audio/video-only presentations and the valence, activation, and dominance for the audio-
visual presentations. The only features selected in six cases (across all audio-visual and
48
Dim Relevant Features
Val
ave audio val (0.159), ave audio dom, ave audio act, mfcc12 mean,
vol quartlow, f0 quartlow, mfcc03 mean, ave video dom, eyebrow angle,
lip corner position, happy voice, ave video val, eyebrow angle
at,
eye shape bottom sharp, f0 quartup (0.1)
Act
vol quartup (0.472), vol quartrange, vol std, vol mean, mfcc01 std,
mfcc07 std,vol max, mfcc01 mean,mfcc08 std, mfcc10 mean, mfcc12 std,
mfcc13 mean, ave audio val, ave audio dom, ave audio act,
speech duration std, mfcc03 mean, mfcc05 std, pause to speech ratio,
mfcc08 mean,f0 quartrange, mfcc11 std, f0 mean, mfcc10 std,
mfcc12 mean, mfcc06 std,mfcc13 std, mfcc02 mean, f0 quartlow, f0 std,
mfcc11 mean, f0 range, f0 quartup, mfcc09 mean, f0 max, mfcc07 mean,
mfcc09 std, mfcc03 std, mfcc04 mean, mfcc05 mean, mfcc04 std,
mfcc06 mean, f0 min, f0 median, pause dur mean, sad voice, vol quartlow,
mfcc02 std, pause duration std, speech duration mean, angry voice (0.165)
Dom
ave audio dom (0.204), ave audio val, vol mean, ave audio act,
angry voice, mfcc12 mean,mfcc06 std, mfcc08 mean, mfcc11 std,
vol max, f0 quartrange, mfcc08 std, mfcc05 std, mfcc12 std,
mfcc09 mean, vol quartup, vol quartrange, f0 quartlow, mfcc01 mean,
mfcc01 std, vol std, mfcc13 std, mfcc03 std (0.103)
Table 3.2: A summary of the features used in the audio-visual analysis of this study. The
order of the feature (left - right) indicates their relative importance. The numbers in
parentheses represent the highest and lowest mean Information Gain above the thresh-
old. Bold italic fonts represent features selected across all three dimensions, italic fonts
represent features selected across two dimensions.
audio-only presentations-VAD
+
Audio
) were the three prior knowledge audio statistics (the
emotion specic centroids for valence, activation, and dominance, e.g., ave audio val),
the quartile pitch range, and a high frequency MFCC feature. The binary variable rep-
resenting the presence of an angry voice was selected in ve of the six dimensions.
The most highly represented video features were the mean video statistics describing
the emotion specic activation and valence evaluations. These two features were repre-
sented in four of the VAD
+
Video
components. The mean video dominance statistic was
represented across only three of the VAD
+
Video
components.
49
There were several features represented once (out of a possible six times) in the
VAD
+
Audio
or VAD
+
Video
. This set of features included three of the FACS-inspired fea-
tures (a binary feature addressing eyebrow angle, a feature addressing eyebrow movement
direction, and an eye shape feature). All three of the features were utilized in the video
valence classication problem. This result suggests that these features provide special-
ized dimensional dierentiation. This feature set of singularly represented features also
includes two of the binary channel features (happy voice and sad voice indicators). These
features were utilized in the audio-visual valence and activation classication tasks re-
spectively. This suggests that these features, while not applicable to channel dependent
classications (i.e., video-only), do provide additional information with respect to multi-
modal discretization and disambiguation.
3.3.2 Feature Selection Results for the Congruent and Con
icting Datasets
The combined congruent-con
icting dataset was separated into two datasets: congruent
presentations and con
icting presentations. Feature selection was applied to these sep-
arated datasets to analyze the features that contribute to the perception of emotionally
matched and mismatched emotional expressions. In the congruent presentations, both
audio and video features were selected across all three dimensions (Table 3.3). In the com-
bined congruent-con
icting dataset no video features were selected for the dimensions of
either activation or dominance. This prominence of audio features (and corresponding
paucity of video features) can still be seen by observing the features selected for the
con
icting database (Table 3.4).
50
Dim Relevant Features
Val
prior knowledge: angry (face, voice), happy (face, voice), neutral (face, voice)
average channel ratings: audio VAD, video VAD
video: eye shape (specic, bottom sharp, bottom soft, wide eyes), eyebrow angle
(general,
at, inner lowered, outer raised), eyebrow mvmt.,
eyebrow mvmt. timing, lip position
pitch: quartile (low, high)
volume: max, std, quartile (low, high, range)
rate: speech duration std
MFCC: mean (1, 4, 6, 7, 10, 11, 12), std (4, 9, 10)
Act
prior knowledge: happy (face, voice), sad (face, voice)
average channel ratings: audio VA, video VAD
video: eye shape (specic, bottom soft, top sharp, top soft), eyebrow angle
(general, inner raised, outer lowered), eyebrow mvmt. timing, lip position
pitch: mean, median, max, range, std, quartile (low, high, range)
volume: mean, max, quartile (high, range)
rate: pause duration mean, pause to speech ratio
MFCC: mean (1, 2, 3, 5, 6, 8, 10, 11, 12, 13),std (1, 3, 4, 5, 6, 8, 9, 10, 12, 13)
Dom
prior knowledge: angry (face, voice), sad (face, voice)
average channel ratings: audio VAD, video VAD
video: eye shape (specic, top sharp, top soft), eyebrow angle (inner lowered,
inner raised, outer lowered, outer raised), eyebrow mvmt.,
eyebrow mvmt. timing
pitch: max, std, quartile range
volume: mean, max, std, quartile (high, range)
MFCC: mean (1, 3, 5, 8, 11, 12), std (4, 5, 6, 7, 8, 9, 11, 13)
Table 3.3: The audio-visual features selected in the congruent database. Features
in bold are features that were selected across the valence, activation, and dominance
dimensions of the Congruent database. Features in bold-italics are features that were
selected in the Congruent
VAD
and Conflicting
AD
databases.
In the congruent database, there were a total of fteen features selected across the
three dimensions. The video features included an eye shape feature (describing the emo-
tional shape of the eye), an eyebrow timing feature (describing if movement occurred
within the rst, second, third, or multiple thirds of the utterance). The audio features
included volume features (including the utterance length maximum and quartile, repre-
senting 25%{75% of the energy, max and range), MFCC mean and standard deviation
features, and prior knowledge average statistics (describing the mean valence and action
51
Dim Relevant Features
Val none over the threshold of 0.1
Act
prior knowledge: angry voice, sad voice
average channel ratings: audio VAD
pitch: mean, median, max, min, range, std, quartile (low, high, range)
volume: mean, max, std, quartile (low, high range)
rate: pause duration (mean, std), speech duration
MFCC: mean (1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12), std (1, 2, 3, 6, 8, 9, 10, 11, 13)
Dom
average channel ratings: audio AD
pitch: quartile low
volume: mean, max, quartile (high, range)
MFCC: mean (8, 9, 12), std (5, 6)
Table 3.4: The audio-visual features selected in the con
icting database. Features
in bold are features that were selected across the valence, activation, and dominance
dimensions of the Congruent database. Features in bold-italics are features that were
selected in the Congruent
VAD
and Conflicting
AD
databases.
of the audio clip and the mean valence, activation, and dominance of the video clip).
These features are presented in Table 3.3 in bold font.
The features selected in the con
icting database included only audio features. There
were a total of eleven features selected across both the activation and dominance di-
mensions. There were no features selected in the valence dimension (all of the features
presented an Information Gain under the threshold of 0.1). The features selected across
the activation and dominance domain included a pitch feature (lower quartile, 25% of
the full range), volume features (mean, max, upper and range quartile features), MFCC
mean and standard deviation features, and prior knowledge average statistics (describing
the mean activation and dominance of the audio clip). These features are presented in
Table 3.4 in bold font.
There were a total of ve features represented over the three dimensions of the con-
gruent database and the activation and dominance of the con
icting database. These
52
eyebrow movement timing
specic eyeshape max volume
high quantile volume
mean volume
quantile range volume
V
audio
VAD
video
mfcc11
mean
low quantile pitch
mfcc01
mean
mfcc04
std
mfcc09
std
mfcc12
mean
mfcc08
mean
mfcc09
mean
mfcc06
std
A
audio
D
audio
Congruent Features Conicting Features
Figure 3.1: Comparison between the emotion perceptions resulting from con
icting audio-
visual presentation.
Presentation Dimension
With Priors (%) Without Priors (%)
Full Reduced Full Reduced
Audio-Visual
Valence 76.618 77.453 75.1566 75.1566
Activation 85.8038 85.595 84.9687 85.595
Dominance 72.0251 69.3111 73.0689 68.2672
Audio
Valence 73.2759 80.1724 79.3103 80.1724
Activation 86.2069 87.069 87.069 87.069
Dominance 75 73.2759 75.8621 70.6897
Video
Valence 84.4523 84.4523 84.4523 85.5124
Activation 55.477 61.1307 56.5371 61.1307
Dominance 57.9505 65.7244 57.9505 65.7244
Table 3.5: This table presents the classication results (SVM) over the three presen-
tation conditions (audio-only, video-only, audio-visual) and three dimensions (valence,
activation, dominance). \Full" refers to classication performed with the original fea-
ture set. \Reduced" refers to classication performed with the feature set resulting from
Information Gain feature selection.
features included volume features (max, upper and range quartile), an MFCC mean fea-
ture, and a prior knowledge average statistic (describing the mean activation of the audio
clip). These features are presented in Tables 3.3 and 3.4 in bold italic font. This feature
set is an audio-only feature set. The common features can also be visualized in Figure 3.1.
53
3.4 Validation: SVM Classication
3.4.1 Validation of the Combined Congruent { Con
icting Feature Sets
The results from the SVM classication across the three presentation conditions (audio-
visual, audio-only, video-only) are presented in Table 3.5. The valence was most ac-
curately classied in the video-only presentation (84.45%), followed by the audio-visual
presentation (76.62%), and the audio-only presentation (73.28%). The activation was
most accurately classied in the audio-only presentation (86.21%), followed by the audio-
visual presentation (85.80%), and the video-only presentation (55.48%). The dominance
was most accurately recognized in the audio-only presentation (75.00%), followed by the
audio-visual presentation (72.03%), and the video-only presentation (57.95%). These re-
sults support the channel bias results from the previous chapter, which asserted that
audio biased the perception of activation while both the video and audio information
contributed to the perception of valence (Chapter 2.7, Table 2.4).
The classication accuracies were tested using a dierence of proportions test to de-
termine if the classication accuracy changed when either the feature set was reduced, or
when the prior information was removed. None of these accuracies diered signicantly
at the = 0:05 level across feature set size (full vs. reduced) or based on prior knowledge
(present vs. absent). This result suggests that the reduced feature sets without prior
knowledge statistics can explain the variance in the user evaluations with a similar of
accuracy to that of the full feature set.
54
Presentation Dimension
With Priors (%) Without Priors (%)
Baseline
Full Reduced Full Reduced
Congruent
Valence 88.71 89.52 88.71 89.52 54.84
Activation 84.68 86.29 84.68 86.29 58.87
Dominance 78.23 78.23 77.42 78.23 58.87
Con
icting
Valence 71.23 { 71.27 { 58.03
Activation 84.79 84.51 83.67 84.23 52.96
Dominance 64.79 67.32 63.66 67.89 55.21
Table 3.6: The SVM classication accuracies (percentages) over the two database divi-
sions (congruent, con
icting) and three dimensions (valence, activation, dominance) using
feature sets reduced with the Information Gain criterion discussed in Section 3.2.2. The
columns marked \Full" refer to the full feature set. The columns marked \Reduced" refer
to the reduced feature set.
3.4.2 Validation of the Congruent and Con
icting Feature Sets
The performance of the SVM classier was aected by presentation type. In general,
the congruent presentations were more accurately classied than the con
icting presenta-
tions (Table 3.6). Using the full feature set, valence was more accurately classied in the
congruent presentations than in the con
icting presentation (88.71% vs. 71.23%, respec-
tively). The activation of the congruent and con
icting presentations was recognized very
similarly (84.68% vs. 84.79%, respectively). The dominance of the congruent presenta-
tions was more accurately recognized than the dominance of the con
icting presentations
(78.23% vs. 64.79%, respectively).
The dierences in classication accuracy were analyzed using the dierence of pro-
portions test. The classication accuracy (full feature set) for the valence dimension
was signicantly lower in the con
icting presentation than in the congruent presentation
( 0:05). The classication accuracy for the dominance dimension was also signi-
cantly lower in the con
icting presentations than in the congruent presentations for both
55
the full and the reduced feature sets (dominance, reduced feature set: 0:05). The
dierence in the classication accuracy of the activation dimension was not signicant
across either presentation type or feature set size.
The classication accuracies were not aected by prior knowledge across either dimen-
sion or presentation type (congruent vs. con
icting). This suggests that the information
contained within the prior knowledge features (semantic labels) is also contained within
the audio and video features. The classication performance was also not aected by fea-
ture set size (full vs. reduced) in any dimension. In all conditions (except for con
icting
valence and both full feature sets for con
icting dominance), the SVM performance beat
the baseline with a signicance of 0:01.
3.5 Discussion
The results of the SVM analysis support the feature selection results regarding the reliance
upon audio and video information demonstrated in the previous chapter. The SVM
classication results for the activation domain do not change signicantly between the
congruent and con
icting presentations (Tables 3.5 and 3.6). This is expected since
humans tend to rely upon audio for activation detection, and since when presented with
con
icting information, evaluators were shown to rely primarily upon the audio channel.
Therefore, when asked to determine the activation, the users were well prepared in either
presentation condition and the variance in the evaluations did not increase enough to
decrease the SVM classication results.
56
In the valence dimension, humans tend to utilize facial information. In the audio-
visual domain, the evaluators were able to integrate and utilize both the audio and the
video information when making their valence assessment. However, as previously stated,
when observing con
icting emotional presentations, evaluators tended to rely on the audio
information. Therefore, when the evaluators attempted to analyze the valence dimension
based primarily on the audio signal, the variance in the evaluations increased and the
performance of the SVM classication using the full feature set decreased.
The VAD ratings of the dominance dimension are aected by both the audio and
video channels. It is therefore expected that the results of the SVM classication would
lie in between that of the activation and valence dimensions with respect to performance
decrease. The SVM classication results are in accordance with the VAD shift analysis
and also support the hypothesis that the variance in the evaluation of dominance is
aected by both the audio and video channels.
3.6 Conclusion
This chapter presented a channel-level analysis of an animated character emotional pre-
sentation. This work identied the video and audio features that are utilized during
emotional evaluations of both congruent and con
icting presentations.
Classication tasks have previously been used for perceptual experiments. In [12],
speech synthesis parameters were selected using classication techniques. The features
that were selected in this process were used to synthesize and modify speech. The cre-
ation of this feature subset allowed the researchers to minimize the utterances to be rated
57
by evaluators. In future studies, this same technique will be applied to determine which
combinations of facial and vocal channel capabilities should be utilized for emotion recog-
nition applications. This presentation framework will allow for the study of how various
audio-visual combinations aect human emotional perception.
The results of the SVM classication on the full and reduced feature sets suggest that
it is possible to identify a reduced feature set with emotional explanatory power in the
congruent presentations and in the activation and dominance dimensions of the con
icting
presentations. SVM classication performance on this reduced feature set indicated that
there was not a signicant decline in performance. These feature sets also support the
nding that users rely upon audio information to detect activation information [45]. How-
ever, the audio channel does not provide sucient valence dierentiation and observers
must thus rely upon other modes of aective communication (video, context, etc.) [54].
In [17], the authors presented an analysis of audio-visual emotion modulation. The
data were segmented by utterances and utterances were compared across four emotion
classes (angry, happy, sad, neutral). The utterances were further segmented by phoneme
boundaries. The phonemes of the emotional utterances were compared to a neutral base-
line. The data suggested that phonemes that contained little emotion modulation (the
feature values of the emotional utterance were not signicantly dierent than those of the
emotional utterance) were accompanied by more facial movement than those phonemes
that were more strongly emotionally modulated. This recasts the emotion production
problem as an emotional bit allocation problem in which the information is transmitted
across the two channels based on the channel bandwidth available. In the work presented
in this chapter, such emotional subtleties were not included due to the limited nature of
58
the video channel and the nonlinearities inherent in a fusion between audio and video
channels. Future experiments will utilize audio-visual data augmented with motion cap-
ture recording [13]. This will allow for a closer study of the channel modulation, fusion,
and perceptual integration resulting from the natural expression of audio-visual emotional
utterances.
This work was limited by the expressivity constraints on the video channel. Given
the expressivity inequalities and the single instantiation of the emotional interface, it
is dicult to generalize these results broadly without follow-up investigations. Future
studies will include a four-by-four factorial design including both synthetic and human
faces and voices. These four combinations will begin to illustrate the interaction between
audio and video information across varying levels of emotional expressivity. The human
data for these future studies will be created using dynamic time warping. This method can
be used to align the phoneme duration of a target sentence (e.g., angry) given the phoneme
duration of another sentence (e.g., neutral). This manipulation permits a combination
of audio and video data streams with the same lexical content, but dierent emotional
realizations, allowing for a comparison across purely human audio and video features.
This chapter presented a quantitative analysis of the features important to emotional
audio-visual perception. A reduced feature set was created and validated using SVM
classication. However, more analyses are required to determine the impact of these
features on emotion perception, rather than on emotion classication. Future work will
explore this avenue using analysis by synthesis techniques to compare the emotional
salience of features as a function of their identied relevance.
59
Chapter 4
Evaluators as Individuals
Quantitative models of user perception have the potential to facilitate the design of syn-
thetic emotional expressions. These models could lead to computer agents and robots that
more naturally and functionally blend into human society [26,92]. User specic emotion
modeling and synthesis requires an understanding of human emotion perception, often
measured using stimuli presentation experiments. Unfortunately, the evaluation process
is non-stationary. Subjective emotion appraisals of evaluators change as the evaluators
tire and as they are exposed to increasing numbers of emotional utterances. It is com-
mon practice to estimate emotional ground truth by averaging evaluations from multiple
evaluators. The question remains as to whether this averaging between evaluators with
dierent internal representations of emotion sacrices important individual information.
The eld of emotion classication has been studied extensively [19,24,53,122]. How-
ever, we believe that it is important to have a stronger understanding of how evaluators
The work presented in this chapter was published in the following article:
1. Emily Mower, Maja J Matari c, Shrikanth Narayanan. "Evaluating Evaluators: A Case Study in
Understanding the Benets and Pitfalls of Multi-Evaluator Modeling." In Proceedings of Interna-
tional Speech Communication Association (InterSpeech). Brighton, England, September 2009.
60
dier in their emotion reporting style and how these dierences aect consequent clas-
sication accuracies. The dierences between the way an individual perceives his/her
portrayal of emotion and the way other evaluators perceive these same displays has been
studied [8, 20, 111] and the results suggest that there is a dierence between how these
groups view the aective content of stimuli. However, the eect on classication accuracy
of individual evaluation styles has not been suciently addressed.
This chapter presents an analysis of human emotion evaluation. The foci of this
chapter are: the measurement of the consistency between evaluators and the evaluation
of evaluators based on automatic emotion recognition. Firstly, we study the consis-
tency between the categorical emotion labels (e.g., angry, happy, sad, neutral) of the
utterances and the evaluators' internal representation of valence (positive vs. negative)
and activation (calm vs. excited), using Na ve Bayes classication. This classication
framework is used to estimate the categorical emotion of an utterance given individual
evaluators' ratings of valence and activation. It is hypothesized that higher performance
will be observed for evaluators with internally consistent representations of the dimen-
sional emotional space. It is possible to map from dimensional properties of emotion to
categorical labels [98]. Therefore, it is hypothesized that those with an easily modeled
mapping will be more accurately classied. Secondly, we model the relationship between
the temporal acoustic properties of a clip and the subjective valence or activation rating
using Hidden Markov Models.
The conventional approach in emotion recognition is to use subjective evaluations
to measure the performance of the system. We propose the opposite approach: to use
the results of an emotion recognition system to measure the consistency of subjective
61
evaluations. We hypothesize that the performance of the automatic system will increase if
the emotion labels are consistent with respect to an evaluator's internal model of emotion.
These results are compared across three conditions: a) training and testing onindividual
data; b) training and testing on averaged data; and c) training on averaged data and
testing on individual data.
As stated in Chapter 1, humans and their methods of reporting the emotion content
of aective stimuli are unique. Therefore, we hypothesized that models trained on the
evaluations of individuals would result in higher classication results than models trained
on an average evaluator or models trained on average evaluator and tested on individuals.
However, we found that models trained on averaged data and tested on either individual
data or averaged data outperformed the individual-specic train-test scenarios. This
study also suggested that individuals who are more consistent in their appraisal of emotion
are more accurately modeled than those individuals who are less consistent.
4.1 Emotional Data
4.1.1 IEMOCAP Data
The database utilized in this study was the USC IEMOCAP database, collected at the
University of Southern California [13]. The USC IEMOCAP database is an audio-visual
database, augmented with motion capture recording. It contains approximately 12 hours
of data recorded from ve male-female pairs of actors (ten actors total). The goal of the
data collection was to elicit natural emotion expressions within a controlled setting. The
benet of the acted dyadic emotion elicitation strategy is that it permits the collection
62
of a wide range of varied emotion expressions. The actors were asked to perform from
(memorized) emotionally evocative scripts and to improvise upon given emotional targets.
The emotional freedom provided to the actors allowed for the collection of a wide range
of emotional interpretations. The benets of utilizing acted data are discussed more fully
in [3,18,43].
The data were evaluated using two evaluation structures: categorical evaluation and
dimensional evaluation. In both evaluation structures, the evaluators observed the audio-
visual clips in temporal order (with context). In the categorical evaluations, evaluators
were asked to rate the categorical emotion present from the set of: angry, happy, neutral,
sad, frustrated, excited, disgusted, fearful, surprised, and other. The evaluators could tag
an utterance with as many categorical labels as they deemed appropriate. There were
a total of six categorical evaluators who evaluated overlapping subsets of the database.
Each emotion was labeled by at least three categorical evaluators. In the dimensional
evaluations, the evaluators were asked to rate the clip according to its valence, activation,
and dominance properties. Valence describes the positive vs. negative aspect of the
emotion [1 = most negative, 5 = most positive]. Activation describes the calm vs. excited
aspect of the emotion [1 = most calm, 5 = most excited]. The dimensional evaluation task
was completed by a separate set of six evaluators, again evaluating overlapping subsets
of the data. Each emotional utterance within the database was labeled by at least two
dimensional evaluators [13].
In each evaluation task, the disparate evaluators were combined into a single rating
to determine an overall ground truth. The categorical ground truth was established using
63
majority voting over all of the reported categorical labels. The dimensional ground truth
was established by averaging (without rounding) over the dimensional evaluators [13].
4.1.2 Data Selection
In this chapter we present dimensional classication analyses. We used the audio les
from the ve female actresses in the IEMOCAP data. We considered only clips labeled
(using majority voting over the categorical evaluators) as angry, happy, sad, or neutral.
We present results using dimensional evaluations from the two evaluators who evaluated
the largest quantity of data (evaluators one and two). Evaluator one analyzed 1,773 clips
and evaluator two analyzed 1,682 clips from the angry, happy, sad, neutral set. The
clips evaluated by evaluator two are a subset of those evaluated by evaluator one. Please
see [13] for more database details.
4.1.3 Audio Features
We extracted 13 lterbanks of Mel Filterbanks (MFB), their delta, and acceleration
from the audio les. MFBs model the human auditory system by creating lterbanks of
increasing width as the frequency increases. This structure approximates the increasing
de-sensitivity to deviations in frequency as the frequency content of the signal increases
in human hearing. Mel Frequency Cepstral Coecients (MFCC), commonly used in
automatic speech recognition, are calculated by taking the Discrete Cosine Transform
(DCT) of the MFBs. MFBs have been shown to contain more emotional information
than MFCCs [19].
64
4.1.4 Treatment of Evaluations
This chapter presents two types of evaluator studies: individual and averaged. The
experiments based upon individual evaluations study the dimensional evaluator behaviors
of the two evaluators, evaluator one and evaluator two, separately. These evaluations
were neither averaged nor normalized. The averaged evaluations were the averages of the
valence and activation ratings of evaluators one and two. The averaged rating was always
rounded up to the nearest integer.
4.2 Approach
There are two points of interest that arise when considering evaluator performance: eval-
uator consistency (how similarly evaluators rate clips of the same semantic label) and
evaluator reliability (how representative the labels are of the acoustic properties of the
utterance). To answer these questions we consider two probabilistic modeling techniques:
Na ve Bayes and Hidden Markov Models, respectively. Previous work has demonstrated
the ecacy of utilizing Na ve Bayes to recognize the emotional content of speech [114]
and Hidden Markov Models to capture its underlying temporal properties [19].
4.2.1 Na ve Bayes
Research has shown that categorical emotions can be depicted as occupying specic por-
tions of a dimensional space dened by valence and activation [23,53,83,98]. For example,
archetypal angry emotions lie within an area dened by negative valence and high activa-
tion, while archetypal happy emotions lie within an area dened by positive valence and
65
mid to high activation. This suggests that given only an evaluator's subjective evalua-
tion of the valence and activation of an utterance, it should be possible to estimate the
categorical emotion label [53].
One measure of evaluator consistency is to determine how well a simple classication
algorithm predicts the categorical emotion label of a clip given the subject evaluation
of valence and activation. We used Na ve Bayes classication for this analysis. In this
classication task, evaluator one's and two's subjective valence and activation ratings are
used to predict the majority voted categorical label.
4.2.2 Hidden Markov Models
In the emotion evaluation process, there exists a dependency between assigned evaluation
and the temporal acoustic properties of an utterance. We used Hidden Markov Models
(HMM) to model the relationship between the temporal
uctuations of the acoustic prop-
erties and the resulting reported emotion perception. The HMM classication accuracies
provide insight regarding how representative the subjective dimensional tags of the eval-
uators are of the underlying emotional acoustic properties of the utterances. The accu-
racies were also used to analyze the eectiveness of averaging subjective, unnormalized
evaluations obtained from multiple individuals.
We describe two separate classication tasks: valence and activation. In these tasks,
the original ve point scale was collapsed into a three point scale to combat a data sparsity
issue. Classes one (low valence/activation) and ve (high valence/activation) were not
tagged with sucient frequency to form models across both evaluators. Class three,
representing neutral (for both valence and activation), remained unchanged. Classes one
66
Broad Phoneme Class Description Phonemes Included
B Back/mid vowels aa ah ao ax axh ax-h uh uw ux
C Fricatives ch dh f hh hv j jh s sh th v z zh
D Diphthong aw ay ey ow oy
F Front vowels ae eh ih ix iy
L Liquid and glide axr el er l r w y
N Nasal em en eng m n ng nx
T Stop b bcl d dcl dx epi g gcl k kcl p pcl q qcl t tcl
S Silence h# #h pau sil
Table 4.1: Mapping from phoneme to broad phoneme classes (Originally presented
in [19]).
and two (either negative valence or lowly activated) were collapsed into a single class
and classes four and ve (either positive valence or highly activated) were collapsed into
a single class. This resulted in three model groups for the valence dimension and three
models groups for the activation dimension classication tasks.
In each model group the data were modeled at the phoneme level. The phonemes
were clustered into seven classes to ensure an adequate quantity of training data for
each phoneme class. The seven classes included: front vowels, back/mid vowels, diph-
thong, liquid, nasal, stop consonants, and fricatives (for a detailed mapping see Table 4.1
and [19]). The utterances had accompanying transcription les at the word and phoneme
level, generated using forced alignment [13]. The phoneme-level transcription les were
modied for each utterance, replacing the original phonemes with phoneme classes (see
Table 4.2 for an example). In each classication task (valence and activation) there were
seven phoneme class models for each of the three model groups, plus emotion-independent
models for silence and laughter for a total of 23 models.
The HMMs were trained using HTK [119]. Each model had three-states and eight
mixture components. The HMMs were trained in two ways: a) using individual-specic
67
Original Data Transformed Data
Time Content Time Phoneme Emotional Phoneme Class
0 - 31 silence 0 - 47 sil sil
48 - 57 what
48 - 51 W one liquid
52 - 54 AH one back/mid
55 - 57 T one stop
58 - 70 was
58 - 60 W one liquid
61 - 63 AX one back/mid
64 - 70 Z one fricative
71 - 87 that
71 -75 DH one fricative
76 - 83 AE one front
84 - 87 TD one stop
88 - 145 silence 88 - 145 sil sil
Table 4.2: Data format used for HMM categorical emotion training, original sentence:
\What was that," expressed with a valence rating of one.
evaluations, and b) using averaged evaluations. The individual-specic HMMs were
tested using individual-specic evaluations. The averaged HMMs were tested using both
individual-specic evaluations and averaged evaluations. The testing procedure utilized
word-level forced alignment using -I in HVite. This focused the classication task on the
identication of the correct emotional phoneme class, rather than the correct phoneme
and the correct emotion class.
The output of the HMM classication consisted of a transcript le containing the
estimated emotional phoneme states over specied time windows. The nal emotion of
the utterance was assigned using majority voting over the estimated emotional phonemes,
weighted by the time duration of each assigned emotional phoneme class. The emotion
class represented most frequently in the output transcription was assigned as the nal
class label.
68
(a) Confusion matrix for
evaluator 1, Accuracy =
59.51%
A H S N
A 61 4 13 22
H 6 74 0 20
S 35 8 23 34
N 10 12 6 73
(b) Confusion matrix for
evaluator 2, Accuracy =
66.80%
A H S N
A 81 2 1 15
H 0 56 0 44
S 23 4 39 35
N 6 3 9 82
Table 4.3: Confusion matrices for the categorical emotion classication task (A = angry,
H = happy, S = sad, N = neutral). The results presented in this table are percentages.
4.3 Results
4.3.1 Na ve Bayes Classication of Evaluator Consistency
The subjective appraisals of valence and activation are linked to the categorical emotion
label [98]. This link can be simply modeled using Na ve Bayes (NB). We used an NB clas-
sier, implemented in the Matlab pattern recognition toolkit, PRTools [38], to predict the
categorical emotion label of the clip given only the subjective valence and activation eval-
uations of: a) evaluator one, and b) evaluator two. In all cases, the clips are chosen from
the set evaluated by both evaluators one and two and with a categorical label of angry,
happy, sad, or neutral. The analysis was performed using ve-fold cross-validation. The
results show that evaluator one's valence and activation ratings predicted the correct cat-
egorical label 59.51% of the time while evaluator two's dimensional evaluations predicted
the correct categorical label 66.80% of the time (see Table 4.3 for the evaluator-specic
confusion matrices).
69
4.3.2 HMM Classication for Correspondence Between Content and
Evaluation
In this section, models are referred to as \A { B", where \A" represents the training set
and \B" represents the testing set. The accuracies of the models trained with averaged
data (models \Ave - (Ave or Ind)" in Tables 4.4 and 4.5) are either better or compara-
ble to the accuracies of the models trained with individual data (models \Ind - Ind" in
Tables 4.4 and 4.5). The models trained and tested on averaged data (models \Ave -
Ave" in Tables 4.4 and 4.5) had a higher accuracy than either of the individual models
(models \Ind - Ind" and \Ave - Ind" in Tables 4.4 and 4.5) for both valence and acti-
vation. The classication performance of the \Ave - Ave" model improves signicantly
only with respect to evaluator one's activation and evaluator two's valence ( = 0:01, dif-
ference of proportions). In all other conditions, the change in classication performance
between the averaged and individual models is not signicant ( = 0:01; 0:05, dierence
of proportions).
The models trained and tested on individual data performed unequally for the valence
and activation classication tasks. The evaluator one model outperformed the evaluator
two model for the valence task ( = 0:01, dierence of proportions). The evaluator
two model outperformed the evaluator one model for the activation task ( = 0:01,
dierence of proportions). The models trained and tested on individual data did not
perform signicantly dierently than the models trained on averaged data and tested on
individual data ( = 0:01; 0:05, dierence of proportions).
70
Type Evaluator 1 (%) 2 (%) 3 (%) Total
Ind - Ind Evaluator 1 50.00 65.90 23.71 52.18
Ind - Ind Evaluator 2 37.47 60.28 46.65 44.33
Ave - Ave Average 47.86 64.70 40.00 52.68
Ave - Ind Evaluator 1 44.28 61.28 38.44 50.91
Ave - Ind Evaluator 2 36.01 69.72 41.34 44.39
Table 4.4: Classication: valence across the three levels.
Type Evaluator 1 (%) 2 (%) 3 (%) Total
Ind - Ind Evaluator 1 64.55 23.16 66.76 47.79
Ind - Ind Evaluator 2 68.81 39.37 62.83 55.79
Ave - Ave Average 64.50 47.00 65.93 56.86
Ave - Ind Evaluator 1 41.18 42.20 71.60 47.55
Ave - Ind Evaluator 2 60.57 49.00 63.70 57.70
Table 4.5: Classication: activation across the three levels.
4.4 Discussion
The NB and HMM classication indicated that the evaluation styles and strengths of the
two evaluators diered across tasks. However, when the evaluations from both evalua-
tors were combined, the HMM classication accuracies across the valence and activation
classication problem either improved or did not change signicantly. This suggests that
models constructed from averaged evaluator data may capture the emotional acoustic
properties of the utterance more closely even given dierent evaluation styles and inter-
nal representations of the relationship between the dimensional and categorical emotion
labels.
The NB results suggest reasons for the discrepancies between the classication per-
formances for the HMMs modeled on evaluator-specic data. The NB classication for
evaluator one indicates that evaluator one's internal representation of valence was more
strictly dened than that of evaluator two (Table 4.3). Evaluator one's confusion matrix
71
demonstrates that based on the subjective valence and activation ratings, there exists a
smaller confusion between happiness and other emotions than is observed for evaluator
two. Happiness is the only emotion with positive valence and should be dierentiable
based on the valence rating. It should be noted, that evaluator one's confusion ma-
trix suggests that there is an increased confusion between happiness and sadness. This
may be due to a misrepresentation of activation, discussed in the following paragraph.
The dierences between the inter-evaluator dimensional consistency may explain why the
individual-specic HMM valence model for evaluator one outperformed that of evaluator
two.
The NB results show an opposite trend for the dimensional ratings of activation. These
results suggest that evaluator two's dimensional rating of activation was more internally
consistent when compared to that of evaluator one. For example, evaluator one's results
indicate that there exists a higher level of confusion between anger and sadness than is
observed in evaluator two's results. Anger and sadness are emotion classes that should
be dierentiable based on their activation (high vs. low, respectively). The dierence in
evaluator activation consistency is supported by the HMM classication accuracies. The
HMM activation classication performance is higher for evaluator two than for evaluator
one.
The comparisons between the NB and HMM results suggest that evaluators one and
two have dierent evaluation styles and internal dimensional representation of emotion.
However, when the ratings of these two evaluators are combined, the performance of
the HMM classication on valence and activation improved (signicantly with respect to
evaluator one activation and evaluator two valence). This suggests that even given large
72
quantities of data, it may be more benecial to create averaged models of dimensional
evaluation, rather than evaluator-specic models (given evaluation styles that are not
divergent).
The user evaluations utilized in this study were not normalized per evaluator. While
the results of normalized evaluations may improve overall classication accuracies, such
techniques are not necessarily representative of real-world user interactions. It is not good
practice to discount the feedback of a user regarding emotion expression. It is important,
from a user initiative standpoint, to work with the evaluations as provided. Furthermore,
given new users in a human-computer or human-robot interaction scenario, it may not
be possible to develop normalization constants in real time, necessitating the use of raw
user input.
It is also important to note that the data utilized in this experiment come from only
partially emotionally constrained dyadic acted speech (both scripted and improvised).
The utterances were not recorded on a turn-by-turn basis with rigid emotional targets. As
a result, the emotional utterances in this database are not archetypal emotion expressions.
Consequently, one cannot expect the classication accuracies of these more natural and
subtle human emotional expressions to match those of classications performed on read
speech databases.
4.5 Conclusion
This chapter presented evidence suggesting that even given dierent evaluation styles and
dierent levels of evaluator consistency, averaged models of emotion perception could still
73
outperform individual models. As we move towards a society with ever increasing com-
puting power, we will begin to see emotionally personalized technology. These systems
must be able to meet both the interaction needs and expectations of the users with whom
they work. This necessitates an understanding and an ability to anticipate these prefer-
ences. Initially it may seem wise to model these expectations at a per user level. However,
this work suggests that the variability of individuals with respect to their dimensional
appraisal may lead to inaccuracies due to user self-misrepresentation. To mitigate this
problem, it may be benecial to adapt averaged models of user perception to accommo-
date individual users.
Additional work is needed to determine how to integrate and interpret raw user eval-
uations. Researchers [105] have suggested that new evaluation metrics should be created.
Experimental techniques for emotion evaluation should also be updated. This may lead
to the creation of new emotional ground truthing techniques that are more evaluator-
intuitive and evaluator-independent.
A weakness of the work presented in this chapter is the small number of dimensional
evaluators considered. Future work includes incorporating the evaluations of additional
evaluators. The IEMOCAP database contains both audio and facial motion capture
information. Future work also includes utilizing the video information to improve the
accuracies and to understand the temporal interaction between the audio information,
video information, and user perception.
74
Chapter 5
Emotion Proling
The proper design of aective agents requires an a priori understanding of human emotion
perception. Models used for the automatic recognition of emotion can provide designers
with a method to estimate how an aective interface may be perceived given the feature
modulations present in the stimuli. An understanding of the mapping between feature
modulation and human perception can foster design improvements in the creation of emo-
tionally relevant and targeted expressions for use in human-computer and human-robot
interaction. This understanding will further improve human-centered design, necessary
for wide-spread adoption of this aective technology [121].
Human perception of naturalistic expressions of emotion is dicult to estimate. This
dicultly is in part due to the presence of complex emotions, emotions containing shades
The work presented in this chapter was published in the following articles:
1. Emily Mower, Maja J Matari c and Shrikanth S. Narayanan, \A Framework for Automatic Hu-
man Emotion Classication Using Emotional Proles." IEEE Transactions on Audio, Speech and
Language Processing. Accepted for publication, August 2010.
2. Emily Mower, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok
Lee, Shrikanth Narayanan. "Interpreting Ambiguous Emotional Expressions." In Proceedings of
ACII Special Session: Recognition of Non-Prototypical Emotion from Speech- The Final Frontier?.
Amsterdam, The Netherlands, September 2009.
75
of multiple aective classes [73,94,103,104]. In [73], the authors detail a scenario in which
evaluators view a clip of a woman learning that her father will remain in jail. Human
evaluators tagged these clips with labels including anger, disappointment, sadness, and
despair [73]. The lack of emotional purity in natural expressions of emotion must be
considered when designing systems to anticipate human perception of non-stereotypical
emotional speech. Classication systems designed to output one emotion label per input
utterance may perform poorly if the expressions cannot be well captured by a single
emotional label.
Naturalistic emotions can be described by detailing the presence/absence of a set of
basic emotion labels (e.g., angry, happy, sad) within the data being evaluated (e.g., a
spoken utterance). This multiple labeling representation can be expressed using Emotion
Proles (EP). EPs provide a quantitative measure for expressing the degree of the presence
or absence of a set of basic emotions within an expression. They avoid the need for
a hard-labeled assignment by instead providing a method for describing the shades of
emotion present in the data. These proles can be used in turn to determine a most
likely assignment for an utterance, to map out the evolution of the emotional tenor of an
interaction, or to interpret utterances that have multiple aective components.
EPs have been used within the community as a method for expressing the variability
inherent in multi-evaluator expressions [105]. These EPs represent the distribution of
reported emotion labels from a set of evaluators for a given utterance. Steidl et al.
compared the entropy of their automatic classication system to that present in human
evaluations. We introduced the notion of EPs for classication in a position paper [85].
We described EPs as a method for representing the emotion content of an utterance
76
in terms of the phoneme-level emotion classication output over the utterance. These
proles described the percentage of phonemes classied as one of ve emotion classes.
In this chapter, proles describe the emotion-specic classier condence. Thus, these
new proles can provide a more natural approximation of human emotion, approximating
blends of emotion, rather than time-percentage breakdowns of classication or reported
evaluator perception.
EPs are an eective representation for emotion. In this chapter we present an im-
plementation of emotion classication from vocal and motion-capture cues using EPs as
an intermediary step. The data are modeled at the utterance-level where an utterance is
dened as one sentence within a continuous speaker turn, or, if there is only one sentence
in the turn, the entire speaker turn. The EPs are composed of four-binary support vector
machine (SVM) classiers, one for each of the emotions considered (anger, happiness,
sadness, neutrality), used to create an estimate of the presence or absence of classes of
emotion using classier condence. There are two methods that can be used to assign a
nal label. The rst method assigns a class label based on the emotion class with the
highest level of condence, represented by the EP. The second method involves a sec-
ondary classication in which the proles serve as a mid-level representation to a nal
classication stage. The rst method is employed in this chapter. The second method is
employed in the following two chapters.
Three data types of varying levels of ambiguity are used in the EP analyses. These
data types are based on evaluator reports. They include unambiguous (\prototypical",
total evaluator agreement), slightly ambiguous (\non-prototypical majority-vote consen-
sus"), highly ambiguous (\non-prototypical non-majority-vote consensus"), and mixed
77
(\full dataset", both total agreement and majority-vote consensus). We demonstrate
that the use of feature selection in conjunction with EP representation results in an over-
all accuracy of 68.2% and an average of per-class accuracies (unweighted accuracy) of
64.5%, which is comparable to a previous audio-visual study resulting in an unweighted
accuracy of 62.4% [78]. The results are compared to a simplied four-way SVM in which
condences were not taken into account. In all cases, the overall accuracy of the presented
method outperforms the simplied system. We also demonstrate that the EP based sys-
tem can be extended to interpret utterances lacking a well-dened ground truth. The
results suggest that EPs can be used to discriminate between types of highly ambiguous
utterances.
This work is novel in that it presents a classication system based on the creation of
EPs and uses this technique to interpret emotionally ambiguous utterances. EPs represent
emotions as complex blends, rather than discrete assignments. Furthermore, these proles
can be used to disambiguate the emotional content of utterances in expressions that would
not otherwise be classied as a single expression of emotion.
5.1 Data Description
The database utilized in this study is the USC IEMOCAP database, collected at the
University of Southern California [13]. The USC IEMOCAP database is an audio-visual
database, augmented with motion capture recording. The data were evaluated using cat-
egorical and dimensional evaluation frameworks. There were at least three evaluators per
78
categorical label and at least two per dimensional label. The database is fully described
in Chapter 4, Section 4.1.1.
5.1.1 Emotion Expression Types
Emotional data can be described by the level of evaluator agreement allowing the data
to be considered either as a cohesive whole or as merged sub-sets of data. The subsets
considered in this work are prototypical, non-prototypical majority-vote consensus (\non-
prototypical MV"), and non-prototypical non-majority-vote consensus (\non-prototypical
NMV"). These three emotional gradations are derivations of Russell's prototypical and
non-prototypical denitions (Chapter 1, Section 1.3.4) and are used to describe the clarity
of the emotion presentations.
The three emotion expression types are dened with respect to the categorical emo-
tional evaluators. Prototypical emotion expressions are expressions with clear well-agreed
upon emotional content. During the categorical emotion labeling task, these utterances
were assigned a categorical label that is assigned by all evaluators' (e.g., for three evalu-
ators, all three evaluators tagged the emotion \angry"). Non-prototypical MV emotions
are utterances with identiable, but ambiguous, emotional content. During the cate-
gorical evaluation, there was no single label at the intersection of all of the evaluators'
assignments. However, these utterances were tagged by a majority of the evaluators with
a single emotional label (e.g., two evaluators tagged an emotion as \angry" and one tagged
the emotion as \disgusted"). The nal emotional group, the non-prototypical NMV emo-
tions were tagged with an inconsistent set of emotional labels (e.g., one evaluator tagged
the emotion as \angry", another as \disgusted", and the nal as \sad"). As a result, it
79
Expression Type Angry Happy Neutral Sad Total
Prototypical 284 708 121 309 1422
Non-prototypical MV 316 496 451 315 1578
Non-prototypical NMV 1L 173 17 47 45 282
Non-prototypical NMV 2L 174 142 350 174 420
Table 5.1: The distribution of the classes in the emotion expression types (note: each
utterance in the 2L group has two labels, thus the sum of the labels is 840 but the total
number of sentences is 420). There are a total of 3,000 utterances in the prototypical and
non-prototypical MV group, and 3,702 utterances in total.
was not possible to dene a ground-truth label for this group of emotions. It is dicult
to make a strong assertion regarding the prototypical or non-prototypical nature of an
utterance since there are, on average, only three evaluators per utterance. However, the
results suggest that the designations are representative of diering amounts of variability
within the emotion classes.
In this chapter the utterances considered were tagged with at least one emotion
from the emotional set: angry, happy, neutral, sad, excited. In all cases, the classes
of happy and excited were merged to combat data sparsity issues into a group referred
to as \happy". In the prototypical and non-prototypical MV data, all the utterances
had labels from this emotional set. In the non-prototypical NMV group, only utterances
tagged by at least one evaluator as angry, happy, neutral, sad, or excited were considered
(the classes of happy and excited were again merged). This group is described as either
1L, indicating that one of the labels is in the emotional set, 2L indicating that two of
the labels are in this set, or nL indicating that there were more than two labels from the
set. The 1L data was extremely biased towards the class of anger (Table 5.1) and there
were only 80 utterances in the nL group, therefore, this study will focus only on the 2L
emotions. Table 5.1 shows the distribution of the data across the three expression classes.
80
5.1.2 Data Selection
This work utilized a subset of the USC IEMOCAP database. During the data collection,
only one actor at a time was instrumented with motion capture markers. This decision
allowed for an increase in the motion capture marker coverage on the actors' faces. Con-
sequently, only half of the utterances in the database are accompanied by motion capture
recordings.
The dataset size was further diminished by eliminating utterances without a single
voiced segment. This eliminated utterances of sighs, breaths, and low whispers.
Finally, the dataset size was reduced by the evaluator reported aective label. As
previously stated, all utterances analyzed in this chapter are tagged with at least one
label from the set: angry, happy/excited, neutral, sad.
5.2 Audio-Visual Feature Extraction
The features utilized in this study were chosen for their perceptual relevance. The ini-
tial feature set contained audio and video (motion-capture extracted) features. All fea-
tures were extracted at the utterance-level and were normalized for each subject using
z-normalization. The feature set was reduced to create four emotion-specic feature sets
using Information Gain.
5.2.1 Audio Features
The audio features include both prosodic and spectral envelope features. The prosodic
features include pitch and energy. These features have been shown to be relevant to
81
Figure 5.1: The location of the IR markers used in the motion capture data collection.
emotion perception [101, 103, 104, 115]. The spectral features include Mel Filterbank
Coecients (MFBs). As stated in the previous chapter, MFBs approximate humans'
sensitivity to changes in frequencies. As the frequency of a signal increases, humans
become less able to dierentiate between two distinct frequencies. MFBs capture this
property by binning the signal with triangular bins of increasing width as the frequency
increases. Mel Filterbank Cepstral Coecients (MFCC) are commonly used in both
speech and emotion classication. MFCCs are discrete cosine transformed (DCT) MFBs.
The DCT decorrelates the feature space. Previous work has demonstrated that MFBs
may contain more emotionally relevant information than Mel Filter Cepstral Coecients
(MFCC) across all phoneme classes, due to the lack of the nal de-correlating step of the
MFCC calculation [19].
5.2.2 Video Features
The denition of the video features was motivated by Facial Animation Parameters
(FAPs). FAPs express distances (x,y,z) between points on the face. The features uti-
lized in this study were based on the features found in [112], adapted to the current facial
82
(a) Cheek. (b) Mouth. (c) Forehead. (d) Eyebrow.
Figure 5.2: The FAP-inspired facial distance features utilized in classication.
motion capture conguration. These features were extracted using motion capture mark-
ers (Figures 5.1 and 5.2). The cheek features include the distance from the top of the
cheek to the eyebrow (approximating the squeeze present in a smile); the distance from
the cheek to the mouth, nose, and chin; cheek relative distance features; and an average
position. The mouth features contain distances correlated with the mouth opening and
closing, the lips puckering, and features detailing the distance of the lip corner and top of
lip to the nose (correlated with smiles and frowns). The forehead features include features
describing the relative distances between points on the forehead and the distance from
one of the forehead points to the region between the eyebrows. The eyebrow features
include features describing the up-down motion of the eyebrows, features describing eye-
brow squish, and features describing the distance to the center of the eyebrows. Each
distance is expressed in three features dening the x, y, and z-coordinates in space.
5.2.3 Feature Extraction
The utterance-length feature statistics include mean, variance, range, quartile maximum,
quartile minimum, and quartile range. The quartile features were used instead of the
83
maximum, minimum, and range because they tend to be less noisy. The pitch features
were extracted only over the voiced regions of the signal. The video motion-capture
derived features were occasionally missing values due to camera error or obstructions. To
combat this missing data problem, the features were extracted only over the recorded
data for each utterance. These audio-visual features have been used in previous emotion
classication problems [53].
The features were normalized over each speaker using z-normalization. The speaker
mean and standard deviation were calculated over all of the speaker-specic expressions
within the dataset (thus, over all of the emotions). Both the normalized and non-
normalized features were included in the feature set.
5.2.4 Feature Selection
There were a total of 685 features extracted. However, there were only 3,000 prototypical
and non-prototypical MV utterances utilized for testing and training. The feature set was
reduced using Information Gain on a per emotion class basis (e.g., the features for the
class of anger diered from those of happiness). Information Gain describes the dierence
between the entropy of the labels in the dataset (e.g., \happy") and entropy of the labels
when the behavior of one of the features is known (e.g., \happy" given that the distance
between the mouth corner and nose is known) [79]. This feature selection method permits
a ranking of the features by the amount of emotion-class-related randomness that they
explain. The top features were selected for the nal emotion-specic feature sets.
The feature selection was implemented in Weka, a Java-based data mining software
package [116]. Information Gain has previously been used to select a relevant feature
84
Emotion Cheek Eyebrow Forehead Mouth Energy MFB
Angry 0.03 { 0.04 0.02 0.04 0.87
Happy 0.48 0.11 0.11 0.30 { {
Neutral 0.48 10.0 0.10 0.28 { 0.05
Sad { { { { 0.04 0.96
Table 5.2: The average percentage of each feature over the 40 speaker-independent
emotion-specic feature sets (10 speakers * 4 emotions).
subset in [90] and in the work discussed in Chapters 2 and 3. Information Gain does
not create an uncorrelated feature set. An uncorrelated feature set is often preferable for
classication algorithms. However, humans rely on a redundant and correlated feature
set for recognizing expressions of emotions and thus Information Gain was chosen to
approximate the feature redundancy of human emotion processing.
The features were selected in a speaker-independent fashion. For example, the Infor-
mation Gain for the emotion-specic features to be used for speaker 1 was calculated over
a database constructed of speakers 2-10 using ten-fold cross-validation.
5.2.5 Final Feature Set
The number of features was determined empirically, optimizing for accuracy. The nal
feature set included the top 85 features (see Table 5.2 for the feature types selected)
for each emotion class. The feature sets for anger and sadness were primarily composed
of MFBs. The feature sets of happiness and neutrality were composed primarily of a
mixture of cheek and mouth features. The high representation of audio features in the
angry and sad feature sets and the low representation in the happy and neutral feature
sets reinforce previous ndings that anger and sadness are well captured using audio data
while happiness is poorly captured using audio data alone [14,85] .
85
Input Utterance Estimate: Angry
4 Way Binary Classication
Angry vs. Not Angry
Happy vs. Not Happy
Neutral vs. Not Neutral
Sad vs. Not Sad
Output Label
Angry (+1)
Condence (ca)
Happy (-1)
Condence (ch)
Neutral (-1)
Condence (cn)
Sad (+1)
Condence (cs)
Angry
Happy
Neutral
Sad
Emotional Prole
+ca
-ch
-cn
+cs
Figure 5.3: The EP system diagram. An input utterance is classied using a four-
way binary classication. This classication results in four output labels representing
membership in the class (+1) or lack thereof (1). This membership is weighted by the
condence (distance from the hyperplane). The nal emotion label is the most highly
condent assessment.
5.3 Classication of Emotion Perception: Emotion Prole
Support Vector Machine
The EP representation utilized in this chapter consists of four binary Support Vector
Machines (SVM). The EPs were created using the four binary outputs and a measure of
classier condence. The nal label of the utterance is the most condent assignment in
the EP (see Figure 5.3 for the system diagram and an example).
5.3.1 Support Vector Machine Classication
SVMs transform input data from the initial dimensionality onto a higher dimension to
nd an optimal separating hyperplane. SVMs have been used eectively in emotion
classication [5, 14, 66, 83, 95]. The SVMs used in this study were implemented using
Matlab's Bioinformatics Toolkit. The kernel used was a Radial Basis Function (RBF)
with a sigma of eight, determined empirically. The hyperplane was found using Sequential
86
Minimal Optimization with no data points allowed to violate the Karush-Kuhn-Tucker
(KKT) conditions (see [25] for a more detailed explanation of SVM convergence using
the KKT conditions).
There were four emotion-specic SVMs trained using the emotion-specic (and speaker-
independent) feature sets selected using Information Gain (Chapter 3, Section 3.2.2).
Each of the emotional SVMs was trained discriminatively using a self vs. other training
strategy (e.g., angry or not angry). The output of each of the classications included a
1 and the distance from the hyperplane. This training structure is similar to the one
utilized in [6], in which the Bartlett et al. estimated the emotion state of a set of speakers
from a video signal. The authors transformed the distances from each of the self vs. other
SVM classiers into probability distributions using a softmax function. In the present
work, the distances were not transformed because pilot studies demonstrated the ecacy
of retaining the distance variations inherent in the outputs of each of the four emotion-
specic SVM models. The models were trained and tested using leave-one-speaker-out
cross-validation on the emotion-specic feature sets.
5.3.2 Creation of Emotional Proles
EPs express the condence of each of the four emotion-specic binary decisions. The
emotions of angry, happy, and sadness are often postulated as basic emotions [88]. We
therefore considered it important to include these emotions, augmented by neutrality, as
the four prole components. Each of the classiers was trained using an emotion-specic
feature set (e.g., the feature set for the angry classier diers from that for the happy
classier). The outputs of each of these classiers included a value indicative of how well
87
the models created by each classier t the test data. This goodness of t measure was
used to assess which model ts the data most accurately.
The SVM goodness of t measure used in this study was the raw distance from the
hyperplane. SVM is a maximum margin classier whose decision hyperplane is chosen
to maximize the separability of the two classes. The distance from the margin of each
emotion-specic classier provides a measure of the classier condence. The prole
components are calculated by weighting each emotion-specic classier output1 by
the absolute value of the distance from the hyperplane (the goodness of t measure).
This formulation renders the EPs representative of the condence of each binary yes-no
emotion class membership assignment.
The intuition behind this decision came from the nature of the SVM classier. SVM
identies a class label using position relative to a separating boundary. Data points that
are close to the boundary suggest that, in the feature space (or projected feature space),
the class label of the data points are more easily confused than points further away from
the boundary. Points that lie far from the separating hyperplane are examples of data
that are more dierentiable, or are less confusable examples of a given class, than data
that lie close to the hyperplane. For example, in the binary angry classication task
a point that is far from the decision hyperplane may be a strong example of \angry"
suggesting that the data point is in fact not \not angry."
Raw distances to the hyperplane are eective measures of the strength of emotion.
The accuracy of this statement was assessed by analyzing the distance to the hyperplane
as a function of location in a valence-activation plot. If the raw distance to the hyperplane
is an appropriate measure one would expect to see that utterances in regions associated
88
with strong expressions of one of the basic emotions considered will be further away
from the hyperplane than utterances located in other regions of the valence-activation
space. In Figure 5.4 the raw distances to the hyperplane are plotted in the valence-
activation space for Speaker 1. The four gures present the distance to the hyperplane
for the four emotional components of the EP (anger, happiness, neutrality, and sadness).
In the gure, dark red represents distances associated with a strong assertion of class
membership while dark blue represents distances associated with a strong rejection of
class membership. For example, in Figure 5.4, the \Happy Component" subplot (upper
right) shows that utterances that are positively valenced (ranging from calm happiness
to excitation) are dark red, the \Angry Component" subplot (upper left) demonstrates
that the negatively valenced and highly activated utterances are dark red, and the \Sad
Component" (lower right) subplot illustrates that the negatively valenced utterances with
low activation are dark red. The \Neutral Component" subplot (lower left) shows that
the utterances with neutral valence and lower activation are dark red. The classication
between neutral and sad data is notoriously dicult. The comparison between the neutral
and sad subplots illustrate that even given a component representation the classes remain
confusable.
SVMs were chosen for the EP backbone based on experimental evidence suggesting
that this algorithm had the highest performance when compared to other discriminative
techniques. Thus, the main results are presented using the SVMs as a backbone. However,
EPs can be created using K-Nearest Neighbors (KNN), Linear Discriminant Analysis
(LDA), or any classier that returns a goodness of t measure (including generative
classiers). Both KNN and LDA have been used in emotion recognition studies [6,32,61].
89
Figure 5.4: The raw distances to the hyperplane for the four emotional components of
the EP.
5.3.3 Final Decision
An emotional utterance was assigned to an emotion class depending on the representation
of the emotions within the EP. The inherent benet of such a classication system is that
it can handle assessments of emotional utterances with ambiguous emotional content. The
four binary classiers may return a value suggesting that the emotion is not a member
of any of the modeled emotion classes when the emotional content of an utterance was
unclear. Absent an EP based system, it would be dicult to assign an emotional label to
such an utterance. However, even in this scenario, it is possible to reach a nal emotion
assignment by considering the condences of each of the no-votes.
By denition, ambiguous or non-prototypical emotional utterances t poorly in the
categorical emotion classes. This mismatch may be because the emotional content was
90
from an emotion class not considered. It may also be because the utterance contained
shades of multiple subtly-expressed emotion classes. However, the EP based classier is
able to recognize these slight presences by a low-condence rejection. Consequently, even
given four no-votes, a nal emotion assignment can still be made.
The neutral emotion class is dicult to classify because there exists a large variability
in the emotion expression expressed within this class. Neutral expressions may be colored
by shades of anger, happiness, or sadness. Evaluators also may assign a class of neutrality
when no other emotion is distinctly expressed. EPs can also be used to capture this phe-
nomenon. If all of the rejection condences, described by the EP, are above a threshold,
then there is a strong indication that there is no clear emotion expressed. These utter-
ances are assigned to the class of neutrality. This threshold is dened by calculating the
mean and subtracting one standard deviation of the condences over each of the emo-
tions. If the EP indicated that the emotion with the highest condence is chosen with a
no-vote condence outside this threshold (i.e. the prole expressed high-condence that
the utterance did not contain that emotion), it is assumed that there is no clear emotion
present, and the utterance is assigned to the class of neutrality. This neutral assignment
method is similar to the one implemented in [78].
5.4 Results and Discussion: the Prototypical, Non-Prototypical
MV, and Mixed Datasets
The results presented describe the system performance over utterances labeled as angry,
happy (the merged happy{excited class), neutral, or sad. The results are divided into
91
three categories: general results, prototypical emotion results, and non-prototypical MV
results.
The general, prototypical, and non-prototypical results are compared to a baseline
classication system and chance. The baseline is a simplied version of the EP classier.
In this baseline, instead of utilizing the EP representation (weighting the output by
the distance from the boundary), the decisions are made using three steps. If only one
classier returns a value of +1, then the emotion label is assigned to this class. If multiple
classiers return +1, the utterance is assigned to the selected class with the higher prior
probability. If no classiers return +1, the emotion is assigned to the class with the
highest prior probability (of the four emotion classes).
The baseline represents SVM classication without considering relative condences.
Emotion is often expressed subtly. This subtle expression of emotion is often not well
recognized by classiers trained to produce a binary decision (acceptance vs. rejection).
The comparison between the EP and the baseline will demonstrate the importance of
considering the condence of classication results (e.g., a weak rejection by one of the
classiers may indicate a subtle expression of emotion, not the absence of the emotion)
rather than just the binary result. The chance classication result assigns all utterances
to the emotion most highly represented within the four (i.e., general, prototypical, and
non-prototypical MV) data sub-sets.
5.4.1 General Results
The rst set of classication results was obtained by training and testing on the full
dataset (prototypical and non-prototypical MV utterances). The overall classication
92
Data Type Emotion Precision Recall F
Angry 0.67 0.75 0.71
Full Happy 0.77 0.81 0.79
EP Neutral 0.54 0.28 0.37
Sad 0.60 0.75 0.67
Weighted: 0.68 Unweighted: 0.65
Baseline 0.59
Angry 0.75 0.80 0.77
Prot Happy 0.89 0.88 0.88
EP Neutral 0.65 0.34 0.45
Sad 0.76 0.89 0.82
Weighted: 0.82 Unweighted: 0.72
Baseline 0.76
Angry 0.58 0.71 0.64
NonProt MV Happy 0.60 0.70 0.65
EP Neutral 0.46 0.39 0.42
Sad 0.55 0.41 0.47
Weighted: 0.55 Unweighted: 0.55
Baseline 0.42
Table 5.3: The EP and baseline classication results for three data divisions: full (a com-
bination of prototypical and non-prototypical MV), prototypical, and non-prototypical
MV. The baseline result (simplied SVM) is presented as a weighted accuracy.
accuracy using the EP representation was 68.2% (Table 5.3). This outperformed both
chance (40.1%) and the simplied SVM (55.9%). The dierence between the EP method
and baseline method was signicant at 0:001 (dierence of proportions test). The
unweighted accuracy (an average of the per-class accuracies) was 64.5%. This result was
comparable to the work of Metallinou et al. [78] (described in Chapter 1, Section 1.3.4)
with an unweighted accuracy of 62.4%, demonstrating the ecacy of the approach for a
dataset with varying levels of emotional ambiguity.
The average proles for all utterances demonstrate that in the classes of angry, happy,
and sad there is a clear dierence between the representation of the reported and non-
reported emotions within the average proles (Figure 5.5). All four average proles
demonstrate the necessity of considering condence in addition to the binary yes-no label
93
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for All Angry Utterances
Distance From Hyperplane
(a) Angry.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for All Happy Utterances
Distance From Hyperplane
(b) Happy.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all neutral utterances
(c) Neutral.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for All Sad Utterances
Distance From Hyperplane
(d) Sad.
Figure 5.5: The average emotional proles for all (both prototypical and non-prototypical)
utterances. The error bars represent the standard deviation.
in the classication of naturalistic human data. For example, the average angry EP
indicates that even within one standard deviation of the average condence the angry
classier returned a label of \not angry" for angry utterances. The use of and comparison
between the four emotional condences allowed the system to determine that, despite the
lack of a perfect match between the angry training and testing data, the evidence indicated
that the expressed emotion was angry (F-measure = 0.71).
As mentioned earlier, the EP technique can be performed using a variety of classica-
tion algorithms. The results are presented using an SVM backbone. Results can also be
presented for an EP KNN (k = 35, 66.4%) and an EP LDA (diagonal covariance matrix,
60.3%).
94
5.4.2 Prototypical Classication
The prototypical classication scenario demonstrated the ability of the classier to cor-
rectly recognize utterances rated consistently by evaluators. The overall accuracy for the
prototypical EP classier was 81.7% (Table 5.3). This outperformed chance (49.8%) and
the simplied SVM (75.5%). The dierence between the EP and baseline was signicant
at 0:001 (dierence of proportions test). The high-performance of the simplied
SVM was due in part to the prevalence of happiness in the prototypical data (49.8%).
This bias aected the nal results because both ties were broken and null-results were
converted to a class assignment using class priors.
The simplied SVM left 391 utterances unclassied (all classiers returned1), rep-
resenting 27.5% of the data.
The average proles for prototypical utterances (Figure 5.6) demonstrated that there
was a dierence between the representation of the reported emotion and non-reported
emotions in the EPs for the classes of angry, happy, and sad. The barely-dierentiated
neutral EP clearly demonstrated the causes behind the poor classication performance
of the neutral data. The performance increase in the angry, happy, and sad classica-
tions can be visually explained by comparing Figures 5.5(a) to 5.6(a), 5.5(b) to 5.6(b),
and 5.5(d) to 5.6(d). The mean condence value for the angry, happy, and sad data were
higher when training and testing on prototypical data.
95
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Angry Prototypical Utterances
Distance From Hyperplane
(a) Angry.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Happy Prototypical Utterances
Distance From Hyperplane
(b) Happy.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Neutral Prototypical Utterances
Distance From Hyperplane
(c) Neutral.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Sad Prototypical Utterances
Distance From Hyperplane
(d) Sad.
Figure 5.6: The average emotional proles for prototypical utterances. The error bars
represent the standard deviation.
5.4.3 Non-prototypical Majority-Vote (MV) Classication
The classication of non-prototypical MV utterances using EPs resulted in an overall
accuracy of 55.4%. This accuracy was particularly of note when compared to the sim-
plied SVM baseline classication whose overall accuracy is 42.2%. This dierence was
signicant ( 0:001, dierence of proportions test). The EP also outperformed chance
(31.4%). The class-by-class comparison can be seen in Table 5.3. In 62.3% of the data
(983 utterances), none of the binary classications in the simplied SVM classier re-
turned any values of +1. This indicates that the four-way binary classication alone is
not sucient to detect the emotion content of ambiguous emotional utterances. In the
EP method there was a higher level of confusion between all classes and the class of
96
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Angry Non−Prototypical Utterances
Distance From Hyperplane
(a) Angry.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Happy Non−Prototypical Utterances
Distance From Hyperplane
(b) Happy.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Neutral Non−Prototypical Utterances
Distance From Hyperplane
(c) Neutral.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average Profile for Sad Non−Prototypical Utterances
Distance From Hyperplane
(d) Sad.
Figure 5.7: The average emotional proles for non-prototypical utterances. The error
bars represent the standard deviation.
neutrality. This suggests that the emotional content of utterances dened as \neutral"
may not belong to a well-dened emotion class, but may instead be representative of the
lack of any clear and emotionally meaningful information.
The average proles for non-prototypical MV utterances (Figure 5.7) demonstrate that
the EP representation strongly dierentiates between reported and non-reported emo-
tions given non-prototypical MV data in the classes of anger and happiness. The average
non-prototypical EPs also provide additional evidence for the importance of comparing
the condences in emotional assessments between multiple self vs. other classication
schemes. The simplied baseline demonstrated that in 62.3% of the data all four binary
classiers returned non-membership labels, indicating that, in this subset, the feature
97
properties of the training and testing data dier more markedly here than in the pro-
totypical training{testing scenario. However, the similarity between the properties of
a specic emotion class in the training data were closer to those of the same emotion
class in the testing data, rather than a dierent emotion class. This suggests that the
EP based method is more robust to the dierences in within-class emotional modulation
than conventional SVM techniques.
The neutral classication of the non-prototypical MV data was more accurate than
that of either the prototypical or full datasets. The neutral EP modeled using the non-
prototypical MV data (Figure 5.7(c)) was better able to capture the feature properties of
the neutral data than those modeled using either the prototypical or full data (compare
to Figures 5.6(c) and 5.7(c)). This suggests that it may be benecial to create models
based on emotionally variable data (e.g., the non-prototypical MV data) when considering
inherently ambiguous emotion classes, such as neutrality.
5.4.4 Emotion Proles as a Minor Emotion Detector
The previous sections demonstrated that EP based representations can be used in a
classication framework. This section will assess the ability of the EPs to capture both
the majority and minority reported emotions (e.g., for the rating \angry-angry-sad", the
major emotion is anger and the minor emotion is sadness). In this assessment, the EP
was trained and tested on the non-prototypical MV data.
The ability of the prole to correctly represent the major and minor emotions is stud-
ied in two ways: rst, using utterances whose major emotion was correctly identied by
the EP and whose minor emotion is from the set of angry, happy, neutral, and sad and
98
second, using utterances whose major and minor emotions were the two most condent
assessments (in either order). There are a total of 748 utterances with minor emotions
in the targeted set. Utterances with minor labels outside of this set were not consid-
ered as the EPs only include representations of anger, happiness, neutrality, and sadness
condences and could not directly represent emotions outside of this set.
The major-minor emotion trends can be seen in Table 5.3(a). The proportion of the
non-prototypical MV emotions with secondary emotions from the considered set diers
with respect to the majority label. For example, 81.04% of the original happy data was
included in the new set while only 6.01% of the angry data had secondary labels in the
set. The most common secondary label for the angry data was frustration (74.05%),
an emotion not considered in this study due to a large degree of overlap between the
classes. The distribution of the secondary emotions suggests that the most common
combination in the considered aective set was a majority label of happy and a minority
label of neutral (Table 5.3(a)). This combination represented 51.06% of the major-minor
emotion combinations in the considered set. It should also be noted that across all major
emotions, the most common co-occurrence emotion was neutrality.
In an ideal case, the EP would be able to represent both the majority and the minority
emotions correctly, with the majority emotion as the most condent assessment and the
minority emotion as the second most condent assessment. There were a total of 211
proles (28.2% of the data) that correctly identied the major and the minor emotions.
Over the non-prototypical MV data, there were 406 utterances with a correctly identied
major label. Thus, the 211 proles represented 52.0% of the correctly labeled data. This
indicated that the majority of proles that correctly identied the major emotion also
99
(a) Total number of emotions with secondary labels in the
angry, happy, neutral, sad set.
Major# Angry Happy Neutral Sad Total
Angry { 4 13 2 19
Happy 4 { 382 16 402
Neutral 12 129 { 56 197
Sad 11 25 94 { 130
Total 27 158 489 74 748
(b) Results where the major and minor emotions are cor-
rectly identied
Major# Angry Happy Neutral Sad Total
Angry { 2 6 1 9
Happy 1 { 136 5 142
Neutral 1 9 { 19 29
Sad 1 2 28 { 31
Total 3 13 170 25 211
(c) Results where the major and minor emotions were both
in the top two reported labels
Major# Angry Happy Neutral Sad Total
Angry { 2 6 1 9
Happy 2 { 161 6 169
Neutral 2 28 { 33 63
Sad 1 11 47 { 59
Total 5 41 214 40 300
Table 5.4: The major{minor emotion analysis.
correctly identied the minor emotion. This suggests that EPs can accurately assess
emotionally clear and emotionally subtle aspects of aective communication. The major-
minor pairing results can be found in Table 5.3(b).
In emotion classication a commonly observed error is the disordering of the major
and minor emotions (i.e., the major emotion is reported as the minor and vice versa).
This phenomenon was also studied. Table 5.3(c) presents the emotions whose major and
minor emotions were recognized in the two most condently returned emotion labels (in
either order). The results demonstrate that of the utterances with minor emotions in the
target aective set, the EPs identied both the major and minor emotions in the top
100
two components 40.1% of the time. This percentage varied across the major labels. The
angry non-prototypical MV data had both components recognized in 47.4% generated
EPs, while they were both represented in 42.0% of the happy EPs, 32% of the neutral
EPs, and 45% of the sad EPs.
These results suggest that the EP technique is capable of representing subtle emotional
information. It is likely that this method does not return a higher level of accuracy
because the expression of the major emotion was already subtle. Therefore, the expression
of the minor emotion was not only subtle, but not observed by all evaluators. Therefore,
this minority assessment may have been due to a characteristic of the data or the attention
level of the evaluator. In this light, the ability of the EP method to capture these
extremely subtle, and at times potentially tenuous, bits of emotional information should
further recommend the method for the quantication of emotional information.
5.5 Results and Discussion: the Non-Prototypical NMV
Dataset
One of the hallmarks of the EP method is its ability to interpret emotionally ambiguous
utterances. EPs can be used to detect whatever information is possible given inherently
ambiguous data. In this chapter, the goal was to utilize utterances that have at least one
label from the target emotional set and to identify at least one of the emotions reported
by the evaluators.
The non-prototypical NMV utterances have no majority-voted label. These utterances
were labeled with one (or more) of the labels from the set: angry, happy, neutral, sad.
101
No ground-truth could be dened for these utterances because there was no evaluator
agreement. EPs are ideally suited to work with this type of data because they provide
information describing the emotional makeup of the utterances rather than a single hard
label.
Two experiments were conducted on the non-prototypical NMV data. The rst ex-
periment was a classication study in which the non-prototypical NMV data were clas-
sied using models trained on the full dataset, the prototypical data only, and the non-
prototypical MV data only. This study determines how well suited the EP representation
method, trained on labeled data, is for recognizing ambiguous emotional content. This
problem is dicult because the classiers must be able to identify the categorical emotion
labels when the evaluators themselves could not. The evaluator confusion implies that
the feature properties of the utterances are not well described by a single label. The
second experiment was a statistical study designed to understand the dierences in the
representations of the emotions within the EPs. This study provides evidence validating
the returned EPs. It demonstrates the dierences that exist between EPs of specic am-
biguous emotion classes. These results suggest that the EP method returns meaningful
information in the presence of emotional noise.
There were a total of 420 non-prototypical NMV 2L utterances considered in the two
experiments.
5.5.1 Experiment One: Classication
In the classication study, three train-test scenarios were analyzed. In each study, the
modeling goal was to recognize at least one of the labels tagged by the evaluators in the
102
2L dataset using the EPs. This modeling demonstrates the ability of the EPs to capture
emotional information in the presence of highly ambiguous emotional content. Classier
success was dened as the condition in which the classied emotion (the top estimate
from the EP) was in the set of labels reported by the categorical evaluators. In the 2L
dataset, there were two possible correct emotions, as explained in Section 5.1.
There were three training-testing scenarios. In the rst scenario, the models were
trained on the full dataset (prototypical and non-prototypical MV). In the second sce-
nario, the models were trained on only prototypical utterances. In the nal experiment,
the models were trained only on non-prototypical MV utterances. The three experiments
analyze the generalizability of the models trained on utterances with varying levels of
ambiguity in expression.
The results demonstrate (Table 5.5) that the emotional proling technique is able
to eectively capture the emotion content inherent even in ambiguous utterances. In all
results presented, the per-class evaluation measure is precision, and the overall measure is
accuracy. This decision was motivated by the evaluation structure. The goal of the system
was to correctly identify either one or both of the two possible answers. Consequently, a
per-class measure that necessitates a calculation of all of the utterances tagged as a certain
emotion is not relevant, because two correct labels would then be in direct opposition
for the per-class accuracy measure. However, accuracy over the entire classied set is
relevant because a classier that returns either of the two listed classes can be dened
as performing correctly. The chance accuracy (assigning a specic utterance to the class
with the highest prior probability) of the 2L dataset was calculated by nding the emotion
103
Dataset Train Angry Happy Neutral Sad Accuracy
2L
All 0.76 0.61 0.90 0.65 0.71
Prot 0.77 0.60 0.88 0.59 0.65
Non-prot 0.70 0.55 0.89 0.68 0.73
Baseline 0.42
Table 5.5: The results of the EP classication on the 2L non-prototypical NMV data.
The results are the precision, or the percentage of correctly returned class designations
divided by the total returned class designations.
that co-occurred with the other emotion labels most commonly. The class of neutrality
occurred in 41.6% of the labels. Thus, chance was 41.6%.
The maximal accuracy of the 2L dataset was achieved in the non-prototypical MV
training scenario with 72.6% (Table 5.5). The class-by-class precision results demon-
strate that specic data types are more suited to identifying the aective components of
emotionally ambiguous utterances.
The results indicate that anger was precisely identied 70-77% of the time. This is
of particular note because in these data, humans could not agree on the label; yet, when
training with the non-prototypical MV data, the EP could precisely identify the presence
of anger.
The results further indicate that the EP was able to reliably detect one of the emo-
tional labels of the utterances from the 2L dataset. The overall accuracy of 72.6% is far
above the chance accuracy of 41.6%. Furthermore, since the chance classier was only
capable of detecting neutrality, this supports the more precise detection of the EP over a
range of emotions.
The EP method is able to capture information that cannot be captured by the sim-
plied baseline SVM discussed earlier. In Figure 5.8 all of the histograms demonstrate
104
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Neutral−Angry utterances
Distance From Hyperplane
(a) Neutral{Angry.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Neutral−Happy utterances
Distance From Hyperplane
(b) Neutral{Happy.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Neutral−Sad utterances
Distance From Hyperplane
(c) Neutral{Sad.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Angry−Happy utterances
Distance From Hyperplane
(d) Angry{Happy.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Angry−Sad utterances
Distance From Hyperplane
(e) Angry{Sad.
Angry Happy Neutral Sad
−2
−1
0
1
2
Average profile for all Happy−Sad utterances
Distance From Hyperplane
(f) Happy{Sad.
Figure 5.8: The average emotional proles for the non-prototypical NMV utterances. The
error bars represent the standard deviation.
that, on average, all four binary classiers return non-membership results (1). The
condence component allows the EP to disambiguate the subtle emotional content of the
non-prototypical NMV utterances. The average proles of Figure 5.8 demonstrate that
the EPs are able to capture the emotion content of these utterances.
5.5.2 Experiment Two: ANOVA of EP based representations
In this statistical study, two ANOVA analyses were performed on the proles to determine
the statistical signicance of the representations of the emotions within the proles. These
studies investigated the ability of the EPs to dierentiate between the reportedly present
and absent emotional content.
The results presented in this section are two-tailed ANOVAs. These analyses were
performed on the 2L dataset with EP models trained using the full dataset (prototypical
105
and non-prototypical MV). This study demonstrate that EPs are able to capture multiple
reported emotions. In the results described below, the two reported labels for an utterance
in the 2L dataset are referred to as the co-occurring labels or group (e.g., neutral and
angry). Labels that are not reported are referred to as the non-occurring labels or
group (e.g., happy and sad). Each ANOVA analysis studies sets of EPs grouped by the
co-occurring emotions (e.g., the neutral{angry group). These groups are referred to as
EP sets.
The rst analysis studies the representation of pairs of emotions in individual EP
sets by comparing the co-occurrence (e.g., neutral and angry) group mean to the non-
occurrence (e.g., happy and sad) group mean. This study asks whether the EP represen-
tation is able to capture the dierence between reportedly present and absent emotions.
This analysis is referred to as the Individual EP Set experiment.
The Individual EP Set experiment demonstrates that in general the representation
of the co-occurrence group in an EP set diers from that of the non-occurrence group.
In the angry{sad EP set this dierence was signicant at 0:001, in the neutral{
happy and neutral{sad, this dierence was signicant at 0:01. In the neutral{angry,
this dierence was signicant at 0:1. In the angry{happy and happy{sad EP sets,
this dierence was not signicant. This suggests that in the majority of the cases, the
individual EP sets were able to dierentiate between the presence and absence of the
co-occurrence labels in the emotional utterances (Table 5.6).
The next study builds on the results of the Individual EP Set results to determine if
the representation of these co-occurring emotion groups diers between their native EP
set and a dierent (non-native) EP set (e.g., compare the representation of neutral and
106
Co-occurring emotions
P-value
Emotion 1 Emotion 2
Neutral Angry -
Neutral Happy **
Neutral Sad **
Angry Happy
Angry Sad ***
Happy Sad
Table 5.6: ANOVA analysis of the dierence in group means between co-occurring and
non-occurring emotions within an EP set (Individual EP set experiment). (- = 0:1,
* = 0:05, ** = 0:01, *** = 0:001)
angry in the neutral{angry EP set to the neutral{angry representation in the happy{sad
EP set). This is referred to as the Group experiment.
The Group experiment found that in most cases, the co-occurrence group mean dif-
fered between the native EP set and the non-native EP sets when the co-occurrence
emotions of the non-native set was disjoint from the co-occurrence emotions of the native
set. This was observed most starkly with the Angry{Sad EP set. The representation of
the co-occurrence emotions diered from their native EP set only when compared with
their representation in the Neutral{Happy EP set, sets where the co-occurrence emotions
were entirely disjoint. This demonstrates that EP sets must be dierentiated based on
more than their co-occurrence emotions (Table 5.7).
The following two analyses determine if the representation of the individual co-
occurring emotions diers between the native and non-native sets. These are referred
to as the Emo
1
and Emo
2
experiments (e.g., compare the representation of neutral in the
neutral{angry EP set to the neutral representation in the happy{sad EP set).
The Emo
1
and Emo
2
experiments demonstrate that the dierence in the representa-
tion of the individual co-occurrence emotions of anger, happiness, and sadness between
107
their native and non-native EP sets occurred most frequently and most signicantly when
the co-occurrence emotion pair was neutrality (Table 5.7).
The Individual EP Set, Group, Emo
1
and Emo
2
analyses demonstrated that aspects
of the EP sets are dierentiable. The nal analysis compares the dierences between
the EP set representations as a whole. The result is an interaction term between the
analysis of the dierence between the representation of each emotion in the EP sets and
the dierence between the two EP sets' values when grouped together (e.g., compare the
neutral{angry EP set to the happy{sad EP set). This is referred to as the EP
1
vs. EP
2
experiment.
The EP
1
vs. EP
2
experiment demonstrates that in 26 of the 30 cases, the representa-
tion of the EP sets diered signicantly between the sets. Furthermore, the cases in which
the EP sets were not signicantly dierent occured when the emotions represented by the
two co-occurrence pairs shared a similar co-occurrence emotion. The co-occurrence pairs
of angry{happy and angry{sad can both represent tempered anger. Consequently, the
EP sets' inability to strongly dierentiate between the two emotion types should not be
seen as a failure, but instead as the EP sets' ability to recognize the inherent similarity
in the emotion expression types (Table 5.7).
These results suggest that certain EP sets distinctly represent the underlying emo-
tions reported by evaluators. This further suggests that these EP sets (rather than a
single condent label) can be used during classication to detect the dierences between
ambiguous emotional utterances. This application of EP sets will be explored in future
work.
108
5.6 Conclusion
Natural human expressions are combinations of underlying emotions. Models aimed at
automated processing of these emotions should re
ect this aspect. Conventional classi-
cation techniques provide single emotion class assignments. However, this assignment
can be very noisy when there is not a single label that accurately describes the presented
emotional content. Instead these utterances should be described using a method that
identies multiple emotion hypotheses. If a single label is necessary, it can be divined
from the information contained within the EP. However, when a hard label is not re-
quired, the entirety of the emotional content can be retained for higher-level emotional
interpretation.
The EP technique performs reliably both for prototypical and non-prototypical emo-
tional utterances. The results also demonstrate that the presented EP based technique
can capture the emotional content of utterances with ambiguous aective content. EPs
provide a method for describing the emotional components of an utterance in terms of a
pre-dened aective set. This technique provides a method for either identifying the most
probable emotional label for a given utterance or the relative condence of the available
emotional tags.
The neutral emotion class is dicult to classify because there exists a wide range in
the variability of emotion expressed within this class. Neutral expressions may be colored
by shades of anger, happiness, or sadness. Evaluators also may assign a class of neutrality
when no other emotion is distinctly expressed.
109
One of the strengths of the EP based method is its relative insensitivity to the selection
of the base classier. This study presented results utilizing an SVM four-way binary
classier. The SVM classier can be replaced by any classier that returns a measure of
condence. The results demonstrate that other classication methods (KNN, LDA) can
also serve as the backbone of an EP based method.
Future work will include several investigations of the utility of the EP representation.
In the presented work, the nal emotion assessments are made by selecting the most
condent emotion assessment from the generated prole. However, this does not take
into account the relationship between the individual emotional components of the EP.
Chapters 6 and 7 will investigate classication of the generated proles. Frustration is
not included in either the EP testing or training. Chapter 6 will also investigate whether
frustration should be included as a component of the prole, or if the EP representation is
suciently powerful to represent frustration without including it as a component. Finally,
Chapter 7 will also investigate the utility of emotion-based components for representation,
rather than data-driven clusters as the relevant components in the prole construction.
These analyses will provide further evidence regarding the ecacy of prole-based repre-
sentations of emotion.
This chapter presented a novel Emotional Proling method for automatic emotion
classication. The results demonstrate that these proles can be used to accurately
interpret naturalistic and emotionally ambiguous human expressions and to generate both
hard- and soft-labels for emotion classication tasks. Furthermore, EPs-based methods
are relatively robust to classier selection. Future work will include utilizing EPs to
interpret dialog-level emotion expression and utilizing EPs for user-specic modeling.
110
EP Set 1 EP Set 2 Co-occurring EP
1
vs. EP
2
Emo
1
Emo
2
Other
1
Other
2
Group Emo
1
Emo
2
(Interaction Term)
Neu Ang
Neu Hap *** * *** ***
Neu Sad *** - - ***
Ang Hap * **
Ang Sad - - *
Hap Sad ** - - ***
Neu Hap
Neu Ang *** * *** ***
Neu Sad *** * *** ***
Ang Hap *
Ang Sad * * ***
Hap Sad
Neu Sad
Neu Ang *** *** ***
Neu Hap *** * *** ***
Ang Hap *** * *** ***
Ang Sad *** - *** ***
Hap Sad * - - **
Ang Hap
Neu Ang * ** **
Neu Hap *** *
Neu Sad *** *** ** ***
Ang Sad
Hap Sad - * **
Ang Sad
Neu Ang - *
Neu Hap ** *** ***
Neu Sad *** *** ***
Ang Hap
Hap Sad * * **
Hap Sad
Neu Ang ** - * ***
Neu Hap
Neu Sad * - **
Ang Hap * **
Ang Sad * * **
Table 5.7: ANOVA analyses of the dierences between reported emotions in proles in
which they were reported vs. proles in which they weren't. Note that the EP
1
vs. EP
2
is
an interaction of an ANOVA analysis of the set EP
1
vs. EP
2
and an ANOVA analysis of
the representation of the individual emotions in each EP set. (- = 0:1, * = 0:05,
** = 0:01, *** = 0:001)
111
Chapter 6
The Robustness of Emotion Proling
In the previous chapter EPs were shown to be eective representations for emotional ut-
terances. The components of the prole allowed for a descriptive characterization of the
emotion components present in the utterance. However, it is necessary to demonstrate
that the EPs can also represent the component properties of out-of-domain emotions.
Ambiguous emotional expressions are a natural part of human communication. Conse-
quently, in a human-machine interaction (HMI), a system's aective awareness capabil-
ities are limited both by its ability to recognize emotions on which it has been trained
and to reconcile emotions that it has not previously observed. This chapter will assess
the ability of EPs to discriminatively represent emotions unseen during training.
The work presented in this chapter was published in the following articles:
1. Emily Mower, Maja J Matari c and Shrikanth S. Narayanan, \A Framework for Automatic Hu-
man Emotion Classication Using Emotional Proles." IEEE Transactions on Audio, Speech and
Language Processing. Accepted for publication, August 2010.
2. EmilyMower, Maja J Matari c and Shrikanth S. Narayanan, \Robust Representations for Out-of-
Domain Emotions Using Emotion Proles." In Proceedings ofIEEEWorkshoponSpokenLanguage
Technology (SLT), Berkeley, CA, December 2010.
3. Emily Mower, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok
Lee, Shrikanth Narayanan. "Interpreting Ambiguous Emotional Expressions." In Proceedings of
ACII Special Session: Recognition of Non-Prototypical Emotion from Speech- The Final Frontier?.
Amsterdam, The Netherlands, September 2009.
112
Emotion classication requires the quantication of aective utterances via mathe-
matical representation. These representations attempt to disambiguate aective data by
maintaining the
exibility needed to capture the essence of the expression while allow-
ing for the variance inherent in human emotions. However, during an interaction with
a human, a system will invariably be faced with representing an emotion unseen during
its training. The representation employed by the machine must be able to capture the
emotional content of the data in a way that will allow for future classication, even if
the emotional category has not previously been observed. This ability to characterize
utterances may allow future HMI systems to adapt to the emotion speaking style of their
users.
EPs have been used to fuse dierent modalities in classication [78]. EP-like repre-
sentations have also been used to represent the evaluations of a set of evaluators [55,104]
and to represent perception based on actions (as a function of multiple emotions) [22]. In
this chapter, we will further analyze this technique to study the ability of this technique
to represent out-of-domain data.
In the previous chapter the EPs were four-dimensional and the classication was four-
way. In this chapter the utility of adding additional representations will be explored. The
driving hypothesis is that the four emotional components of the EP should be able to
represent emotions that are combinations of the components. This chapter will demon-
strate that four semantically meaningful \ideal" clusters (e.g., angry, happy) can be used
to represent ve separate emotion categories, using frustration as a case study. The EPs
will again be composed of angry, happy, neutral, and sad emotional components with
an optional frustration component. The comparison of the four and ve dimensional EP
113
classications will demonstrate the robustness of the EPs for the representation of out-of-
domain emotional data. The classication accuracies did not signicantly dier between
the four and ve dimensional EP classications suggesting that EPs need only contain
the emotions necessary to \span" the emotional space. The classication results and
statistical analyses presented in this chapter suggest that EPs are a robust representation
for emotion, both in- and out-of-domain.
The results demonstrate that the EP representation can be eectively used to char-
acterize the data in an n-way (where n = 4; 5) speaker-dependent emotion classication
task using Na ve Bayes. This speaker-dependent classication is representative of the user
personalization component inherent in long-term human-machine interaction. The pre-
sented classication framework obtains an accuracy of 68.43% over the four-class emotion
classication problem (angry, happy, neutral, and sad) over the full dataset. However, its
true power lies in its ability to characterize emotions unseen during the generation of the
representation. EPs trained only on angry, happy, neutral, and sad data can classify a
test set composed of angry, happy, neutral, sad, and frustrated utterances with a classi-
cation accuracy of 58.20%. This represents a decrease of performance of only 0.35% when
compared to the results obtained by including frustration in the EP training. This study's
novelty is in its demonstration that EPs, a new representation for emotional utterances,
can be used to discriminatively characterize emotions unseen during the training of the
EPs.
114
6.1 Description of Data
6.1.1 IEMOCAP Database
The representative capability of the EP representation was evaluated using the USC
IEMOCAP Dataset collected at the University of Southern California [13]. This dataset
contains data from ve mixed-gender pairs of actors (10 actors total). The data include
video, audio, and motion-capture recordings. A full description of the data can be found
in Chapter 4, Section 4.1.1.
6.1.2 Data Denitions
As discussed in the previous chapter, the data were partitioned into groups dened by
the level of agreement between evaluators. These groups were labeled prototypical and
nonprototypical majority-vote (hereafter referred to as nonprototypical). These deni-
tions are derived from those of Russell [98]. Prototypical utterances have clear emotional
content with total evaluator agreement; the utterance's majority emotional tag was se-
lected by all of the evaluators. The nonprototypical utterances have emotional content
that is less clear than that of the prototypical utterances, the majority emotion tag was
the tag selected by only a majority of the evaluators. The distribution of the data can
be seen in Table 6.1.
6.2 Emotion Proles
One theory of emotion asserts that there exist \basic emotions". An emotion is basic if it
is dierentiable from all other emotions [39]. The set of basic emotions can be thought of
115
Data Type Angry Happy Neutral Sad Frustrated
Prototypical
284 709 121 309 353
15.99% 39.92% 6.81% 17.40% 19.88%
Nonprototypical
316 496 451 315 598
14.52% 22.79% 20.73% 14.48% 27.48%
Combined
600 1205 572 624 951
15.18% 30.49% 14.47% 15.79% 24.06%
Table 6.1: The distribution of the emotion classes in the prototypical and nonprototypical
categories.
as a subset of the space of human emotion, forming an approximate basis for the emotional
space. More complex, or secondary, emotions can be created by blending combinations
of the basic emotions. For example, the secondary emotion of jealousy can be thought of
as the combination of the basic emotions of anger and sadness [120]. There are often four
emotions postulated as basic. This emotion list includes anger, happiness, sadness, and
fear. The basic emotions utilized in this work are a subset of this basic emotion list and
include: anger, happiness, sadness, and an additional emotion, neutrality, usually dened
as the absence of discernible emotional content.
Thus, EPs represent emotional utterances using a set of emotional bases. The EPs
quantify the presence or absence of a set of emotions in a given utterance. This subset of
emotional labels is chosen to minimize class overlap and correlation. This work assesses
the utility of extending the EP representation to include additional emotions that are
correlated with the emotional bases previously described.
6.2.1 Construction of an EP
As discussed in the previous chapter, the EPs were constructed using Support Vector
Machines (SVM) and are speaker-independent. The models were trained using a disjoint
116
Input Test
Utterance
from Test
Speaker
Output
Angry : +1, dist 2
Happy : -1, dist 2
Neutral: -1, dist 1
Sad : +1, dist 1
Test EP
+2
- 2
Naïve
Bayes
(4-way
or
5-way)
Angry
Labeled
Train Data
from Test
Speaker
Output
Angry : +1, dist 2
Happy: -1, dist 1
Neutral: -1, dist 2
Sad: -1, dist 1
Labeled EPs
+2
- 2
Labeled
Data From
Disjoint
Speaker Set
Train Binary SVMs
Angry vs. Not
Happy vs. Not
Neutral vs. Not
Sad vs. Not
Trained System for
EP Generation
Figure 6.1: The EP based classication system diagram. This example demonstrates
the correct classication of a nonprototypical angry utterance (a mixture of anger and
sadness).
speaker set (e.g., the EP for Speaker 1 was generated using data from Speakers 2-10).
These training data were clustered into the semantic classes using the labels angry, happy,
neutral, sad, and when applicable, frustrated. The EPs were constructed by testing the
held out speaker data (e.g., Speaker 1) on the trained SVM models (Figure 6.1). Each EP
contained n-components, one for the output of each emotion-specic SVM. The number
of components was either four (angry, happy, neutral, and sad) or ve (angry, happy,
neutral, sad, and frustrated). See Figure 6.2 for an example of a four-dimensional EP.
6.2.2 Classication with EP Based Representations
There are two ways to transform an n-dimensional EP into a nal classication label.
The simpler of the two approaches is to assign a label to an input utterance based on
117
Angry Happy Neutral Sad
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Happy"
No Frustration Training
Figure 6.2: The EP of an utterance tagged as 'happy'. This EP has been trained without
frustration data.
the maximal component of the prole (e.g., in Figure 6.2 the label would be happy).
This approach was employed in Chapter 5. However, as observed in the previous chapter
(Section 5.4.4), the minority component also contains emotional relevant information.
Voting-based labeling does not take advantage of the information in the minority com-
ponents. Instead of relying on choosing the maximal condence, the nal emotion can
be selected after classifying the generated prole in a speaker dependent method. In this
work, we use Na ve Bayes classication.
6.2.3 Speaker-Dependent and Speaker-Independent Components
The classication framework employed in this study is motivated by speaker personal-
ization. Speaker personalization involves two stages, a speaker-dependent and a speaker-
independent stage. In speaker personalization, a system is initialized with a baseline set
of models. The personalization stage is then the process of adapting the system's models
118
to the current speaker. Speaker personalization is important in emotion-aware technology
as emotion production varies across individuals.
In this framework the classication system was composed of the described speaker-
independent and speaker-dependent components. In the speaker-independent stage, emotion-
specic SVMs were trained using the labeled emotional (angry, happy, neutral, sad, and
frustrated, if applicable) data from nine speakers. These four or ve emotion-specic
SVMs were used to generate the four or ve-dimensional EPs for the held out speaker.
These EPs were used as the features in the speaker-dependent classication stage. In
the speaker-dependent classication stage, the held out speaker's EPs were classied in
a speaker-dependent fashion using Na ve Bayes (Figure 6.1). The results were assessed
using leave-one-utterance-out cross-validation over the generated EPs for each speaker.
For example, Speaker 1 had m EPs after the speaker-independent EP construction. The
nal emotion class assignment of an utterance (represented by an EP) was determined by
training a Na ve Bayes classier on the remaining m 1 EPs. This process was repeated
over all of the generated EPs. Preliminary results suggested that Na ve Bayes classi-
cation was more eective in this task than K-Nearest Neighbors, Discriminant Analysis,
and Gaussian Mixture Models.
6.3 Feature Extraction and Selection
The features utilized in this study were extracted from the audio and motion-capture
information. In both cases utterance-level features were used. The statistics used in this
study include: mean, variance, upper quartile, lower quartile, and quartile range.
119
The audio features include the rst thirteen Mel Filterbank Coecients (MFB), pitch,
and intensity. Pitch and intensity are commonly used in emotion classication tasks and
have been found to be eective [101,103,104,115]. As discussed in the previous chapter,
Mel Filterbank Cepstral Coecients (MFCC) are also commonly used in both speech and
emotion recognition. MFCCs are not used because previous work has demonstrated that
MFBs are more eective for emotion classication than MFCCs [15].
As stated in the last chapter, the video features are based on Facial Animation Pa-
rameters (FAP) [112]. These features were adapted for the motion capture conguration
present in the USC IEMOCAP dataset. FAPs specify the (x,y,z) distances between spe-
cic points on the face. The video features were broken down into regions dened by the
cheeks, eyebrow, forehead, and mouth. A more detailed description of the video features
can be found in Chapter 5, Section 5.2.2.
6.3.1 Feature Selection
The initial feature set consisted of 685 features. The feature selection method utilized
is Principle Feature Analysis (PFA) [69]. PFA is an extension of Principle Component
Analysis (PCA) that returns interpretable features (from the original feature space) rather
than linear combinations of features. In PFA, as in PCA, the eigenvalues and eigenvec-
tors are calculated. The features are clustered in the PCA space. The features closest to
the mean of each of the clusters are returned as the nal feature set. The PFA feature
selection was speaker-independent (e.g., features were selected for Speaker 1 using Speak-
ers 2-10) over the prototypical and nonprototypical utterances labeled as angry, happy,
120
neutral, or sad. The nal feature set contained 30-features for each speaker. This feature
selection algorithm has been used in emotion classication tasks on the USC IEMOCAP
dataset [77,78].
6.4 Methods
There were two train-test scenarios presented to analyze the ability of the EP tool to
generalize to unseen data. In both scenarios, the EP performance when the training
and test contain the same emotions was used as a benchmark. In the rst scenario, the
EPs were augmented to include a frustration component, in the second scenario the EPs
contained only the angry, happy, neutral, and sad data. In both conditions, the EPs were
tested on the angry, happy, neutral, sad, and frustrated data. The goal was to assess
the ability of the EP to uniquely represent unseen test data. The hypothesis was that
frustration test utterances would be represented in the EPs suciently dierently from
that of the other aective classes. This result was anticipated because frustration has a
high degree of overlap with the classes of anger, happiness, and sadness. Consequently,
EPs trained on the set of angry, happy, neutral, and sad emotions should be able to
represent frustration. This result would suggest that EPs used for n-way classication
need not contain n components.
6.5 Results
This section demonstrates the ecacy of EP based representation for the emotional classes
of angry, happy, neutral, sad, and frustrated. The classication performance is analyzed
121
across three data conditions: prototypical only, combined prototypical and nonprototyp-
ical, and nonprototypical. The classication results on a baseline set of angry, happy,
neutral, and sad data are provided as a reference. In the previous chapter and previously
published work [78] the classication was entirely speaker-independent. Consequently, the
results presented in this study cannot be compared directly to any of the published work
due to the nal user-dependent classication step. However, in [78] the authors obtained
a speaker-independent unweighted accuracy of 62.42% (accuracy across the four emotion
categories) on combined prototypical and nonprototypical data across the classes of an-
gry, happy, neutral, and sad using a fused GMM-HMM approach. The authors used a
prole-based technique to fuse the facial (motion-capture) and vocal modalities. The cur-
rent unweighted accuracy of 66.52% is not directly comparable, however, it demonstrates
that the EP based classication technique is eective for this database.
6.5.1 Classication with EP Frustration Training
This set of results demonstrates the classication performance when a ve-dimensional
EP representation is employed. The hypothesis is that training EPs with frustration
will not provide signicant benet to the overall ve-class classication accuracy when
compared with the ve-class classication of the data without rst training the EPs on
the frustration data.
In this scenario, both the EPs and Na ve Bayes classier were trained with data from
the set of angry, happy, neutral, sad, and frustrated utterances. The results demonstrate
that over both the prototypical and combined datasets the classication performance for
122
each of the emotions decreased when the train and test sets were augmented with frus-
tration (Table 6.2, compare the left-most and middle result columns). These results were
anticipated due to the high degree of overlap with the angry, sad, and neutral emotional
classes. In [13] the authors demonstrated that within the human evaluations frustration
overlaps with the classes of anger, sadness, and neutrality. In the human evaluations,
utterances labeled as frustration were also labeled as anger, happiness, neutrality, and
sadness 11%, 0%, 7% and 4% of the time, respectively. Utterances labeled as anger,
happiness, neutrality, and sadness were also labeled as frustration 17%, 1%, 13%, and 8%
of the time, respectively. Consequently, one would expect the classication performance
of those three classes to decrease when frustration was added to the train and test sets.
In the nonprototypical dataset there was also a decrease in performance in the happy
classication. This may be due to the increasingly vague denition of the emotion of
happiness.
6.5.2 Classication without EP Frustration Training
In the nal scenario the EPs were trained only with angry, happy, neutral, and sad data,
while the Na ve Bayes classier had to classify emotions from all ve categories. In this
training scenario, the EPs had to distinctly represent an emotion not seen during training.
The results are compared to the previous training scenario in which frustration was used
to train the EPs. The hypothesis was that since frustration overlaps with the other classes
already represented in the prole, the prole does not need a frustration component, as
that information is redundant.
123
Prototypical Four class EP
Frustration Augmentation
EP Train No EP Train
F-
measure
Angry 0.82 0.69 0.71
Happy 0.90 0.86 0.85
Neutral 0.59 0.51 0.53
Sad 0.82 0.80 0.78
Frustrated { 0.58 0.56
Weighted Accuracy (%) 83.69 74.32 73.54
Unweighted Accuracy (%) 79.29 69.09 69.01
Combined Four class EP
Frustration Augmentation
EP Train No EP Train
F-
measure
Angry 0.73 0.54 0.56
Happy 0.78 0.75 0.75
Neutral 0.45 0.27 0.30
Sad 0.67 0.61 0.61
Frustrated { 0.50 0.46
Weighted Accuracy (%) 68.43 58.55 58.20
Unweighted Accuracy (%) 66.52 54.19 54.30
Nonprototypical Four class EP
Frustration Augmentation
EP Train No EP Train
F-
measure
Angry 0.66 0.37 0.40
Happy 0.61 0.58 0.57
Neutral 0.47 0.29 0.33
Sad 0.54 0.49 0.48
Frustrated { 0.45 0.42
Weighted Accuracy (%) 56.53 44.72 44.44
Unweighted Accuracy (%) 57.89 44.42 43.83
Table 6.2: Classication results (F-measure) across the three datasets: prototypical,
combined, and nonprototypical. \EP Train" indicates ve-dimensional EPs, \No EP
Train" indicates four-dimensional EPs.
The results demonstrate that there is no signicant dierence between including frus-
tration in the training of the proles and merely training on the proles resulting from
only the angry, happy, neutral, and sad training. The greatest performance disparity oc-
curred in the prototypical dataset where the weighted accuracy decreased by only 0.78%,
the unweighted by 0.08%. In the combined and nonprototypical datasets, the weighted
124
Angry Happy Neutral Sad
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Angry"
NO Frustration Training
(a) Angry.
Angry Happy Neutral Sad
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Neutral"
NO Frustration Training
(b) Neutral.
Angry Happy Neutral Sad
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Frustrated"
NO Frustration Training
(c) Frustrated.
Angry Happy Neutral Sad
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Sad"
NO Frustration Training
(d) Sad.
Figure 6.3: The average EPs for the prototypical and nonprototypical utterances when
the EPs were trained without frustration data. The error bars represent the standard
deviation. The happy EP is not included in this plot; the trends follow those of the angry
and sad EPs.
accuracy decreased by 0.35% and 0.28%, respectively (Table 6.2). These performance dif-
ferences are not signicant at = 0:05. The small discrepancies in performance suggest
that the EPs are a robust representation for emotion.
6.5.3 EP Representation of Frustration
Previous work has demonstrated that the utterances labeled as frustrated in this database
are confused both by human evaluators [13] and by machine learning algorithms [85]
(audio-only analysis). The graphs of Figures 6.3 and 6.4 further support the inherent
diculty in characterizing this ambiguous emotion. Figure 6.4 demonstrates that on
125
Angry Happy Neutral Sad Frustrated
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Angry"
WITH Frustration Training
(a) Angry.
Angry Happy Neutral Sad Frustrated
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Neutral"
WITH Frustration Training
(b) Neutral.
Angry Happy Neutral Sad Frustrated
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Frustrated"
WITH Frustration Training
(c) Frustrated.
Angry Happy Neutral Sad Frustrated
−2
−1
0
1
2
Distance From Hyperplane
EP for an Utterance Labeled "Sad"
WITH Frustration Training
(d) Sad.
Figure 6.4: The average EPs for the prototypical and nonprototypical utterances when the
EPs were trained with frustration data. The error bars represent the standard deviation.
The sad EP is not included in this plot; the trends follow those of the angry and happy
EPs.
average the emotion of \frustration" is represented as not present for emotions labeled as
frustration. However, in both training conditions, frustration is recognized well above the
chance level, which is 19.88% for prototypical data and 27.48% for nonprototypical data
(Table 6.2). This indicates that the feature variations characteristic of frustration are
captured by both methods. This supports the assignment of frustration to a secondary,
rather than a basic emotion since it can be similarly described using a combination of
basic emotions. This further supports the idea that an emotional utterance should be
characterized by what is present, but also by what is condently identied as absent.
It should be noted that frustration, even when not modeled during the construction of
126
EP Type Angry Happy Neutral Sad
4-Dim ANS AHNS AS AHNS
5-Dim ANSF AHNSF ANS ANSF
Table 6.3: ANOVA analysis of the component-by-component comparison between the
frustrated and other emotional EPs. The emotion components are labeled by the rst
letter of their class (e.g., angry EP component = `A'). All dimensions listed in this table
are statistically dierent with p< 0:001.
the EP, can be more accurately characterized than neutral utterances, which have been
historically dicult to characterize in this database [78,84,85].
The average EPs of Figures 6.3 and 6.4 suggest that there is not a large dierence
between the characterization of neutral and frustrated data. Such a nding would imply
that frustration, like neutrality, is not so much captured as defaulted to a generic \none
of the above" representation. However, statistical analyses support the dierentiation of
these two emotion classes in line with the semantic understanding of these emotion classes.
In the four-dimensional EPs the frustration EPs were dierentiated from the neutrality
EPs along the anger and sadness dimensions with p< 0:001 (ANOVA, Table 6.3), where
anger was more strongly represented and sadness was less strongly represented in the
frustration EP than in neutrality EP (p < 0:001, one-way t-test, dierence of means).
This suggests that frustrated utterances can be dierentiated from neutral utterances
based on the presence of angry components, although these components are less strongly
dened when compared to the angry utterances (p< 0:001, one-way t-test, dierence of
means). It is also interesting to note that the comparison of the sad components in the
frustration and anger EPs suggests that sadness is represented more strongly in frustrated
utterances than in angry utterances (p< 0:001, one-way t-test, dierence of means).
127
6.6 Conclusions
This chapter demonstrated the ecacy of EP based classication for out-of-domain audio-
visual emotional data. In all three data types there was no signicant dierence between
the classication accuracies (weighted or unweighted) of the EPs trained on frustrated
data and trained only on angry, happy, neutral, and sad data. The decrease in the
emotion-specic F-measures between the EPs trained and not trained on frustration was
less than or equal to 0.04 in all cases and in some cases increased (prototypical anger
and neutrality, combined neutrality, nonprototypical anger and neutrality). It should be
noted that all emotions were recognized above the chance level. This indicates that EPs
whose components span the target emotional space are suciently
exible to represent
unseen emotions and oer robust representations for emotional communication.
The representative power of an EP is dependent on the employed emotional basis.
The EPs in this study were able to represent frustration because frustration could be
described as combination of the emotion classes included in the EPs. The ability of the
EPs to distinctly represent emotions that do not overlap with the EP components has
not yet been assessed. Future work will include the investigation of techniques to derive
additional component representations for EPs. Future work will also include analyses of
the ability of the EPs to represent additional more highly ambiguous emotion classes.
The F-measures for the classes of neutral and frustration were comparatively low.
This may be a result of ambiguous class denitions, the ambiguous expression of neutral
and frustrated speech prevalent in human interactions, or perhaps a suboptimal feature
set. Future work includes the investigation of techniques to improve these accuracies.
128
However, the success of such future work is not guaranteed. The lower performance
of frustration classication can be explained in part by the high-degree of overlap in
human evaluations between the classes of frustration and anger, neutrality, and sadness.
Such a large degree of overlap suggests that there is a lower upper-bound for frustration
classication.
This work demonstrates a method for quantifying out-of-domain emotional data. Such
representations are necessary as human-machine interactive technology continues to de-
velop and speaker personalization becomes increasingly important. As human-interactive
technologies become more prevalent, interfaces must be able to interpret truly ambiguous
information, utterances without human-labeled ground truths. Future work includes ex-
tending this representation to the domain of these truly ambiguous emotional utterances.
129
Chapter 7
Cluster Proling
The previous two chapters introduced and described Emotion Proles (EP), a method for
representing the aective content of human utterances. This representation is important
for interactive aective technologies, which detailed models of human emotion for accurate
user state determination. These models are commonly trained using supervised learning
algorithms. However, such algorithms typically require labeled training corpora, the
collection of which is often expensive and time-intensive. This chapter presents a system-
level heuristic semi-supervised approach to user-specic emotion-classication using a
novel Cluster-Prole (CP) representation of emotion.
The work presented in this chapter was published in the following articles:
1. Emily Mower, Maja J Matari c and Shrikanth S. Narayanan, \A Framework for Automatic Hu-
man Emotion Classication Using Emotional Proles." IEEE Transactions on Audio, Speech and
Language Processing. Accepted for publication, August 2010.
2. Emily Mower, Kyu Jeong Han, Sungbok Lee and Shrikanth S. Narayanan. \A Cluster-Prole
Representation of Emotion Using Agglomerative Hierarchical Clustering." In Proceedings of In-
ternational Speech Communication Association (InterSpeech), Makuhari, Japan. September 2010.
3. Emily Mower, Angeliki Metallinou, Chi-Chun Lee, Abe Kazemzadeh, Carlos Busso, Sungbok
Lee, Shrikanth Narayanan. "Interpreting Ambiguous Emotional Expressions." In Proceedings of
ACII Special Session: Recognition of Non-Prototypical Emotion from Speech- The Final Frontier?.
Amsterdam, The Netherlands, September 2009.
130
In user-adapted emotion classication systems, two types of data are necessary: a
large amount of emotional data from multiple speakers and a smaller amount of data
from the target speaker. The labels from the target speaker are directly relevant to the
classication task while those from the disjoint speakers are needed only for training. An
approach requiring only the labels of the target speaker's utterances would drastically
reduce the time needed for database preparation.
In the previous chapters the ecacy of an Emotion-Prole (EP) based representation
for classication was demonstrated. EPs were described as a quantitative representation
of the aective content of an utterance in terms of the presence or absence of a set of
component emotions. The components of the prole were the semantic, or categorical,
labels: angry, happy, neutral, and sad. However, it is not clear that the proles must be
constructed using these types of semantic components.
In this chapter we investigate a system-level heuristic semi-supervised approach for
emotion classication. The classication system is broken down into four steps: speaker-
independent feature selection, speaker-independent clustering, speaker-independent pro-
le generation, and speaker-dependent classication. The feature selection method is the
method utilized in the previous chapter, unsupervised Principal Feature Analysis (PFA),
an extension of Principal Component Analysis, also used in [77, 78]. The data are clus-
tered using unsupervised agglomerative hierarchical clustering of the emotional space.
These clusters are used to train cluster-specic Support Vector Machines (SVM) whose
output are the components of the CPs. Finally, the emotion content of the utterance is
assessed by classifying over the generated CPs. The system is a heuristic semi-supervised
approach because the feature selection, clustering, and prole generation are unsupervised
131
while the nal classication step is supervised. The unsupervised portion establishes a
data-dependent representation for the aective test data using the majority of the train-
ing data. The nal supervised classication utilizes the generated CPs for Na ve Bayes
classication.
The CP classication method outperforms the EP classication by 0.88% absolute
(69.25% vs. 68.37%). This result demonstrates the ecacy of the CP based classication
system. The CPs represent emotional utterances inn-components, wheren is the number
of clusters. This comparable performance of the CP and EP representations suggests that
given training sets with expressions from a non-disjoint set of emotion classes, it may
be necessary to label only a subset of the large training data. These results cannot be
compared directly to any published work due to the nal speaker-dependent classication
step. However, this performs comparably to fused GMM-HMM method presented in [78]
(62.42%). The novelty of the current work lies in its new denition of a prole and an
assessment of the necessity of the semantic prole dimensions utilized in the EPs.
7.1 Description of Data: The EMOCAP Database
The discriminative power of the CP representation is evaluated using the USC IEMOCAP
database, collected at the University of Southern California (USC) [13] and described in
Chapter 4, Section 4.1.1.
132
7.2 Emotion and Cluster Proles
As described in the previous two chapters, prole based representations describe aective
utterances over a set of aective components rather than in terms of a single mathematical
(e.g., a valence of `3') or semantic label (e.g., `angry'). This added
exibility is benecial
when the emotional character of the speech is subtle. In the previous two chapters the
EPs were implemented as either four or ve dimensional representations of emotion.
The dimensions expressed the degree of presence or absence of each of the emotions:
angry, happy, neutral, sad, and optionally frustrated. This subset was chosen to minimize
aective overlap in our experimental dataset. In this chapter, we explore prole generation
using an unsupervised component-generation approach (Figure 7.1).
7.2.1 Description of the Train and Test Sets
The dataset consisted of 4,806 utterances across the ten emotional labels and ten-speakers.
The prole generation (\training") was speaker-independent while the nal classication
(\testing") was speaker-dependent (Figure 7.1). For each speaker, the training data
(for unsupervised clustering and prole generation) consisted of all of the utterances not
spoken by the speaker. These data contained unlabeled emotions from all 10 emotion
categories. The testing data consisted only of utterances spoken by the speaker from the
set: angry, happy, neutral, and sad.
133
Input Test
Utterance
Train Binary SVMs
C1 vs. Not
C2 vs. Not
C3 vs. Not
C4 vs. Not
C5 vs. Not
C6 vs. Not
Output
C1 : +1, dist 2
C2 : -1, dist 2
C3 : -1, dist 1
C4 : +1, dist 1
C5: -1 dist 1
C6: -1 dist 1
Generated CP
+2
- 2
Naïve
Bayes
Angry
AHC
Clustering
Labeled
Test Data
(To Train
Naïve Bayes
Classifier)
Output
C1 : +1, dist 2
C2 : -1, dist 1
C3 : -1, dist 2
C4 : -1, dist 1
C5: +1 dist 1
C6: -1 dist 2
Labeled CPs
+2
- 2
Trained System
Unlabeled
Train Data
Figure 7.1: The CP based classication system diagram. This example demonstrates
the correct classication of a nonprototypical angry utterance (a mixture of anger and
sadness).
7.2.2 Unsupervised Clustering for CPs
The feature space was clustered using the unsupervised agglomerative hierarchical clus-
tering (AHC) over the unlabeled training data. This hierarchical clustering strategy cir-
cumvents the initialization issues common to other clustering approaches (e.g., k-means
or GMM-EM [37,117]). AHC is a bottom-up process, which is more computationally ef-
cient than top-down (divisive) clustering. Research has demonstrated that AHC can be
applied to many clustering tasks and is eective. This clustering approach is of particular
popularity in the eld of speaker clustering and diarization [110].
Initially, AHC considers each data point a cluster. Then, at every iteration, it selects
the closest pair of clusters to merge. This merging procedure continues until a pre-set
stopping criterion is satised. Generalized likelihood ratio (GLR) [46] is used to measure
inter-cluster distance at every stage of AHC. The stopping criterion is a manually pre-set
134
number of clusters, n. This chapter explores the utility of considering dierent numbers
of clusters in the CP construction.
7.2.3 Construction of a Prole
EPs and CPs were both constructed using the output from Support Vector Machines
(SVM). As described in the Chapter 5, Section 5.3, SVM is a maximum margin classier
that projects input data into a higher dimensional space to nd an optimal separating
hyperplane between two classes. The distance from one point in the projected space to
the hyperplane can be interpreted as the condence of the classier's assessment. Points
closer to the hyperplane are representative of data that are more easily confused in the
projected-space. These points represent utterances that cannot be as condently labeled
as utterances further from the decision hyperplane.
In the CP approach, n speaker-independent binary self vs. other SVMs were trained
for each of the clusters generated using AHC. Each cluster-specic SVM returned a mem-
bership value ( 1) and a distance from the hyperplane. The proles were created by
weighting the membership by the raw distance from the hyperplane. A sigmoid func-
tion is often used to convert the range of SVM hyperplane distances to the range 0{1.
However, the raw distances were retained because pilot studies demonstrated the ecacy
of utilizing the raw, rather than the sigmoid-transformed, distances in the prole-based
representations (see Chapter 5, Section 5.3 and Figure 5.4). The nal prole was an n-
dimensional representation of then-classier condences. The performance of the cluster
prole representation is compared to that of the emotion-prole representation (Chap-
ter 5). The EPs were constructed in the same manner as seen in Chapter 5.
135
The nal step is performing classication over the generated proles (both CP and
EP). Thisn-dimensional classication is performed using Na ve Bayes. Gaussian Mixture
Models, KNN, and Discriminant Analysis were also explored, but were not as eective.
Only Na ve Bayes results will be reported.
7.3 Features Extraction
The EPs and CPs were trained using utterance-level features extracted from the audio
and motion-capture modalities. The statistics used in this study include: mean, vari-
ance, upper quartile, lower quartile, and quartile range. All features were normalized
using speaker-dependent z-normalization. Utterances were rejected if any of the audio or
motion-capture features were undened. The features utilized in the CP analysis are the
same as those utilized in the previous chapter.
The set of audio features included: intensity, pitch, and the rst 13 Mel Filterbank
Coecients (MFB). Intensity and pitch have been used successfully in emotion classica-
tion studies [101,103,104,115]. As discussed in previous chapters, MFBs are also eective
for emotion classication.
The motion-capture features utilized in this work were derived from Facial Animation
Parameters (FAP) [112]. These features are part of the MPEG-4 standard and represent
distances between points on the face. The FAPs were adapted to the motion-capture
conguration used in the USC IEMOCAP data recording. The facial features were bro-
ken into groups by facial region. These regions included: mouth, cheeks, forehead, and
eyebrows. These features were also used in Chapters 5 and 6.
136
7.4 Feature Selection
The initial feature set has 685 features. The feature set size is reduced using the unsuper-
vised method of Principal Feature Analysis (PFA) [69] (Chapter 6, Section 6.3.1). The
feature sets were identied in a speaker-independent fashion. For example, the selected
features for Speaker 1 were analyzed using the data from Speakers 2-10. The nal feature
set size was 20-features.
7.5 Experimental Methods
The goal of this analysis was to determine if an unsupervised data clustering algorithm
can nd relevant clusters within the data for use in the prole-based classication. A
successful result would indicate that exhaustive labeling of the training space is not
necessary. Instead, the data-dependent clusters inherent in the space can be used as
components of the prole for a nal supervised training on a much smaller proportion of
the data.
We hypothesized that the data-driven CP representation would be more accurately
classied than the EP representation. The CP can represent more emotion-specic
uc-
tuations than the EP because it can contain a larger number of components. This should
allow the CP to capture more of the inherent variation in the aective data. A negative
result would suggest that the semantic emotional labels are more eective at clustering
the aective space than the employed data-driven technique.
The EP based classication is presented as a baseline performance metric. We hy-
pothesized that the EP representation would be a more compact representation of the
137
aective components of human speech. Semantic emotion labels describe clusters of the
data that are objectively recognized by large numbers of people. Consequently, it was ex-
pected that the clusters generated using these aective labels will be highly representative
of the feature-level properties of the emotional utterances.
The EPs were trained in a speaker-independent fashion (e.g., EPs for Speaker 1 are
trained using the data from Speakers 2-10) over the semantic labels of angry, happy,
neutral, and sad. In CP based classication the speaker-independent training data were
rst clustered into n-clusters using the aforementioned clustering approach. The CPs
were then constructed using the output from the n-SVMs trained on each cluster's data
(one SVM for each of the n clusters). In both prole-based methods, the nal emotion
assessment was made using Na ve Bayes over the generated proles. The performance of
the Na ve Bayes classier was assessed using leave-one-out cross-validation (see system
diagram, Figure 7.1).
7.6 Results
7.6.1 EP Classication
The EP based classication serves as a comparative baseline for the CP based classi-
cation results. In the EP based classication, the accuracy was 68.37%. The emotion-
specic results can be seen in Table 7.1. The classes of anger, happiness, and sadness
were well recognized (F-measure > 0.69). The class of neutrality was relatively poorly
recognized. This trend is common in this database, where neutrality remains an emotion
class that is not well understood [78,84].
138
Emotion EP
Number of Clusters
3 5 7 9 11 13 15 17 19
F-Meas.
Angry 0.73 0.51 0.60 0.64 0.64 0.68 0.68 0.69 0.68 0.69
Happy 0.77 0.74 0.76 0.76 0.76 0.77 0.77 0.77 0.77 0.77
Neutral 0.45 0.34 0.40 0.49 0.49 0.53 0.54 0.54 0.53 0.53
Sad 0.69 0.52 0.60 0.67 0.67 0.70 0.70 0.70 0.70 0.70
Acc. Weighted 0.68 0.57 0.62 0.66 0.66 0.69 0.69 0.69 0.69 0.69
Table 7.1: The CP based classication results. The entries in bold font indicate the best
accuracy or F-measure recorded.
7.6.2 CP Classication
In this task, the maximal accuracy occurred with 15 clusters. The maximal accuracy
was 69.25% (Table 7.1). The emotions of anger, happiness, and sadness were again well
recognized (F-measure > 0.69). It should be noted that in the CP representation, the
F-measure for the class of neutrality increased to 0.54. This represents a 9% absolute
and 20.00% relative improvement. This result suggests that CP based representations
are more eective for capturing inherently ambiguous classes of emotion than EP based
representations.
It should be further noted that the CP based classication outperformed the EP based
classication by 0.88% absolute (1.29% relative). This result is not statistically signicant
at = 0:05, indicating that the CP and EP representations are both eective for emotion
classication. This equivalence suggests that it is not necessary to exhaustively label a
large dataset for user-adapted emotion classication tasks.
139
7.7 Discussion
The studies in this chapter were motivated by two hypotheses: 1) the CP based technique
would outperform the EP based technique and 2) the EP representation would oer a more
compact representation of aective content. The results demonstrate that the CP based
classication outperforms EP based classication by 0.88% absolute (1.29% relative) with
15-clusters. This suggests that the CP based representation can adequately represent
emotion in an unsupervised manner. However, the assertion that the CP representation
can more accurately represent emotion content of utterances cannot be supported at this
time.
The second hypothesis is also supported. CP based classication required at least
11-clusters to match the accuracy obtained by EP based classication. The F-measures
obtained in the EP based classication for angry, happy, neutral, and sad was never
obtained in the CP based classication for anger and required 11, 13, and 11 clusters
respectively for the classes of happiness, neutrality, and sadness. This suggests that
the EP based representation is more compact than this implementation of the CP based
representation. This further suggests that the clusters generated from the semantic labels
of angry, happy, neutral, and sad are eective for capturing the aective feature properties
of the utterances, supporting the use of the components of anger, happiness, neutrality,
and sadness in the EP based representation.
140
7.8 Conclusions
This chapter presented a novel system-level heuristic semi-supervised technique to classify
the emotion content of utterances using a prole-based technique. The CP based clas-
sication non-signicantly outperformed the EP based classication by 0.88% with 15
clusters. This suggests that both data-driven and knowledge-driven clusters are eective
for prole generation. The CP based representation alleviates the need for exhaustive
labeling of the training corpus, requiring instead a labeling of a small subset of the data.
Although, as stated earlier, the results presented in this chapter cannot be directly
compared to previously published methods, both the EP and CP based classication
systems produce similar accuracies to those seen in the literature (62.42%) [78]. This
demonstrates that both prole-based representations are eective for emotion classica-
tion tasks.
The results are presented on the USC IEMOCAP database. Future research will
investigate the relative robustness of the EP or CP methods across multiple databases.
The lower complexity of the EP representation suggests that the emotional clusters (an-
gry, happy, neutral, and sad) may be a more orthogonal \basis" representation in the
IEMOCAP database. This may indicate that the EP components are a more percep-
tually salient representation than the CP components. However, the CP representation
in this database provides better functional denitions for the components. Future work
will investigate the relevance of the EP and CP representations with respect to human
perception. Future work will also include the analysis of additional clustering methods
to determine the eect of these techniques on the classication accuracy of the system.
141
This chapter presented a foray into heuristic semi-supervised learning for emotion
classication. Semi-supervised emotion classication has the potential to make user-
personalization more tractable by incorporating unlabeled emotional data for deriving
an aective representation. As aective interactive technologies continue to grow in
popularity, these techniques will only become more important.
142
Chapter 8
Conclusions and future work
This dissertation presents the following novel contributions:
A congruent-con
icting emotion stimuli presentation paradigm used to study the
ways in which individuals process and interpret emotional cues and trends in user
evaluation styles
An evaluator analysis framework for understanding evaluator performance
An Emotion Prole representation method for quantifying and identifying the emo-
tion present in naturalistic human utterances
8.1 Emotion Perception
This thesis explored the link between reported perception and feature modulation using
a database composed of congruent and con
icting emotional cues. We introduced a novel
stimuli presentation framework in which the stimuli were composed of synthetic video and
human audio cues. We utilized the McGurk experimental paradigm and created two sets
of stimuli: congruent (emotionally matched audio and video information) and con
icting
143
(emotionally mismatched audio and video). We found that the perceptual judgements
of evaluators were biased by the information in the audio channel. We performed a
statistical analysis of the reported perception and found that in emotionally congruent
utterances, utterances in which the emotion expressed in the audio and video channels
match, evaluators integrated the information expressed within the channels to arrive at an
emotional description dependent on both the audio and video aective expressions. This
highlights the importance of the proper design of both the audio and video components
of emotional expression even given a simplied (synthetic) video channel. We also found
that given emotionally con
icting utterances, utterances in which the emotions expressed
in the audio and video channels do not match, evaluators tended to rely more heavily
on the more expressive channel than the less expressive channel. However, this result
varies across the emotion dimensions. Our results suggest that evaluators tended to rely
on audio (the more expressive modality and the modality correlated with activation) for
activation information and both the audio and video channels for valence information.
This nding further stresses the importance of proper video design, even when the video
is much less expressive than the audio, for recognizable emotion expression.
8.2 Evaluator Modeling
The thesis also presented methods to evaluate evaluators based on classication metrics.
Our results indicate that it is often more accurate to model an averaged evaluator than
an individual evaluator. These studies also demonstrated that there is no signicant
decrease in accuracy when an individual evaluator is classied utilizing models trained
144
on an averaged evaluator's data. Furthermore, the pervasive diculty in classication
suggests that both categorical and dimensional descriptions of emotion do not fully de-
scribe the emotional landscape. These ndings highlighted the importance of developing
an understanding of the relationship between feature modulation, evaluator state, and
the resulting reported emotion perception.
8.3 Emotion Proles
The results from the perceptual and evaluator-modeling experiments suggested that tra-
ditional characterizations of emotion (i.e., dimensional or categorical representations) are
not sucient to describe the emotion content of aective utterances. This insight led
to our development of Emotion Proles (EP), a new method for quantifying emotions.
EPs are a multidimensional representation of emotion that incorporate aspects of the
categorical and dimensional representations to arrive at a rich and interpretable method
for expressing the aective content of utterances.
In human emotion expression, naturally occurring utterances are often complex ex-
pressions of emotion. These complex emotional utterances can appear as combinations
or blendings of several emotions, combinations which cannot be well captured by a single
label. Furthermore, these complex utterances may not be well described by dimensional
labels because separate subtle emotion classes may overlap in the dimensional space,
leading to interpretability problems.
The complexity inherent in emotional expressions inspired our development of EP
based quantication techniques. These techniques are an integration of the categorical
145
and dimensional descriptions of emotion. They describe the aective content of natural
utterances in terms of the degree of presence or absence of a set of emotions, leading to a
richer emotional description. The results presented in this thesis demonstrate that EPs
are an eective measure for quantifying reported human emotion perception. EPs can be
used as an intermediary step during classication or as a method to characterize highly
ambiguous emotional utterances. The EP method is able to not only accurately classify
emotions with aective ground truths, but is also able to interpret the aective content of
emotionally ambiguous utterances, allowing for their inclusion in natural human-machine
interactions.
This thesis also presented detailed analyses of the EP representations through an
investigation of the robustness of the representation and through a study analyzing the
ecacy of using data-driven, rather than semantic emotional, components of the EP.
The results suggest that EPs can robustly represent unseen secondary emotions that
can be described as a combination of the prole components. We demonstrate that
frustration, described as a combination of sadness and anger, can be represented in a
similarly discriminative fashion using four dimensional EPs (angry, happy, neutral, sad)
or ve dimensional EPs (angry, happy, neutral, sad, frustrated). This result suggests that
EPs can be used to represent emotions that are combinations of the prole components.
However, EP components are not required to be semantic emotional labels. Although
semantic components allow the prole to represent and quantify emotion in an inter-
pretable way, the generation of these components require a large amount of labeled train-
ing data in order to accurately model the condence of the component assertions (i.e.,
degree of presence or absence of a given emotion class). We demonstrate that proles
146
can be created using data-driven components. These components represent emotions in
terms of the presence vs. absence, not of specic emotion classes, but of clusters within
the feature space. By modeling emotion as a collection of these cluster-level condences
it is possible to characterize emotion without requiring the input training data to have
emotional labels. Like EPs, these proles, called Cluster Proles (CP), can be used as
mid-level representations in a classication system. The results suggest that EPs and CPs
function as similarly eective mid-level representations. This suggests that both seman-
tic emotional clusters and data-driven clusters can be used to characterize the aective
makeup of the utterance.
8.4 Open Problems and Future Work
This thesis demonstrated that conventional methods for quantifying emotional content
do not fully describe the emotional space. This nding motivated the development of
the EP based frameworks. This technique was of particular use for characterizing highly
ambiguous emotional utterances. These utterances are rarely handled in conventional
emotion recognition frameworks. By denition, these utterances either have an unclear
ground truth or no ground truth at all and are not often considered in the testing phase.
However, in natural human communication, many such utterances are of this form.
In, \Basic Emotions," Ekman states that, \Emotions obviously do occur without any
evident signal, because we can, to a very large extent, inhibit the appearance of a signal.
Also, a threshold may need to be crossed to bring about an expressive signal, and that
threshold may vary across individuals [39]." This suggests that an emotion recognition
147
system, built to analyze natural human speech, must be able to identify emotions ranging
from subtle presentations to clear displays. Future work will explore how to analyze,
quantify, and measure the subtleties inherent in natural human communication. One of
the open problems in emotion recognition is the denition of the class of neutrality. This
class is often dicult to model and recognize. Future work should identify what a label
of neutrality means and how neutral emotions are used in human communication. Such
models may allow us to better understand how to interpret and describe this complex
and perceptually amorphous emotion.
Complexities in emotion interpretation also result from inherent dierences in indi-
vidual expression and perception of emotion. This problem is complicated by the non-
stationarity of emotion production and perception. Both of these processes are aected
by the context in which an emotion exists. In our future work we will explore emotion
in a situated framework by developing an understanding of an emotional grammar. Such
a grammar would describe patterns aecting, or potentially constraining, the manner
through which users transition between emotion classes and degrees of single emotion
classes within an utterance and within a dialog. An understanding of an emotional gram-
mar would allow researchers to develop dynamic modeling practices that incorporate
temporal
ow to arrive at more accurate emotion classication and user understanding.
Finally, we anticipate demonstrating the eectiveness of our studies in an applica-
tion context. Previous work has demonstrated that children with an Autism Spectrum
Disorders (ASD) diagnosis have a dicult time identifying the emotional content of ex-
pressions [21, 47, 57, 70]. We are developing an interactive computer avatar designed to
assist children with ASD in recognizing socially relevant emotion states. We believe that
148
the inclusion of an avatar will allow the children to learn to recognize and, in the future,
utilize these emotion states.
149
References
[1] R. Abelson and V. Sermat, \Multidimensional scaling of facial expressions," Journal
of Experimental Psychology, vol. 63, no. 6, pp. 546{554, 1962.
[2] A. Arya, L. N. Jeeries, J. T. Enns, and S. DiPaola, \Facial actions as visual cues
for personality," Computer Animation and Virtual Worlds, vol. 17, no. 3{4, pp.
371{382, 2006.
[3] T. Banziger and K. Scherer, \Using actor portrayals to systematically study mul-
timodal emotion expression: The GEMEP corpus," Lecture Notes in Computer
Science, vol. 4738, p. 476, 2007.
[4] L. Barrett, \Are emotions natural kinds?" Perspectives on Psychological Science,
vol. 1, no. 1, pp. 28{58, 2006.
[5] M. S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan,
\Recognizing facial expression: machine learning and application to spontaneous
behavior," in IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 2, San Diego, CA, USA, June 2005, pp. 568{573.
[6] M. Bartlett, G. Littlewort, C. Lainscsek, I. Fasel, and J. Movellan, \Machine learn-
ing methods for fully automatic recognition of facial expressions and facial actions,"
in IEEE International Conference on Systems, Man and Cybernetics, vol. 1, Hawaii,
USA, October 2004, pp. 592{597.
[7] J. Bates, \The role of emotion in believable agents," Communications of the ACM,
vol. 37, no. 7, pp. 122{125, 1994.
[8] S. Biersack and V. Kempe, \Tracing vocal emotion expression through the speech
chain: do listeners perceive what speakers feel," in ISCA Workshop on Plasticity
in Speech Perception, London, UK, June 2005, pp. 211{214.
[9] M. M. Bradley and P. J. Lang, \Measuring emotion: The self-assessment manikin
and the semantic dierential," Journal of Behavior Therapy and Experimental Psy-
chiatry, vol. 25, no. 1, pp. 49 { 59, 1994.
[10] S. Buisine, S. Abrilian, R. Niewiadomski, J. Martin, L. Devillers, and C. Pelachaud,
\Perception of blended emotions: From video corpus to expressive agent," Lecture
Notes in Computer Science, vol. 4133, p. 93, 2006.
150
[11] M. Bulut, C. Busso, S. Yildirim, A. Kazemzadeh, C. Lee, S. Lee, and S. Narayanan,
\Investigating the role of phoneme-level modications in emotional speech resyn-
thesis," in Interspeech, Lisbon, Portugal, September 4{8 2005, pp. 801{804.
[12] M. Bulut, S. Lee, and S. Narayanan, \Recognition for synthesis: automatic pa-
rameter selection for resynthesis of emotional speech from neutral speech." in In-
ternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), Las
Vegas, NV, April 2008, pp. 4629 { 4632.
[13] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang,
S. Lee, and S. Narayanan, \IEMOCAP: Interactive emotional dyadic motion cap-
ture database," Journal of Language Resources and Evaluation, vol. 42, no. 4, pp.
335{359, November 2008.
[14] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, S. Lee,
U. Neumann, and S. S. Narayanan, \Analysis of emotion recognition using facial
expressions, speech and multimodal information," in International Conference on
Multimodal Interfaces, State Park, PA, October 2004, pp. 205{211.
[15] C. Busso, S. Lee, and S. Narayanan, \Using neutral speech models for emotional
speech analysis," in InterSpeech, Antwerp, Belgium, August 2007, pp. 2225{2228.
[16] C. Busso and S. Narayanan, \Interrelation between speech and facial gestures in
emotional utterances: a single subject study," IEEE Transactions on Audio, Speech
and Language Processing, vol. 15, no. 8, pp. 2331{2347, November 2007.
[17] ||, \Joint analysis of the emotional ngerprint in the face and speech: A single
subject study," in IEEE International Workshop on Multimedia Signal Processing
(MMSP), Chania, Greece, October 2007, pp. 43{47.
[18] ||, \Recording audio-visual emotional databases from actors: a closer look," in
Second International Workshop on Emotion: Corpora for Research on Emotion and
Aect, International conference on Language Resources and Evaluation (LREC),
Marrakech, Morocco, May 2008, pp. 17{22.
[19] C. Busso, S. Lee, and S. Narayanan, \Using neutral speech models for emotional
speech analysis," in InterSpeech, Antwerp, Belgium, August 2007, pp. 2225{2228.
[20] C. Busso and S. S. Narayanan, \The expression and perception of emotions: Com-
paring assessments of self versus others," in InterSpeech, Brisbane, Australia,
September 2008, pp. 257{260.
[21] G. Celani, M. Battacchi, and L. Arcidiacono, \The understanding of the emotional
meaning of facial expressions in people with autism," Journal of Autism and De-
velopmental Disorders, vol. 29, no. 1, pp. 57{66, 1999.
[22] C. Cottrell and S. Neuberg, \Dierent emotional reactions to dierent groups: A
sociofunctional threat-based approach to prejudice.," Journal of Personality and
Social Psychology, vol. 88, no. 5, pp. 770{789, 2005.
151
[23] R. Cowie and R. Cornelius, \Describing the emotional states that are expressed in
speech," Speech Communication, vol. 40, no. 1-2, pp. 5{32, 2003.
[24] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz,
and J. Taylor, \Emotion recognition in human-computer interaction," IEEE Signal
Processing Magazine, vol. 18, no. 1, pp. 32{80, January 2001.
[25] N. Cristianini and J. Shawe-Taylor, An introduction to support Vector Machines:
and other kernel-based learning methods. Cambridge Univ Pr, 2000.
[26] K. Dautenhahn, C. Numaoka, and AAAI, Socially intelligent agents. Springer,
2002.
[27] J. Davitz, The language of emotion. Academic Pr, 1969.
[28] B. de Gelder, \The perception of emotions by ear and by eye," Cognition & Emo-
tion, vol. 14, no. 3, pp. 289{311, 2000.
[29] B. de Gelder, K. B ocker, J. Tuomainen, M. Hensen, and J. Vroomen, \The combined
perception of emotion from voice and face: early interaction revealed by human
electric brain responses," Neuroscience Letters, vol. 260, no. 2, pp. 133{136, 1999.
[30] L. De Silva, T. Miyasato, and R. Nakatsu, \Facial emotion recognition using multi-
modal information," in International Conference on Information, Communications
and Signal Processing (ICICS), vol. I, Singapore, 1997, pp. 397{401.
[31] B. DeGelder and P. Bertelson, \Multisensory integration, perception, and ecological
validity," Trends in Cognitive Sciences, vol. 7, no. 10, pp. 460{467, October 2003.
[32] F. Dellaert, T. Polzin, and A. Waibel, \Recognizing emotion in speech," in Inter-
national Conference on Spoken Language Processing International Conference on
Spoken Language Processing (ICSLP), vol. 3, October 1996, pp. 1970{1973.
[33] L. Devillers, L. Vidrascu, and L. Lamel, \Challenges in real-life emotion annotation
and machine learning based detection," Neural Networks, vol. 18, no. 4, pp. 407{
422, May 2005.
[34] S. D'Mello, R. Picard, and A. Graesser, \Toward an aect-sensitive AutoTutor,"
IEEE Intelligent Systems, vol. 22, no. 4, pp. 53{61, July/August 2007.
[35] S. D'Mello, S. Craig, B. Gholson, S. Franklin, R. Picard, and A. Graesser, \Inte-
grating aect sensors in an intelligent tutoring system," in Aective Interactions:
The Computer in the Aective Loop Workshop at the International Conference on
Intelligent User Interfaces, 2005, pp. 7{13.
[36] E. Douglas-Cowie, L. Devillers, J. Martin, R. Cowie, S. Savvidou, S. Abrilian, and
C. Cox, \Multimodal databases of everyday emotion: Facing up to complexity," in
Interspeech, Lisbon, Portugal, September 2005, pp. 813{816.
[37] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication. Wiley-Interscience
Publication, 2000.
152
[38] R. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. De Ridder, and D. Tax, \PRTools,
a matlab toolbox for pattern recognition," Delft University of Technology, 2004.
[39] P. Ekman, \Basic emotions," Handbook of cognition and emotion, pp. 45{60, 1999.
[40] P. Ekman and W. Friesen, Facial Action Coding System. Palo Alto, CA: Consulting
Psychologists Press, 1978.
[41] R. el Kaliouby and M. Goodwin, \iSET: interactive social-emotional toolkit for
autism spectrum disorder," in Proceedings of the 7th international conference on
Interaction design and children. ACM, 2008, pp. 77{80.
[42] T. Engen, N. Levy, and H. Schlosberg, \The dimensional analysis of a new series of
facial expressions," Journal of Experimental Psychology, vol. 55, no. 5, pp. 454{458,
1958.
[43] F. Enos and J. Hirschberg, \A framework for eliciting emotional speech: Capitaliz-
ing on the actors process," in First International Workshop on Emotion: Corpora
for Research on Emotion and Aect (International conference on Language Re-
sources and Evaluation (LREC)), Genoa, Italy, May 2006, pp. 6{10.
[44] S. Fagel, \Emotional McGurk Eect," in International Conference on Speech
Prosody, vol. 1, Dresden, 2006.
[45] N. Fragopanagos and J. Taylor, \Emotion recognition in human-computer interac-
tion," Neural Networks, vol. 18, no. 4, pp. 389{405, 2005.
[46] H. Gish, M. Siu, and R. Rohlicek, \Segregation of speakers for speech recognition
and speaker identication," in International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), vol. 2, 1991, pp. 873{876.
[47] O. Golan, S. Baron-Cohen, J. Hill, and M. Rutherford, \The Reading the Mind
in the Voicetest-revised: A study of complex emotion recognition in adults with
and without autism spectrum conditions," Journal of Autism and Developmental
Disorders, vol. 37, no. 6, pp. 1096{1106, 2007.
[48] F. Gosselin and P. G. Schyns, \Bubbles: a technique to reveal the use of information
in recognition tasks," Vision Research, vol. 41, no. 17, pp. 2261 { 2271, 2001.
[49] J. Gratch, W. Mao, and S. Marsella, Modeling social emotions and social attribu-
tions. Cambridge University Press, 2006, pp. 219{251.
[50] J. Gratch and S. Marsella, \A domain-independent framework for modeling emo-
tion," Cognitive Systems Research, vol. 5, no. 4, pp. 269{306, 2004.
[51] M. Grimm and K. Kroschel, \Evaluation of natural emotions using self assessment
manikins," in IEEE Automatic Speech Recognition and Understanding Workshop,
San Juan, Puerto Rico, November 2005, pp. 381{385.
153
[52] ||, \Rule-based emotion classication using acoustic features," in Conference on
Telemedicine and Multimedia Communication, Kajetany, Poland, October 2005,
p. 56.
[53] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, \Primitives-based evaluation
and estimation of emotions in speech," Speech Communication, vol. 49, no. 10{11,
pp. 787{800, 2007.
[54] A. Hanjalic, \Extracting moods from pictures and sounds: Towards truly person-
alized tv," IEEE Signal Processing Magazine, vol. 23, no. 2, pp. 90{100, March
2006.
[55] U. Hess, S. Sen ecal, G. Kirouac, P. Herrera, P. Philippot, and R. Kleck, \Emotional
expressivity in men and women: Stereotypes and self-perceptions," Cognition and
Emotion, vol. 14, no. 5, pp. 609{642, 2000.
[56] J. Hietanen, J. Lepp anen, M. Illi, and V. Surakka, \Evidence for the integration of
audiovisual emotional information at the perceptual level of processing," European
Journal of Cognitive Psychology, vol. 16, no. 6, pp. 769{790, 2004.
[57] R. Hobson, J. Ouston, and A. Lee, \Emotion recognition in autism: Coordinating
faces and voices," Psychological Medicine, vol. 18, no. 4, pp. 911{923, 1988.
[58] C. Izard, Human emotions. Springer, 1977.
[59] A. Jaimes and N. Sebe, \Multimodal human{computer interaction: A survey,"
Computer Vision and Image Understanding, vol. 108, no. 1-2, pp. 116{134, 2007.
[60] E. Konstantinidis, M. Hitoglou-Antoniadou, A. Luneski, P. Bamidis, and M. Niko-
laidou, \Using aective avatars and rich multimedia content for education of chil-
dren with autism," in International Conference on Pervasive Technologies Related
to Assistive Environments (PETRA). Corfu, Greece: ACM, June 2009, pp. 1{6.
[61] O. Kwon, K. Chan, J. Hao, and T. Lee, \Emotion recognition by speech signals,"
in Conference on Speech Communication and Technology, 2003, pp. 32{35.
[62] R. Lazarus, J. Averill, and E. Opton Jr, \Towards a cognitive theory of emotion,"
in Feeling and emotion: The Loyola Symposium, 1970, pp. 207{232.
[63] C.-C. Lee, C. Busso, S. Lee, and S. S. Narayanan, \Modeling mutual in
uence of
interlocutor emotion states in dyadic spoken interactions," in InterSpeech, Brighton,
UK, September 2009.
[64] C.-C. Lee, E. Mower, C. Busso, S. Lee, and S. Narayanan, \Emotion recognition
using a hierarchical binary decision tree approach," in Interspeech, Brighton, UK,
September 2009, pp. 320{323.
[65] C.-C. Lee, S. Lee, and S. S. Narayanan, \An analysis of multimodal cues of interrup-
tion in dyadic spoken interactions," in InterSpeech, Brisbane, Australia, September
2008, pp. 1678{1681.
154
[66] Y. Lin and G. Wei, \Speech emotion recognition based on HMM and SVM," in
International Conference on Machine Learning and Cybernetics, vol. 8, August
2005, pp. 4898{4901.
[67] C. L. Lisetti and F. Nasoz, \Using noninvasive wearable computers to recognize
human emotions from physiological signals," EURASIP Journal on Applied Signal
Processing, pp. 1672{1687, September 2004.
[68] C. Liu, K. Conn, N. Sarkar, and W. Stone, \Aect recognition in robot assisted
rehabilitation of children with autism spectrum disorder," in IEEE International
Conference on Robotics and Automation (ICRA), Rome, Italy, April 2007, pp. 1755{
1760.
[69] Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, \Feature selection using principal feature
analysis," in International Conference on Multimedia. New York, NY, USA: ACM,
2007, pp. 301{304.
[70] H. Macdonald, M. Rutter, P. Howlin, P. Rios, A. Le Conteur, C. Evered, and
S. Folstein, \Recognition and expression of emotional cues by autistic and normal
adults," Journal of Child Psychology and Psychiatry, vol. 30, no. 6, pp. 865{77,
1989.
[71] M. Madsen, R. El Kaliouby, M. Goodwin, and R. Picard, \Technology for just-in-
time in-situ learning of facial aect for persons diagnosed with an autism spectrum
disorder," in Assets '08: International ACM SIGACCESS Conference on Comput-
ers and Accessibility, Halifax, Nova Scotia, Canada, October 2008, pp. 19{26.
[72] G. Mandler, Mind and emotion. Wiley, 1975.
[73] J. Martin, R. Niewiadomski, L. Devillers, S. Buisine, and C. Pelachaud, \Mul-
timodal complex emotions: Gesture expressivity and blended facial expressions,"
International Journal of Humanoid Robotics, vol. 3, no. 3, pp. 269{292, 2006.
[74] D. Massaro, \Fuzzy logical model of bimodal emotion perception: Comment on"
The perception of emotions by ear and by eye" by de Gelder and Vroomen," Cog-
nition & Emotion, vol. 14, no. 3, pp. 313{320, 2000.
[75] H. McGurk and J. MacDonald, \Hearing lips and seeing voices," Nature, vol. 264,
pp. 746{748, 1976.
[76] H. K. M. Meeren, C. C. R. J. van Heijnsbergen, and B. de Gelder, \Rapid perceptual
integration of facial expression and emotional body language," Proceedings of the
National Academy of Sciences, vol. 102, no. 45, pp. 16 518{16 523, 2005.
[77] A. Metallinou, C. Busso, S. Lee, and S. S. Narayanan, \Visual emotion recogni-
tion using compact facial representations and viseme information," in International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas,
March 2010, pp. 2474 {2477.
155
[78] A. Metallinou, S. Lee, and S. Narayanan, \Decision level combination of multiple
modalities for recognition and analysis of emotional expression," in International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas,
March 2010, pp. 2462 {2465.
[79] T. M. Mitchell, Machine Learning. McGraw Hill, 1997.
[80] D. Mobbs, N. Weiskopf, H. C. Lau, E. Featherstone, R. J. Dolan, and C. D. Frith,
\The Kuleshov Eect: the in
uence of contextual framing on emotional attribu-
tionseect: the in
uence of contextual framing on emotional attributions," Social
Cognitive and Aective Neuroscience, vol. 1, no. 2, pp. 95{106, 2006.
[81] E. Mower, S. Lee, M. J. Matari c, and S. Narayanan, \Human perception of synthetic
character emotions in the presence of con
icting and congruent vocal and facial
expressions," in IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Las Vegas, NV, April 2008, pp. 2201{2204.
[82] ||, \Joint{processing of audio-visual signals in human perception of con
icting
synthetic character emotions," in IEEE International Conference on Multimedia &
Expo (ICME), Hannover, Germany, 2008, pp. 961{964.
[83] E. Mower, M. Matari c, and S. Narayanan, \Human perception of audio-visual syn-
thetic character emotion expression in the presence of ambiguous and con
icting
information," IEEE Transactions on Multimedia, vol. 11, no. 5, pp. 843 {855, 2009.
[84] ||, \A framework for automatic human emotion classication using emotion pro-
les," IEEE Transactions on Audio, Speech, and Language Processing, Accepted
for Publication, 2010.
[85] E. Mower, A. Metallinou, C.-C. Lee, A. Kazemzadeh, C. Busso, S. Lee, and
S. Narayanan, \Interpreting ambiguous emotional expressions," in ACII Special
Session: Recognition of Non-Prototypical Emotion from Speech- The Final Fron-
tier?, Amsterdam, The Netherlands, September 2009, pp. 1{8.
[86] O. Mowrer, Learning theory and behavior. Wiley New York, 1960.
[87] M. Nicolao, C. Drioli, and P. Cosi, \Voice GMM modelling for FESTI-
VAL/MBROLA emotive TTS synthesis," in International Conference on Spoken
Language Processing, Pittsburgh, PA, USA, 2006, pp. 1794{1797.
[88] A. Ortony and T. Turner, \What's basic about basic emotions," Psychological re-
view, vol. 97, no. 3, pp. 315{331, 1990.
[89] A. Ortony and A. Collins, The cognitive structure of emotions. Cambridge uni-
versity press, 1988.
[90] P. Oudeyer, \The production and recognition of emotions in speech: features and
algorithms," International Journal of Human-Computer Studies, vol. 59, no. 1{2,
pp. 157{183, 2003.
156
[91] M. Pantic, A. Pentland, A. Nijholt, and T. Huang, \Human computing and ma-
chine understanding of human behavior: a survey," Artical Intelligence for Human
Computing, pp. 47{71, 2007.
[92] M. Pantic, N. Sebe, J. Cohn, and T. Huang, \Aective multimodal human-computer
interaction," in ACM international conference on Multimedia. ACM New York,
NY, USA, 2005, pp. 669{676.
[93] R. W. Picard, Aective Computing. MIT Press, 1997.
[94] R. Plutchik, Emotion: A psychoevolutionary synthesis. Harper & Row, New York,
1980.
[95] P. Rani, C. Liu, and N. Sarkar, \An empirical study of machine learning techniques
for aect recognition in human{robot interaction," Pattern Analysis & Applications,
vol. 9, no. 1, pp. 58{69, May 2006.
[96] P. Robbel, M. Hoque, and C. Breazeal, \An integrated approach to emotional
speech and gesture synthesis in humanoid robots," in Proceedings of the Interna-
tional Workshop on Aective-Aware Virtual Agents and Social Robots. ACM, 2009,
pp. 1{4.
[97] J. Russell, \A circumplex model of aect," Journal of personality and social psy-
chology, vol. 39, no. 6, pp. 1161{1178, 1980.
[98] J. Russell and L. Barrett, \Core aect, prototypical emotional episodes, and other
things called emotion: Dissecting the elephant," Journal of Personality and Social
Psychology, vol. 76, no. 5, pp. 805{819, 1999.
[99] S. Schacter and J. Singer, \Cognitive and emotional determinants of emotional
states," Psychological Review, vol. 69, pp. 379{399, 1962.
[100] H. Schlosberg, \Three dimensions of emotion," Psychological review, vol. 61, no. 2,
pp. 81{88, 1954.
[101] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers,
L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, \The relevance of feature
type for the automatic classication of emotional user states: Low level descriptors
and functionals," in Interspeech, Antwerp, Belgium, August 2007, pp. 2253{2256.
[102] B. Schuller, S. Steidl, and A. Batliner, \The Interspeech 2009 emotion challenge,"
in Interspeech, Brighton, UK, September 2009, pp. 312{315.
[103] N. Sebe, I. Cohen, T. Gevers, and T. Huang, \Emotion recognition based on joint
visual and audio cues," in International Conference on Pattern Recognition, 2006,
pp. 1136{1139.
[104] D. Seppi, A. Batliner, B. Schuller, S. Steidl, T. Vogt, J. Wagner, L. Devillers,
L. Vidrascu, N. Amir, and V. Aharonson, \Patterns, Prototypes, Performance:
Classifying Emotional User States," in InterSpeech, Brisbane, AU, 2008, pp. 601{
604.
157
[105] S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, \\Of all things the mea-
sure is man": Automatic classication of emotions and inter-labeler consistency,"
in ICASSP, 2005., vol. 1, 2005, pp. 317{320.
[106] S. Sutton, R. Cole, J. de Villiers, J. Schalkwyk, P. Vermeulen, M. Macon, Y. Yan,
E. Kaiser, B. Rundle, K. Shobaki, P. Hosom, A. Kain, J. Wouters, D. Massaro, and
M. Cohen, \Universal speech tools: the CSLU toolkit," in International Conference
on Spoken Language Processing (ICSLP), Sydney, Australia, November { December
1998, pp. 3221{3224.
[107] W. Swartout, J. Gratch, R. Hill, E. Hovy, S. Marsella, J. Rickel, and D. Traum,
\Toward virtual humans," AI Magazine, vol. 27, no. 2, pp. 96{108, 2006.
[108] M. Swerts and E. Krahmer, \The importance of dierent facial areas for signalling
visual prominence," in International Conference on Spoken Language Processing
(ICSLP), Pittsburgh, PA, USA, 2006, pp. 1280{1283.
[109] F. Thomas and O. Johnston, Disney animation: The illusion of life. Abbeville
Press New York, 1981.
[110] S. E. Tranter and D. A. Reynolds, \An overview of automatic speaker diarization
systems," IEEE Transactions Audio, Speech, Language Processing, vol. 14, no. 5,
pp. 1557{1565, 2006.
[111] K. Truong, M. Neerincx, and D. van Leeuwen, \Assessing agreement of observer-
and self-annotations in spontaneous multimodal emotion data," in Interspeech, Bris-
bane, Australia, 2008, pp. 318{321.
[112] N. Tsapatsoulis, A. Raouzaiou, S. Kollias, R. Cowie, and E. Douglas-Cowie, \Emo-
tion recognition and synthesis based on MPEG-4 FAP's," in MPEG-4 Facial
Animation: The Standard, Implementation, and Applications, I. S. Pandzic and
R. Forchheimer, Eds. John Wiley & Sons, Ltd., 2002, ch. 9, pp. 141{167.
[113] V. Vapnik, Statistical Learning Theory. Wiley, New York, 1998.
[114] T. Vogt and E. Andre, \Comparing feature sets for acted and spontaneous speech
in view of automatic emotion recognition," in IEEE International Conference on
Multimedia & Expo (ICME), Los Alamitos, CA, USA, 2005, pp. 474{477.
[115] M. Wimmer, B. Schuller, D. Arsic, G. Rigoll, and B. Radig, \Low-level fusion
of audio and video feature for multi-modal emotion recognition," in International
Conference on Computer Vision Theory and Applications (VISAPP), Madeira, Por-
tugal, 2008, pp. 145{151.
[116] I. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. Cunningham, \Weka:
Practical machine learning tools and techniques with Java implementations," in
ANNES International Workshop on Emerging Engineering and Connectionnist-
based Information Systems, vol. 99, Dunedin, New Zealand, 1999, pp. 192{196.
[117] R. Xu and D. Wunsch, Clustering. Wiley-IEEE Press, 2008.
158
[118] S. Yildirim, M. Bulut, C. M. Lee, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and
S. Narayanan, \An acoustic study of emotions expressed in speech." in International
Conference on Spoken Language Processing (ICSLP), Jeju Island, South Korea,
2004, pp. 2193{2196.
[119] S. Young, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK book.
Entropic Cambridge Research Laboratory Cambridge, England, 1997.
[120] J. Zelenski and R. Larsen, \The distribution of basic emotions in everyday life: A
state and trait perspective from experience sampling data," Journal of Research in
Personality, vol. 34, no. 2, pp. 178{197, 2000.
[121] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, \A survey of aect recognition
methods: Audio, visual, and spontaneous expressions," IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 39{58, January 2009.
[122] Z. Zeng, J. Tu, B. Pianfetti, and T. Huang, \Audio{Visual Aective Expression
Recognition Through Multistream Fused HMM," IEEE Transactions on Multime-
dia, vol. 10, no. 4, pp. 570{577, 2008.
159
Abstract (if available)
Abstract
Emotion has intrigued researchers for generations. This fascination has permeated the engineering community, motivating the development of affective computational models for the classification of affective states. However, human emotion remains notoriously difficult to interpret computationally both because of the mismatch between the emotional cue generation (the speaker) and perception (the observer) processes and because of the presence of complex emotions, emotions that contain shades of multiple affective classes. Proper representations of emotion would ameliorate this problem by introducing multidimensional characterizations of the data that permit the quantification and description of the varied affective components of each utterance. Currently, the mathematical representation of emotion is an area that is under-explored.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multimodality, context and continuous dynamics for recognition and analysis of emotional states, and applications in healthcare
PDF
Towards generalizable expression and emotion recognition
PDF
Representation, classification and information fusion for robust and efficient multimodal human states recognition
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
Asset Metadata
Creator
Mower, Emily K.
(author)
Core Title
Emotions in engineering: methods for the interpretation of ambiguous emotional content
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/17/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
affective computing,agglomerative hierarchical clustering,audio-visual emotion,audio-visual emotion perception,emotion,emotion classification,emotion profiles,emotion representation,expressive animation,facial emotion expression,hidden Markov model,McGurk effect,multimodal emotion expression,multimodality,OAI-PMH Harvest,perception
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Matarić, Maja J. (
committee chair
), Narayanan, Shrikanth S. (
committee chair
), Kuo, C.-C. Jay (
committee member
), Sha, Fei (
committee member
)
Creator Email
emower@gmail.com,mower@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3535
Unique identifier
UC1217148
Identifier
etd-Mower-4205 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-419710 (legacy record id),usctheses-m3535 (legacy record id)
Legacy Identifier
etd-Mower-4205.pdf
Dmrecord
419710
Document Type
Dissertation
Rights
Mower, Emily K.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
affective computing
agglomerative hierarchical clustering
audio-visual emotion
audio-visual emotion perception
emotion classification
emotion profiles
emotion representation
expressive animation
facial emotion expression
hidden Markov model
McGurk effect
multimodal emotion expression
multimodality
perception